This is more 'data exploration' than visualization, but still some interesting results.
Word2vec is an algorithm that takes words in a corpus, and puts them into a multidimensional space, turning each word into a dimension. It then takes each individual word and creates a 'word vector' out of it, based on words it's found next to in that multidimensional space (within a given window).
So if you said 'I like cats', and 'I like dogs', the vectors for the words 'cats' and 'dogs' would be pointing (roughly) towards the 'I', and 'like' dimensions. The cool part is that these are still 1-dimensional lines, so you can compare them by taking the cosine of the angle they make.
If two words occur in the exact same context together, they'll have a cosine of 1. If they're completely independent of eachother, they'll be orthogonal and have a cosine of 0. If they occur inversely to one another, you'll get a cosine of -1.
When done on real data sets with real words, we do actually see that words with similar contexts have very strong semantic relationships (just google 'word2vec' and you'll see what I mean).
So, we know word2vec can approximate semantic groupings based on the corpus of data its been fed. But, I wondered, would it be able to detect biased semantic groupings based on biased data? Like, say, political speeches given in parliament?
Swedish parliament data
Swedish parliament has all of its speech data available here. I downloaded the XML for speeches between 2006 and early 2018. I did some basic cleanup (capitalization, punctuation etc) and created groupings by political party. The assumption is that training on the sum of a political party's speeches might surface the bias in that party.
Obviously, there are tons of caveats here. Some political parties are older than others, the memberships change over time, my cleanup is overly simplistic (not combining singular and plurals for instance), this is nowhere near enough data, etc.
All that having been said though, what does this 'first attempt' yield?
I made a fairly simple data explorer. You pick the political party, type in a word, and you get the top 25 or so words matching the given word (so the 25 words with the highest cosine similarity).
Hovering over the word will give you the cosine similarity to 2 decimal places, multiplied by 100 (so a similarity of 0.5539650 = 55). You can click on the words returned and search that word, and so on.
Interpreting the results
Because cosine similarities come back with a value between 0 and 1 for positive results, it's very easy to (mistakenly) interpret them as percentages. They're not. That having been said, though, I needed a way to show some words were 'more' similar than others visually. So I kind of treated the similarity metric as a percentage in terms of scaling the font for the words displayed.
Unfortunately, the 'meaningfulness' of the similarity depends on the size of the corpus and the relative frequency of that word in the corpus. Frequently used words, like "Sweden" give fairly intuitive results, but less-frequent words might show a strong similarity metric just because they were used together a handful of times.
The only way to know how similar is 'similar enough' is to sanity-check against an expert who knows Swedish parliamentary speeches, history, and data really well. I don't have access to such an expert, so rather than tweaking it myself based on my limited knowledge, I left it all as-is.
The end result is, well, not very clear-cut.
Some associations make sense, such as Sweden being in the same semantic category as Norway, Denmark etc.
There are also some of the expected associations (xenophobia in the xenophobic party, environmentalism in the environmentalist party etc) but not as many as I would have guessed.
A fast majority, though, is noise and random chance. We see 'enemy' and '12-year-old' paired up for the Social Democrats. I doubt they see 12-year-olds as enemies. It's most likely that these two words are very infrequent, and happened to have similar contexts the small handful of times they were used.
Still, it's kind of cool when you put in something like 'cultural heritage' (kulturarv) and see that every political party wants to protect it(försvar), but they all have different ideas of what it is.
Why release this?
I wasn't sure if I should release this or not in the given state. I mean, it obviously has potential, but still has lots of data cleanup needed before it can be meaningful. Also, a more meaningful version would only return words that are frequent enough and co-occur enough to not give any 'false-positives' (which, again, would only be impirically found with the help of an expert).
So why release it at all? Well, to let people know that such topic analysis is possible in the first place, and to hopefully inspire others to take this kind of topic analysis to the next level. And, who knows, maybe now some Swedish parliamentary expert might want to team up with me afterwards 😃
So, please look at this more as interactive word-art, and not a rigorous, scientific study
Code and Live Demo
A note on previous work with Danish data
I did work similar to this for zetland.dk. The Zetland article was done in conjunction with Philip Flores, who is an expert about happenings in the Danish parliament. He was able to put some of the seemingly random results we got into context.
As such, everything reported in the article had been 'sanity-checked' to be sure the information was not the result of errant artefacts of undersampling.
In contrast, the data for the Swedish tool is provided 'as-is', and is upto the user to do further research if they so desire.
Subscribe to The Brain of Nimish
Get the latest posts delivered right to your inbox