Cosine Similarity

Words appear in context. The context in which a word appears might tell us about that word. And the word might tell us about its context.

Cosenu di sumigghianza presents measures of cosine similarity based on the articles in Wikipedia Siciliana (2019-02-01).

This page discusses the measures. For a description of the models and estimation, please see my word context and word embeddings pages. And please see my similarity page for a simple description of the cosine measure itself.

The cosine measures reflect similarity in context. For the word "apple," the context might be a discussion of "fruit." And if the context were "fruit," we might expect to see the word "apple."

For a given word, the Skipgram model predicts the context in which that word might appear. The CBOW model makes the opposite prediction. For a given context, CBOW predicts the words that might appear in that context.

The difference has an interesting effect on the measures. For example, when the context is "vespri," the CBOW model predicts that the three most context-similar words are: "aragunisi," "angiò" and "rivorta." But when the word is "vespri," the Skipgram model predicts that the three most context-similar words are: "aragunisi," "nnipinnintista" and "sicili."

In other words, when the context is "vespri," the articles on Sicilian Wikipedia tend to use words that describe the historical events, the Sicilian Vespers of 1282. But when the word is "vespri," the articles tend to use words that describe themes (context) of independence and Sicilian nationalism.

And because Sicilian Wikipedia has an extensive article about the rock band Queen, we can use the cosine measures to find information about the band too.

The Skipgram model predicts that the most context-similar words to "queen," are: "mercury," "seas," "cold" and "freddie." The CBOW model predicts that the most context-similar words to "queen," are: "miracle," "freddie," "tribute" and "killer."

I do not know how to interpret that, but I find it interesting.

You can find some more interesting relationships with the cosenu di sumigghianza tool. If you enter a single word, it will return the 10 most-context similar words to that word (like the "vespri" and "queen" examples above).

If you enter a list of words (and if the tool finds enough matches), it returns a cosine matrix for that list of words. This feature allows you obtain cosine measures for any pair of words (not just the top 10).

For example, to create a Skipgram cosine matrix, just type:   "sicilia bedda."

Copyright © 2018-2024 Eryk Wdowiak