Mining word collocations
2 min read
John Rupert Firth (1957):
You shall know a word by the company it keeps
Contents
Problem
Let’s say you want to extract common bigrams/trigrams from your text corpus. The corpus could be reviews data, feedback text, customer interactions with CRM tool etc. Extracting bigrams and trigrams can help you explore key phrases used in your text corpus so that you can better understand your corpus. For example, users might always complain “doesn’t work”, or “work anymore” could indicate recent breakage due to sofware updates. Our task here is to extract these collocations from a custom dataset and then use the results for a downstream application (monitoring, clustering etc.).
Methods
Gensim & NPMI
Gensim can help generate bigrams and trigrams with the Normalized Pointwise Mutual Information Score (NPMI). NPMI is a score that ranges from -1 to 1 and can be interpreted as:
- -1: the words occur separately but not together
- 0: words are distributed as expected, i.e. no relation, i.e. independent
- 1: words only occur together
For two words and , NPMI is defined as:
Gensim has a handy interface to help you tune your NPMI scores on a corpus and extract phrases that exceed a threshold. Once you’ve initialized it on your corpus, you can always run it on new data instances.
Orange Data Mining
Orange Data Mining is a really cool open source machine learning and data visualization tool. Once you install the text mining Add-On, you can access the Corpus and Collocation widgets to pick from any of the following scoring methods:
Choosing between the two
Orange is fun to play around with especially for data exploration, but if you want more customization and flexibility (for potentially building a system), the Gensim approach is preferred.
References & Links
- Gensim Phrases Documentation
- Orange Data Mining Collocations
- Manning, Christopher, and Hinrich Schütze. 1999. Collocations
- Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality”
- “Normalized (Pointwise) Mutual Information in Collocation Extraction” by Gerlof Bouma