Mining word collocations

2 min read

John Rupert Firth (1957):

You shall know a word by the company it keeps


  1. Problem

  2. Methods

    1. Gensim & NPMI
    2. Orange Data Mining
  3. Choosing between the two

  4. References & Links


Let’s say you want to extract common bigrams/trigrams from your text corpus. The corpus could be reviews data, feedback text, customer interactions with CRM tool etc. Extracting bigrams and trigrams can help you explore key phrases used in your text corpus so that you can better understand your corpus. For example, users might always complain “doesn’t work”, or “work anymore” could indicate recent breakage due to sofware updates. Our task here is to extract these collocations from a custom dataset and then use the results for a downstream application (monitoring, clustering etc.).


Gensim & NPMI

Gensim can help generate bigrams and trigrams with the Normalized Pointwise Mutual Information Score (NPMI). NPMI is a score that ranges from -1 to 1 and can be interpreted as:

  • -1: the words occur separately but not together
  • 0: words are distributed as expected, i.e. no relation, i.e. independent
  • 1: words only occur together

For two words xx and yy, NPMI is defined as:

in(x,y)=(lnp(x,y)p(x)p(y))/lnp(x,y)i_n(x, y) = \left( \ln \frac{p(x, y)}{p(x)p(y)} \right) / -\ln p(x, y)

Gensim has a handy interface to help you tune your NPMI scores on a corpus and extract phrases that exceed a threshold. Once you’ve initialized it on your corpus, you can always run it on new data instances.

from gensim.models import Phrases
ct = [...common domain words, function words, etc.]
bigram = Phrases(tokenized_corpus, # tokens x sentences, [["",""], ["",], ...]
# new data
bigram[tokenized_sentence] # new extractions

Orange Data Mining

Orange Data Mining is a really cool open source machine learning and data visualization tool. Once you install the text mining Add-On, you can access the Corpus and Collocation widgets to pick from any of the following scoring methods: Alt text

Choosing between the two

Orange is fun to play around with especially for data exploration, but if you want more customization and flexibility (for potentially building a system), the Gensim approach is preferred.