collocations - tPMHighlighter - a tool to help you gain insights into text

tPMHighlighter
tPMHighlighter
Go to content
Collocations
What are collocations?
Collocation refers to the statistical relationship between the occurrence of two words within a specified text window, based on the frequencies of each item together and apart within a corpus. It is a corpus method used to indicate association between words, and helps explain the way vocabulary items are co-selected.  
In the screenshot, the text has been highlighted three times, using different collocation metrics.
The pink column highlights words more strongly if they are surrounded by other words with which they have strong associations, drawing on collocation metrics which are not affected by word order or the spacing between words.  
The yellow column highlights words more strongly if they are surrounded by words to the left or to the right which have strong associations, drawing on a collocation metric which takes the order of words into account.
The green column is the strictest measure of collocation as it only highlights words more strongly if they are surrounded by words to the left or the right in specific slots. Its collocation metric takes the order of the words into account and also distinguishes between words which are immediately adjacent and words which are two, three or four words apart.
Since collocation measures are generally designed to give lower ranking to grammar words (which tend to be used in many different word combinations), we can note that in many cases grammar words tend not to be so strongly highlighted.

How does tPMHighlighter work with collocations under the hood?
When the corpus is built, sentences are sent to The Prime Machine server, and the server runs through each sentence word by word, generating pairs of words occurring 1, 2, 3 or 4 words apart. Each pair is then checked against the readymade corpus and collocation scores are collected. The T, Dice and MI3 measures are based on the two words occurring in any order. The Delta P measure takes the order of the two words into account, and the Log-likelihood measure takes into account the order of the two words and whether or not the words in the pair are adjacent. The readymade corpus has collocations for all word pairs in the reference corpus, and these are divided into levels, according to the collocational strength for each measure compared to all the other collocations in the reference corpus.
When a text is displayed, each word is given a colour to represent the strength of its collocations with the words which surround it. Lists of the collocations found in the reference corpus which match the pairs of words in the sentence can be viewed.

Back to content