corpusstatistics - tPMHighlighter - a tool to help you gain insights into text

Corpus Statistics

To view basic statistics for each text in your corpus, click the i button next to the corpus name from the main menu.

What kinds of measures are included?

The corpus statistic measures are shown in the table below.

Token count	Running words in the text (including punctuation)
Types	Different word forms in the text (including punctuation)
TTR	Type-token ratio (types divided by tokens); Affected by the length of text; very long texts have lower TTR.
STTR	Standardised Type-token ratio (the average TTR for strips of 400 running words within the text); affected less by the length of the text.
Sentences	The number of sentences in the text, determined when the corpus is built by looking for sentence punctation marks (./!/?) and paragraphing.
Mean sentence length	The average length of the sentences in the text.
Paragraphs	The number of paragraphs in the text, determined when the corpus is built by looking for line breaks and/or <p> tags.
Mean paragraph length	The average length of paragraphs in the text.
Numbers in digits	The total number of tokens where the string is made up entirely by numerals (0-9).
Punctuation	The total number of tokens recognised as standard punctuation marks.
Words	The total number of tokens which are not digits and not punctuation.
Mean word length	The average length of words (excluding digits and punctuation).
1-15+ Letter words	The percentage of tokens which are words (not digits or punctuation) and which contain the specified number of letters.
1-5 .. 70+ token sentences	The percentage of sentences which contain the specified number of tokens (including punctuation).

Tip:

You can use the rotate grid button in the bottom-left corner of the table to switch between viewing results with each text in a different row and with each text in a different column.