corpusstatistics - tPMHighlighter - a tool to help you gain insights into text

tPMHighlighter
tPMHighlighter
Go to content
Corpus Statistics
To view basic statistics for each text in your corpus, click the i button next to the corpus name from the main menu.
What kinds of measures are included?
The corpus statistic measures are shown in the table below.

Token count
Running words in the text (including punctuation)
Types
Different word forms in the text (including punctuation)

TTR
Type-token ratio (types divided by tokens);
Affected by the length of text; very long texts have lower TTR.

STTR
Standardised Type-token ratio (the average TTR for strips of 400 running words within the text); affected less by the length of the text.
Sentences
The number of sentences in the text, determined when the corpus is built by looking for sentence punctation marks (./!/?) and paragraphing.

Mean sentence length
The average length of the sentences in the text.

Paragraphs
The number of paragraphs in the text, determined when the corpus is built by looking for line breaks and/or <p> tags.

Mean paragraph length
The average length of paragraphs in the text.

Numbers in digits
The total number of tokens where the string is made up entirely by numerals (0-9).

Punctuation
The total number of tokens recognised as standard punctuation marks.

Words
The total number of tokens which are not digits and not punctuation.

Mean word length
The average length of words (excluding digits and punctuation).

1-15+ Letter words
The percentage of tokens which are words (not digits or punctuation) and which contain the specified number of letters.

1-5 .. 70+ token sentences
The percentage of sentences which contain the specified number of tokens (including punctuation).

Tip:
You can use the rotate grid button in the bottom-left corner of the table to switch between viewing results with each text in a different row and with each text in a different column.

Back to content