Corpus Statistics
To view basic statistics for each text in your corpus, click the i button next to the corpus name from the main menu.
What kinds of measures are included?
The corpus statistic measures are shown in the table below.
Token count | Running words in the text (including punctuation) |
Types | Different word forms in the text (including punctuation) |
TTR | Type-token ratio (types divided by tokens); Affected by the length of text; very long texts have lower TTR. |
STTR | Standardised Type-token ratio (the average TTR for strips of 400 running words within the text); affected less by the length of the text. |
Sentences | The number of sentences in the text, determined when the corpus is built by looking for sentence punctation marks (./!/?) and paragraphing. |
Mean sentence length | The average length of the sentences in the text. |
Paragraphs | The number of paragraphs in the text, determined when the corpus is built by looking for line breaks and/or <p> tags. |
Mean paragraph length | The average length of paragraphs in the text. |
Numbers in digits | The total number of tokens where the string is made up entirely by numerals (0-9). |
Punctuation | The total number of tokens recognised as standard punctuation marks. |
Words | The total number of tokens which are not digits and not punctuation. |
Mean word length | The average length of words (excluding digits and punctuation). |
1-15+ Letter words | The percentage of tokens which are words (not digits or punctuation) and which contain the specified number of letters. |
1-5 .. 70+ token sentences | The percentage of sentences which contain the specified number of tokens (including punctuation). |
Tip:
You can use the rotate grid button in the bottom-left corner of the table to switch between viewing results with each text in a different row and with each text in a different column.