Guides · Writing

Readability formulas compared: why six scores disagree

Published: May 30, 2026·Reading time: 8 minutes

Six classic readability formulas can stare at the same paragraph and report grade levels four points apart. That looks like a bug. It is not. Each formula measures a different proxy for difficulty, and none of them measures comprehension directly. The disagreement is the most useful thing the scores have to tell you.

Flesch Reading Ease and Flesch-Kincaid use average syllables per word. Gunning Fog and SMOG count the share of words with three or more syllables. Coleman-Liau and ARI ignore syllables entirely and use letters per word. All six react strongly to sentence length, but they weight it differently. When the scores cluster, you have a reliable read on the text. When they spread apart, the spread tells you what kind of text is throwing the formulas off — long sentences, heavy vocabulary, irregular spellings, or short samples that none of the formulas were built to handle.

The short version: agreement across formulas means the text is what it looks like. Disagreement means one of the inputs (sentence length, syllable density, letter count) is doing something unusual, and that is usually a clue about how to revise.


What each formula actually counts

All six formulas turn two surface signals — sentence length and word weight — into a single number. They differ in how they measure word weight, and that is where most of the disagreement comes from. Microsoft Word documents the Flesch and Flesch-Kincaid formulas it ships with in its readability statistics reference, which is a useful baseline for what most word processors report.

Flesch Reading Ease

Outputs a 0–100 score where higher is easier. Inputs are average sentence length and average syllables per word. Rudolf Flesch published the formula in 1948 to help newspapers and government publications match prose to audience. A score of 60–70 is generally treated as plain English for a general adult audience.

206.835 − 1.015 × (words/sentence) − 84.6 × (syllables/word)

Flesch-Kincaid Grade Level

Same inputs as Flesch Reading Ease, but the output is a US school grade. It was developed in 1975 for the US Navy to evaluate training materials and is now the formula most word processors quote first.

0.39 × (words/sentence) + 11.8 × (syllables/word) − 15.59

Gunning Fog Index

Outputs a grade level. Inputs are average sentence length and the percentage of words with three or more syllables (its "complex words"). Robert Gunning introduced it in 1952 as a tool for business writers. Because it counts long words rather than averaging syllables, a single polysyllabic term pushes the score up more than it would in Flesch-Kincaid.

0.4 × (words/sentence + 100 × complex-word share)

SMOG (Simple Measure of Gobbledygook)

Outputs a grade level. Counts polysyllabic words (three or more syllables) across a 30-sentence sample. G. Harry McLaughlin published it in 1969 and designed it to predict 100 percent comprehension rather than the partial comprehension other formulas targeted, which is why public health agencies and patient-education writers tend to favor it.

1.0430 × √(polysyllables × 30 / sentences) + 3.1291

Coleman-Liau

Outputs a grade level. Drops syllables entirely and uses average letters per word plus sentences per 100 words. Meri Coleman and T. L. Liau published it in 1975, motivated by the fact that computers in that era could count letters reliably but counted syllables badly. That tradeoff still holds — letters per word is unambiguous, syllables are a heuristic.

0.0588 × L − 0.296 × S − 15.8

(L is letters per 100 words; S is sentences per 100 words.)

Automated Readability Index (ARI)

Outputs a grade level. Like Coleman-Liau, ARI uses letters per word rather than syllables, combined with average sentence length. It was developed in 1967 for the US Air Force, also to sidestep syllable counting on early computers.

4.71 × (letters/word) + 0.5 × (words/sentence) − 21.43

The same paragraph, six different scores

Here is a single paragraph of moderately formal prose:

"The committee voted unanimously to delay the decision. Several members wanted more time to examine the budget projections, particularly the assumptions about future revenue. A few preferred to move ahead, arguing that further delay would damage public trust. The vote split along familiar lines."

Four sentences, 44 words, average sentence length around 11. Plenty of three-syllable words (committee, decision, projections, assumptions, particularly, unanimously). Run it through the AnchorKite Readability Checker and the six formulas report approximately:

Formula Score What that means
Flesch Reading Ease~36Difficult — college level
Flesch-Kincaid Grade Level~11High school junior
Gunning Fog~13.5Early college
SMOG~12High school senior
Coleman-Liau~14.4College sophomore
ARI~10.4High school sophomore

That is a four-grade spread between the most generous formula (ARI at ~10.4) and the least generous (Coleman-Liau at ~14.4). All six are looking at the same words, the same sentences, the same letters. They disagree because they weight those inputs differently.

Why so much spread? The paragraph has short sentences (good for ARI and Flesch-Kincaid, which both lean on sentence length) but long average word length (bad for Coleman-Liau, which is dominated by letters per word). About a quarter of the words are polysyllabic, which pushes Gunning Fog and SMOG higher than Flesch-Kincaid even though all three are nominally measuring "grade level." The pattern is the signal: this is prose with reasonably short sentences but heavy vocabulary. If you want a lower score across the board, cut the polysyllables — not the sentences.

Why disagreement is the useful signal

When the six scores cluster within a grade or two of each other, the text is roughly what it looks like: standard prose for a coherent audience. When they spread apart, the spread is diagnostic.

Where the formulas break down

All six formulas were calibrated on multi-paragraph passages of standard English prose. They get noisy or wrong on text that does not match those assumptions:

How to revise using the spread

Most of the practical advice in plain-language guidance is older than the formulas and outlives any single score. The federal plain language guide recommends short sentences, common words, and active voice; the CDC's framing of health literacy emphasizes that audience comprehension, not a number, is what matters. Use the readability scores as a quick diagnostic against that backdrop, not as a target to optimize.

A useful workflow:

  1. Score the whole document. Note the spread, not the average.
  2. If the spread is small and the level matches your audience, stop. The text is fine.
  3. If the spread is large, look at which formulas are high. Long-word formulas (Gunning Fog, SMOG) flag vocabulary. Sentence-length formulas (Flesch-Kincaid, ARI) flag pacing. Letter-density formulas (Coleman-Liau, ARI) flag jargon-style spellings.
  4. Revise the input the high-scoring formula is reacting to — shorten the sentence, swap the polysyllable, replace the term — and rescore. The spread usually narrows quickly.
  5. If a heading-heavy or bullet-heavy section produces wild scores, that is a tooling artifact. Score the prose paragraphs separately.

Where the AnchorKite tools fit

The Readability Checker reports all six formulas live in the browser, plus the underlying counts — words, sentences, syllables, letters, complex words — so you can see which input is driving each formula. The text never leaves your device.

If you only need raw counts (for a 280-character cap, a Substack lede, or a print column), the Word Counter handles that without running the formulas at all. Both tools, along with the rest of the writing toolkit, are on the Writing Tools hub.

Use the Readability Checker to compare all six formulas on your own text. Look at the spread, not the average. Revise based on the pattern the spread reveals rather than chasing one magic number across formulas that were never going to agree.

More writing tools at the Writing Tools hub.

Sources and further reading

FAQ

Why do readability formulas give different scores for the same text?

Each formula measures a different proxy for difficulty. Flesch Reading Ease and Flesch-Kincaid use average syllables per word; Gunning Fog and SMOG count words with three or more syllables; Coleman-Liau and ARI use letters per word instead of syllables. All six react to sentence length, but they weight it differently. The same paragraph can land at grade 10 on one formula and grade 14 on another because each one is answering a slightly different question.

Which readability formula is most accurate?

None of them measures comprehension directly. They all approximate difficulty from surface features of the text. Flesch-Kincaid is widely used in government and education; SMOG is favored in healthcare because it was validated against full comprehension rather than partial; Coleman-Liau and ARI avoid syllable counting, which makes them more robust on text with unusual spellings. The most reliable read is the spread across several formulas — when they agree, you have a strong signal.

What kinds of text confuse readability formulas?

Short samples (under a few sentences), bulleted lists, headings, code snippets, dialogue, proper nouns, and technical terms all tend to push scores around in misleading ways. The formulas were designed for paragraphs of standard English prose. A page of section headings will look easy because the sentences are short; a paragraph full of long proper nouns will look hard for the wrong reason.

Should I target a specific readability score?

Target an audience, not a number. For general-public writing, government plain-language guidance and healthcare communicators commonly aim for around 6th- to 8th-grade level. For a college audience, grade 12 to 14 is normal. A target only matters if it matches the people you are writing for. Chasing a single number across formulas that disagree is a quick way to over-edit prose.

Where can I run the six formulas on my own text?

The AnchorKite Readability Checker computes all six scores live in your browser as you type. Nothing leaves the device — no accounts, no logging. The page also reports the underlying counts (words, sentences, syllables, letters) so you can see which input is driving the disagreement.

More guides at anchorkite.com/guides.