Scoring Writing Proficiency Across Languages: A Model Bake-Off

A language-learning platform came to us with a deceptively simple ask: grade a learner's written answer on the CEFR scale, from A1 to C2, in both English and Hebrew. The hard part was not the grading. It was doing it accurately across two very different languages, cheaply enough to run on every submission, and with enough evidence to trust the score. Here is exactly how we did it, and the numbers from the run.

99-100%

language detection accuracy

models compared, 120 items each

~$8

total cost of the whole bake-off

One flow: detect, route, score

Instead of a separate pipeline per language, we built a single flow. The first agent looks at the writing and decides the language by its dominant script. A condition node then routes the text to the matching examiner: an English CEFR scorer or a Hebrew CEFR scorer. Each examiner returns a level and a short piece of evidence from the text.

The clever move was in how we detect. An earlier version tried to be a three-way classifier (English, Hebrew, or neither), and it topped out around 87 percent because gibberish and mixed input confused it. We changed the question to something a model answers almost perfectly: which script dominates? Detection jumped to 99 to 100 percent across every model we tried.

We scored on a number, not a letter

CEFR is six bands, but a band is a blunt instrument. A strong B2 and a weak B2 are not the same, and a rigid band match punishes a model for being one step off exactly as hard as being wildly wrong. So we scored on a continuous 0 to 1 scale, where the score for each answer is 1 minus the distance between the predicted level and the reference level. Gibberish and empty answers score 0. The reference levels themselves were graded once by a strong model (Claude Opus 4.8, with high reasoning) so every other model was measured against the same yardstick.

The bake-off

We ran three experiments on the one flow, 120 items each: detection, English level, and Hebrew level. Here are the top results per task.

Task	Best accuracy	Best value
Detection (English / Hebrew)	100% (three models tied)	gpt-5.4-mini, ~$0.0002 / item
English level	claude-haiku-4-5, 96.7%	claude-haiku-4-5, ~$0.0012 / item
Hebrew level	claude-sonnet-4-6, 94.4%	gpt-5.4-mini, ~$0.0008 / item

Across the level tasks the field was tight: English landed between 94.6 and 96.7 percent, Hebrew between 91.2 and 94.4 percent. The reference-grading model was not automatically the best scorer, which is the whole reason to run a comparison instead of trusting a hunch.

Three things we learned

1. The best model is per-task, not per-project. Claude Sonnet 4.6 was the strongest all-rounder, especially on Hebrew. Claude Haiku 4.5 was the best value, topping English at a fraction of the cost. The right answer was a mixed pipeline, not one model everywhere.

2. Cheap models won more than we expected. Once the task is framed well, a small model matches a large one on work like this. Detection was a solved problem at two hundredths of a cent per item.

3. More reasoning did not help. We tested high-reasoning variants expecting a bump. On these classification and scoring tasks they came back equal or slightly worse, at two to three times the cost and latency. Turning the dial up is not free, and here it bought nothing.

Run this on your own agent

The point of the story is not the CEFR scores. It is that you should never guess which model to ship. In Evaligo you point your flow at a set of test cases, pick a few models, and get this same scoreboard back: accuracy, cost and latency, side by side, with a clear winner. If you do not have test data yet, Evaligo generates a representative set for you. The whole bake-off above cost about eight dollars and answered a question that teams argue about for weeks.

Scoring Writing Proficiency Across Languages: A Model Bake-Off

One flow: detect, route, score

We scored on a number, not a letter

The bake-off

Three things we learned

Run this on your own agent

Danny Lev

Ready to Build This?

Need Help With Your Use Case?

Related Articles

Fact-Checking AI Articles, With a Receipt for Every Fix