Scoring Writing Proficiency Across Languages: A Model Bake-Off
A language-learning platform needed to grade CEFR writing level in English and Hebrew. We built one flow to detect the language, route it and score it, then compared six models. Detection reached 99 to 100 percent, and the best model was not the most expensive one.
A language-learning platform came to us with a deceptively simple ask: grade a learner's written answer on the CEFR scale, from A1 to C2, in both English and Hebrew. The hard part was not the grading. It was doing it accurately across two very different languages, cheaply enough to run on every submission, and with enough evidence to trust the score. Here is exactly how we did it, and the numbers from the run.
99-100%
language detection accuracy
6
models compared, 120 items each
~$8
total cost of the whole bake-off
One flow: detect, route, score
Instead of a separate pipeline per language, we built a single flow. The first agent looks at the writing and decides the language by its dominant script. A condition node then routes the text to the matching examiner: an English CEFR scorer or a Hebrew CEFR scorer. Each examiner returns a level and a short piece of evidence from the text.
The clever move was in how we detect. An earlier version tried to be a three-way classifier (English, Hebrew, or neither), and it topped out around 87 percent because gibberish and mixed input confused it. We changed the question to something a model answers almost perfectly: which script dominates? Detection jumped to 99 to 100 percent across every model we tried.
We scored on a number, not a letter
CEFR is six bands, but a band is a blunt instrument. A strong B2 and a weak B2 are not the same, and a rigid band match punishes a model for being one step off exactly as hard as being wildly wrong. So we scored on a continuous 0 to 1 scale, where the score for each answer is 1 minus the distance between the predicted level and the reference level. Gibberish and empty answers score 0. The reference levels themselves were graded once by a strong model (Claude Opus 4.8, with high reasoning) so every other model was measured against the same yardstick.
The bake-off
We ran three experiments on the one flow, 120 items each: detection, English level, and Hebrew level. Here are the top results per task.
| Task | Best accuracy | Best value |
|---|---|---|
| Detection (English / Hebrew) | 100% (three models tied) | gpt-5.4-mini, ~$0.0002 / item |
| English level | claude-haiku-4-5, 96.7% | claude-haiku-4-5, ~$0.0012 / item |
| Hebrew level | claude-sonnet-4-6, 94.4% | gpt-5.4-mini, ~$0.0008 / item |
Across the level tasks the field was tight: English landed between 94.6 and 96.7 percent, Hebrew between 91.2 and 94.4 percent. The reference-grading model was not automatically the best scorer, which is the whole reason to run a comparison instead of trusting a hunch.
Three things we learned
1. The best model is per-task, not per-project. Claude Sonnet 4.6 was the strongest all-rounder, especially on Hebrew. Claude Haiku 4.5 was the best value, topping English at a fraction of the cost. The right answer was a mixed pipeline, not one model everywhere.
2. Cheap models won more than we expected. Once the task is framed well, a small model matches a large one on work like this. Detection was a solved problem at two hundredths of a cent per item.
3. More reasoning did not help. We tested high-reasoning variants expecting a bump. On these classification and scoring tasks they came back equal or slightly worse, at two to three times the cost and latency. Turning the dial up is not free, and here it bought nothing.
Run this on your own agent
The point of the story is not the CEFR scores. It is that you should never guess which model to ship. In Evaligo you point your flow at a set of test cases, pick a few models, and get this same scoreboard back: accuracy, cost and latency, side by side, with a clear winner. If you do not have test data yet, Evaligo generates a representative set for you. The whole bake-off above cost about eight dollars and answered a question that teams argue about for weeks.
Ready to Build This?
Start building AI workflows with Evaligo's visual builder. No coding required.
Need Help With Your Use Case?
Every business is different. Tell us about your specific requirements and we'll help you build the perfect workflow.
Get Help Setting This UpFree consultation • We'll review your use case • Personalized recommendations