Solo tre parole: benchmarking local LLMs

“Solo tre parole: non sei solo.”

Six Italian words. The first three announce “only three words”. The last three deliver “you are not alone”. Read it again, slowly. The same word “solo” opens and closes the sentence, but it does different work each time: first as the adverb “only”, then as the adjective “alone”. And the supposedly three-word reassurance, non sei solo, contains exactly three words. The sentence is about itself. It is also, quietly, a little gem1.

I gave it to a small zoo of local language models running on a laptop, asking each to translate it into English while preserving every nuance it could. What followed was more interesting than the question deserved. One model invented wordplay that did not exist. Another saw the cleverness with full clarity, articulated it precisely, and then ignored it. A third produced a candidate translation so contorted it was almost charming. And the model that most clearly understood the puzzle took sixteen minutes to say so.

This is the story of that afternoon.

Why the phrase is sneakier than it looks

Translation, at its dullest, is word substitution. At its more interesting, it is constraint satisfaction under aesthetic pressure: keep the meaning, keep the tone, keep the rhythm, and if there is a clever trick in the source, ideally keep that too. Translating is a highly skilled job.

Solo tre parole: non sei solo hides three small tricks in plain sight.

The first is lexical polysemy. Italian solo is doing two jobs in the same sentence: a quantifying adverb at the start, an existential adjective at the end. Same form, different role, different meaning. English has no single word that pulls double duty in quite the same way; we are forced to split the echo into two distinct lexical items, and the gentle internal rhyme of the original collapses.

The second is self-reference. The opening clause announces the length of the second clause, and the second clause delivers exactly that length. Non sei solo is genuinely three words. The sentence describes itself accurately. Most English candidate translations, like “you are not alone”, break this property: four words, not three. To preserve the self-reference, you need a contraction (“you’re not alone”) or some less natural construction.

The third is register. The sentence is intimate, minimalist, the kind of thing one writes on a postcard or sends as a message at a hard moment. It is not florid, it is not formal. Anything that translates the meaning but reaches for “you are not in solitude” misses the point entirely.

So that is the brief: hold polysemy, self-reference, and register together in six English words, or any possible alternative. Possible, but only just.

The setup

I tested everything through Ollama on an M4 MacBook Air with 24GB of RAM, using the same prompt across all models. Although, I have a PC with a better GPU and cooling, but the GPU memory would not be able to handle most of the tested models. That’s why my passively-cooled MacBook was transformed into a LLM powerhouse.

The prompt asks the model to do five things, in order:

  1. Provide a literal gloss.
  2. Identify any wordplay, double meanings, self-referential structure, register choices, or cultural framing.
  3. Explain the genuinely hard parts.
  4. Offer two or three candidate translations, each with what it preserves and what it sacrifices.
  5. Pick a recommended translation and justify the choice.

The structure is deliberately fussy. Small local models tend to leap straight from “I see Italian” to “here is a translation”, missing every interesting layer along the way. Forcing a literal gloss first slows them down. Forcing multiple candidates with explicit trade-offs makes them name what they are giving up. Asking for justifications stops them waving their hands.

The lineup, more or less in order of running:

  • llama3.1:8b
  • granite4.1:8b
  • mistral-small3.2:24b
  • gemma4:e4b (small) and gemma4:26b (large), both with native thinking mode
  • deepseek-r1:8b and deepseek-r1:14b, both with native thinking
  • qwen3.6:27b, with thinking
  • gpt-oss:20b, with thinking

All Q4_K_M quantisation, except gpt-oss which uses MXFP4. So roughly comparable on the quantisation front, with one small caveat for the gpt-oss numbers.

How small models fail when asked to be clever

The most striking failure was llama3.1:8b. Asked to find subtlety in the source, it confidently told me that tre is phonetically similar to t’re, “which sounds like ‘there'”. This is invented. There is no such pun. The model, faced with a request to find wordplay, hallucinated wordplay rather than admit it could not find any.

This is the worst sort of small-model failure. A miss is recoverable; a fabrication looks like analysis and is not. If you do not speak the source language, you have no way to check. The model produced clean prose, confident structure, and made-up linguistics underneath.

granite4.1:8b did better as it identified the solo/solo polysemy but its account of what the polysemy actually did in the sentence collapsed into incoherence. It missed the self-referential count entirely.

These are the small-model results in a nutshell: 8B parameters at Q4 quantisation does not appear to be enough capacity to hold polysemy, structural self-reference, and register all at once. Something has to give, and it does.

The analysis–translation gap

A more interesting failure showed up in the larger models. gpt-oss:20b is the cleanest example.

It saw the polysemy: “solo occurs twice, first as the adverb ‘only’, then as the adjective ‘alone'”. It saw the self-reference: “the phrase claims that the whole sentence consists of just three words”. Then in step 3, it noted, in plain English: “the English equivalent — ‘Only three words: you are not alone’ — has four words, so the exact numeric precision is lost.”

It saw the problem with full clarity. Then it proposed three candidates, none of which solved the problem, and recommended one that did not either.

deepseek-r1:14b showed the same shape. Sharp analysis, all candidates fail the count, recommendation flat.

This is more interesting than “didn’t see the problem”. These models did see it. They simply could not turn the seeing into a generation constraint. Identifying a problem and constructively satisfying it are, apparently, separate skills. Constraint identification looks like memorisation and pattern-matching; constraint satisfaction in English requires the model to feel its way to “you’re not alone” (counting the contraction as one word, which is the cleanest fix available) rather than describe its way there.

What thinking mode actually buys you

gemma4:26b with thinking mode enabled was the only model in the batch that caught everything and knew it had caught everything. Its analysis used phrases like “lexical echo” and “semantic mirror”. Its recommended translation, Just three words: you’re not alone, came with an explicit note: treating the contraction “you’re” as a single word preserves the 3:3 word count of the original. It did not stumble into the answer; it reasoned to it.

I then ran the same model with /set nothink. Same weights. Same prompt. Different answer.

The non-thinking version flatly stated, “there is no linguistic wordplay in the sense of puns.” This is wrong. With thinking off, Gemma’s failure mode collapsed neatly onto Mistral’s, missing the polysemy entirely.

That single comparison — same model, same query, thinking on versus off, opposite verdicts on whether wordplay even exists — is the cleanest demonstration I have seen of what reasoning-at-inference-time actually contributes. It is not just a quality boost. It is the ability to revise a first impression. Without thinking, the first pass is the answer, and a confident first pass can be confidently wrong.

A small experiment with Mistral

mistral-small3.2:24b does not have native thinking. So I tried to fake it.

Upon a suggestion from Claude (see credits at the bottom of this post), I added a “step 0” to the prompt:

Before writing the visible sections, work through the source carefully: list the words individually, check whether any word appears more than once, decide for each repeated word whether the meanings are the same or different, and check whether the announced word count matches any clause in the sentence.

With this addition, Mistral suddenly caught the solo/solo polysemy it had missed completely on its first pass. The capability was in the base model; what had been missing was the procedure for using it. The hypothesis was cleanly confirmed as Mistral could see polysemy, which it simply did not bother in one pass.

There was a twist. With the new instruction, Mistral lost track of the self-referential count, which it had caught in the original run. As if attention is a budget: spend it forcing one feature, lose it on another. Whether that is a real effect or a coincidence on this single sentence, I genuinely cannot tell.

The same run also gave me a textbook example of confabulation under structural pressure. One of Mistral’s candidate translations claimed to “preserve the shift from ‘only’ to ‘alone'” while sacrificing “the explicit count of three words”, but the candidate phrase was Three words only: you’re not alone. The words “three” and “words” are right there. The count is preserved. The model invented a sacrifice that did not exist, because the prompt asked for three differentiated candidates and only two of them were genuinely different.

The cost of cleverness

Capability is one axis. Speed is another, and there were surprises here too.

ModelEval rate (tok/s)Wall clock
gpt-oss:20b25,640 s
gemma4:26b25,653 s
llama3.1:8b20,821 s
deepseek-r1:8b19,01m 14s
granite4.1:8b18,627 s
deepseek-r1:14b10,81m 41s
mistral-small3.2:24b7,250 s
qwen3.6:27b2,815m 57s

A 24B model running at 7,2 tok/s on the same hardware as a 26B model running at 25,6 tok/s is not what parameter-count instinct would predict. The biggest single factor turns out to be embedding dimension. Mistral’s 5.120-wide embeddings cost roughly 3,3× more compute per token than Gemma’s 2.816-wide ones, and that ratio matches the speed gap almost exactly. On Apple Silicon, where memory bandwidth is the binding constraint for inference, narrow-and-deep beats wide-and-shallow.

qwen3.6:27b is more puzzling. It has the same 5.120 embedding as Mistral and DeepSeek 14B, yet ran at 2,8 tok/s, far slower than width alone explains. With 17GB, the model is comparable in size with gemma4:26b, and it is at the limit of what a MacBook Air with 24GB can run without memory pressure being too high. But the Gemma model was able to answer in less than a minute, so likely it is deeper, or has unoptimised inference paths in Ollama for that architecture, or pays an overhead for its 262.144-token context length. Whatever the cause, sixteen minutes for a single-sentence translation is not interactive. Quality-wise it was strong. Usability-wise it is unusable.

And the commercial chatbots?

A fair sanity check: how do the cloud-hosted models do on the same prompt?

ChatGPT (free version, GPT-5.5 as of writing) and Perplexity in default mode both performed at roughly the level of mistral-small3.2 or granite4.1: identifying one of the two layers, missing the other, recommending a flat translation. Defaulting to a consumer-friendly model presumably trades depth for cost, and the trade-off shows.

Perplexity with the Sonar model reached the gemma4:26b level: both layers caught, count preserved. Gemini in Fast+Thinking mode matched it too. So far, no surprises.

Claude with Opus 4.7 in Adaptive mode (which appears to engage thinking) also matched gemma4:26b on the first pass. But when I pushed it to convey everything from its own analysis in the translation rather than declaring trade-offs, it came back with something none of the other models had produced:

Three words alone: you’re not alone.

That is genuinely clever. The word “alone” appears twice, doing different work each time — first as a postpositive adverb meaning “merely” or “by themselves”, then as the predicate adjective meaning “solitary” — directly mirroring the solo/solo echo of the original. The post-colon clause is exactly three words. Register holds. It is the only translation across the entire test, local or commercial, that preserves all three constraints simultaneously.

The wider observation, perhaps: the gap between the best commercial cloud chatbot and the best local model on a MacBook Air is now smaller than the gap within either category. A well-chosen local model beats a default-mode commercial chatbot. And the difference between a thinking and non-thinking variant of the same model is larger than the difference between one good thinking model and another, regardless of where it runs.

Takeaways, more or less

A few things I will be carrying forward from this afternoon.

Parameter count is a poor proxy for almost everything. A 24B model can be slower than a 26B one and produce weaker analysis. Architecture, training, and inference mode all dominate. “Size class” is a simplification that hides every interesting variable.

Thinking mode does real work. When a task requires anything more than one-pass pattern-matching — counting, cross-referencing, constraint satisfaction — disabling thinking will silently cripple the model. The same gemma4:26b confidently denied wordplay existed without thinking, and confidently dissected it with thinking. If your local model supports /set think, leave it on for anything subtle.

Identification is not satisfaction. Several models cleanly described the problem and then produced answers that ignored their own description. Knowing the constraint and respecting it during generation are separate capabilities, and the second one is rarer.

Confabulation is the worst failure mode. Llama 3.1’s invented phonetic pun, and Mistral’s invented preserves/sacrifices justifications, are more dangerous than missing an answer. Missing leaves you uncertain; confabulating leaves you confidently wrong. Smaller models do it more, but no model is immune.

Local LLMs on a laptop are remarkable but not magical. A MacBook Air can now run models that catch literary wordplay in a foreign language. It can also run models that hallucinate confidently and sound convincing whilst doing it. The gap between those two modes is, increasingly, the more important question.

For what it is worth, my favourite translation remains the one gemma4:26b reasoned its way to:

Just three words: you’re not alone.

Three words. Precisely. With deliberate intent — and in a sentence about being seen, that is the whole point.


1: I got the idea while reading this article from The Guardian: ‘Being human helps’: despite rise of AI is there still hope for Europe’s translators? by P. Oltermann.

The full prompt and raw model outputs are available on request.

Credits: The translation prompt and the draft of this post were developed in conversation with Claude Opus 4.7 (Anthropic). Any errors of judgement remain mine 😉.

A Short Trip to Münster

As I had an appointment in Münster today, I took the opportunity to go for a short walk through this beautiful town once again.

I went back to the Botanical Garden. In winter, it has a completely different atmosphere, and there is almost no one there. Today, the small pond was still frozen, and because the air was relatively warmer, a thin layer of fog hovered above it.

After this, I walked briefly through the city center before heading back to my vehicle. Two buildings from two different eras caught my attention.

Along the way, snowdrops are popping up in the parks. They aren’t blooming yet – it is still winter – but their leaves are emerging from the bulbs, for those who pay attention.

It’s these small moments that remind me why I love exploring – and re-exploring – even familiar places.

Der Leuchtturm von Eierland

Und nun hier eine deutsche Version des Gedichts über den Leuchtturm von Eierland (allerdings ohne Reime, das ist mir nicht gelungen).

Der Leuchtturm von Eierland

Allein auf dem Achterdeck, heut Abend,
zittert ein Schimmer - Trauer im Blick.
Der Wind vom Meer, pfeifend in der Dunkelheit,
bläht das Segel - und das Herz verirrt sich.

–––

Das Leuchtfeuer von Eierland
  atmet in seinem Nebel
und säumt die Düne,
  auf der ein erstarrter Schatten wacht.
Der Wind des Wattenmeers,
  voll Bitterkeit,
trägt das Seufzen
  eines nie ausgesprochenen Wortes fort.

Jean-Christophe Berthon

Le Phare d’Eierland

Toujours dans le thème maritime, voici un autre poème.

Le Phare d’Eierland

Seul sur la dunette, ce soir,
un halo tremble - tristesse d’un regard.
Le vent du large, sifflant dans le noir,
gonfle la toile - le cœur s’égare.

–––

Le fanal d’Eierland
respire dans sa brume
et ourle la dune
où veille une ombre engourdie.
Le vent des Wadden,
plein d’amertume,
emporte le soupir
d’un mot jamais dit.

Jean-Christophe Berthon

Ich werde bald eine deutsche Übersetzung vorschlagen. 🪻

Mandala – colouring

I decided to try the little I’ve learned about sketching to colouring a mandala with pencils. I might want to try to colour some future sketches and before I f*🫢§ them up by bad colouring 🙈, I thought it a good idea to try first with mandalas. 😁

A Mandala of a green glass with pencils and pens in it

Le Voilier – travail d’une esquisse

Je souhaite créer une esquisse de mon poème sur le voilier. N’ayant trouvé aucun dessin inspirant ou instructif pour apprendre à dessiner un voilier, j’ai décidé de me tourner vers l’intelligence artificielle. J’ai spécifié le type de voilier que je souhaitais, un côtre à un mât, et j’ai décrit la scène. Une fois satisfait du résultat, j’ai demandé à l’IA de générer une image que je pourrais utiliser comme base pour mon dessin.

Ce travail est encore en cours, mais je tenais à en montrer les étapes.

Adventures in France: Where Even the Complaints Are Scenic

Every summer, there comes a time when you must pack the car, ignore the nagging feeling that you’ve forgotten something crucial (spoiler: we most certainly did, but I like the blissful feeling that if I would remember what, then I wouldn’t have forgotten it in the first place, or would I? 😉), and head south toward cheese, baguettes, family, and hopefully a bit of rest. This year, our plan was simple: visit family in France, give the kids a change of scenery, and perhaps steal a few moments of peace in between. What else?

Week One – Family Affairs and Slightly Grumpy Hikes

We left on a Sunday – early for us – miraculously organised the day before (wrote this one down in the family history book). Vera heroically took the wheel for most of the 10-hour drive, while JC, having tossed and turned like a seal after a meal all night, could barely manage 1-hour drives. Still, we arrived in one piece. Bravo, Vera!

Monday greeted us with classic French summer weather: greyish skies, a stubborn 22°C, and the kind of light and sporadic drizzle that makes you unsure whether to wear a raincoat or not. We opted for a walk, the adults at least. The kids were less enthused, already dreaming of lakes and ice cream.

On Tuesday, JC was gently reminded that holidays warp your sense of time. He was quite surprised to see his uncle at the door, having firmly believed it was still Monday. Plans were reshuffled. Vera took the kids for a nearby via ferrata before heading to the lake, while JC enjoyed a long walk and chat with his uncle. We all reunited at the beach for drinks under the trees – a nice family moment.

Wednesday’s plan was a “small trek” – 5-6 km with a gentle 150 m elevation. We’ve done it before with the kids as toddlers. This time? After 500 m, complaints began. The kids executed a perfect moaning relay, each handing off the baton of grumbling as we marched on. Very French of them.

Speaking of experiencing the French, later that week, JC cycled to the bakery and encountered the true spirit of French driving – a local buzzed past him so closely JC could’ve check his look in their side mirror. Naturally, JC expressed himself in an eloquent blend of French swearing and Italian hand gestures. The driver gestured toward a barely visible shared path (which, legally, he didn’t have to use). All part of the authentic French cycling on the road experience. Beautiful country, baffling cycling signage and morons behind the steering wheel.

Friday brought a family reunion with JC’s cousin. We explored a cave (12-14°C inside 🥶 – bliss for JC in shorts and sandals 😎), visited a waterfall, and had a picnic with 11 people, which required the kind of table you’d usually find at a wedding. The cave wasn’t long or large, but you could stand in it (except for my cousin and his 1m90 or so) and it had plants, green plants in it (mostly ferns), JC was surprised by that and asked the guide why, she replied that the lighting system installed for the visitor is emitting also in UV lights while bats and tourists bring the seeds which find clay and water to grow. Amazing! After that refreshing experience, the kids ran wild around a nearby pond while the adults caught up in the shade. Simple joys.

On Saturday, we attempted a hike to a via ferrata overlooking Lac du Bourget. Flynn and Vera were the brave ones who clipped in and started climbing 🧗💪. JC and the rest opted for games in the shade – we all have our strengths. Flynn gave it a shot but turned back (understandably, it’s 800m above the lake!). Vera finished the route with ease and then casually added a second, more challenging one for dessert. Show-off. 😜

Sunday brought the village festival and a visit from JC’s parents. Music, games, laughter, and even dancing — JC shared a lovely moment with Runa dancing together. We stayed up late. It was, in a word, festive.

Week Two – Lakes, Ropes, and 36 Degrees of Realisation

Monday and Tuesday were about beach life and Stand-Up Paddleboarding. JC found time to sketch a mountain landscape from Austria (yes, he brought his pencils, he’s that kind of holidaymaker … and he forgot his reading book at home – now I remember what I had forgotten, no more blissful feelings). The temperature was climbing, but still kind, especially in the shade.

Wednesday was the treetop adventure morning. We lied. Just a little. Told the staff all our kids were over 10 so they could try every routes – including the red and black ones. They all did brilliantly, especially Flynn who tackled the black like a pro. Afterwards, we – guess what – went to the lake. But it was chilly and windy (20°C), and the enthusiasm quickly gave way to shivers. All ask to leave early, except one stubborn soul. (We’re not saying who but he quickly surrendered.)

And then came the heatwave.

With 36°C forecast, JC finally understood why Vera and the kids complain above 24°C 🥵. His Mediterranean blood has apparently expired this year. On Friday, he entered full Italian mode: siesta between 12 and 16h, shade, cold drinks, and zero movement. While the rest stayed active. What he didn’t realise was that the next day would be payback time: packing the car in 36°C 🥵🥵, without a single patch of shade. He’d parked by an empty bicycle rack, which he promptly used as a climbing frame to reach the roof box. Then came the “holiday luggage Tetris” championship. By the end, his t-shirt and shorts were completely soaked — the kind of wet you normally associate with jumping into the lake, not loading a car. This year, he really can’t stand the heat… and at last, he truly understands the rest of the family when their summer moaning begins.

Next we will visit my parents the next two days and go home. We all miss our cat. 🐈

Day Two: Scrambled Plans and NRW Wilderness

We woke around 7:00 and started the day with a modest breakfast: a cereal bar and some scrambled eggs; Rührei, as the Germans call it. Flynn took the lead in our field kitchen, and I was assigned the role of apprentice. Fair division of labour. ;-)

The chef preparing our breakfast

Today, there were fewer photo stops. Jean-Christophe had promised Flynn no diversions for wildflowers photography or wild blackberries tasting. A hard promise to keep for him, but fair enough. :-D

We rode through Dülmen, where we made time for a second breakfast – a German tradition which Jean-Christophe was happy to discover. Then we left the town behind, passt the old gates and headed towards Lüdinghausen, with Burg Vischering as our next culinary goal.

The scenery along the way was amazingly peaceful with humming fields, cooling forests, gliding storks, and a lone maypole near a cozy fire circle. Part of the route took us along a stretch I had cycled a month earlier on the way to Münster.

At Burg Vischering, we paused for drinks and hoped for a bit of cake or ice cream. (especially Flynn wanted soft ice) Unfortunately, the ice cream machine was out of order for the day, and the only drinks available were fizzy – not Flynn’s favourite. Disappointment was quickly put aside with the help of home made cookies, and we moved on.

The road to Olfen remained pleasant, but time began to press. Jean-Christophe had an appointment later in the afternoon, and we needed to keep a steady pace. We reached Olfen and Flynn – exhausted – deserved an ice cream and was finally rewarded! He, after 88 km in two days, was understandably worn out. Vera kindly came to meet us near Henrichenburg, where Flynn wrapped up his journey.

Jean-Christophe continued alone to Lünen, this time without the panniers — a lighter finish.

Spending time like this – Father-Child – was truly special. Jean-Christophe hopes to have similar moments with his other children too – on bikes or feet, under the open sky, and with time to just be together.

Day One: A Detour or Two

Our cycling trip got off to a slightly wobbly start. After 488 meters – yes, meters – Flynn’s rear derailleur gave up. Not something we could fix roadside, so we turned back, swapped bikes, and tried again. Second departure: successful.

The forecast had promised dry skies after 10:00. At 11:00, the sky disagreed. Whatever was falling wasn’t technically rain – or so we told ourselves – but it soaked us just the same.

I had a route plan. A gentle stop after 2 km to buy sandwiches. A coffee break 10 km later. Then we’d eat the sandwiches. Later still, cake and another coffee. Flynn wasn’t particularly aligned with this schedule. But we did at least stop for the sandwiches.

We also stopped more often than planned to take shelter from some heavy showers.

The fields we rode through were beautiful – full of wildflowers. I couldn’t resist taking a few photos along the way. Then came blackberries. Another pause. Flynn’s patience began to show signs of strain. Fair enough. We agreed to tone down the stops.

The path through the Haard forest was a highlight – green, peaceful, and definitely uphill. I reassured Flynn it was the last climb, and that from there it would be downhill all the way – at least, that’s what my navigation said. But just like the weather earlier, reality had its own plan. One more hill appeared… and that one was on me – I’d missed a turn. So for once, the forecast had been right – we just didn’t follow it. We passed a firefighter observation tower along the way and, naturally, climbed to the top.

A proper pause in Haltern am See gave us a bit of a breather before heading on toward Dülmen and our campsite near the lake.

Dinner – pasta with Bolognese – was served just as the first raindrops returned. We made it into the tent in time. Outside: a steady downpour. Inside: dry and warm.