Essay April 2026 6 min read

Why I asked five technicians to label sixty thousand car comments

How a 2019 BERT fine-tuning project at Toyota became a lesson in domain judgment, labeling work, and production AI.

How do you teach a machine the difference between “the car drives smoothly” and “the car drives smoothly until you actually press the pedal”?

In mid-2019, sitting at a desk at Toyota Motor Europe in Brussels, that was my problem. We wanted to scrape YouTube reviews, owner forums, and magazine articles across five languages — English, French, German, Spanish, Italian — and turn millions of unstructured comments into something the engineering teams could act on. Not just “positive” or “negative,” but which subsystem the customer was complaining about, and what specifically was wrong with it.

The obvious move was to take a pretrained BERT model from Hugging Face and fine-tune it. BERT had landed less than a year earlier and the world was just starting to notice. The problem: BERT, at that time, knew Wikipedia and a corpus of books. It did not know cars.

When you fed it “the vehicle is jittery,” it had no idea whether that was good or bad. It didn’t know that a jittery clutch is a transmission complaint and a jittery throttle is a calibration complaint and they go to two completely different teams. It treated “the engine note is aggressive” as negative when half of automotive enthusiasts mean it as a compliment. The English models were the most mature, and yet English reviewers had a habit I came to dread: they’d open with a complaint and close with praise in the same paragraph. “The brakes feel grabby in the wet, but honestly once you get used to them, the car is brilliant.” The classic tone we are accustomed to it now, thanks to some infamous British YouTubers. The language model would catch the first half and miss the rest entirely.

Generic models couldn’t fix this. We had to give them automotive context.

Which is how I ended up walking into a room of five Toyota technicians and explaining that I needed them to read tens of thousands of customer comments in an Excel file and label each one with three things: is this about the vehicle at all, which subsystem does it belong to, and what is the specific complaint.

Hand-drawn labeling schema showing a raw car comment becoming a labeled training row. — my sketch reimagined with ChatGPT Images 2.0: raw comment → manual labeling → training data. The point was not consistency alone, but consistency with judgment.

They did not believe it would work. Not in a hostile way rather in a quiet, patient, we’ve-seen-engineers-with-ideas-before way, especially straight out of college ones. I wasn’t that junior but definitely felt so.

The disconnect wasn’t technical. These were people who had spent twenty years diagnosing real cars. The true petrol-heads. They knew exactly what “jittery” meant in the context of an EV regen cycle versus a manual transmission. They could tell from three lines of forum text whether the writer drove the car or just watched a YouTube review of it. They had the domain knowledge that no model on earth had.

What they didn’t believe was that reading Excel rows for two months would somehow produce a system that scraped YouTube on its own and routed complaints to the right engineering team in Zaventem, Belgium at our Technical Center. It sounded, fairly, like science fiction. The first week was mostly me re-explaining the pipeline. “You label this. The model learns what you mean by ‘NVH.’ Then the model labels the next million on its own. Then the engineering team gets a dashboard.”

By month two they were arguing with each other about edge cases. “That’s not a clutch issue, that’s drive-by-wire.” That’s when I knew it was working.

We labeled in the tens of thousands. The fine-tuned model got us to a place where we could route automotive sentiment by subsystem, in five languages, with enough confidence that calibration engineers actually used the dashboard. It went into production. The latency, by the way, was under 100 milliseconds, but that’s another story.

The lesson I carried out of that project, and that I keep relearning every time someone shows me a flashy AI demo, is this: the model is almost never the hard part. The hard part is the unglamorous work of getting the right humans to encode their judgment into structured data. Not labels in the abstract — the specific judgments of people who have spent decades being right about the thing.

Seven years later I’m watching the AI-in-finance space relearn this exact lesson. People keep showing me document chat demos that hallucinate confidently because the underlying model never learned what quality of earnings actually means in a German Mittelstand transaction. The fix isn’t a smarter model. The fix is a senior diligence partner sitting next to an engineer for six weeks, labeling, which thanks to where we are with AI, can be done much faster.

The technicians at Toyota knew things the model could not learn from the internet. The same is true now of compliance officers, deal partners, and procurement leads in every regulated industry I’ve worked in since.

You can’t skip that step. You can only respect it earlier or later.

Topic: Production AI depends on domain experts encoding judgment into structured data, not just better models.
Anchor example: A 2019 Toyota Motor Europe project using five technicians to label tens of thousands of multilingual automotive comments for BERT fine-tuning.
Key claim: The model is almost never the hard part; the hard part is getting the right humans to encode the judgments that the model cannot learn from the internet.