
Nonetheless, these promotional claims do not all the time match real-world efficiency, in response to latest assessments. “I am usually a fairly large fan of the Mistral fashions, however the brand new OCR-specific one they launched final week actually carried out poorly,” Willis famous.
“A colleague despatched this PDF and requested if I might assist him parse the desk it contained,” says Willis. “It is an previous doc with a desk that has some advanced structure components. The brand new [Mistral] OCR-specific mannequin actually carried out poorly, repeating the names of cities and botching lots of the numbers.”
AI app developer Alexander Doria additionally not too long ago identified on X a flaw with Mistral OCR’s skill to know handwriting, writing, “Sadly Mistral-OCR has nonetheless the standard VLM curse: with difficult manuscripts, it hallucinates utterly.”
In accordance with Willis, Google at the moment leads the sphere in AI fashions that may learn paperwork: “Proper now, for me the clear chief is Google’s Gemini 2.0 Flash Professional Experimental. It dealt with the PDF that Mistral didn’t with a tiny variety of errors, and I’ve run a number of messy PDFs by means of it with success, together with these with handwritten content material.”
Gemini’s efficiency stems largely from its skill to course of expansive paperwork (in a sort of short-term reminiscence referred to as a “context window”), which Willis particularly notes as a key benefit: “The scale of its context window additionally helps, since I can add massive paperwork and work by means of them in elements.” This functionality, mixed with extra strong dealing with of handwritten content material, apparently offers Google’s mannequin a sensible edge over rivals in real-world document-processing duties for now.
The drawbacks of LLM-based OCR
Regardless of their promise, LLMs introduce a number of new issues to doc processing. Amongst them, they will introduce confabulations or hallucinations (plausible-sounding however incorrect data), by accident observe directions within the textual content (pondering they’re a part of a consumer immediate), or simply usually misread the information.