LLM Document Structuring, Measured: 100% on Prices, 94.7% on Names with Local Qwen3-14B

The Data Entry Bottleneck

Every service that handles store or business information hits the same bottleneck: the real information lives in unstructured documents (menus, info sheets, Word files), while the system needs schema-conforming structured data — names, prices, categories, opening hours. Bridging that gap manually takes tens of minutes per store and introduces typos and omissions. LLMs are theoretically perfect for this conversion, but demos never tell you the number that matters for a production decision: field-level accuracy. So we measured it.

Test Design: Build the Answer Key First

Measuring accuracy requires ground truth. We used a fictional test cafe dataset: a 19-item menu (menu.md) and a store info sheet (info.md) in markdown, with the same content manually entered into PostgreSQL as the reference. The pipeline: unstructured document → LLM extraction with a structuring prompt and JSON schema → field-by-field comparison against the manual DB.

Three setup decisions mattered. The model was Qwen3-14B-AWQ served by SGLang — deliberately a mid-size local model, because the question was what a realistically operable on-premise model can do, not a cloud flagship. Temperature was 0.1, since extraction needs reproducibility, not creativity. And Qwen3’s thinking mode was disabled (enable_thinking: false), which stabilized JSON output by keeping reasoning traces from leaking into the response and breaking the parser.

Metric	Result	Grade
Menu items extracted	19 (matches DB exactly)	Perfect
Name recall	94.7% (18/19)	Excellent
Price accuracy	100.0%	Perfect
Category accuracy	100.0%	Perfect
Extraction time	5.92 seconds	-

The key finding: the closer data is to tabular/structured form, the stronger LLM extraction gets. Prices and categories came out perfect across all 19 items — no transposed digits, no dropped zeros. Considering human typo rates when transcribing 19 menu prices, this is already beyond human accuracy. And the whole extraction took 5.92 seconds versus 10+ minutes by hand — speed is not even a comparison.

The Failure Mode: A Plausibly Wrong Name

The single name mismatch is the most interesting result. The original item “벚꽃라떼” (cherry blossom latte) was extracted as “베트꽃라떼” — not a hallucinated item, but an OCR-like transcription slip in Korean syllables. Its price and category were still correct: the model recognized the item but misspelled it.

This error class is dangerous precisely because it is plausibly wrong. A completely wrong value fails validation; a misspelled-but-well-formed string passes JSON schema checks — it is a non-empty string of the right shape. Catching it requires source-document cross-checking in post-processing, e.g. verifying that every extracted string actually appears as a substring in the original document.

Two more observations: allergy info explicitly stated in the source (desserts) was extracted correctly, while unstated fields (drinks) were returned as null — the model did not fabricate missing information, which is a positive. Description fields came back empty because the source table had none: extraction quality is ultimately capped by source quality.

Business Info: The Normalization Wall

Field	Match	Note
Address	O	Minor notation difference ("Seoul" vs full official name) passed
Phone number	O	Exact match
Wi-Fi SSID	O	Exact match
Wi-Fi password	O	Exact match
Weekday opening time	O	09:00, exact
Weekday closing time	O	22:00, exact
Indoor seat count	X	Value is 30 in both — string vs integer type mismatch

Seven fields, five matches (71.4%), 2.38 seconds. The crucial nuance: the mismatches were not comprehension failures. Every fact — phone, Wi-Fi, hours — was extracted correctly. The failures were normalization issues: two valid spellings of the same city name, and a seat count returned as the string “30” instead of the integer 30. These are problems for deterministic post-processing code, not for another LLM call. Lumping them together as “the LLM was wrong” hides the actual fix.

Production Playbook

Treat LLM extraction as a draft generator, not a final data source. Even with some fields at 100%, the pipeline’s trust level must be set by its weakest field.
Layer schema validation with source cross-checking. JSON schema catches structural errors; substring verification against the original document catches plausible typos; type casting and notation normalization handle the rest — all deterministic code.
Move humans from data entry to review. The job changes from “type in 19 items” to “check the one or two flagged items.” Time drops to a tenth; accuracy goes up.

One more takeaway on model size: these numbers came from a quantized 14B model running locally, not a cloud flagship. Document-to-structured-data work is already within reach of mid-size local models — deployable even where data cannot leave the premises.

Conclusion

LLM auto-structuring is already practical as an assisted data-entry tool. Tabular data extracts at or near 100%; the dangerous errors are plausibly wrong values that pass schema validation, so source cross-checking is mandatory; and most mismatches are normalization and typing issues solvable in deterministic post-processing. Keep a human in the loop as reviewer — that role shift, from typist to reviewer, is where the real value of this technique lives.

Structuring Unstructured Documents with an LLM — Measured Accuracy on a 19-Item Menu

The Data Entry Bottleneck

Test Design: Build the Answer Key First

Menu Extraction: Prices Hit 100%

The Failure Mode: A Plausibly Wrong Name

Business Info: The Normalization Wall

Production Playbook

Conclusion

Related Posts

DB + RAG Hybrid Search — How We Improved LLM Fact Accuracy by 5x

Text2SQL Real-World Test — When LLM Writes SQL Directly

Qwen3-14B Deep Review — Why It Is Our Top-Ranked Local LLM