FeaturedIndoDoc Vision: Automated Kartu Keluarga Extraction with State-of-the-Art A
About this Project
IndoDoc Vision is a pipeline designed to extract structured JSON data from Indonesian Family Card (Kartu Keluarga) documents. By integrating YOLOv8, U-Net, and Gemini VLM, the system achieves >95% field-level accuracy.
Details
Features According to the sources, the project utilizes a sophisticated three-stage pipeline to ensure maximum reliability:
• Intelligent Detection: Uses YOLOv8 to pinpoint 22 specific field classes (e.g., NIK, Name, Address) with high precision (mAP@0.5-0.95 = 0.886).
• Advanced Enhancement: Employs a U-Net model to "clean" document crops through denoising, binarization, and line removal, making the text significantly more legible for the extraction phase.
• VLM Extraction: Leverages Google Gemini 1.5 to perform intelligent field association and OCR, converting visual data into a structured JSON format.
• Production Capabilities: The system is Docker-ready, supports Prometheus metrics for observability, and features a secure design that ensures no PII (Personally Identifiable Information) logging.
Challenges The primary challenge identified in the sources was balancing latency and accuracy. While a "VLM Only" mode is faster (~1.0s), it suffers from hallucinations and lower accuracy (85-92%) on poor-quality documents. Implementing the "Full Pipeline" solves this but introduces higher GPU memory usage and increased complexity by managing three separate model dependencies.
Key Learnings
• The "Sweet Spot": Through testing, the YOLO + VLM mode was discovered to be the recommended balance for most production cases, providing accurate row association and layout understanding while being ~400ms faster than the Full Pipeline.
• Critical Preprocessing: The sources highlight that U-Net enhancement is essential for legacy or low-quality documents to prevent OCR errors caused by table borders or image noise.
• Efficiency Tools: Implementing the UV package manager was a significant breakthrough, allowing for 10-100x faster package installations compared to traditional methods.
Analogy: Think of this system as a high-speed automated sorting facility: YOLO is the scanner that identifies where the packages are; U-Net is the cleaning station that removes mud and dust from the labels; and Gemini VLM is the intelligent clerk who reads the labels and records them perfectly into the digital database.