HTML to Markdown and Table Chunking Achieve 20% RAG
Accuracy Gain
The Challenge
The initial HTML ingestion pipeline was extracting only raw text, losing critical structural elements like links, formatting, and hierarchy. Additionally, answers derived from long tables were often incomplete or inaccurate because of loss of structured data due to ingestion. This highlighted a need for ETL pipeline optimization focused on structured data ingestion for LLMs and reliable tabular data processing for Q&A.
Hypothesis
Transforming HTML into a concise, structured format that preserves key details (such as links and formatting) while being easily interpretable by LLMs would improve retrieval quality. Additionally, special chunking strategies for long tables could help LLMs generate more accurate responses through table chunking for AI accuracy.
Execution
- HTML to Markdown Conversion: HTML content was converted to Markdown, preserving structural elements like headings, bullet points, and hyperlinks in a lightweight, readable format via HTML to Markdown ingestion as part of ETL pipeline optimization and structured data ingestion for LLMs.
- Table Chunking: Large tables were intelligently split into smaller, semantically meaningful chunks to improve the LLM’s ability to extract accurate answers, enabling robust tabular data processing for Q&A through table chunking for AI accuracy.
Outcomes
This optimized ingestion approach significantly improved the quality of responses in the RAG pipeline (yielding a 20% gain on the evaluation set and capturing 100 previously missed cases), enabling answers that retained context, included relevant links, and better addressed tabular data through consistent tabular data processing for Q&A.
Project Highlights
20%
gain on the evaluation set and capturing 100 previously missed cases