Deterministic Parser
Convert PDFs into Markdown, JSON, or HTML for downstream pipelines. The parser keeps structure predictable for chunking, storage, and review.
DocuShell Parse API
Convert PDFs into structured Markdown and JSON with per-element bounding boxes for RAG citations, vector DB metadata, and clean LLM context windows.
Parser capabilities
DocuShell keeps the PDF source attached to model-ready output, so teams can build retrieval flows without losing page context or safety controls.
Parse
Upload a source PDF and request the format your pipeline actually consumes.
Sanitize
Remove invisible instructions and keep clean structure before data enters the model context.
Route
Send Markdown or JSON into retrieval, review queues, webhooks, or document stores.
Convert PDFs into Markdown, JSON, or HTML for downstream pipelines. The parser keeps structure predictable for chunking, storage, and review.
The XY-Cut++ algorithm handles multi-column layouts, sidebars, and mixed elements with no manual config. Output stays ordered for chunking and retrieval.
JSON includes page, type, text, and bbox fields for each extracted element. Store those fields with chunks so RAG answers can cite source regions.
DocuShell strips hidden text, transparent or zero-size fonts, off-page content, and invisible layers before output. Clean text reaches the model context.
OCR supports 80+ languages including Korean, Japanese, Chinese, and Arabic. The parser handles 300+ DPI scans.
AI pipelines
Each pattern keeps one rule intact: source geometry stays beside the text that reaches your retrieval layer.
Problem
RAG systems lose citation quality when PDF text is extracted without page positions or reading order.
Solution
DocuShell returns Markdown with preserved heading hierarchy and JSON elements with page and bbox fields.
Chunk on headings, embed the chunk text, and store bbox metadata so generated answers can point back to the source page region.
chunk metadata
{
"chunk": "Risk factors include supplier concentration...",
"metadata": {
"page": 14,
"heading": "Risk Factors",
"bbox": [72, 142, 510, 196]
}
}Quickstart
Request Markdown output, then store extracted text and bbox metadata in your retrieval layer.
Request
curl -X POST "https://api.docushell.com/api/v1/parse" \
-H "Authorization: Bearer $DOCUSHELL_API_KEY" \
-H "Idempotency-Key: parse-rag-001" \
-F "file=@./source.pdf;type=application/pdf" \
-F "formats=markdown"Structured response
{
"id": "job_parse_01j6r3m8p5t6",
"status": "completed",
"output": "markdown",
"document": {
"markdown": "# Risk Factors\n\nSupplier concentration may affect delivery timelines...",
"elements": [
{
"page": 14,
"type": "heading",
"text": "Risk Factors",
"bbox": [72, 82, 388, 118]
},
{
"page": 14,
"type": "paragraph",
"text": "Supplier concentration may affect delivery timelines...",
"bbox": [72, 142, 510, 196]
}
]
}
}Common questions about PDF parsing for RAG, table extraction, OCR, citations, and prompt injection defense.
Integrations
Use the API directly, or send completed parse jobs into agent workflows and vector databases.
Start with structured Markdown and JSON that keep citations, layout, and hidden-content defenses intact.