DocuShell Parse API

Unstructured PDFs into AI-ready data for RAG.

Convert PDFs into structured Markdown and JSON with per-element bounding boxes for RAG citations, vector DB metadata, and clean LLM context windows.

Parser capabilities

Structure that survives retrieval, review, and audit.

DocuShell keeps the PDF source attached to model-ready output, so teams can build retrieval flows without losing page context or safety controls.

Parse

Upload a source PDF and request the format your pipeline actually consumes.

Sanitize

Remove invisible instructions and keep clean structure before data enters the model context.

Route

Send Markdown or JSON into retrieval, review queues, webhooks, or document stores.

Deterministic Parser

Convert PDFs into Markdown, JSON, or HTML for downstream pipelines. The parser keeps structure predictable for chunking, storage, and review.

XY-Cut++ Reading Order

The XY-Cut++ algorithm handles multi-column layouts, sidebars, and mixed elements with no manual config. Output stays ordered for chunking and retrieval.

Bounding Boxes & Citations

JSON includes page, type, text, and bbox fields for each extracted element. Store those fields with chunks so RAG answers can cite source regions.

Prompt Injection Protection

DocuShell strips hidden text, transparent or zero-size fonts, off-page content, and invisible layers before output. Clean text reaches the model context.

Global OCR (80+ languages)

OCR supports 80+ languages including Korean, Japanese, Chinese, and Arabic. The parser handles 300+ DPI scans.

AI pipelines

Use cases that need clean context, not raw extraction.

Each pattern keeps one rule intact: source geometry stays beside the text that reaches your retrieval layer.

Problem

RAG systems lose citation quality when PDF text is extracted without page positions or reading order.

Solution

DocuShell returns Markdown with preserved heading hierarchy and JSON elements with page and bbox fields.

Chunk on headings, embed the chunk text, and store bbox metadata so generated answers can point back to the source page region.

chunk metadata

{
  "chunk": "Risk factors include supplier concentration...",
  "metadata": {
    "page": 14,
    "heading": "Risk Factors",
    "bbox": [72, 142, 510, 196]
  }
}

Quickstart

POST a PDF. Get Markdown with source geometry.

Request Markdown output, then store extracted text and bbox metadata in your retrieval layer.

Request

curl -X POST "https://api.docushell.com/api/v1/parse" \
  -H "Authorization: Bearer $DOCUSHELL_API_KEY" \
  -H "Idempotency-Key: parse-rag-001" \
  -F "file=@./source.pdf;type=application/pdf" \
  -F "formats=markdown"

Structured response

{
  "id": "job_parse_01j6r3m8p5t6",
  "status": "completed",
  "output": "markdown",
  "document": {
    "markdown": "# Risk Factors\n\nSupplier concentration may affect delivery timelines...",
    "elements": [
      {
        "page": 14,
        "type": "heading",
        "text": "Risk Factors",
        "bbox": [72, 82, 388, 118]
      },
      {
        "page": 14,
        "type": "paragraph",
        "text": "Supplier concentration may affect delivery timelines...",
        "bbox": [72, 142, 510, 196]
      }
    ]
  }
}

Answers for RAG and PDF parsing teams.

Common questions about PDF parsing for RAG, table extraction, OCR, citations, and prompt injection defense.

The best PDF parser for RAG returns clean text, preserves document structure, and exposes citation metadata. DocuShell Parse API converts unstructured PDFs into Markdown and JSON with per-element bounding boxes for RAG citations. Use Markdown when your retriever needs heading-aware chunks. Use JSON when you need page, type, text, and bbox fields for source attribution. Prompt injection defense strips hidden text, transparent or zero-size fonts, off-page content, and invisible layers before output.
Extract tables from PDF for LLMs by preserving table text with nearby headings, captions, page numbers, and coordinates. DocuShell returns structured Markdown and JSON so table content can stay connected to its document context. For financial filings, board packets, and reports, avoid sending a flattened page string directly to a model. Keep table-adjacent metadata in JSON for validation, then pass the relevant Markdown section into the prompt or retrieval chunk.
Chunk PDFs for RAG by splitting on heading hierarchy first, then attaching page and bbox metadata to every chunk. DocuShell Markdown preserves headings, and its JSON output includes per-element bounding boxes for source citations. A practical pipeline is parse, split by heading, merge short neighboring elements, embed the chunk text, and store page plus bbox fields in vector metadata. This keeps context windows cleaner and makes answer citations inspectable.
DocuShell Parse API speed depends on page count, scan quality, OCR, and output format; there is no single fixed latency that applies to every PDF. Text-native PDFs avoid OCR work, while poor-quality scans and multi-column layouts require more processing. For lower latency, parse only the pages your workflow needs and request the output format your pipeline will consume. OCR and complex layouts take more processing than text-native PDFs.
Yes. DocuShell supports OCR for poor-quality scanned PDFs across 80+ languages including Korean, Japanese, Chinese, and Arabic. The OCR path handles 300+ DPI scans and produces structured Markdown, JSON, or HTML for downstream LLM, vector DB, and RAG pipelines.
DocuShell prevents prompt injection from PDFs by removing hidden or invisible content before it reaches the output. It strips hidden text, transparent or zero-size fonts, off-page content, and invisible layers before generating Markdown, JSON, or HTML. This matters because PDF files can contain text that a reader does not see but an extraction pipeline might pass into an LLM. Sanitizing those layers reduces the chance that invisible instructions enter a RAG prompt or agent workflow.

Integrations

Fits the AI stack you already use.

Use the API directly, or send completed parse jobs into agent workflows and vector databases.

  • OpenAI
  • Anthropic
  • LangChain
  • LlamaIndex
  • n8n
  • Pinecone
  • Weaviate
Source-aware output

Build safer, faster RAG pipelines today.

Start with structured Markdown and JSON that keep citations, layout, and hidden-content defenses intact.

Get API Key