Document ingestion API for RAG

Turn private PDFs into source-backed data for RAG & AI.

DocuShell Parse API extracts deterministic JSON, Markdown, HTML, and text — with tables, coordinates, metadata, and source maps — and pairs with Ethos to verify every citation against the original document.

Run a sample parse Start free

No signup needed to inspect sample PDFs, structured outputs, and source references.

Read the docs

invoice.pdf → invoice.jsonPOST /v1/parse

{  "invoice_number": "INV-2048",  "vendor_name": "Northline Office Co.",  "invoice_total": 8420,  "due_date": "2026-05-31",  "artifacts": ["json", "markdown", "html", "text"],  "source": { "page": 1, "element_id": "odl-142",              "bbox": [72, 96, 516, 214] },  "evidence": {    "evidence_id": "inv2048-total",    "evidence_kind": "text",    "expected_text_sha256": "sha256:9f2c1e…",    "source_fingerprint": "sha256:1d4ab7…" }}

Every field carries a source map and an Ethos evidence anchor — deterministic on every run.

Try it live

Parse a document. No signup.

Pick a sample, inspect the payload, and see how Ethos releases an answer only when it can point back to the source. The degraded scan shows a withheld answer.

PDFinvoice.pdfpage 1 / 1

INVOICENo. INV-2048 · p. 1

VendorNorthline Office Co.

Billed toMeridian Design Studio

Issued2026-05-01

Due2026-05-31

Line items

Item	Qty	Unit	Total
Furniture
Standing desks	4	1,700.00	6,800.00
Task chairs	2	405.00	810.00
Services
Delivery & setup ¹	1	810.00	810.00

Total due$8,420

Payment within 30 days of issue. Late payments accrue 1.5% monthly.

¹ Includes on-site assembly and packaging removal.

invoice.pdf → artifact.jsonprocessing

Processing document...

Born-digital invoice. Key-value fields, the line-item table, and the footnote all carry source maps.

Sample documents only — choose a playground to upload your own

Layout-aware PDF extraction

PDF hallucinations start with bad extraction.

If a model receives flattened text, it loses the document evidence a reviewer needs. Give it deterministic fields, tables, layout, and source context instead.

Private PDFs lose proof when flattened

Policies, contracts, support manuals, filings, and reports often collapse into loose text. Clauses, tables, definitions, and exceptions become easy for AI systems to misread.

RAG needs source-backed citations

DocuShell Parse API returns page context, coordinates, and structured artifacts so answers can point back to the source material instead of relying on retrieval text alone.

Workflows need stable evidence

Use deterministic JSON and Markdown for clauses, policy sections, invoice totals, tables, and other fields your application must route, review, or audit.

Use Parse API

Source maps for citations

Parse once, return source-backed structure, and let your AI workflow cite the page and region it used.

See how Parse API works

LLM-ready parsing pipeline

From private PDF to source-backed context.

Validate the document, preserve structure, return artifacts, and keep source context available for review before it enters a model or workflow.

Output formats

Deterministic JSON, Markdown, text, and HTML. One parse.

JSON for extraction and review. Markdown for RAG and knowledge bases. Text for search and indexing. HTML for web delivery. The JSON carries source maps, while a companion Ethos evidence artifact anchors every output to the original document.

invoice.json

Best for: Extraction, validation, and review

{
  "invoice_number": "INV-2048",
  "vendor_name": "Northline Office Co.",
  "invoice_total": 8420,
  "due_date": "2026-05-31",
  "table": { "rows": 3, "columns": 4 },
  "source": {
    "page": 1,
    "element_id": "odl-142",
    "bbox": [72, 96, 516, 214]
  }
}

One parse · JSON source map · Ethos companion anchor

One parse job can return all four artifacts together, with no second upload or pass.

Start free

Pricing shape

Start free. Pay when you ship.

Browser tools stay free, and free accounts include monthly credits plus the playgrounds. API keys and webhooks start on Starter — no surprises later.

Free

No credit card required

500 credits every month
Free browser PDF tools
Playgrounds with sample documents
Files up to 50 MB / 50 pages

Starter

$9/month

API access starts here

API keys and webhooks
5,000 credits every month
2 concurrent queued parse jobs
Files up to 50 MB / 100 pages

Start free Compare all plans

Built for teams where citations are not optional.

DocuShell is strongest when private PDFs feed a product, workflow, or knowledge base where a bad citation creates review risk.

Compliance and policy teams

Turn policies, procedures, standards, and guidance PDFs into source-backed sections that review workflows and RAG systems can cite.

See RAG use cases

Legal and contract review

Preserve clause order, page references, and extraction warnings so reviewers can inspect the source before relying on AI output.

Review API contract

Support knowledge bases

Ingest manuals, onboarding packets, and troubleshooting PDFs with structure that keeps answer citations tied to the source document.

Open playgrounds

Startups building document AI

Prototype with free tools, then move repeated ingestion, extraction, and webhook delivery into an authenticated API workflow.

Compare plans

Security

Built for enterprise security.

Sensitive PDFs need more than a promise. DocuShell validates inputs, isolates processing, streams results, and keeps temporary storage short-lived.

Review security details

Validate

Schema and PDF checks run before a job is accepted.

Process

Workers handle documents in isolated processing paths.

Stream

Results are returned for one-time download flows.

Delete

Temporary files are swept within the 1-hour retention window.

Zero-trust file handling

PDFs are checked before work is queued, including schema validation and file-type verification.

Ephemeral storage

Uploaded and generated files are temporary, with cleanup designed around a 1-hour retention window.

Private network protection

Webpage capture blocks localhost, intranet, and metadata IP ranges before rendering.

LLM privacy boundary

Parse workflows are framed around deterministic extraction instead of training on customer documents.

Need a quick, one-off file fix?

Use the free browser tools for compressing, merging, splitting, converting, protecting, rotating, or capturing PDFs.

Upload PDF

Drop up to 5 PDFs here, or choose files

One-off PDF utilities

Finish a one-off PDF task without joining an API workflow.

Use these when you need a quick file fix. Use the Parse API when extraction belongs inside a product or pipeline.

01PDF utilitiesCompress, merge, split, organize
02Web captureSave public webpages as PDFs
03Parse APIParse to JSON and Markdown with source context

Common task

Compress PDF

Reduce file size for email, WhatsApp, upload portals, and client delivery.

Compress now

Common task

Webpage to PDF

Capture public URLs as clean PDFs through the secure rendering worker.

Capture page

Merge PDF

Combine separate files into one organized PDF.

Merge files

Split PDF

Extract pages or ranges from a single document.

Split pages

PDF to Word

Convert PDFs into editable Word documents.

Convert file

PDF to JSON

Extract searchable text, tables, and JSON from PDFs.

Extract data

Rotate PDF

Fix sideways pages without changing the rest of the document.

Rotate pages

Protect PDF

Add password protection in a browser-based flow.

Protect file

View every PDF tool

Practical guides for PDF data problems.

Read clear steps for compression, conversion, extraction, privacy tradeoffs, and when API parsing is better than a manual tool.

Browse guides

Extract data from PDFs, not just pages.

Pull tables, text, and fields into structured JSON for AI apps, search, dashboards, and rule-based workflows.

Compress PDF under 1MB

A practical guide for forms, portals, and size limits.

Adobe Acrobat alternatives

Compare lightweight browser tools for everyday PDF work.

Stirling PDF alternative

Use PDF tools without running your own Docker stack.

AI Evaluation

Can an LLM Judge Another LLM for Hallucinations?

Learn when LLM-as-a-judge works for hallucination detection, where it fails, how to calibrate it, and when deterministic evidence checks are safer.

Read guide

Developer Guides

Can Ethos Verify Other Document Parsers?

Learn how foreign document parsers integrate with Ethos through GroundingSource adapters for deterministic citation and evidence verification.

Read guide

AI Evaluation

Can Ethos Verify RAG Without Another LLM?

Ethos verifies source-bound RAG citations deterministically without a judge-model call, while semantic relevance and synthesis remain separate evaluation tasks.

Read guide

Answers before you upload or integrate.

Quick notes on AI-ready parsing, browser tools, secure workers, and developer APIs.

01How does DocuShell help reduce AI hallucinations from PDFs?+

DocuShell Parse API returns deterministic JSON, Markdown, tables, layout metadata, and source context from PDFs. That gives RAG systems and review workflows source-backed data instead of loose text that can cause AI systems to guess.

02Can I use DocuShell for rule-based PDF extraction?+

Yes. DocuShell can parse invoices, statements, forms, reports, and tables into stable fields and artifacts that work with deterministic business rules, audits, dashboards, and review workflows.

03Do all DocuShell PDF tools upload my files?+

No. DocuShell uses browser-first processing whenever the task can run locally. Workflows that need server resources, such as webpage capture, stronger compression, parsing, or DOCX conversion, use secure cloud workers with temporary processing rules.

04Does DocuShell offer APIs for developers?+

Yes. Developers can use the DocuShell API Hub for authenticated PDF parsing, source-aware JSON and Markdown output, table extraction, conversion, compression, webpage rendering, queued jobs, status polling, and downloadable artifacts.

05Are DocuShell tools free to use?+

DocuShell provides free browser PDF tools for common tasks such as compressing, merging, splitting, rotating, organizing, protecting, and extracting PDF content. Some advanced automation and API usage is tied to account plans and credits.

06What inputs and outputs does the Parse API support?+

The Parse API is PDF-first: it handles born-digital PDFs and scanned documents through OCR workflows. Output artifacts include deterministic JSON, Markdown, text, and HTML with tables, layout metadata, and source maps that point back to page and region.

07How is this different from sending a PDF straight to an LLM?+

An LLM returns fluent text, but no stable structure or provenance — the same document can produce different answers on different runs. The Parse API returns layout-grounded fields with page references and coordinates, so your application can validate values, cite sources, and get the same output for the same input every time.

08How is the Parse API different from OCR?+

OCR makes a scanned page machine-readable; it gives you a wall of text. The Parse API adds the structure a system can act on: reading order, tables, fields, page coordinates, and source maps, so downstream workflows can validate, route, and cite instead of re-parsing raw text.

09What happens when extraction is uncertain or incomplete?+

DocuShell fails closed. Uncertain or incomplete extractions surface explicit warnings and partial states instead of silently guessing, so review workflows can inspect the source before an answer or automation relies on it. Blocked and partial outcomes are valid, clearly labeled results.

10How long are uploaded files kept?+

Uploaded and generated files are temporary. Processing artifacts are cleaned up around a 1-hour retention window, and results are returned through one-time download flows. Customer documents are not used to train models.

11Can I get JSON and Markdown from the same parse?+

Yes. A single parse job can return JSON alongside Markdown or HTML artifacts for the same document — no second upload or second pass required.

12How does the free tier work?+

Free accounts include 500 credits per month, the free browser PDF tools, and the Parse Playground with sample documents — no credit card required. API keys and webhooks start on the Starter plan at $9/month with 5,000 monthly credits.

Developer resources

Everything you need for a first parse.

Reference docs, auth, limits, and a playground for inspecting output before you write a line of integration code.

Build document AI that can point back to the source.

Start in a playground, inspect the output, then move repeated ingestion into authenticated API workflows. Free PDF tools remain available for one-off jobs.

Start free Run a sample parse