PDF Tools
PDF to JSON / CSV / Excel
Turn any PDF into structured data — tables, text, and metadata — instantly in your browser.
Complete Guide to Extracting Data from PDFs
PDFs lock data in a fixed layout, making it painful to reuse content in spreadsheets or code. DocuShell Parse PDF solves that with a dedicated server-side extraction API.
The current implementation standardizes on a structured JSON tree plus Markdown, HTML, plain text, and annotated PDF debug artifacts for downstream automation.
Uploads are handled in ephemeral storage and cleaned up automatically rather than being retained long-term.
How DocuShell Parse PDF Works
When you submit a job, DocuShell first validates that the upload is a real PDF, uses `qpdf --show-npages` for fast page-count preflight, and enforces plan limits before queueing work.
The worker then runs the parser, captures the generated artifacts, and stores those artifacts temporarily for the API status and download endpoints.
What the JSON Contains
DocuShell now returns structured JSON for successful parse jobs rather than the older in-house `parse_v1` envelope. That means the stable contract lives in the root object and its `kids` tree.
This is a better fit for downstream systems that want the parser's original structure without DocuShell translating it into a second schema first.
Markdown, HTML, and plain text remain available as companion artifacts when you want text-oriented renditions of the same document.
- Successful API jobs expose `result.document` as structured JSON.
- Markdown, JSON, HTML, text, and annotated PDF downloads are provided as separate artifacts when requested.
- A short rollout overlap may still unwrap older stored legacy jobs, but new jobs do not write the old schema anymore.
Export Formats Explained
JSON is the primary structured output. It is the document tree and is the best choice for APIs, indexing, and LLM ingestion pipelines.
Markdown and plain text are companion text artifacts. They are useful for human inspection, quick previews, search, and downstream systems that prefer lightly structured text.
HTML supports styled downstream rendering, while annotated PDF is a visual debug artifact for comparing detected structure to the source document.
Scanned PDFs and OCR
If your PDF was created by scanning a physical document, it may contain images of text rather than selectable text. The Parse PDF API defaults to hybrid `auto` mode when the server OCR backend is configured.
DocuShell still checks extracted text density after parsing runs. If OCR is unavailable or the output stays near-empty, the API returns the stable `ocr_required` failure code.
For private browser-only OCR, use the OCR PDF workflow to create searchable text first, then parse the OCR result when you need JSON, Markdown, HTML, text, or annotated PDF artifacts.
Performance and Large Files
The heavy lifting now happens in the parse worker, not in the browser. That lets DocuShell keep the public API contract stable and centralize error mapping, queueing, and artifact cleanup.
Fast page-count preflight uses qpdf instead of a second full parser pass, so request validation stays lightweight before the parse job starts.
Completed jobs expose timing metadata through the API, which makes load testing and rollout monitoring much easier than the old browser-only path.
How It Works
Step 1
Upload a PDF (drag & drop or browse) — up to 50 MB, PDF only.
Step 2
Click Parse PDF. DocuShell sends the file to the parse service, validates the page count with qpdf, and queues structured extraction.
Step 3
Download structured JSON, Markdown, HTML, plain text, or annotated PDF debug artifacts once the async job completes.
Why This Tool
- • Server-side PDF parsing with structured JSON, Markdown, HTML, plain text, and annotated PDF debug artifacts.
- • Returns a hierarchical document tree for automation, indexing, and review.
- • Works well for text-native PDFs, reports, tables, and structured extraction pipelines.
- • Ephemeral storage cleanup removes uploaded and generated files automatically.
- • Supports page-range parsing and optional header/footer retention.
- • Hybrid parsing is enabled for scanned/image-heavy PDFs when the server OCR backend is available.
Use Cases
- • Pulling financial tables out of bank statements or annual reports.
- • Converting government-form PDFs into importable spreadsheet data.
- • Extracting product data from supplier catalogues into JSON for APIs.
- • Archiving research papers as structured, searchable JSON.
- • Feeding PDF content into LLM pipelines through a stable API response.
Frequently Asked Questions
Common questions about the PDF to JSON tool, how it works, privacy, file limits, and more.
Need a walkthrough before you start?
We publish first-party guides for the workflows people actually use, and we explain how those articles are tested, reviewed, and updated.
Privacy, file deletion, and support
Browser-based tools never upload your file. Server-assisted tools run in isolated workers with short-lived storage and deletion rules documented in our public policies.