PDF Tools

PDF to JSON / CSV / Excel

Turn any PDF into structured data — tables, text, and metadata — instantly in your browser.

Loading tool...

Complete Guide to Extracting Data from PDFs

PDFs lock data in a fixed layout, making it painful to reuse content in spreadsheets or code. DocuShell Parse PDF solves that with a dedicated server-side extraction API.

The current implementation standardizes on a structured JSON tree plus Markdown, HTML, plain text, and annotated PDF debug artifacts for downstream automation.

Uploads are handled in ephemeral storage and cleaned up automatically rather than being retained long-term.

How DocuShell Parse PDF Works

When you submit a job, DocuShell first validates that the upload is a real PDF, uses `qpdf --show-npages` for fast page-count preflight, and enforces plan limits before queueing work.

The worker then runs the parser, captures the generated artifacts, and stores those artifacts temporarily for the API status and download endpoints.

What the JSON Contains

DocuShell now returns structured JSON for successful parse jobs rather than the older in-house `parse_v1` envelope. That means the stable contract lives in the root object and its `kids` tree.

This is a better fit for downstream systems that want the parser's original structure without DocuShell translating it into a second schema first.

Markdown, HTML, and plain text remain available as companion artifacts when you want text-oriented renditions of the same document.

  • Successful API jobs expose `result.document` as structured JSON.
  • Markdown, JSON, HTML, text, and annotated PDF downloads are provided as separate artifacts when requested.
  • A short rollout overlap may still unwrap older stored legacy jobs, but new jobs do not write the old schema anymore.

Export Formats Explained

JSON is the primary structured output. It is the document tree and is the best choice for APIs, indexing, and LLM ingestion pipelines.

Markdown and plain text are companion text artifacts. They are useful for human inspection, quick previews, search, and downstream systems that prefer lightly structured text.

HTML supports styled downstream rendering, while annotated PDF is a visual debug artifact for comparing detected structure to the source document.

Scanned PDFs and OCR

If your PDF was created by scanning a physical document, it may contain images of text rather than selectable text. The Parse PDF API defaults to hybrid `auto` mode when the server OCR backend is configured.

DocuShell still checks extracted text density after parsing runs. If OCR is unavailable or the output stays near-empty, the API returns the stable `ocr_required` failure code.

For private browser-only OCR, use the OCR PDF workflow to create searchable text first, then parse the OCR result when you need JSON, Markdown, HTML, text, or annotated PDF artifacts.

Performance and Large Files

The heavy lifting now happens in the parse worker, not in the browser. That lets DocuShell keep the public API contract stable and centralize error mapping, queueing, and artifact cleanup.

Fast page-count preflight uses qpdf instead of a second full parser pass, so request validation stays lightweight before the parse job starts.

Completed jobs expose timing metadata through the API, which makes load testing and rollout monitoring much easier than the old browser-only path.

How It Works

Step 1

Upload a PDF (drag & drop or browse) — up to 50 MB, PDF only.

Step 2

Click Parse PDF. DocuShell sends the file to the parse service, validates the page count with qpdf, and queues structured extraction.

Step 3

Download structured JSON, Markdown, HTML, plain text, or annotated PDF debug artifacts once the async job completes.

Why This Tool

  • Server-side PDF parsing with structured JSON, Markdown, HTML, plain text, and annotated PDF debug artifacts.
  • Returns a hierarchical document tree for automation, indexing, and review.
  • Works well for text-native PDFs, reports, tables, and structured extraction pipelines.
  • Ephemeral storage cleanup removes uploaded and generated files automatically.
  • Supports page-range parsing and optional header/footer retention.
  • Hybrid parsing is enabled for scanned/image-heavy PDFs when the server OCR backend is available.

Use Cases

  • Pulling financial tables out of bank statements or annual reports.
  • Converting government-form PDFs into importable spreadsheet data.
  • Extracting product data from supplier catalogues into JSON for APIs.
  • Archiving research papers as structured, searchable JSON.
  • Feeding PDF content into LLM pipelines through a stable API response.

Frequently Asked Questions

Common questions about the PDF to JSON tool, how it works, privacy, file limits, and more.

Yes. Parse PDF is a server-side API tool. The file is uploaded to the DocuShell parse service, processed into requested artifacts, and stored only temporarily in ephemeral storage before cleanup.
Successful responses return structured JSON. At the top level you should expect fields such as `numberOfPages`, `fileName`, and `kids`, where `kids` contains the extracted document structure.
DocuShell parsing works best on text-native PDFs. When the server OCR backend is available, hybrid parsing is used by default for scanned or image-only PDFs; otherwise the job fails cleanly with `ocr_required` guidance.
After parsing runs, DocuShell checks the extracted text density. If the output is near-empty for the selected page count, the job fails with `ocr_required`, which usually means the PDF is scanned or image-only.
Your plan controls upload size and page-count limits. The DocuShell API defaults are lower than the old browser-only marketing copy, and the server enforces them before or during queueing.
Yes, when the server hybrid OCR backend is available. The parse APIs default to hybrid `auto` mode, while text-native PDFs continue through the faster structured parser path.

Need a walkthrough before you start?

We publish first-party guides for the workflows people actually use, and we explain how those articles are tested, reviewed, and updated.

Privacy, file deletion, and support

Browser-based tools never upload your file. Server-assisted tools run in isolated workers with short-lived storage and deletion rules documented in our public policies.

Explore More Tools