Skip to Content

Extract

POST /api/v1/extract extracts structured fields from a PDF or DOCX using a JSON Schema or a built-in template. Processing is asynchronous.

Required feature flag: document_extract_api

Two input paths

PathWhen to use
Upload a file directlySingle-step convenience — parse and extract in one call
Pass a parse_artifact_idReuse an existing parse result — faster, no re-parsing

Exactly one of file or parse_artifact_id must be provided.

Schema vs. template

OptionWhen to use
template_idUse a built-in schema (currently: invoice)
extraction_schemaProvide your own JSON Schema

Exactly one of template_id or extraction_schema must be provided.

Request

The request is a multipart/form-data upload.

FieldTypeRequiredDescription
filefileOne of file/parse_artifact_idPDF or DOCX file to parse and extract
parse_artifact_idstringOne of file/parse_artifact_idID of an existing canonical_document artifact
template_idstringOne of template_id/extraction_schemaBuilt-in template ID (e.g. invoice)
extraction_schemastringOne of template_id/extraction_schemaJSON Schema string
languagestringNoLanguage code (default: en)
configstringNoJSON string with additional config options
tenant_idstringNoTenant override

Example — file upload with built-in template

curl -X POST https://api.expunct.ai/api/v1/extract \ -H "X-API-Key: pk_live_abc123" \ -F "file=@/path/to/invoice.pdf" \ -F "template_id=invoice"

Example — reuse an existing parse artifact

curl -X POST https://api.expunct.ai/api/v1/extract \ -H "X-API-Key: pk_live_abc123" \ -F "parse_artifact_id=art_3f2a1b4c..." \ -F "template_id=invoice"

Example — custom schema

curl -X POST https://api.expunct.ai/api/v1/extract \ -H "X-API-Key: pk_live_abc123" \ -F "file=@/path/to/contract.pdf" \ -F 'extraction_schema={ "type": "object", "properties": { "party_name": { "type": "string", "description": "Name of the contracting party" }, "effective_date": { "type": "string", "description": "Contract effective date" }, "total_value": { "type": "number", "description": "Total contract value" } }, "required": ["party_name", "effective_date"] }'

Response (202 Accepted)

Same shape as the parse response with workflow_kind: "extract".

Polling and artifacts

Poll GET /api/v1/documents/jobs/{job_id}. A completed extract job produces:

artifact_kindDescription
canonical_documentIntermediate parse result (ephemeral, deleted after extraction)
extract_resultExtracted fields with confidence scores and citations

Retrieve artifact content with GET /api/v1/documents/{artifact_id}/content.

extract_result shape

{ "document_id": "3f2a1b4c-...", "source_artifact_id": "art_abc...", "template_id": "invoice", "schema_used": { "...": "..." }, "fields": [ { "field_name": "invoice_number", "value": "INV-2024-001", "confidence": 0.85, "citations": [ { "page_number": 1, "block_id": "b_001", "text_snippet": "Invoice #INV-2024-001\nIssued: March 1, 2025" } ] }, { "field_name": "total_amount", "value": 4250.00, "confidence": 0.85, "citations": [ { "page_number": 2, "block_id": "b_041", "text_snippet": "Total Due: $4,250.00" } ] }, { "field_name": "vendor_name", "value": null, "confidence": 0.0, "citations": [] } ], "raw_output": { "invoice_number": "INV-2024-001", "total_amount": 4250.00 }, "validation_errors": [], "extraction_duration_ms": 45, "model_versions": { "extraction_engine": "rule_v1" } }

Field confidence levels

Score rangeMeaning
0.80–1.0Label found in same block as value
0.70–0.79Label found in adjacent block
0.30–0.69Pattern match only (no label context)
0.0Field not found

validation_errors

Present when a field marked required in the schema was not found:

"validation_errors": ["required field 'invoice_number' not found"]

Built-in templates

invoice

Extracts common invoice fields from PDF or DOCX invoices.

FieldTypeRequired
invoice_numberstringYes
invoice_datestringYes
total_amountnumberYes
vendor_namestringNo
vendor_addressstringNo
customer_namestringNo
customer_addressstringNo
due_datestringNo
purchase_order_numberstringNo
currencystringNo
subtotalnumberNo
tax_amountnumberNo
tax_ratestringNo
discount_amountnumberNo
amount_duenumberNo
payment_termsstringNo
line_itemsarrayNo

line_items is an array of objects with description, quantity, unit_price, and amount.

Error responses

StatusMeaning
400Missing required fields, conflicting inputs, or invalid JSON
403Feature flag document_extract_api not enabled
404parse_artifact_id not found
410Parse artifact payload has been purged
413File exceeds plan upload limit