Safe Parse
POST /api/v1/workflows/safe-parse submits a PDF or DOCX for combined structural parsing and PII redaction. The document is parsed, all PII is detected and replaced in-place, and three sanitized artifacts are produced — ready for storage or indexing without privacy risk.
Processing is asynchronous. Poll GET /api/v1/documents/jobs/{job_id} for status.
Required feature flag: document_safe_parse_workflow
How it works
- Parse — the file is ingested and converted to a canonical document (pages, blocks, tables)
- Sanitize — Expunct’s PII detection engine (Presidio) scans every text block and replaces detected entities with redaction labels (e.g.
[PERSON],[EMAIL_ADDRESS]) - Render — sanitized markdown and semantic chunks are produced from the sanitized canonical document
The raw canonical document is ephemeral — it is deleted after sanitization and never returned to the caller. Only sanitized artifacts are retained.
Request
The request is a multipart/form-data upload.
| Field | Type | Required | Description |
|---|---|---|---|
file | file | Yes | PDF or DOCX file to parse and sanitize |
language | string | No | Language code (default: en) |
policy_id | string | No | Redaction policy ID — controls which entity types are redacted |
config | string | No | JSON string with additional config options |
tenant_id | string | No | Tenant override (defaults to authenticated tenant) |
Config options
Pass additional options as a JSON string in the config field:
| Key | Type | Default | Description |
|---|---|---|---|
redaction_mode | string | type_label | How to render redacted spans. type_label → [PERSON]; mask → ████ |
pii_types | array | ["all"] | Entity types to redact (e.g. ["PERSON", "EMAIL_ADDRESS"]). all means every supported type |
pii_categories | array | ["PII","PCI","PHI"] | Categories to include |
Example
cURL
curl -X POST https://api.expunct.ai/api/v1/workflows/safe-parse \
-H "X-API-Key: pk_live_abc123" \
-F "file=@/path/to/document.pdf" \
-F "language=en"Example — restrict to specific entity types
curl -X POST https://api.expunct.ai/api/v1/workflows/safe-parse \
-H "X-API-Key: pk_live_abc123" \
-F "file=@/path/to/document.pdf" \
-F 'config={"pii_types": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"], "redaction_mode": "type_label"}'Example — apply a saved redaction policy
curl -X POST https://api.expunct.ai/api/v1/workflows/safe-parse \
-H "X-API-Key: pk_live_abc123" \
-F "file=@/path/to/document.pdf" \
-F "policy_id=pol_hipaa_strict"Response (202 Accepted)
Same shape as the parse response with workflow_kind: "safe_parse".
{
"id": "7a8b9c0d-1e2f-3a4b-c5d6-e7f8a9b0c1d2",
"status": "queued",
"workflow_kind": "safe_parse",
"media_type": "pdf",
"progress_pct": 0,
"created_at": "2025-03-01T12:00:00Z",
"updated_at": "2025-03-01T12:00:00Z"
}Polling for completion
Poll GET /api/v1/documents/jobs/{job_id} until status is completed or failed.
cURL
curl https://api.expunct.ai/api/v1/documents/jobs/7a8b9c0d-1e2f-3a4b-c5d6-e7f8a9b0c1d2 \
-H "X-API-Key: pk_live_abc123"Artifacts
A completed safe-parse job produces three artifacts:
artifact_kind | Retention | Description |
|---|---|---|
sanitized_canonical_document | Persistent | PII-free structured document (pages, blocks, tables) |
sanitized_markdown_render | Persistent | Sanitized document rendered as Markdown |
sanitized_chunks_v1 | Persistent | Semantic chunks of the sanitized document, ready for embedding |
The raw canonical_document is ephemeral — it is created internally during sanitization and deleted before the job completes. It is never included in the artifact list.
Retrieve artifact content with GET /api/v1/documents/{artifact_id}/content.
sanitized_canonical_document shape
Identical structure to the canonical_document from the parse workflow, but with all PII replaced:
{
"document_id": "7a8b9c0d-...",
"page_count": 2,
"block_count": 18,
"parse_route": "text_native",
"pages": [
{
"page_number": 1,
"blocks": [
{
"block_id": "b_001",
"kind": "paragraph",
"text": "Patient [PERSON] visited on [DATE_TIME]. Contact: [EMAIL_ADDRESS]",
"reading_order": 0
}
],
"tables": []
}
]
}sanitized_chunks_v1 shape
Identical structure to chunks_v1 but sourced from the sanitized canonical document:
{
"document_id": "7a8b9c0d-...",
"source_artifact_id": "art_sanitized...",
"chunks": [
{
"chunk_id": "c_001",
"text": "Patient [PERSON] visited on [DATE_TIME]. Contact: [EMAIL_ADDRESS]",
"page_number": 1,
"block_ids": ["b_001"],
"token_count": 14
}
]
}Redaction modes
redaction_mode | Example output |
|---|---|
type_label (default) | [PERSON], [EMAIL_ADDRESS], [PHONE_NUMBER] |
mask | ████ (fixed-length block character) |
Error responses
| Status | Meaning |
|---|---|
400 | Unsupported file type or invalid config JSON |
403 | Feature flag document_safe_parse_workflow not enabled |
413 | File exceeds plan upload limit |