Safe Parse

POST /api/v1/workflows/safe-parse submits a PDF or DOCX for combined structural parsing and PII redaction. The document is parsed, all PII is detected and replaced in-place, and three sanitized artifacts are produced — ready for storage or indexing without privacy risk.

Processing is asynchronous. Poll GET /api/v1/documents/jobs/{job_id} for status.

Required feature flag: document_safe_parse_workflow

How it works

Parse — the file is ingested and converted to a canonical document (pages, blocks, tables)
Sanitize — Expunct’s PII detection engine (Presidio) scans every text block and replaces detected entities with redaction labels (e.g. [PERSON], [EMAIL_ADDRESS])
Render — sanitized markdown and semantic chunks are produced from the sanitized canonical document

The raw canonical document is ephemeral — it is deleted after sanitization and never returned to the caller. Only sanitized artifacts are retained.

Request

The request is a multipart/form-data upload.

Field	Type	Required	Description
`file`	file	Yes	PDF or DOCX file to parse and sanitize
`language`	string	No	Language code (default: `en`)
`policy_id`	string	No	Redaction policy ID — controls which entity types are redacted
`config`	string	No	JSON string with additional config options
`tenant_id`	string	No	Tenant override (defaults to authenticated tenant)

Config options

Pass additional options as a JSON string in the config field:

Key	Type	Default	Description
`redaction_mode`	string	`type_label`	How to render redacted spans. `type_label` → `[PERSON]`; `mask` → `████`
`pii_types`	array	`["all"]`	Entity types to redact (e.g. `["PERSON", "EMAIL_ADDRESS"]`). `all` means every supported type
`pii_categories`	array	`["PII","PCI","PHI"]`	Categories to include

Example

cURL


curl -X POST https://api.expunct.ai/api/v1/workflows/safe-parse \
  -H "X-API-Key: pk_live_abc123" \
  -F "file=@/path/to/document.pdf" \
  -F "language=en"

Python


import httpx
 
with open("/path/to/document.pdf", "rb") as f:
    response = httpx.post(
        "https://api.expunct.ai/api/v1/workflows/safe-parse",
        headers={"X-API-Key": "pk_live_abc123"},
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"language": "en"},
    )
 
job = response.json()
print(job["id"])  # e.g. "7a8b9c0d-..."

Node.js


import FormData from 'form-data';
import fs from 'fs';
import fetch from 'node-fetch';
 
const form = new FormData();
form.append('file', fs.createReadStream('/path/to/document.pdf'), 'document.pdf');
form.append('language', 'en');
 
const response = await fetch('https://api.expunct.ai/api/v1/workflows/safe-parse', {
  method: 'POST',
  headers: { 'X-API-Key': 'pk_live_abc123', ...form.getHeaders() },
  body: form,
});
 
const job = await response.json();
console.log(job.id);

Example — restrict to specific entity types


curl -X POST https://api.expunct.ai/api/v1/workflows/safe-parse \
  -H "X-API-Key: pk_live_abc123" \
  -F "file=@/path/to/document.pdf" \
  -F 'config={"pii_types": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"], "redaction_mode": "type_label"}'

Example — apply a saved redaction policy


curl -X POST https://api.expunct.ai/api/v1/workflows/safe-parse \
  -H "X-API-Key: pk_live_abc123" \
  -F "file=@/path/to/document.pdf" \
  -F "policy_id=pol_hipaa_strict"

Response (202 Accepted)

Same shape as the parse response with workflow_kind: "safe_parse".


{
  "id": "7a8b9c0d-1e2f-3a4b-c5d6-e7f8a9b0c1d2",
  "status": "queued",
  "workflow_kind": "safe_parse",
  "media_type": "pdf",
  "progress_pct": 0,
  "created_at": "2025-03-01T12:00:00Z",
  "updated_at": "2025-03-01T12:00:00Z"
}

Polling for completion

Poll GET /api/v1/documents/jobs/{job_id} until status is completed or failed.

cURL


curl https://api.expunct.ai/api/v1/documents/jobs/7a8b9c0d-1e2f-3a4b-c5d6-e7f8a9b0c1d2 \
  -H "X-API-Key: pk_live_abc123"

Python


import time
import httpx
 
job_id = "7a8b9c0d-1e2f-3a4b-c5d6-e7f8a9b0c1d2"
headers = {"X-API-Key": "pk_live_abc123"}
 
while True:
    r = httpx.get(
        f"https://api.expunct.ai/api/v1/documents/jobs/{job_id}",
        headers=headers,
    )
    job = r.json()
    print(f"Status: {job['status']} ({job['progress_pct']}%)")
 
    if job["status"] == "completed":
        for artifact in job["artifacts"]:
            print(f"  {artifact['artifact_kind']}: {artifact['id']}")
        break
    elif job["status"] == "failed":
        print(f"Failed: {job['error_message']}")
        break
 
    time.sleep(2)

Node.js


const jobId = '7a8b9c0d-1e2f-3a4b-c5d6-e7f8a9b0c1d2';
const headers = { 'X-API-Key': 'pk_live_abc123' };
 
while (true) {
  const r = await fetch(
    `https://api.expunct.ai/api/v1/documents/jobs/${jobId}`,
    { headers },
  );
  const job = await r.json();
  console.log(`Status: ${job.status} (${job.progress_pct}%)`);
 
  if (job.status === 'completed') {
    for (const artifact of job.artifacts) {
      console.log(`  ${artifact.artifact_kind}: ${artifact.id}`);
    }
    break;
  } else if (job.status === 'failed') {
    console.error(`Failed: ${job.error_message}`);
    break;
  }
 
  await new Promise((r) => setTimeout(r, 2000));
}

Artifacts

A completed safe-parse job produces three artifacts:

`artifact_kind`	Retention	Description
`sanitized_canonical_document`	Persistent	PII-free structured document (pages, blocks, tables)
`sanitized_markdown_render`	Persistent	Sanitized document rendered as Markdown
`sanitized_chunks_v1`	Persistent	Semantic chunks of the sanitized document, ready for embedding

The raw canonical_document is ephemeral — it is created internally during sanitization and deleted before the job completes. It is never included in the artifact list.

Retrieve artifact content with GET /api/v1/documents/{artifact_id}/content.

sanitized_canonical_document shape

Identical structure to the canonical_document from the parse workflow, but with all PII replaced:


{
  "document_id": "7a8b9c0d-...",
  "page_count": 2,
  "block_count": 18,
  "parse_route": "text_native",
  "pages": [
    {
      "page_number": 1,
      "blocks": [
        {
          "block_id": "b_001",
          "kind": "paragraph",
          "text": "Patient [PERSON] visited on [DATE_TIME]. Contact: [EMAIL_ADDRESS]",
          "reading_order": 0
        }
      ],
      "tables": []
    }
  ]
}

sanitized_chunks_v1 shape

Identical structure to chunks_v1 but sourced from the sanitized canonical document:


{
  "document_id": "7a8b9c0d-...",
  "source_artifact_id": "art_sanitized...",
  "chunks": [
    {
      "chunk_id": "c_001",
      "text": "Patient [PERSON] visited on [DATE_TIME]. Contact: [EMAIL_ADDRESS]",
      "page_number": 1,
      "block_ids": ["b_001"],
      "token_count": 14
    }
  ]
}

Redaction modes

`redaction_mode`	Example output
`type_label` (default)	`[PERSON]`, `[EMAIL_ADDRESS]`, `[PHONE_NUMBER]`
`mask`	`████` (fixed-length block character)

Error responses

Status	Meaning
`400`	Unsupported file type or invalid config JSON
`403`	Feature flag `document_safe_parse_workflow` not enabled
`413`	File exceeds plan upload limit