Categorizing Validation Errors in Regulatory Document Pipelines
A production guide to turning raw validation failures into a deterministic taxonomy: run jsonschema and pydantic v2 over ingested clinical documents, classify each error by type and severity, then route every finding to dead-letter, human review, or safe auto-fix with PHI-free structured logging and metrics.
When a site-activation packet, an IND amendment, or a consent form fails validation, the question is never simply “is it valid?” It is “which of the forty fields broke, how badly, who needs to look at it, and can the pipeline recover on its own?” A flat boolean answer forces a human to re-open every rejected document. A well-categorized answer lets the pipeline auto-fix a stray whitespace, quarantine a malformed payload to a dead-letter queue, and page a regulatory reviewer only for the genuinely ambiguous cases. This long-tail sits under the Schema Validation & Error Categorization cluster and the Automated Document Ingestion & Validation Workflows pillar, and it focuses narrowly on one thing: a concrete, runnable error-categorization system.
What “categorization” actually means
A validation error has three independent axes, and conflating them is the most common design mistake:
- Type — the machine-readable reason the value failed. From pydantic v2 this is the error
typestring (missing,extra_forbidden,value_error,string_pattern_mismatch,enum, and so on). From jsonschema this is thevalidatorkeyword that failed (required,type,pattern,enum,additionalProperties). - Severity — the operational impact: blocking (the document cannot proceed to submission), warning (it proceeds but is flagged), or info (cosmetic). Severity is a business decision layered on top of type; the same
patternfailure may block onprotocol_versionbut only warn on a free-text comment. - Disposition — what the pipeline does next: send to dead-letter, escalate to human review, or apply a bounded auto-fix and re-validate.
Because the type is structural and the severity/disposition are policy, we keep them in separate layers. The validators emit types; a policy table maps (field, type) to severity and disposition. That separation is what makes the system maintainable as regulatory requirements change.
Decision flow
flowchart TD
A[Raw document payload] --> B[Run jsonschema iter_errors]
B --> C[Run pydantic model_validate]
C --> D{Any errors collected}
D -->|No| E[Mark validated and proceed]
D -->|Yes| F[Classify each error by field and type]
F --> G[Look up severity in policy table]
G --> H{Highest severity}
H -->|Info or warning only| I[Attach flags and proceed]
H -->|Blocking and auto fixable| J[Apply bounded auto fix]
J --> K{Re validate}
K -->|Clean| E
K -->|Still failing| L[Route to human review]
H -->|Blocking and ambiguous| L
H -->|Blocking and structurally broken| M[Dead letter queue]
L --> N[Emit metrics and PHI free log]
M --> N
I --> N
E --> N
The schemas: jsonschema and pydantic v2 side by side
We validate twice on purpose. JSON Schema is the contract the sponsor and the EDC agree on — it travels with the data and is easy to version. Pydantic v2 is the in-process model that gives us typed objects, cross-field validators, and rich error metadata. Running both catches different classes of defect: jsonschema reports additionalProperties violations and exact JSON paths cleanly, while pydantic’s @model_validator expresses business rules such as “consent date cannot precede IRB approval” that are awkward in pure JSON Schema.
"""Schemas for a clinical site-activation document."""
from __future__ import annotations
from datetime import date
from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator
# JSON Schema (Draft 2020-12) — the portable contract shared with the sponsor/EDC.
DOCUMENT_JSON_SCHEMA: dict[str, object] = {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"additionalProperties": False,
"required": [
"document_id",
"protocol_version",
"site_id",
"irb_approval_date",
"site_classification",
"attachments",
],
"properties": {
"document_id": {"type": "string", "minLength": 8, "maxLength": 32},
"protocol_version": {"type": "string", "pattern": r"^v\d+\.\d+\.\d+$"},
"site_id": {"type": "string", "pattern": r"^SITE-\d{4}$"},
"irb_approval_date": {"type": "string", "format": "date"},
"consent_date": {"type": "string", "format": "date"},
"site_classification": {
"type": "string",
"enum": ["Phase I", "Phase II", "Phase III", "Phase IV"],
},
"attachments": {
"type": "array",
"minItems": 1,
"items": {"type": "string"},
},
"reviewer_comment": {"type": "string", "maxLength": 500},
},
}
class RegulatoryDocument(BaseModel):
"""In-process model. `extra='forbid'` makes unexpected keys raise
`extra_forbidden`, mirroring the JSON Schema `additionalProperties: false`."""
model_config = ConfigDict(extra="forbid", str_strip_whitespace=False)
document_id: str = Field(min_length=8, max_length=32)
protocol_version: str = Field(pattern=r"^v\d+\.\d+\.\d+$")
site_id: str = Field(pattern=r"^SITE-\d{4}$")
irb_approval_date: date
consent_date: date | None = None
site_classification: str
attachments: list[str] = Field(min_length=1)
reviewer_comment: str | None = Field(default=None, max_length=500)
@field_validator("site_classification")
@classmethod
def _known_phase(cls, value: str) -> str:
allowed = {"Phase I", "Phase II", "Phase III", "Phase IV"}
if value not in allowed:
# Surfaces as a pydantic error with type 'value_error'.
raise ValueError(f"unknown site_classification: {value!r}")
return value
@model_validator(mode="after")
def _consent_after_irb(self) -> "RegulatoryDocument":
if self.consent_date is not None and self.consent_date < self.irb_approval_date:
raise ValueError("consent_date precedes irb_approval_date")
return self
The error type strings used below (missing, extra_forbidden, value_error, string_pattern_mismatch, enum, too_short) are the actual stable identifiers pydantic v2 emits in ValidationError.errors()[i]["type"]. The jsonschema attributes (error.validator, error.json_path, error.absolute_path) are the real attributes on jsonschema.exceptions.ValidationError.
Normalizing both validators into one error record
The core idea: collapse every jsonschema and pydantic error into a single ValidationFinding dataclass keyed by (field_path, type_code). Downstream policy only ever sees that uniform shape.
"""Normalize jsonschema + pydantic errors into a uniform finding."""
from __future__ import annotations
import enum
from dataclasses import dataclass, field
from jsonschema import Draft202012Validator
from jsonschema.exceptions import ValidationError as JsonSchemaError
from pydantic import ValidationError as PydanticError
class Severity(enum.IntEnum):
"""Ordered so max() yields the most serious severity present."""
INFO = 10
WARNING = 20
BLOCKING = 30
class Disposition(str, enum.Enum):
PROCEED = "proceed"
AUTO_FIX = "auto_fix"
HUMAN_REVIEW = "human_review"
DEAD_LETTER = "dead_letter"
@dataclass(frozen=True)
class ValidationFinding:
"""One normalized validation problem, validator-agnostic."""
source: str # "jsonschema" or "pydantic"
field_path: str # dotted path, e.g. "attachments.0" or "protocol_version"
type_code: str # "missing", "pattern", "extra_forbidden", ...
message: str # short, PHI-free description
severity: Severity = Severity.BLOCKING
disposition: Disposition = Disposition.HUMAN_REVIEW
def _json_path_to_dotted(error: JsonSchemaError) -> str:
"""Build a dotted field path from a jsonschema error.
`absolute_path` is a deque of property names / array indices; an empty
path means the error is on the document root (e.g. a `required` failure).
"""
parts = [str(p) for p in error.absolute_path]
if not parts and error.validator == "required":
# The missing property name lives in the validator_value/message.
return "<root>"
return ".".join(parts) if parts else "<root>"
def collect_jsonschema_findings(payload: dict[str, object]) -> list[ValidationFinding]:
"""Run Draft 2020-12 validation and emit one finding per error."""
validator = Draft202012Validator(DOCUMENT_JSON_SCHEMA)
findings: list[ValidationFinding] = []
# iter_errors yields *all* violations, not just the first.
for err in sorted(validator.iter_errors(payload), key=lambda e: list(e.absolute_path)):
findings.append(
ValidationFinding(
source="jsonschema",
field_path=_json_path_to_dotted(err),
# `validator` is the failing keyword: required/type/pattern/enum/...
type_code=str(err.validator),
message=_safe_jsonschema_message(err),
)
)
return findings
def _safe_jsonschema_message(err: JsonSchemaError) -> str:
"""Return a message that names the constraint, never the instance value
(which could be PHI)."""
if err.validator == "required":
return f"missing required property: {err.message.split()[0]}"
if err.validator == "additionalProperties":
return "unexpected property present"
return f"failed constraint '{err.validator}'"
# Map pydantic v2 type prefixes to our normalized type codes.
_PYDANTIC_TYPE_MAP: dict[str, str] = {
"missing": "missing",
"extra_forbidden": "extra_forbidden",
"string_pattern_mismatch": "pattern",
"string_too_short": "min_length",
"too_short": "min_length",
"enum": "enum",
"value_error": "value_error",
"date_from_datetime_parsing": "type",
"date_parsing": "type",
}
def collect_pydantic_findings(payload: dict[str, object]) -> list[ValidationFinding]:
"""Validate with pydantic v2 and normalize each error."""
try:
RegulatoryDocument.model_validate(payload)
except PydanticError as exc:
return [_finding_from_pydantic(e) for e in exc.errors()]
return []
def _finding_from_pydantic(err: dict[str, object]) -> ValidationFinding:
raw_type = str(err["type"])
loc = err.get("loc", ())
field_path = ".".join(str(p) for p in loc) if loc else "<root>"
return ValidationFinding(
source="pydantic",
field_path=field_path,
type_code=_PYDANTIC_TYPE_MAP.get(raw_type, raw_type),
# err["msg"] from pydantic v2 does not echo the input value by default,
# but we still avoid logging err["input"], which may contain PHI.
message=str(err.get("msg", raw_type)),
)
Note the deliberate PHI hygiene: pydantic’s errors() includes an input key holding the offending value, and a verbose jsonschema message echoes the instance. A consent form’s offending value could be a subject identifier or date of birth, so we never copy err["input"] or err.instance into a finding. We log the field path and the constraint name only.
The policy table: type and field to severity and disposition
This is the only place business rules live. It is intentionally data, not code, so a regulatory analyst can review it.
"""Severity + disposition policy. Keyed by (field_path, type_code) with
sensible per-type fallbacks. This is the single source of truth for routing."""
from __future__ import annotations
# (field_path, type_code) -> (Severity, Disposition)
_FIELD_POLICY: dict[tuple[str, str], tuple[Severity, Disposition]] = {
("protocol_version", "pattern"): (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
("site_id", "pattern"): (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
("irb_approval_date", "type"): (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
("attachments", "min_length"): (Severity.BLOCKING, Disposition.DEAD_LETTER),
("attachments", "minItems"): (Severity.BLOCKING, Disposition.DEAD_LETTER),
# A stray reviewer comment over the limit is cosmetic and auto-trimmable.
("reviewer_comment", "max_length"): (Severity.WARNING, Disposition.AUTO_FIX),
("reviewer_comment", "maxLength"): (Severity.WARNING, Disposition.AUTO_FIX),
("<root>", "additionalProperties"): (Severity.WARNING, Disposition.AUTO_FIX),
("<root>", "extra_forbidden"): (Severity.WARNING, Disposition.AUTO_FIX),
}
# Fallback by type_code when no specific (field, type) rule matches.
_TYPE_DEFAULT: dict[str, tuple[Severity, Disposition]] = {
"missing": (Severity.BLOCKING, Disposition.DEAD_LETTER),
"required": (Severity.BLOCKING, Disposition.DEAD_LETTER),
"type": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
"pattern": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
"enum": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
"value_error": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
"extra_forbidden": (Severity.WARNING, Disposition.AUTO_FIX),
"additionalProperties": (Severity.WARNING, Disposition.AUTO_FIX),
"min_length": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
}
_GLOBAL_DEFAULT: tuple[Severity, Disposition] = (
Severity.BLOCKING,
Disposition.HUMAN_REVIEW,
)
def classify(finding: ValidationFinding) -> ValidationFinding:
"""Return a copy of the finding enriched with severity + disposition."""
severity, disposition = _FIELD_POLICY.get(
(finding.field_path, finding.type_code),
_TYPE_DEFAULT.get(finding.type_code, _GLOBAL_DEFAULT),
)
# Dataclass is frozen; produce a new instance.
return ValidationFinding(
source=finding.source,
field_path=finding.field_path,
type_code=finding.type_code,
message=finding.message,
severity=severity,
disposition=disposition,
)
The rationale behind a few choices:
- A
missingrequired field is structurally broken and almost always indicates an upstream extraction or mapping fault, so it dead-letters rather than wasting a reviewer’s time. - A
patternfailure onprotocol_versionis usually a real but human-resolvable transcription issue, so it goes to review. - Extra properties (
extra_forbidden/additionalProperties) are warnings: the safest auto-fix in regulated data is to drop unknown keys (never to invent values), which is non-mutating with respect to required content.
Auto-fix: bounded, non-mutating, and re-validated
Auto-fix in a regulated pipeline must be conservative. We only ever remove unknown keys or truncate an over-long free-text comment — operations that cannot fabricate or alter regulatory content. Any fix is logged, and the document is re-validated; if it still fails, it escalates.
"""Bounded auto-fix. Only safe, non-fabricating transformations are allowed."""
from __future__ import annotations
import copy
def apply_auto_fix(
payload: dict[str, object],
findings: list[ValidationFinding],
) -> tuple[dict[str, object], list[str]]:
"""Return a fixed copy of the payload plus a list of applied fix notes.
Never mutates the input. Only handles findings whose disposition is
AUTO_FIX; everything else is left untouched for the caller to route.
"""
fixed = copy.deepcopy(payload)
notes: list[str] = []
allowed_keys = set(DOCUMENT_JSON_SCHEMA["properties"]) # type: ignore[arg-type]
for finding in findings:
if finding.disposition is not Disposition.AUTO_FIX:
continue
if finding.type_code in {"extra_forbidden", "additionalProperties"}:
removed = [k for k in list(fixed) if k not in allowed_keys]
for key in removed:
del fixed[key]
if removed:
notes.append(f"dropped unknown keys: {sorted(removed)}")
elif finding.type_code in {"max_length", "maxLength"}:
comment = fixed.get("reviewer_comment")
if isinstance(comment, str) and len(comment) > 500:
fixed["reviewer_comment"] = comment[:500]
notes.append("truncated reviewer_comment to 500 chars")
return fixed, notes
Tying it together: orchestration, logging, and metrics
The orchestrator runs both validators, classifies, attempts one auto-fix pass, re-validates, and emits a PHI-free structured log line plus counters. The disposition is the worst one present, because a document with one dead-letter finding cannot proceed even if its other findings are auto-fixable.
"""End-to-end orchestration with structured logging and metrics."""
from __future__ import annotations
import logging
from collections import Counter
# `python-json-logger` or stdlib `logging` with a JSON formatter both work;
# here we emit structured `extra` fields and rely on the handler to serialize.
logger = logging.getLogger("regulatory.validation")
# Module-level counters; in production back these with Prometheus/StatsD.
METRICS: Counter[str] = Counter()
def _collect_all(payload: dict[str, object]) -> list[ValidationFinding]:
raw = collect_jsonschema_findings(payload) + collect_pydantic_findings(payload)
return [classify(f) for f in raw]
def _worst_disposition(findings: list[ValidationFinding]) -> Disposition:
"""Resolve the binding disposition for the whole document."""
if any(f.disposition is Disposition.DEAD_LETTER for f in findings):
return Disposition.DEAD_LETTER
if any(f.disposition is Disposition.HUMAN_REVIEW for f in findings):
return Disposition.HUMAN_REVIEW
if any(f.disposition is Disposition.AUTO_FIX for f in findings):
return Disposition.AUTO_FIX
return Disposition.PROCEED
def validate_document(
payload: dict[str, object],
document_id: str,
) -> dict[str, object]:
"""Validate, categorize, and route a single document.
`document_id` is a non-PHI surrogate key safe to log. The raw `payload`
and individual field values are never logged.
"""
findings = _collect_all(payload)
disposition = _worst_disposition(findings)
if disposition is Disposition.AUTO_FIX:
fixed, notes = apply_auto_fix(payload, findings)
residual = _collect_all(fixed)
if not residual:
payload, findings, disposition = fixed, [], Disposition.PROCEED
_log("auto_fix_succeeded", document_id, findings, extra={"notes": notes})
else:
# Auto-fix did not fully clean it; escalate the residual.
findings = residual
disposition = _worst_disposition(residual)
severity = max((f.severity for f in findings), default=Severity.INFO)
METRICS[f"disposition.{disposition.value}"] += 1
for finding in findings:
METRICS[f"error_type.{finding.type_code}"] += 1
_log("validation_complete", document_id, findings,
extra={"disposition": disposition.value, "severity": severity.name})
return {
"document_id": document_id,
"valid": disposition is Disposition.PROCEED,
"disposition": disposition.value,
"severity": severity.name,
"findings": [
{
"source": f.source,
"field": f.field_path,
"type": f.type_code,
"severity": f.severity.name,
"disposition": f.disposition.value,
}
for f in findings
],
}
def _log(
event: str,
document_id: str,
findings: list[ValidationFinding],
extra: dict[str, object] | None = None,
) -> None:
"""Emit a PHI-free structured log line.
We log field paths and type codes only — never values, never `err.input`.
"""
payload: dict[str, object] = {
"event": event,
"document_id": document_id,
"finding_count": len(findings),
"error_types": sorted({f.type_code for f in findings}),
}
if extra:
payload.update(extra)
logger.info(event, extra={"validation": payload})
A worked example shows the routing in action. Given a payload missing attachments, carrying an unknown legacy_notes key, and a protocol_version of "1.0" (no leading v):
- jsonschema emits
required(forattachments),additionalProperties(forlegacy_notes), andpattern(forprotocol_version). - pydantic emits
missing,extra_forbidden, andstring_pattern_mismatchfor the same problems. - Classification yields one
DEAD_LETTER(missing attachments), oneAUTO_FIX(droplegacy_notes), and oneHUMAN_REVIEW(protocol pattern). _worst_dispositionresolves to dead-letter: the document cannot proceed regardless of the fixable noise, and a reviewer is not paged for a document that is structurally incomplete.
Operational concerns
- Metrics that matter. Track
disposition.dead_letterrate,disposition.human_reviewrate, and per-error_typecounts. A spike inerror_type.patternonprotocol_versionusually signals an upstream template change, not a data-entry problem — categorization makes that visible. - Idempotency. Run auto-fix exactly once per document, then re-validate. Looping auto-fix invites oscillation and obscures provenance.
- Audit alignment. Every disposition is an event worth recording in your 21 CFR Part 11 audit trail with the surrogate
document_id, the timestamp, and the resolved disposition — but, per ALCOA+ and PHI-minimization, never the field values themselves.
FAQ
Why validate with both jsonschema and pydantic instead of one?
JSON Schema is the portable, versionable contract you share with sponsors and EDC vendors; pydantic v2 gives you typed objects and expressive cross-field validators (@model_validator) for rules like “consent date must not precede IRB approval.” Running both surfaces a wider class of defects and lets the JSON Schema travel with the data while the model stays in your process.
How do I avoid leaking PHI into validation logs?
Never log the offending value. Pydantic’s errors() includes an input key and jsonschema errors expose .instance; both can contain subject identifiers or dates. Log only the field_path, the normalized type_code, and a constraint-only message — exactly what the _log helper and _safe_jsonschema_message above enforce.
When is auto-fix safe in a regulated pipeline?
Only for non-fabricating, non-mutating transformations: dropping unknown keys or truncating an over-length free-text field. Auto-fix must never invent a missing value, reformat a date ambiguously, or alter regulatory content. Always re-validate after a fix and escalate anything still failing.
What’s the difference between dead-letter and human-review routing?
Dead-letter is for structurally broken or unrecoverable payloads (missing required fields, decode failures) that indicate an upstream extraction or mapping fault — a human fixes the source, not the document. Human-review is for valid-but-ambiguous findings (a malformed protocol_version) where a reviewer can make a judgment call on the document itself.