Building FDA eCTD-Compliant JSON Schemas for Clinical Trials

This guide shows how to model the electronic Common Technical Document (eCTD) as JSON Schema Draft 2020-12 and pydantic v2 models: capturing harmonized CTD Modules 2-5, region-specific Module 1, document leaf metadata, lifecycle operations, and file checksums in a validatable, regulator-aligned data contract.

The eCTD is the structured format health authorities such as FDA and EMA use to receive marketing applications, INDs, and their amendments. Underneath the regulator-supplied DTDs and validation criteria sits a logical model: a tree of modules, headings, and document leaves, each carrying metadata and a lifecycle operation. This article builds that logical model in JSON Schema and pydantic v2 so your internal pipeline can author, validate, and version submission metadata before anything is rendered to the official XML backbone. We deliberately treat the eCTD specification version and region-specific rules as configuration, not hardcoded constants, because those values change and differ by region.

This is the deep how-to under the FDA/EMA Submission Schema Design cluster, itself part of the Core Architecture & Regulatory Mapping for Clinical Trials pillar. It pairs closely with Regulatory Data Dictionary Construction for controlled vocabularies and with Schema Validation & Error Categorization for how to surface and triage the validation failures these schemas produce.

What “eCTD-compliant” means for a JSON model

A few facts shape the design. The CTD is organized into five modules. Modules 2 through 5 are harmonized across ICH regions: the quality, nonclinical, and clinical content is the same regardless of where you submit. Module 1 is region-specific — its administrative content (cover letters, application forms, region-specific forms) differs between FDA, EMA, and other authorities, and each region publishes its own Module 1 specification and version.

The official eCTD transport is XML: an index.xml backbone references regional and study DTDs, and every document is a leaf with attributes describing its file, checksum, and lifecycle operation. Our JSON model is not a replacement for that XML — it is the upstream source of truth that an exporter renders into the backbone. Modeling it as JSON Schema gives us machine-checkable structure, while pydantic v2 gives us typed authoring and runtime validation in Python.

flowchart TD
    SUB[eCTD Submission] --> M1[Module 1 Regional Administrative]
    SUB --> M2[Module 2 Summaries]
    SUB --> M3[Module 3 Quality]
    SUB --> M4[Module 4 Nonclinical]
    SUB --> M5[Module 5 Clinical Study Reports]
    M1 --> M1F[Region specific forms and cover letters]
    M2 --> M2S[Overviews and written summaries]
    M3 --> Q[Drug substance and drug product]
    M4 --> NC[Pharmacology and toxicology reports]
    M5 --> CSR[Clinical study reports and listings]
    M1F --> LEAF[Document leaf metadata]
    Q --> LEAF
    CSR --> LEAF
    LEAF --> OP[Lifecycle operation new replace append delete]
    LEAF --> CK[File reference and checksum]

Three concepts must survive the round-trip into the backbone, so we model them explicitly:

  • Leaf metadata — title, file href, MIME/media type, language, and a stable identifier per document.
  • Lifecycle operation — every leaf in a submission declares what it does relative to prior sequences: new, replace, append, or delete. A replace or delete must point at the leaf it modifies.
  • Checksums — each leaf carries a file checksum so the receiving authority can verify integrity. The checksum algorithm is region- and version-configurable, so we store both the algorithm name and the digest.

Modeling document leaves and lifecycle operations in pydantic v2

We start at the leaf because it is the atom of an eCTD submission. The lifecycle operation is the trickiest part: replace, append, and delete reference a prior leaf, while new must not. That is a natural fit for a discriminated union keyed on the operation name, which keeps the “modified-leaf-ID is required here, forbidden there” rule inside the type system instead of in scattered if statements.

"""eCTD logical model: document leaves and lifecycle operations.

Validated with pydantic v2. Emits JSON Schema (Draft 2020-12) via
model_json_schema(), which is checkable with jsonschema's Draft202012Validator.
"""
from __future__ import annotations

import hashlib
from enum import Enum
from pathlib import Path
from typing import Annotated, Literal, Union

from pydantic import BaseModel, ConfigDict, Field, field_validator


class LifecycleOp(str, Enum):
    """Lifecycle operations an eCTD leaf may declare against prior sequences."""
    NEW = "new"
    REPLACE = "replace"
    APPEND = "append"
    DELETE = "delete"


class ChecksumAlgo(str, Enum):
    """Checksum algorithm. The required algorithm is configured per region
    and eCTD version rather than hardcoded, so multiple values are allowed."""
    MD5 = "md5"
    SHA256 = "sha-256"


class FileChecksum(BaseModel):
    """Integrity descriptor for a single physical file."""
    model_config = ConfigDict(extra="forbid")

    algorithm: ChecksumAlgo
    digest: Annotated[str, Field(pattern=r"^[0-9a-f]+$", min_length=32)]

    @field_validator("digest")
    @classmethod
    def _lowercase_hex(cls, value: str) -> str:
        return value.lower()


class LeafBase(BaseModel):
    """Fields common to every document leaf regardless of operation."""
    model_config = ConfigDict(extra="forbid")

    leaf_id: Annotated[str, Field(pattern=r"^[A-Za-z][A-Za-z0-9_.\-]{2,63}$")]
    title: Annotated[str, Field(min_length=1, max_length=512)]
    href: Annotated[str, Field(min_length=1, max_length=2048)]
    media_type: Annotated[str, Field(pattern=r"^[\w.+-]+/[\w.+-]+$")] = "application/pdf"
    language: Annotated[str, Field(pattern=r"^[a-z]{2}(-[A-Z]{2})?$")] = "en"
    checksum: FileChecksum

    @field_validator("href")
    @classmethod
    def _reject_absolute_paths(cls, value: str) -> str:
        """Leaf hrefs are submission-relative; reject traversal and absolute paths."""
        if value.startswith(("/", "\\")) or ".." in Path(value).parts:
            raise ValueError("href must be a relative path without '..' segments")
        return value


class NewLeaf(LeafBase):
    operation: Literal[LifecycleOp.NEW] = LifecycleOp.NEW


class ReplaceLeaf(LeafBase):
    operation: Literal[LifecycleOp.REPLACE] = LifecycleOp.REPLACE
    modified_leaf_id: str = Field(description="leaf_id this leaf replaces")


class AppendLeaf(LeafBase):
    operation: Literal[LifecycleOp.APPEND] = LifecycleOp.APPEND
    modified_leaf_id: str = Field(description="leaf_id this leaf appends to")


class DeleteLeaf(LeafBase):
    operation: Literal[LifecycleOp.DELETE] = LifecycleOp.DELETE
    modified_leaf_id: str = Field(description="leaf_id this leaf deletes")


# Discriminated union: pydantic and JSON Schema both route on `operation`.
Leaf = Annotated[
    Union[NewLeaf, ReplaceLeaf, AppendLeaf, DeleteLeaf],
    Field(discriminator="operation"),
]

The discriminator="operation" annotation is what makes this production-grade rather than a loose oneOf. pydantic uses it to pick the right model with a clear error when operation is missing or unknown, and model_json_schema() emits a JSON Schema oneOf plus a discriminator mapping, so downstream tooling that understands discriminators (OpenAPI-style) routes the same way. extra="forbid" maps to additionalProperties: false, which is essential — eCTD backbones are strict, and silently accepting unknown keys is how malformed metadata reaches a gateway.

Region-specific Module 1 vs. harmonized Modules 2-5

Module 1 is where regional divergence lives, so we model the region itself as a discriminator. An FDA Module 1 and an EMA Module 1 are different shapes; trying to force them into one optional-field-soup model is exactly the design that leaks invalid submissions. With a discriminated union, a payload tagged region: "fda" can only ever satisfy the FDA shape.

class ModuleSection(BaseModel):
    """A heading node that groups leaves and/or nested sections."""
    model_config = ConfigDict(extra="forbid")

    section_code: Annotated[str, Field(pattern=r"^[0-9](\.[0-9A-Za-z]+)*$")]
    title: Annotated[str, Field(min_length=1, max_length=512)]
    leaves: list[Leaf] = Field(default_factory=list)
    subsections: list["ModuleSection"] = Field(default_factory=list)


class FDAModule1(BaseModel):
    """US region-specific administrative module."""
    model_config = ConfigDict(extra="forbid")

    region: Literal["fda"] = "fda"
    submission_type: Literal["ind", "nda", "bla", "anda"]
    application_number: Annotated[str, Field(pattern=r"^[0-9]{6}$")]
    cover_letter: Leaf
    sections: list[ModuleSection] = Field(default_factory=list)


class EMAModule1(BaseModel):
    """EU region-specific administrative module."""
    model_config = ConfigDict(extra="forbid")

    region: Literal["ema"] = "ema"
    procedure_type: Literal["centralised", "national", "mrp", "dcp"]
    cover_letter: Leaf
    sections: list[ModuleSection] = Field(default_factory=list)


Module1 = Annotated[
    Union[FDAModule1, EMAModule1],
    Field(discriminator="region"),
]


class HarmonizedModule(BaseModel):
    """Modules 2-5 share one shape across regions (ICH-harmonized content)."""
    model_config = ConfigDict(extra="forbid")

    module_number: Literal[2, 3, 4, 5]
    sections: list[ModuleSection] = Field(default_factory=list)

The application_number and procedure_type patterns above are illustrative shapes, not regulatory facts; pull the real validation criteria for each region from your Regulatory Data Dictionary Construction registry and parametrize them per region and eCTD version. The point is structural: Module 1 varies by region and is selected by a discriminator; Modules 2-5 do not.

The submission envelope and configurable spec version

The top-level model ties everything together and carries the sequence metadata that drives lifecycle. Crucially, the eCTD specification version is a field, validated against an allow-list you supply at construction time, not a literal baked into the schema.

class Submission(BaseModel):
    """Root logical model rendered into an eCTD backbone by an exporter."""
    model_config = ConfigDict(extra="forbid")

    ectd_spec_version: Annotated[str, Field(min_length=1, max_length=32)]
    sequence: Annotated[str, Field(pattern=r"^[0-9]{4}$")]
    module_1: Module1
    harmonized_modules: list[HarmonizedModule] = Field(default_factory=list)

    @field_validator("harmonized_modules")
    @classmethod
    def _unique_modules(cls, mods: list[HarmonizedModule]) -> list[HarmonizedModule]:
        numbers = [m.module_number for m in mods]
        if len(numbers) != len(set(numbers)):
            raise ValueError("each harmonized module (2-5) may appear at most once")
        return mods

Treating ectd_spec_version as data means one codebase serves FDA and EMA, current and prior spec versions, without forking the model. The exporter that renders the XML backbone reads this field to choose the correct DTD and validation profile.

Emitting and checking the JSON Schema

pydantic v2 emits Draft 2020-12 JSON Schema directly. We can then validate raw, untrusted JSON payloads with jsonschema’s Draft202012Validator — useful when input arrives from another system that does not import your Python models. Always attach a FormatChecker, because format assertions are advisory by default and silently ignored otherwise.

import json

from jsonschema import Draft202012Validator
from jsonschema.validators import validator_for


def build_submission_schema() -> dict:
    """Generate a Draft 2020-12 JSON Schema for the Submission model."""
    return Submission.model_json_schema()


def make_validator(schema: dict) -> Draft202012Validator:
    """Build a draft-pinned validator with format assertions enforced.

    check_schema confirms the emitted schema is itself a well-formed
    meta-schema-valid document before we use it; the runtime validator is
    pinned explicitly to Draft 2020-12.
    """
    validator_cls = validator_for(schema)
    validator_cls.check_schema(schema)
    return Draft202012Validator(
        schema, format_checker=Draft202012Validator.FORMAT_CHECKER
    )


def validate_payload(payload: dict, schema: dict) -> list[str]:
    """Return human-readable error strings; empty list means valid.

    Errors are sorted by JSON path so the report is deterministic and
    digestible by an error-categorization stage downstream.
    """
    validator = make_validator(schema)
    errors = sorted(validator.iter_errors(payload), key=lambda e: list(e.absolute_path))
    return [f"{'/'.join(map(str, e.absolute_path)) or '<root>'}: {e.message}" for e in errors]

Returning all errors via iter_errors (rather than raising on the first) is what lets a categorization layer group failures by type and location. That handoff is covered in Schema Validation & Error Categorization; here we just make sure the validator surfaces every problem with a stable path.

Computing leaf checksums deterministically

Checksums are integrity facts, not free text. Compute them from the actual file bytes with a streaming read so large clinical PDFs do not exhaust memory, and store the algorithm alongside the digest so verification is unambiguous.

def compute_checksum(path: Path, algorithm: ChecksumAlgo) -> FileChecksum:
    """Stream a file through the configured hash and return a FileChecksum.

    Reads in fixed-size chunks to bound memory on large submission documents.
    """
    algo_name = "sha256" if algorithm is ChecksumAlgo.SHA256 else "md5"
    digest = hashlib.new(algo_name)
    with path.open("rb") as handle:
        for chunk in iter(lambda: handle.read(1024 * 1024), b""):
            digest.update(chunk)
    return FileChecksum(algorithm=algorithm, digest=digest.hexdigest())

The algorithm is passed in, never assumed: the required checksum algorithm is part of the region/version validation criteria, so resolve it from config and feed it here.

Putting it together

A minimal end-to-end build-and-validate flow looks like this. It constructs a typed Submission, dumps it to JSON, regenerates the schema, and validates the round-tripped payload — proving the model and the emitted schema agree.

def example() -> None:
    leaf = NewLeaf(
        leaf_id="m1.cover.001",
        title="Cover Letter",
        href="m1/us/cover-letter.pdf",
        checksum=FileChecksum(
            algorithm=ChecksumAlgo.SHA256,
            digest="e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        ),
    )
    submission = Submission(
        ectd_spec_version="us-regional-3.3",  # configurable, not hardcoded
        sequence="0000",
        module_1=FDAModule1(
            submission_type="ind",
            application_number="123456",
            cover_letter=leaf,
        ),
        harmonized_modules=[HarmonizedModule(module_number=3)],
    )

    schema = build_submission_schema()
    payload = json.loads(submission.model_dump_json())
    errors = validate_payload(payload, schema)
    assert not errors, errors
    print(f"valid submission, sequence {submission.sequence}")


if __name__ == "__main__":
    example()

The ectd_spec_version string above ("us-regional-3.3") is a placeholder token your configuration supplies; do not treat it as an authoritative version identifier. The model neither asserts nor invents a specific regulatory version — it records whatever your region/version config declares and lets the exporter and the official validation tool enforce the authority’s current criteria.

Validation and design checklist

  • extra="forbid" on every model so the schema emits additionalProperties: false.
  • Lifecycle operations modeled as a discriminated union; replace/append/delete require modified_leaf_id, new forbids it.
  • Module 1 modeled per region behind a region discriminator; Modules 2-5 share one harmonized shape.
  • eCTD spec version, region rules, and checksum algorithm read from configuration, never hardcoded.
  • Draft202012Validator built via validator_for + check_schema, with a FormatChecker attached.
  • Leaf href is submission-relative with no traversal segments.
  • Checksums computed from real file bytes via streaming reads, with algorithm stored alongside the digest.
  • Validation returns all errors with stable JSON paths for downstream categorization.

FAQ

Does this JSON model replace the eCTD XML backbone?

No. The eCTD transport remains the regulator-specified XML backbone with its DTDs. This model is the upstream, machine-checkable source of truth that an exporter renders into that backbone. Validating the JSON early catches structural and metadata errors before they reach official validation.

Why use a discriminated union instead of a single Module 1 model with optional fields?

Because Module 1 genuinely differs by region. A single model full of optional fields cannot express “application_number is required for FDA but meaningless for EMA,” so invalid combinations pass silently. A discriminator keyed on region makes each shape exact and produces clear validation errors.

Where do the real version numbers and validation criteria come from?

From each authority’s published region-specific specification and validation criteria, resolved through your regulatory data dictionary at runtime. This article keeps those values configurable on purpose so the model stays correct as specifications are revised. See Regulatory Data Dictionary Construction.

How should I surface the validation errors this produces?

Collect them with iter_errors, keep the JSON path stable, then group and prioritize them in a dedicated categorization stage. That pattern is detailed in Schema Validation & Error Categorization.