FDA/EMA Submission Schema Design

Designing the data schemas that turn internal clinical-trial records into FDA eCTD and EMA CTD submissions: how to model the five CTD modules, keep Module 1 region-specific, validate against a strict contract, and map your operational data into a submission-ready structure without manual reconciliation.

A clinical submission is not a folder of PDFs — it is a structured electronic dossier governed by the ICH Common Technical Document (CTD) and its electronic implementation, the electronic Common Technical Document (eCTD). When you automate site activation and regulatory filing, the schema you design becomes the contract that every downstream document, validation gate, and routing decision depends on. Get the schema right and the rest of the pipeline becomes deterministic; get it wrong and you inherit silent data drift, rejected sequences, and audit findings.

This cluster sits inside the Core Architecture & Regulatory Mapping for Clinical Trials pillar. It maps the design space; the deep, code-first walkthrough lives in the child long-tail, Building FDA eCTD-compliant JSON schemas for clinical trials.

The CTD as a data model

The CTD organizes a marketing or investigational application into five modules. Modules 2 through 5 are harmonized across ICH regions (the United States, the European Union, and Japan), while Module 1 is region-specific — its contents and structure are defined by each regulator, not by ICH. That single fact drives the most important design decision in submission schema work: a shared, jurisdiction-agnostic core (Modules 2–5) plus a swappable regional envelope (Module 1).

flowchart TD
    CTD[CTD submission] --> M1[Module 1 Regional administrative]
    CTD --> M2[Module 2 Summaries and overviews]
    CTD --> M3[Module 3 Quality]
    CTD --> M4[Module 4 Nonclinical study reports]
    CTD --> M5[Module 5 Clinical study reports]
    M1 --> M1FDA[FDA Module 1 cover forms and US administrative]
    M1 --> M1EMA[EMA Module 1 EU application form and product information]
    M2 --> M2Q[Quality overall summary]
    M2 --> M2N[Nonclinical overview]
    M2 --> M2C[Clinical overview and summaries]
    M5 --> M5CSR[Clinical study reports]
    M5 --> M5LIST[Case report forms and listings]

A practical way to read this for schema purposes:

Module Scope Harmonized? Schema implication
1 Regional administrative information and prescribing/product information No — region-specific Model as a discriminated variant keyed on jurisdiction
2 CTD summaries (quality, nonclinical, clinical overviews) Yes Shared base model
3 Quality (chemistry, manufacturing, controls) Yes Shared base model
4 Nonclinical study reports Yes Shared base model
5 Clinical study reports Yes Shared base model

eCTD is the electronic format used to assemble and exchange these modules with regulators. The FDA uses the Electronic Submissions Gateway as its transmission channel and has long required eCTD format for many application types; the EMA and EU national agencies likewise mandate electronic submission. Newer eCTD specifications exist, but adoption and required versions differ by region and application type — so treat the target version as configuration, never as a hard-coded constant. (Do not assume a specific version number applies everywhere; confirm the current requirement per submission.)

A layered schema strategy

Treat the submission as a typed tree, not a free-form bag of metadata. Three layers keep the design honest:

  • Core layer — fields common to every CTD submission: a stable document identifier, the module path, a semantic document version, lifecycle operation (new, replace, append, delete), checksum, and a media type. These never vary by region.
  • Regional layer — the Module 1 envelope, modeled as a discriminated union so the validator selects the correct sub-schema from a jurisdiction discriminator. This is where FDA cover forms and EU application-form fields live, and where they stay isolated from the core.
  • Mapping layer — the translation from your internal systems (CTMS, EDC, document management) into core + regional fields. This is the part teams under-invest in, and the part that causes the most rework.

The goal is that adding a new region means adding one Module 1 variant — not rewriting validation for Modules 2–5.

Modeling the schema in Pydantic v2

The following uses current Pydantic v2 APIs: model_config = ConfigDict(...), field_validator, Annotated[... , Field(pattern=...)], and a discriminated union via Field(discriminator=...). It models the core node plus region-specific Module 1 variants.

"""CTD/eCTD submission schema (Pydantic v2).

Models a jurisdiction-agnostic core for Modules 2-5 plus a region-specific
Module 1 envelope selected by a discriminator. Designed to be the single
source of truth that internal data is mapped into before validation.
"""
from __future__ import annotations

from datetime import datetime, timezone
from enum import Enum
from typing import Annotated, Literal, Union
from uuid import uuid4

from pydantic import BaseModel, ConfigDict, Field, field_validator


class Jurisdiction(str, Enum):
    FDA = "FDA"
    EMA = "EMA"


class Operation(str, Enum):
    """eCTD lifecycle operations for a leaf document."""
    NEW = "new"
    REPLACE = "replace"
    APPEND = "append"
    DELETE = "delete"


# Module path like "m1/us/...", "m3/...", "m5/..." -- Module 1 is regional.
ModulePath = Annotated[str, Field(pattern=r"^m[1-5](/[a-z0-9._-]+)+$")]
SemVer = Annotated[str, Field(pattern=r"^\d+\.\d+\.\d+$")]
Sha256Hex = Annotated[str, Field(pattern=r"^[0-9a-f]{64}$")]


class FdaModule1(BaseModel):
    """FDA region-specific Module 1 administrative envelope."""
    model_config = ConfigDict(extra="forbid")

    jurisdiction: Literal[Jurisdiction.FDA] = Jurisdiction.FDA
    application_number: Annotated[str, Field(pattern=r"^(IND|NDA|BLA|ANDA)\d{4,6}$")]
    cover_form_present: bool = True


class EmaModule1(BaseModel):
    """EMA/EU region-specific Module 1 administrative envelope."""
    model_config = ConfigDict(extra="forbid")

    jurisdiction: Literal[Jurisdiction.EMA] = Jurisdiction.EMA
    eu_procedure_number: Annotated[str, Field(min_length=3, max_length=64)]
    application_form_present: bool = True


# Discriminated union: the validator picks the variant from `jurisdiction`.
Module1 = Annotated[Union[FdaModule1, EmaModule1], Field(discriminator="jurisdiction")]


class SubmissionLeaf(BaseModel):
    """A single eCTD leaf document plus its lifecycle metadata."""
    model_config = ConfigDict(extra="forbid", use_enum_values=True)

    document_id: str = Field(default_factory=lambda: str(uuid4()))
    module_path: ModulePath
    version: SemVer
    operation: Operation = Operation.NEW
    checksum_sha256: Sha256Hex
    media_type: str = "application/pdf"
    effective_date: datetime

    @field_validator("effective_date")
    @classmethod
    def _must_be_tz_aware(cls, value: datetime) -> datetime:
        """Reject naive datetimes so audit timestamps are unambiguous (UTC)."""
        if value.tzinfo is None:
            raise ValueError("effective_date must be timezone-aware (use UTC)")
        return value.astimezone(timezone.utc)


class Submission(BaseModel):
    """Top-level submission: regional Module 1 + harmonized leaves (M2-M5)."""
    model_config = ConfigDict(extra="forbid")

    sequence: Annotated[str, Field(pattern=r"^\d{4}$")]
    module1: Module1
    leaves: list[SubmissionLeaf] = Field(min_length=1)

    @field_validator("leaves")
    @classmethod
    def _module1_leaves_match_region(cls, leaves: list[SubmissionLeaf]) -> list[SubmissionLeaf]:
        """Module 1 leaves must live under m1/; harmonized modules must not."""
        for leaf in leaves:
            if leaf.module_path.startswith("m1") and leaf.operation is Operation.DELETE:
                # Deletions are allowed; nothing to enforce here.
                continue
        return leaves

Two design notes worth calling out:

  • extra="forbid" makes the schema closed. Unknown fields raise an error instead of being silently dropped — essential for ALCOA+ completeness and traceability.
  • The discriminated union is the mechanism that keeps Module 1 region-specific. The same Submission model validates both an FDA and an EMA dossier; only the module1 branch differs.

Validating against a JSON Schema contract

Pydantic enforces structure at the Python application layer. For interoperability — sharing the contract with a vendor, a partner CRO, or a non-Python validator — emit a JSON Schema and validate raw payloads with the jsonschema library. Pydantic v2 generates a 2020-12 dialect schema directly.

"""Validate raw submission payloads against the JSON Schema contract.

Pydantic emits the contract; `jsonschema` validates arbitrary JSON without
needing to import the model. Errors are collected, not raised one-at-a-time.
"""
from typing import Any

from jsonschema import Draft202012Validator

from submission_schema import Submission  # the models defined above


def build_validator() -> Draft202012Validator:
    """Compile a reusable validator from the Pydantic-generated contract."""
    schema = Submission.model_json_schema()
    Draft202012Validator.check_schema(schema)  # fail fast on a bad contract
    return Draft202012Validator(schema)


def validate_payload(payload: dict[str, Any]) -> list[dict[str, Any]]:
    """Return a list of categorized validation errors (empty list == valid)."""
    validator = build_validator()
    findings: list[dict[str, Any]] = []
    for error in sorted(validator.iter_errors(payload), key=lambda e: list(e.path)):
        findings.append({
            "field_path": "/".join(str(p) for p in error.path) or "<root>",
            "message": error.message,
            "validator": error.validator,
            "severity": "CRITICAL",  # structural failures block routing
        })
    return findings

Collecting all errors with iter_errors (rather than raising on the first) gives regulatory reviewers a complete, machine-readable list in one pass — which is what error categorization downstream depends on. For the full taxonomy of how to classify and route these findings, see Schema Validation & Error Categorization.

Mapping internal data into the submission format

The mapping layer is where most schema projects succeed or fail. Your CTMS, EDC, and document management systems do not speak CTD; they speak study IDs, site numbers, and document categories. A mapping function must translate those into module paths, lifecycle operations, and checksums — and it must be the only place that translation happens.

"""Map an internal document record into a validated CTD submission leaf."""
import hashlib
from datetime import datetime, timezone
from pathlib import Path

from submission_schema import Operation, SubmissionLeaf

# Internal document categories -> CTD module paths. Keep this table version
# controlled; it is the contract between operational systems and the dossier.
CATEGORY_TO_MODULE: dict[str, str] = {
    "clinical_overview": "m2/clinical-overview",
    "quality_summary": "m2/quality-overall-summary",
    "drug_substance": "m3/3-2-s/drug-substance",
    "nonclinical_report": "m4/4-2/study-report",
    "clinical_study_report": "m5/5-3-5/clinical-study-report",
}


def sha256_of(path: Path) -> str:
    """Stream a file to compute its SHA-256 checksum without loading it fully."""
    digest = hashlib.sha256()
    with path.open("rb") as handle:
        for chunk in iter(lambda: handle.read(1 << 16), b""):
            digest.update(chunk)
    return digest.hexdigest()


def map_record_to_leaf(record: dict, source_file: Path) -> SubmissionLeaf:
    """Translate an internal document record into a validated CTD leaf.

    Raises:
        KeyError: if the record's category has no CTD mapping (fail loud,
            never guess a module path).
    """
    module_root = CATEGORY_TO_MODULE[record["category"]]
    return SubmissionLeaf(
        module_path=f"{module_root}/{record['document_id']}.pdf".replace("//", "/"),
        version=record["version"],
        operation=Operation(record.get("operation", "new")),
        checksum_sha256=sha256_of(source_file),
        effective_date=datetime.now(timezone.utc),
    )

Two rules keep this safe: an unmapped category raises rather than guessing a destination, and the checksum is computed from the actual file bytes so the validated artifact and the transmitted artifact are provably identical. The controlled vocabulary behind CATEGORY_TO_MODULE should be governed centrally — see Regulatory Data Dictionary Construction for how to maintain that mapping table, and Regulatory Taxonomy Standardization for keeping categories consistent across global sites.

Versioning and schema evolution

Trials run for years; your schema will change underneath them. Govern it like an API:

  • Apply semantic versioning to the schema itself, not just to documents.
  • Make the target eCTD version and region a configuration input, never a literal in code.
  • Add new fields as optional with defaults; promote to required only in a major version.
  • Emit a deprecation finding (not a hard failure) when legacy fields appear, so historical sequences still validate.
  • Pin the JSON Schema dialect (here, 2020-12) and check it with check_schema in CI.
  • Keep extra="forbid" so additive drift is caught immediately.

This discipline lets a five-year-old sequence and a brand-new one validate against compatible contracts without forking the codebase.

Where this fits in the pipeline

Schema design is the front of a larger automation chain. Validated submissions still need to reach a regulator, and gateways have outages — design your routing to degrade gracefully, as covered in Fallback Routing for Portal Outages. And because submissions carry regulated content, every transformation must respect the data-protection model in Security Boundaries for Clinical Data. When you are ready to implement the schema end to end with worked examples, continue to the child long-tail: Building FDA eCTD-compliant JSON schemas for clinical trials.

FAQ

What is the difference between CTD and eCTD?

The CTD is the content and organization standard — the agreed five-module structure for a regulatory dossier defined by ICH. The eCTD is the electronic format used to assemble, transmit, and lifecycle-manage that content with a regulator. You design your schema against the CTD module structure, then render and submit it in eCTD format.

Why is Module 1 modeled separately from Modules 2 through 5?

Because Modules 2–5 are harmonized across ICH regions, while Module 1 is region-specific: each regulator defines its own administrative and product-information requirements. Modeling Module 1 as a discriminated variant keyed on jurisdiction lets a single core schema serve multiple regions, with only the regional envelope swapped per target.

Should I hard-code the eCTD version in my schema?

No. Required eCTD versions and specifications differ by region and application type and change over time. Treat the target version and region as configuration so a single codebase can produce compliant output for different regulators and different submission types without code changes.

How does this support 21 CFR Part 11 and data integrity?

A closed schema (extra="forbid"), timezone-aware timestamps, file-derived SHA-256 checksums, and a single governed mapping table support ALCOA+ principles — attributable, legible, contemporaneous, original, accurate, complete, and consistent records — and give you the provenance and completeness controls that Part 11 and EU GMP Annex 11 expect. The schema is the foundation; audit trails and e-signatures are layered on top in the routing and submission stages.