Core Architecture & Regulatory Mapping for Clinical Trials

Clinical trial site activation and regulatory submission automation sits where rigid ICH, FDA, and EMA mandates meet a distributed data ecosystem. This pillar maps the reference architecture, the regulatory data model, and the Python patterns that turn 21 CFR Part 11, ALCOA+, and eCTD requirements into a deterministic, audit-ready system.

The hard problem is not moving files. It is guaranteeing that every transformation, signature, and routing decision is reproducible, attributable, and defensible during an inspection. The sections below decompose that problem into seven domains, each backed by a dedicated cluster, and show how they fit into one coherent platform. For the document-handling half of the same platform — parsing, OCR, schema validation, and batch ingestion — see the sibling pillar on Automated Document Ingestion & Validation Workflows.

How to read this pillar

Domain	What it governs	Cluster
Submission schema	eCTD structure, JSON/XML backbone, file-level validation	FDA/EMA Submission Schema Design
Taxonomy	Controlled vocabularies, code lists, cross-jurisdiction mapping	Regulatory Taxonomy Standardization
Data dictionary	Field-level definitions, value sets, lineage	Regulatory Data Dictionary Construction
Site readiness	Feasibility, infrastructure, GCP qualification gates	Clinical Site Readiness Assessment Frameworks
IRB/ethics	Submission state machine, human-in-the-loop review	IRB/Ethics Workflow Mapping
Resilience	Portal timeouts, retries, fallback routing	Fallback Routing for Portal Outages
Security	Network segmentation, PHI isolation, zero trust	Security Boundaries for Clinical Data

Reference architecture

A maintainable clinical platform abandons the single shared database in favor of layered, event-driven services joined by an append-only audit log. Inputs — site qualification packets, protocol amendments, IRB approvals, regulatory clearances — enter an ingestion layer over authenticated APIs or SFTP, where each artifact is hashed and stamped with provenance metadata before anything else touches it. A normalization layer maps heterogeneous payloads onto the canonical model defined by the data dictionary. A validation layer applies deterministic, jurisdiction-aware rules. A submission layer assembles the eCTD sequence and routes it to the correct portal.

flowchart LR
    subgraph Edge
        SRC[Sites and sponsors]
    end
    subgraph Platform
        I[Ingestion and hashing]
        N[Normalization to canonical model]
        V[Validation engine]
        S[Submission assembly]
        R[Routing and fallback]
    end
    subgraph Authorities
        FDA[FDA ESG eCTD]
        EMA[EMA gateway]
    end
    SRC --> I --> N --> V --> S --> R
    R --> FDA
    R --> EMA
    AUD[Append-only audit log]
    I -. records .-> AUD
    N -. records .-> AUD
    V -. records .-> AUD
    S -. records .-> AUD
    R -. records .-> AUD

Two properties make this topology compliant rather than merely tidy. First, the audit log is write-once: every layer emits an event but no layer can mutate a prior one, which is what makes the trail trustworthy under 21 CFR Part 11. Second, the boundaries between layers are also trust boundaries — PHI never crosses into a layer that does not need it, a constraint elaborated in Security Boundaries for Clinical Data.

ALCOA+ as a design contract

ALCOA+ is the data-integrity standard regulators apply to records and is the most useful checklist for architecture decisions. Treat each attribute as a non-functional requirement rather than a policy slogan:

Attributable — every event carries an authenticated actor and timestamp
Legible — records are human-readable and machine-parseable (UTF-8, ISO 8601)
Contemporaneous — events are written at the moment of action, server-side
Original — the first capture is preserved; derivations link back to it
Accurate — validated against the data dictionary before persistence
Complete — including failed attempts, retries, and overrides
Consistent — ordered, with timezone-aware, monotonic sequencing
Enduring — retained per protocol and archival policy
Available — retrievable for inspection without reconstruction

A practical way to encode the contract is a typed, immutable audit event. The model below uses Pydantic v2 and never trusts a client-supplied timestamp:

"""Append-only audit event for a 21 CFR Part 11 compliant clinical platform."""
from __future__ import annotations

import hashlib
from datetime import datetime, timezone
from enum import Enum

from pydantic import BaseModel, ConfigDict, Field


class AuditAction(str, Enum):
    INGESTED = "ingested"
    NORMALIZED = "normalized"
    VALIDATED = "validated"
    SUBMITTED = "submitted"
    REROUTED = "rerouted"


class AuditEvent(BaseModel):
    """One immutable entry in the audit trail.

    The event is frozen after construction so application code cannot
    backdate or mutate it, satisfying the ALCOA+ 'original' and
    'contemporaneous' attributes.
    """

    model_config = ConfigDict(frozen=True)

    actor_id: str = Field(..., min_length=1, description="Authenticated user or service identity")
    action: AuditAction
    artifact_sha256: str = Field(..., pattern=r"^[0-9a-f]{64}$")
    recorded_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
    prev_hash: str = Field(default="", pattern=r"^([0-9a-f]{64})?$")

    def chain_hash(self) -> str:
        """Hash this event together with its predecessor.

        Linking each entry to the previous hash turns the log into a
        tamper-evident chain: altering any record invalidates every hash
        after it.
        """
        payload = "|".join(
            (
                self.prev_hash,
                self.actor_id,
                self.action.value,
                self.artifact_sha256,
                self.recorded_at.isoformat(),
            )
        )
        return hashlib.sha256(payload.encode("utf-8")).hexdigest()

Regulatory mapping: taxonomy, dictionary, schema

Most rejected submissions fail on data semantics, not transport. The fix is three coordinated layers of regulatory metadata, each owning a distinct concern.

Taxonomy standardizes the controlled vocabularies — sponsor study phases, document types, country codes, IRB decision states — so that a single concept has one canonical code regardless of which site or system produced it. Doing this once, centrally, is what lets a global program reconcile data across regions; the patterns are covered in Regulatory Taxonomy Standardization.

The data dictionary binds each canonical field to its definition, data type, permitted value set, and downstream lineage. It is the authoritative source the normalization and validation layers consult, and it is version-controlled so that a schema change is reviewable against the regulatory update that motivated it. See Regulatory Data Dictionary Construction.

The submission schema expresses structure: the eCTD backbone, module placement, file naming, and PDF metadata that FDA and EMA pre-validation enforce. The U.S. FDA accepts marketing applications in eCTD format via the Electronic Submissions Gateway, and the EMA operates its own gateway for centralized procedures; both organize content into the harmonized Common Technical Document Modules 1 through 5, where Module 1 is the region-specific administrative module. Encoding these rules as Pydantic or JSON Schema validators catches structural defects at the transformation boundary rather than at the portal. See FDA/EMA Submission Schema Design.

The relationship is strictly layered — each layer depends only on the one beneath it:

flowchart TB
    TAX[Taxonomy and code lists] --> DICT[Data dictionary fields and value sets]
    DICT --> SCHEMA[Submission schema eCTD structure]
    SCHEMA --> SUB[Validated submission sequence]

Site activation and IRB workflow as state machines

Site activation is inherently stateful: feasibility, contract execution, IRB or ethics approval, and regulatory clearance must complete in order, and an automated trigger should fire only when every prerequisite is genuinely met. Modeling activation and the IRB lifecycle as explicit finite-state machines makes the gates auditable and prevents illegal transitions — for example, dispatching study drug before the ethics committee has issued a favorable opinion.

stateDiagram-v2
    [*] --> Feasibility
    Feasibility --> ContractExecution: qualified
    ContractExecution --> IRBReview: contract signed
    IRBReview --> RegulatoryClearance: favorable opinion
    IRBReview --> IRBReview: revisions requested
    RegulatoryClearance --> Activated: clearance granted
    Activated --> [*]

The IRB transitions in particular must preserve a human-in-the-loop decision point; automation routes reminders and assembles packets but never manufactures an approval. The mapping from real submission workflows to enforceable state machines is detailed in IRB/Ethics Workflow Mapping, and the upstream qualification gates that feed the Feasibility state come from Clinical Site Readiness Assessment Frameworks.

A minimal, correct transition guard keeps the rules in one place:

"""Guarded transitions for the site-activation state machine."""
from __future__ import annotations

ALLOWED: dict[str, frozenset[str]] = {
    "feasibility": frozenset({"contract_execution"}),
    "contract_execution": frozenset({"irb_review"}),
    "irb_review": frozenset({"regulatory_clearance", "irb_review"}),
    "regulatory_clearance": frozenset({"activated"}),
    "activated": frozenset(),
}


def transition(current: str, target: str) -> str:
    """Return the next state or raise if the move is not permitted.

    Centralizing the allow-list prevents skipping a compliance gate such
    as activating a site before IRB clearance.
    """
    if current not in ALLOWED:
        raise ValueError(f"Unknown state: {current!r}")
    if target not in ALLOWED[current]:
        raise ValueError(f"Illegal transition {current!r} -> {target!r}")
    return target

Resilience and operational continuity

Regulatory portals enforce rate limits, maintenance windows, and submission deadlines that do not move because a gateway is down. The platform must degrade safely: distinguish a permanent failure (a malformed sequence — fail fast and surface it) from a transient fault (a gateway timeout — retry with bounded exponential backoff and jitter), and route around an outage to an alternate channel or a durable queue without losing the submission’s state. These patterns, including circuit breaking and dead-letter handling, are the focus of Fallback Routing for Portal Outages.

Backoff is simply a geometric schedule with a cap. For attempt $n$ with base delay $b$ and ceiling $C$ , the deterministic component is:

$t_n = \min\bigl(C,\; b \cdot 2^{\,n}\bigr)$

Adding bounded random jitter on top of $t_n$ prevents synchronized retry storms when many site packets queue behind the same recovering gateway.

Production Python in a regulated environment

The same engineering discipline runs through every layer:

Reproducible builds — pin dependencies in pyproject.toml and a lockfile so a submission can be reconstructed from a known toolchain.
Validated input — never persist external data before it passes the data-dictionary rules; reject rather than coerce ambiguous values.
Structured, tamper-evident logging — emit JSON audit events (for example with structlog) and chain them as shown above.
No bare excepts, no swallowed errors — catch specific exceptions, classify them as permanent or transient, and record both outcomes.
No hardcoded secrets — read credentials and keys from the environment or a secrets manager; generate tokens with secrets, never random.
Tested compliance logic — cover validation and state-transition code with pytest, mapping each test to the regulatory requirement it defends.

Compliance here is not a layer bolted on at the end; it is expressed as code and enforced in CI, so that schema definitions, transition guards, and audit chaining are verified on every change.

FAQ

What is the difference between the taxonomy, the data dictionary, and the submission schema?

The taxonomy standardizes vocabulary (the canonical codes for a concept), the data dictionary defines fields (type, value set, lineage for each data element), and the submission schema defines structure (how validated fields are assembled into an eCTD sequence). They form a strict dependency chain: schema depends on the dictionary, which depends on the taxonomy.

How does this architecture satisfy 21 CFR Part 11?

Part 11 governs electronic records and signatures. The append-only, hash-chained audit log provides attributable, tamper-evident records; server-side timestamps enforce contemporaneity; and role-based access plus authenticated actor identity on every event support the electronic-signature and access-control expectations. The state machine ensures records of who approved what, and in what order.

Where does document parsing and OCR fit?

Parsing, OCR, schema validation, and batch ingestion are the document-handling counterpart to this regulatory-mapping pillar. They are covered in the sibling pillar Automated Document Ingestion & Validation Workflows, which feeds normalized artifacts into the ingestion layer described here.

Why model activation and IRB review as state machines instead of a checklist?

A checklist records whether steps are done; a state machine enforces order and legality of transitions. That distinction is what prevents an automated trigger from skipping a compliance gate — such as activating a site or shipping drug before a favorable ethics opinion exists.

Core Architecture & Regulatory Mapping for Clinical Trials

How to read this pillar #

Reference architecture #

ALCOA+ as a design contract #

Regulatory mapping: taxonomy, dictionary, schema #

Site activation and IRB workflow as state machines #

Resilience and operational continuity #

Production Python in a regulated environment #

FAQ #

What is the difference between the taxonomy, the data dictionary, and the submission schema? #

How does this architecture satisfy 21 CFR Part 11? #

Where does document parsing and OCR fit? #

Why model activation and IRB review as state machines instead of a checklist? #

Explore the clusters #

Explore this section