Core Architecture & Regulatory Mapping for Clinical Trials
Clinical trial site activation and regulatory submission automation sits where rigid ICH, FDA, and EMA mandates meet a distributed data ecosystem. This pillar maps the reference architecture, the regulatory data model, and the Python patterns that turn 21 CFR Part 11, ALCOA+, and eCTD requirements into a deterministic, audit-ready system.
The hard problem is not moving files. It is guaranteeing that every transformation, signature, and routing decision is reproducible, attributable, and defensible during an inspection. The sections below decompose that problem into seven domains, each backed by a dedicated cluster, and show how they fit into one coherent platform. For the document-handling half of the same platform — parsing, OCR, schema validation, and batch ingestion — see the sibling pillar on Automated Document Ingestion & Validation Workflows.
How to read this pillar
| Domain | What it governs | Cluster |
|---|---|---|
| Submission schema | eCTD structure, JSON/XML backbone, file-level validation | FDA/EMA Submission Schema Design |
| Taxonomy | Controlled vocabularies, code lists, cross-jurisdiction mapping | Regulatory Taxonomy Standardization |
| Data dictionary | Field-level definitions, value sets, lineage | Regulatory Data Dictionary Construction |
| Site readiness | Feasibility, infrastructure, GCP qualification gates | Clinical Site Readiness Assessment Frameworks |
| IRB/ethics | Submission state machine, human-in-the-loop review | IRB/Ethics Workflow Mapping |
| Resilience | Portal timeouts, retries, fallback routing | Fallback Routing for Portal Outages |
| Security | Network segmentation, PHI isolation, zero trust | Security Boundaries for Clinical Data |
Reference architecture
A maintainable clinical platform abandons the single shared database in favor of layered, event-driven services joined by an append-only audit log. Inputs — site qualification packets, protocol amendments, IRB approvals, regulatory clearances — enter an ingestion layer over authenticated APIs or SFTP, where each artifact is hashed and stamped with provenance metadata before anything else touches it. A normalization layer maps heterogeneous payloads onto the canonical model defined by the data dictionary. A validation layer applies deterministic, jurisdiction-aware rules. A submission layer assembles the eCTD sequence and routes it to the correct portal.
flowchart LR
subgraph Edge
SRC[Sites and sponsors]
end
subgraph Platform
I[Ingestion and hashing]
N[Normalization to canonical model]
V[Validation engine]
S[Submission assembly]
R[Routing and fallback]
end
subgraph Authorities
FDA[FDA ESG eCTD]
EMA[EMA gateway]
end
SRC --> I --> N --> V --> S --> R
R --> FDA
R --> EMA
AUD[Append-only audit log]
I -. records .-> AUD
N -. records .-> AUD
V -. records .-> AUD
S -. records .-> AUD
R -. records .-> AUD
Two properties make this topology compliant rather than merely tidy. First, the audit log is write-once: every layer emits an event but no layer can mutate a prior one, which is what makes the trail trustworthy under 21 CFR Part 11. Second, the boundaries between layers are also trust boundaries — PHI never crosses into a layer that does not need it, a constraint elaborated in Security Boundaries for Clinical Data.
ALCOA+ as a design contract
ALCOA+ is the data-integrity standard regulators apply to records and is the most useful checklist for architecture decisions. Treat each attribute as a non-functional requirement rather than a policy slogan:
- Attributable — every event carries an authenticated actor and timestamp
- Legible — records are human-readable and machine-parseable (UTF-8, ISO 8601)
- Contemporaneous — events are written at the moment of action, server-side
- Original — the first capture is preserved; derivations link back to it
- Accurate — validated against the data dictionary before persistence
- Complete — including failed attempts, retries, and overrides
- Consistent — ordered, with timezone-aware, monotonic sequencing
- Enduring — retained per protocol and archival policy
- Available — retrievable for inspection without reconstruction
A practical way to encode the contract is a typed, immutable audit event. The model below uses Pydantic v2 and never trusts a client-supplied timestamp:
"""Append-only audit event for a 21 CFR Part 11 compliant clinical platform."""
from __future__ import annotations
import hashlib
from datetime import datetime, timezone
from enum import Enum
from pydantic import BaseModel, ConfigDict, Field
class AuditAction(str, Enum):
INGESTED = "ingested"
NORMALIZED = "normalized"
VALIDATED = "validated"
SUBMITTED = "submitted"
REROUTED = "rerouted"
class AuditEvent(BaseModel):
"""One immutable entry in the audit trail.
The event is frozen after construction so application code cannot
backdate or mutate it, satisfying the ALCOA+ 'original' and
'contemporaneous' attributes.
"""
model_config = ConfigDict(frozen=True)
actor_id: str = Field(..., min_length=1, description="Authenticated user or service identity")
action: AuditAction
artifact_sha256: str = Field(..., pattern=r"^[0-9a-f]{64}$")
recorded_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
prev_hash: str = Field(default="", pattern=r"^([0-9a-f]{64})?$")
def chain_hash(self) -> str:
"""Hash this event together with its predecessor.
Linking each entry to the previous hash turns the log into a
tamper-evident chain: altering any record invalidates every hash
after it.
"""
payload = "|".join(
(
self.prev_hash,
self.actor_id,
self.action.value,
self.artifact_sha256,
self.recorded_at.isoformat(),
)
)
return hashlib.sha256(payload.encode("utf-8")).hexdigest()
Regulatory mapping: taxonomy, dictionary, schema
Most rejected submissions fail on data semantics, not transport. The fix is three coordinated layers of regulatory metadata, each owning a distinct concern.
Taxonomy standardizes the controlled vocabularies — sponsor study phases, document types, country codes, IRB decision states — so that a single concept has one canonical code regardless of which site or system produced it. Doing this once, centrally, is what lets a global program reconcile data across regions; the patterns are covered in Regulatory Taxonomy Standardization.
The data dictionary binds each canonical field to its definition, data type, permitted value set, and downstream lineage. It is the authoritative source the normalization and validation layers consult, and it is version-controlled so that a schema change is reviewable against the regulatory update that motivated it. See Regulatory Data Dictionary Construction.
The submission schema expresses structure: the eCTD backbone, module placement, file naming, and PDF metadata that FDA and EMA pre-validation enforce. The U.S. FDA accepts marketing applications in eCTD format via the Electronic Submissions Gateway, and the EMA operates its own gateway for centralized procedures; both organize content into the harmonized Common Technical Document Modules 1 through 5, where Module 1 is the region-specific administrative module. Encoding these rules as Pydantic or JSON Schema validators catches structural defects at the transformation boundary rather than at the portal. See FDA/EMA Submission Schema Design.
The relationship is strictly layered — each layer depends only on the one beneath it:
flowchart TB
TAX[Taxonomy and code lists] --> DICT[Data dictionary fields and value sets]
DICT --> SCHEMA[Submission schema eCTD structure]
SCHEMA --> SUB[Validated submission sequence]
Site activation and IRB workflow as state machines
Site activation is inherently stateful: feasibility, contract execution, IRB or ethics approval, and regulatory clearance must complete in order, and an automated trigger should fire only when every prerequisite is genuinely met. Modeling activation and the IRB lifecycle as explicit finite-state machines makes the gates auditable and prevents illegal transitions — for example, dispatching study drug before the ethics committee has issued a favorable opinion.
stateDiagram-v2
[*] --> Feasibility
Feasibility --> ContractExecution: qualified
ContractExecution --> IRBReview: contract signed
IRBReview --> RegulatoryClearance: favorable opinion
IRBReview --> IRBReview: revisions requested
RegulatoryClearance --> Activated: clearance granted
Activated --> [*]
The IRB transitions in particular must preserve a human-in-the-loop decision point; automation routes reminders and assembles packets but never manufactures an approval. The mapping from real submission workflows to enforceable state machines is detailed in IRB/Ethics Workflow Mapping, and the upstream qualification gates that feed the Feasibility state come from Clinical Site Readiness Assessment Frameworks.
A minimal, correct transition guard keeps the rules in one place:
"""Guarded transitions for the site-activation state machine."""
from __future__ import annotations
ALLOWED: dict[str, frozenset[str]] = {
"feasibility": frozenset({"contract_execution"}),
"contract_execution": frozenset({"irb_review"}),
"irb_review": frozenset({"regulatory_clearance", "irb_review"}),
"regulatory_clearance": frozenset({"activated"}),
"activated": frozenset(),
}
def transition(current: str, target: str) -> str:
"""Return the next state or raise if the move is not permitted.
Centralizing the allow-list prevents skipping a compliance gate such
as activating a site before IRB clearance.
"""
if current not in ALLOWED:
raise ValueError(f"Unknown state: {current!r}")
if target not in ALLOWED[current]:
raise ValueError(f"Illegal transition {current!r} -> {target!r}")
return target
Resilience and operational continuity
Regulatory portals enforce rate limits, maintenance windows, and submission deadlines that do not move because a gateway is down. The platform must degrade safely: distinguish a permanent failure (a malformed sequence — fail fast and surface it) from a transient fault (a gateway timeout — retry with bounded exponential backoff and jitter), and route around an outage to an alternate channel or a durable queue without losing the submission’s state. These patterns, including circuit breaking and dead-letter handling, are the focus of Fallback Routing for Portal Outages.
Backoff is simply a geometric schedule with a cap. For attempt with base delay and ceiling , the deterministic component is:
Adding bounded random jitter on top of prevents synchronized retry storms when many site packets queue behind the same recovering gateway.
Production Python in a regulated environment
The same engineering discipline runs through every layer:
- Reproducible builds — pin dependencies in
pyproject.tomland a lockfile so a submission can be reconstructed from a known toolchain. - Validated input — never persist external data before it passes the data-dictionary rules; reject rather than coerce ambiguous values.
- Structured, tamper-evident logging — emit JSON audit events (for example with
structlog) and chain them as shown above. - No bare excepts, no swallowed errors — catch specific exceptions, classify them as permanent or transient, and record both outcomes.
- No hardcoded secrets — read credentials and keys from the environment or a secrets manager; generate tokens with
secrets, neverrandom. - Tested compliance logic — cover validation and state-transition code with
pytest, mapping each test to the regulatory requirement it defends.
Compliance here is not a layer bolted on at the end; it is expressed as code and enforced in CI, so that schema definitions, transition guards, and audit chaining are verified on every change.
FAQ
What is the difference between the taxonomy, the data dictionary, and the submission schema?
The taxonomy standardizes vocabulary (the canonical codes for a concept), the data dictionary defines fields (type, value set, lineage for each data element), and the submission schema defines structure (how validated fields are assembled into an eCTD sequence). They form a strict dependency chain: schema depends on the dictionary, which depends on the taxonomy.
How does this architecture satisfy 21 CFR Part 11?
Part 11 governs electronic records and signatures. The append-only, hash-chained audit log provides attributable, tamper-evident records; server-side timestamps enforce contemporaneity; and role-based access plus authenticated actor identity on every event support the electronic-signature and access-control expectations. The state machine ensures records of who approved what, and in what order.
Where does document parsing and OCR fit?
Parsing, OCR, schema validation, and batch ingestion are the document-handling counterpart to this regulatory-mapping pillar. They are covered in the sibling pillar Automated Document Ingestion & Validation Workflows, which feeds normalized artifacts into the ingestion layer described here.
Why model activation and IRB review as state machines instead of a checklist?
A checklist records whether steps are done; a state machine enforces order and legality of transitions. That distinction is what prevents an automated trigger from skipping a compliance gate — such as activating a site or shipping drug before a favorable ethics opinion exists.