Regulatory Data Dictionary Construction

A regulatory data dictionary is the controlled, version-governed source of truth that defines every data element collected in a clinical trial: its name, definition, data type, permissible values, controlled terminology binding, and mapping to downstream submission standards. This guide explains how to build one on real CDISC, MedDRA, and LOINC foundations, govern its versions, and model it in Python with validation that enforces both structure and terminology.

A data dictionary is the connective tissue of the architecture described in the parent pillar, Core Architecture & Regulatory Mapping for Clinical Trials. Without it, two sites collect “sex” as M/F and 1/2, an adverse event is coded three different ways, and the eventual submission package fails technical conformance. With it, every field has a single authoritative definition, a stable binding to a recognized terminology, and a traceable lineage from collection form to submission dataset.

What a regulatory data dictionary actually contains

A common mistake is to treat the dictionary as a flat spreadsheet of column names. In a regulated context it is a structured catalog of data element definitions, each carrying enough metadata to drive collection, validation, transformation, and submission. The minimum viable record for a single element includes:

Attribute	Purpose	Example
OID / stable identifier	Immutable machine key, never reused	`DM.SEX`
Variable name	Submission-facing name (8-char SDTM limit)	`SEX`
Label	Human-readable description	`Sex`
Data type	Storage and validation type	`text`, `integer`, `float`, `date`
Origin	Where the value comes from	`Collected`, `Derived`, `Assigned`, `Protocol`
Codelist binding	Controlled terminology reference	NCI C66731 (Sex)
Core designation	Required/expected/permissible	`Req`, `Exp`, `Perm`
Mapping target	Downstream submission location	SDTM `DM.SEX`
Version metadata	When introduced, deprecated, superseded	added v1.2.0

The dictionary deliberately separates definition (what the element means and how it is constrained) from instance data (the actual subject values). The definition is governed and versioned; the instance data is validated against it.

Controlled terminologies are the backbone

The value of a regulatory dictionary comes almost entirely from binding fields to recognized controlled terminologies rather than inventing your own. The major standards, used for distinct purposes, are:

CDISC CDASH (Clinical Data Acquisition Standards Harmonization) standardizes data collection — the fields and questions on case report forms (CRFs). It defines how data should be captured at the source.
CDISC SDTM (Study Data Tabulation Model) standardizes how collected data is organized for submission to regulators such as FDA and PMDA. It is built around domains (e.g., DM Demographics, AE Adverse Events, LB Laboratory, VS Vital Signs) with prescribed variables and roles.
CDISC Controlled Terminology, published and maintained quarterly through NCI EVS (the NCI Enterprise Vocabulary Services), supplies the codelists that populate categorical SDTM/CDASH variables. Each codelist and term carries an NCI “C-code” (e.g., the Sex codelist is C66731).
MedDRA (Medical Dictionary for Regulatory Activities) is the standard terminology for coding adverse events, medical history, and indications. It is hierarchical — System Organ Class, High Level Group Term, High Level Term, Preferred Term (PT), and Lowest Level Term (LLT) — and is versioned twice yearly (MedDRA releases like 28.1, 29.0).
LOINC (Logical Observation Identifiers Names and Codes) identifies laboratory tests and clinical observations, frequently used to harmonize lab data across central and local laboratories before it is mapped into the SDTM LB domain.

A dictionary field does not duplicate these terminologies; it binds to them. The binding records which codelist or dictionary version applies, so that when MedDRA 29.0 ships, you can see exactly which fields are affected and plan a controlled re-code rather than silently drifting.

flowchart TD
    SRC[CRF field defined in CDASH] --> DICT[Regulatory data dictionary entry]
    DICT --> CT[CDISC Controlled Terminology codelist]
    DICT --> MED[MedDRA coded term]
    DICT --> LO[LOINC lab code]
    DICT --> MAP[Mapping to SDTM domain variable]
    MAP --> SUB[Submission dataset Define-XML]
    CT --> VAL[Validation engine]
    MED --> VAL
    LO --> VAL
    DICT --> VAL
    VAL --> AUD[Versioned audit trail]

The terminology bindings feed both the validation engine and the eventual Define-XML metadata that accompanies a submission. Getting the bindings right in the dictionary is what makes the submission package described in FDA/EMA Submission Schema Design conformant rather than rejected at technical review.

Versioning and governance

Because a dictionary is referenced by collected data that may have already been submitted, it cannot be edited freely. Treat it as a governed artifact under semantic versioning:

MAJOR — breaking changes: removing a field, narrowing a codelist so previously valid values become invalid, or changing a data type.
MINOR — additive, backward-compatible changes: new fields, new permissible values, a new optional codelist binding.
PATCH — non-semantic corrections: fixing a label typo or clarifying a definition without changing meaning or constraints.

The governance principles that keep the dictionary audit-ready under 21 CFR Part 11 and EU Annex 11:

Identifiers (OIDs) are immutable and never reused, even after a field is deprecated.
Fields are deprecated, not deleted — they carry a deprecated_in version and an optional superseded_by pointer so historical data stays interpretable.
Every change is attributed to a person, reason, and timestamp (audit trail).
A change-control board reviews MAJOR changes before release.
Each released version is immutable; corrections produce a new version, never an in-place edit.
Terminology version is pinned per dictionary version (e.g., “MedDRA 29.0, CDISC CT 2026-03-27”).

This discipline aligns the dictionary with the broader naming and concept work in Regulatory Taxonomy Standardization: the taxonomy defines the shared vocabulary and hierarchy, while the data dictionary binds concrete collected fields to it.

Modeling a versioned data dictionary in Python

The following model uses pydantic v2 to represent dictionary entries, codelist bindings, and a versioned dictionary that can validate instance records. It enforces controlled-terminology membership, type constraints, and required-field rules, while keeping definitions immutable once released.

"""Versioned regulatory data dictionary with terminology-aware validation.

Models CDISC-style data element definitions, binds them to controlled
terminology codelists (CDISC CT / MedDRA / LOINC), and validates instance
records against a released, semantically versioned dictionary.
"""
from __future__ import annotations

from datetime import date
from enum import Enum
from typing import Optional

from pydantic import BaseModel, ConfigDict, Field, field_validator


class DataType(str, Enum):
    TEXT = "text"
    INTEGER = "integer"
    FLOAT = "float"
    DATE = "date"


class Core(str, Enum):
    """SDTM core designation."""
    REQUIRED = "Req"
    EXPECTED = "Exp"
    PERMISSIBLE = "Perm"


class Origin(str, Enum):
    COLLECTED = "Collected"
    DERIVED = "Derived"
    ASSIGNED = "Assigned"
    PROTOCOL = "Protocol"


class CodelistRef(BaseModel):
    """Binding to a controlled terminology, pinned to a specific version."""
    model_config = ConfigDict(frozen=True)

    system: str = Field(..., description="e.g. 'NCI-CT', 'MedDRA', 'LOINC'")
    code: str = Field(..., description="Codelist identifier, e.g. NCI C66731")
    version: str = Field(..., description="Terminology release, e.g. '2026-03-27'")
    permissible_values: frozenset[str] = Field(
        default_factory=frozenset,
        description="Allowed submission values; empty means open/coded externally.",
    )


class DataElement(BaseModel):
    """Immutable definition of a single dictionary field."""
    model_config = ConfigDict(frozen=True)

    oid: str = Field(..., description="Stable, never-reused identifier, e.g. 'DM.SEX'")
    name: str = Field(..., max_length=8, description="SDTM variable name (<= 8 chars)")
    label: str = Field(..., max_length=40)
    data_type: DataType
    core: Core = Core.PERMISSIBLE
    origin: Origin = Origin.COLLECTED
    codelist: Optional[CodelistRef] = None
    sdtm_target: Optional[str] = Field(
        default=None, description="Submission mapping, e.g. 'DM.SEX'"
    )
    added_in: str
    deprecated_in: Optional[str] = None
    superseded_by: Optional[str] = None

    @field_validator("name")
    @classmethod
    def _uppercase_name(cls, v: str) -> str:
        if not v.isascii() or not v.replace("_", "").isalnum():
            raise ValueError("SDTM variable names must be ASCII alphanumeric/underscore")
        return v.upper()

    def is_active(self) -> bool:
        return self.deprecated_in is None


class ValidationIssue(BaseModel):
    oid: str
    severity: str  # "error" | "warning"
    message: str


class DataDictionary(BaseModel):
    """A released, semantically versioned dictionary (MAJOR.MINOR.PATCH)."""
    model_config = ConfigDict(frozen=True)

    version: str
    meddra_version: str
    elements: tuple[DataElement, ...]

    def _index(self) -> dict[str, DataElement]:
        return {e.oid: e for e in self.elements}

    def validate_record(self, record: dict[str, object]) -> list[ValidationIssue]:
        """Validate one instance record (OID -> value) against this dictionary."""
        issues: list[ValidationIssue] = []
        index = self._index()

        # Required-field enforcement for active, required elements.
        for element in self.elements:
            if not element.is_active() or element.core is not Core.REQUIRED:
                continue
            if record.get(element.oid) in (None, ""):
                issues.append(ValidationIssue(
                    oid=element.oid,
                    severity="error",
                    message=f"Required field '{element.oid}' is missing.",
                ))

        # Type and controlled-terminology enforcement for supplied values.
        for oid, value in record.items():
            element = index.get(oid)
            if element is None:
                issues.append(ValidationIssue(
                    oid=oid, severity="error",
                    message=f"Unknown field '{oid}' not in dictionary {self.version}.",
                ))
                continue
            if element.deprecated_in is not None:
                issues.append(ValidationIssue(
                    oid=oid, severity="warning",
                    message=f"Field '{oid}' deprecated in {element.deprecated_in}.",
                ))
            if value in (None, ""):
                continue
            issues.extend(self._check_type(element, value))
            issues.extend(self._check_codelist(element, value))
        return issues

    @staticmethod
    def _check_type(element: DataElement, value: object) -> list[ValidationIssue]:
        ok = {
            DataType.TEXT: isinstance(value, str),
            DataType.INTEGER: isinstance(value, int) and not isinstance(value, bool),
            DataType.FLOAT: isinstance(value, (int, float)) and not isinstance(value, bool),
            DataType.DATE: isinstance(value, date),
        }[element.data_type]
        if ok:
            return []
        return [ValidationIssue(
            oid=element.oid, severity="error",
            message=f"Value {value!r} is not of type {element.data_type.value}.",
        )]

    @staticmethod
    def _check_codelist(element: DataElement, value: object) -> list[ValidationIssue]:
        cl = element.codelist
        if cl is None or not cl.permissible_values:
            return []
        if str(value) not in cl.permissible_values:
            return [ValidationIssue(
                oid=element.oid, severity="error",
                message=(
                    f"Value {value!r} not in codelist {cl.code} "
                    f"({cl.system} {cl.version})."
                ),
            )]
        return []


if __name__ == "__main__":
    sex_codelist = CodelistRef(
        system="NCI-CT", code="C66731", version="2026-03-27",
        permissible_values=frozenset({"F", "M", "U", "UNDIFFERENTIATED"}),
    )
    dictionary = DataDictionary(
        version="1.2.0",
        meddra_version="29.0",
        elements=(
            DataElement(
                oid="DM.SEX", name="SEX", label="Sex", data_type=DataType.TEXT,
                core=Core.REQUIRED, codelist=sex_codelist,
                sdtm_target="DM.SEX", added_in="1.0.0",
            ),
            DataElement(
                oid="DM.AGE", name="AGE", label="Age", data_type=DataType.INTEGER,
                core=Core.EXPECTED, sdtm_target="DM.AGE", added_in="1.0.0",
            ),
        ),
    )

    for issue in dictionary.validate_record({"DM.SEX": "X", "DM.AGE": "42"}):
        print(f"[{issue.severity}] {issue.oid}: {issue.message}")

Two design choices are worth highlighting. First, definitions are frozen (ConfigDict(frozen=True)): a released DataElement or DataDictionary cannot be mutated, which is what “immutable version” means in practice — a change produces a new object and a new version string. Second, the codelist binding pins a terminology version, so when CDISC CT or MedDRA publishes a new release you create a new dictionary version that references it and can diff exactly which fields moved.

Connecting the dictionary to the wider system

A data dictionary is only useful when it is wired into the surrounding pipelines:

Collection forms in the EDC are generated from CDASH-bound dictionary entries, so what is captured matches what will be submitted.
Incoming documents and datasets are checked against the dictionary’s codelists and type rules; this is the controlled vocabulary that Schema Validation & Error Categorization categorizes errors against, separating true terminology violations from formatting noise.
The dictionary’s SDTM mappings drive Define-XML generation for the submission package.
Every dictionary version, terminology pin, and field change is recorded in an append-only audit trail to satisfy 21 CFR Part 11 and EU Annex 11.

FAQ

How is a data dictionary different from a study taxonomy?

A taxonomy organizes concepts into a hierarchy of names and relationships — the shared vocabulary. A data dictionary binds concrete, collected data elements to that vocabulary and adds the operational detail (data type, codelist, mapping target, version) needed to validate and submit data. Taxonomy work is covered in Regulatory Taxonomy Standardization.

Do I need to license MedDRA and CDISC terminologies?

MedDRA requires a subscription through MSSO for use. CDISC standards and the NCI-published CDISC Controlled Terminology are openly available, and LOINC is freely licensed for use. Your dictionary should record which terminology versions it pins, regardless of how each is obtained, so the binding is reproducible and auditable.

What happens to collected data when MedDRA or CDISC CT releases a new version?

You do not edit the existing dictionary version in place. You create a new dictionary version that pins the new terminology release, diff the affected codelists, and run a controlled re-coding or impact assessment. Deprecated terms are retained via deprecated_in/superseded_by so previously submitted data remains interpretable.

Where do field origins like Derived or Assigned matter?

Origin drives traceability for submission. SDTM and Define-XML require that derived variables document their derivation and that assigned values are distinguishable from collected ones. Capturing Origin in the dictionary lets you generate that metadata automatically rather than reconstructing it by hand.

Regulatory Data Dictionary Construction

What a regulatory data dictionary actually contains #

Controlled terminologies are the backbone #

Versioning and governance #

Modeling a versioned data dictionary in Python #

Connecting the dictionary to the wider system #

FAQ #

How is a data dictionary different from a study taxonomy? #

Do I need to license MedDRA and CDISC terminologies? #

What happens to collected data when MedDRA or CDISC CT releases a new version? #

Where do field origins like Derived or Assigned matter? #