PDF/DOCX Parsing for Clinical Docs

Extracting structured, validated data from clinical PDFs and Word documents is the foundation of automated site activation. This guide maps the parsing pipeline for protocols, IRB approvals, and consent forms using the maintained pypdf and python-docx libraries, covering forms, tables, encrypted files, and the validation that makes output audit-ready.

Clinical operations and regulatory-affairs teams receive a steady stream of documents during site activation: protocol amendments, investigator brochures, IRB/IEC approval letters, signed delegation logs, financial disclosure forms, and informed consent forms (ICFs). These arrive as native PDFs, scanned PDFs, AcroForm fillable PDFs, and Word .docx files with nested tables. Manual transcription is slow, error-prone, and breaks the ALCOA+ data-integrity chain. This cluster page maps the techniques for turning that mixed corpus into structured payloads that feed your validation and submission systems.

This page sits under the Automated Document Ingestion & Validation Workflows pillar. It focuses on the extraction stage; once you have structured fields, route them to Schema Validation & Error Categorization and reconcile them against site requirements via Checklist Sync & Gap Analysis. For purely image-based documents, hand off to OCR & Metadata Extraction Pipelines.

Choosing the right extraction path

There is no single “parse a document” call. The correct technique depends on the file type and how the text is stored. Routing each file to the right extractor — and detecting when a PDF has no real text layer — is the single most important design decision in the pipeline.

flowchart TD
    A[Incoming clinical document] --> B{File extension}
    B -->|docx| C[python-docx]
    B -->|pdf| D{Encrypted}
    D -->|yes| E[Decrypt with pypdf]
    D -->|no| F[Open with pypdf]
    E --> F
    F --> G{Has AcroForm fields}
    G -->|yes| H[Read form field values]
    G -->|no| I{Extractable text layer}
    I -->|yes| J[Extract text and tables]
    I -->|no| K[Route to OCR pipeline]
    C --> L[Structured field map]
    H --> L
    J --> L
    K --> L
    L --> M[Validation and audit log]

The key branch points: encryption must be resolved first; AcroForm-based PDFs expose data far more reliably as named field values than as free text; and a PDF with no extractable text is a scan that belongs in the OCR pipeline, not the text path.

Library landscape: use the maintained packages

A critical correctness note: PyPDF2 is deprecated and no longer maintained. Its codebase was merged back into pypdf, which is the maintained successor and the library you should import today. Code written against PyPDF2 should migrate to pypdf — the API is largely compatible (PdfReader, PdfWriter), so the change is usually a one-line import swap. The child guide on parsing complex IRB consent forms with Python and PyPDF2 walks through the full consent-form case using modern pypdf.

Task	Library	Notes
Read/write/decrypt PDF, read form fields	`pypdf`	Maintained successor to `PyPDF2`
Extract text from native PDF	`pypdf`	`page.extract_text()` for reading order
Robust table extraction from PDF	`pdfplumber`	Word-level coordinates, ruling-line tables
Read `.docx` paragraphs, tables, headings	`python-docx`	XML-backed, preserves structure
OCR for scanned PDFs	Tesseract via `pytesseract`	LSTM engine, `--oem 1`

Install the core dependencies with pip install pypdf python-docx pdfplumber. Never pin to PyPDF2 in new clinical pipelines.

Extracting text and form fields from PDFs

Most regulatory PDFs fall into two groups: documents with a flowing text layer (protocols, IB chapters) and AcroForm documents with named fillable fields (1572 forms, financial disclosures, many consent acknowledgement pages). The code below handles encryption, distinguishes a real text layer from a scan, and reads form fields when present.

"""Extract structured content from clinical PDF documents using pypdf.

pypdf is the maintained successor to the deprecated PyPDF2 library.
"""
from __future__ import annotations

import logging
import os
from dataclasses import dataclass, field
from pathlib import Path

from pypdf import PdfReader
from pypdf.errors import FileNotDecryptedError, PdfReadError

logger = logging.getLogger(__name__)

# Below this many characters per page we treat the PDF as a scan, not text.
MIN_TEXT_CHARS_PER_PAGE = 20


@dataclass
class ExtractedPdf:
    """Structured result of a PDF extraction."""

    path: str
    page_count: int
    text: str
    form_fields: dict[str, str] = field(default_factory=dict)
    is_probably_scanned: bool = False


def open_pdf(path: Path, password: str | None = None) -> PdfReader:
    """Open a PDF, transparently decrypting it when a password is supplied.

    The password is read from the environment by the caller, never hardcoded.
    Raises FileNotDecryptedError if the document is encrypted and the password
    is missing or wrong.
    """
    reader = PdfReader(str(path))
    if reader.is_encrypted:
        if not password:
            raise FileNotDecryptedError(f"{path} is encrypted but no password was provided")
        # decrypt() returns 0 on failure, 1 (owner) or 2 (user) on success.
        if reader.decrypt(password) == 0:
            raise FileNotDecryptedError(f"Incorrect password for {path}")
    return reader


def extract_pdf(path: Path, password: str | None = None) -> ExtractedPdf:
    """Extract text and AcroForm field values from a clinical PDF.

    Detects image-only (scanned) PDFs so the caller can route them to OCR.
    """
    try:
        reader = open_pdf(path, password)
    except (PdfReadError, FileNotDecryptedError):
        logger.exception("Failed to open PDF %s", path)
        raise

    pages_text: list[str] = []
    for page in reader.pages:
        # extract_text() returns "" for pages with no text layer; never None here.
        pages_text.append(page.extract_text() or "")
    text = "\n".join(pages_text)

    form_fields: dict[str, str] = {}
    fields = reader.get_fields()
    if fields:
        for name, obj in fields.items():
            value = obj.get("/V")
            if value is not None:
                form_fields[name] = str(value)

    page_count = len(reader.pages)
    avg_chars = len(text) / page_count if page_count else 0
    is_scanned = avg_chars < MIN_TEXT_CHARS_PER_PAGE and not form_fields

    return ExtractedPdf(
        path=str(path),
        page_count=page_count,
        text=text,
        form_fields=form_fields,
        is_probably_scanned=is_scanned,
    )


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    # Encryption passwords come from the environment, never source control.
    pdf_password = os.environ.get("PDF_PASSWORD")
    result = extract_pdf(Path("irb_approval.pdf"), password=pdf_password)
    if result.is_probably_scanned:
        logger.info("%s looks scanned; routing to OCR", result.path)
    else:
        logger.info("Extracted %d form fields", len(result.form_fields))

The is_probably_scanned flag is the explicit bridge to the OCR pipeline: when a PDF yields almost no text and no form fields, it is an image and should not be silently passed through with empty content. This prevents the classic failure mode of a “successful” parse that quietly drops an entire consent form.

Reading tables: ruling lines vs. coordinates

pypdf.extract_text() flattens tables into raw text, which is fine for prose but loses cell structure. Delegation logs, dosing schedules, and amendment history tables need real cell extraction. pdfplumber reads word-level coordinates and ruling lines, so it reconstructs rows and columns reliably.

"""Extract tables from a clinical PDF page using pdfplumber."""
from __future__ import annotations

from pathlib import Path

import pdfplumber


def extract_tables(path: Path) -> list[list[list[str]]]:
    """Return every table on every page as a list of rows of cell strings.

    Empty cells come back as "" rather than None so downstream validators
    can treat missing data uniformly.
    """
    tables: list[list[list[str]]] = []
    with pdfplumber.open(str(path)) as pdf:
        for page in pdf.pages:
            for raw_table in page.extract_tables():
                cleaned = [
                    [(cell or "").strip() for cell in row]
                    for row in raw_table
                ]
                tables.append(cleaned)
    return tables

For documents whose tables have no ruling lines, pass table_settings={"vertical_strategy": "text", "horizontal_strategy": "text"} to extract_tables() so column boundaries are inferred from word alignment instead of drawn lines.

Parsing DOCX with python-docx

Word documents store their content as structured XML, so python-docx gives you headings, paragraphs, and tables directly — far more robust than treating the file as flat text. Note one quirk: python-docx exposes only the body, not headers, footers, or footnotes, so date stamps placed in a footer must be read separately.

"""Extract paragraphs and tables from a clinical DOCX file."""
from __future__ import annotations

from dataclasses import dataclass, field
from pathlib import Path

from docx import Document


@dataclass
class ExtractedDocx:
    """Structured result of a DOCX extraction."""

    paragraphs: list[str] = field(default_factory=list)
    tables: list[list[list[str]]] = field(default_factory=list)


def extract_docx(path: Path) -> ExtractedDocx:
    """Read body paragraphs and tables from a .docx clinical document."""
    document = Document(str(path))

    paragraphs = [p.text.strip() for p in document.paragraphs if p.text.strip()]

    tables: list[list[list[str]]] = []
    for table in document.tables:
        rows = [[cell.text.strip() for cell in row.cells] for row in table.rows]
        tables.append(rows)

    return ExtractedDocx(paragraphs=paragraphs, tables=tables)

Validating extracted fields

Extraction is only half the job. Raw strings must be coerced to typed, validated fields before they enter a submission queue, and validation failures must be explicit, not silent. pydantic gives you typed parsing with clear error messages, which maps cleanly onto the categorization tiers used downstream.

"""Validate extracted regulatory fields with pydantic v2."""
from __future__ import annotations

from datetime import date

from pydantic import BaseModel, Field, ValidationError, field_validator


class IRBApproval(BaseModel):
    """Validated IRB approval record extracted from a clinical document."""

    protocol_number: str = Field(min_length=3, max_length=64)
    irb_approval_date: date
    expiration_date: date
    principal_investigator: str = Field(min_length=2)

    @field_validator("expiration_date")
    @classmethod
    def expiry_after_approval(cls, value: date, info) -> date:
        approval = info.data.get("irb_approval_date")
        if approval and value <= approval:
            raise ValueError("expiration_date must be after irb_approval_date")
        return value


def validate_approval(fields: dict[str, str]) -> IRBApproval | None:
    """Build a validated IRBApproval, returning None on validation failure.

    A real pipeline would record the ValidationError details against the
    audit trail and categorize them rather than discarding them.
    """
    try:
        return IRBApproval(**fields)
    except ValidationError as exc:
        # Categorize and persist: missing-required vs. business-rule failure.
        for err in exc.errors():
            field_path = ".".join(str(p) for p in err["loc"])
            print(f"invalid field {field_path}: {err['msg']}")
        return None

The cross-field rule (expiration must follow approval) is exactly the kind of ALCOA+ “accurate and consistent” check that keeps bad data out of submissions. Categorizing those ValidationError entries into critical, warning, and informational tiers is covered in depth in categorizing validation errors in regulatory document pipelines.

Encrypted PDFs and security

Many sponsor- and IRB-issued PDFs are password-protected or owner-locked. pypdf decrypts both user-password (open) and owner-password (permissions) encryption via reader.decrypt(password). Two rules keep this compliant:

Never hardcode passwords. Read them from environment variables or a secrets manager, as shown in the extraction code. Hardcoded credentials in a regulated pipeline are an audit finding.
Log access, not secrets. Record that a file was decrypted and by which process for the 21 CFR Part 11 audit trail, but never log the password itself.

If a document is encrypted with a certificate or an algorithm pypdf cannot handle, fail loudly and quarantine the file rather than skipping it.

Audit trail and data integrity

Every extraction must be reproducible and traceable to satisfy 21 CFR Part 11 and ALCOA+. In practice that means hashing the source file (SHA-256), recording the extraction method (text layer, AcroForm fields, or OCR), capturing the library versions, and writing a timestamped, append-only record linked to the CTMS document ID. Because pypdf and python-docx are deterministic for a given input, the same file always produces the same payload — a property your audit trail can assert and inspectors can verify.

Implementation checklist

Use pypdf, not the deprecated PyPDF2, in all new code
Detect scanned PDFs by low text yield and route them to OCR
Read AcroForm field values for fillable regulatory PDFs
Extract tables with pdfplumber, not flattened text, when cell structure matters
Read DOCX via python-docx structure, and remember headers/footers are separate
Decrypt encrypted PDFs with passwords from the environment, never hardcoded
Validate and type every extracted field with pydantic before it enters the queue
Hash sources and log method, versions, and timestamps for the audit trail

FAQ

Should I still use PyPDF2?

No. PyPDF2 is deprecated and unmaintained; its code was folded into pypdf, which is the actively maintained successor. New clinical pipelines should import pypdf. The migration is usually just swapping the import, since PdfReader and PdfWriter carry over.

How do I know whether a PDF is scanned or has real text?

Extract text from each page and measure the average characters per page. A document that yields almost no text and exposes no AcroForm fields is an image-only scan and should be routed to the OCR & Metadata Extraction Pipelines rather than treated as empty text.

What is the best way to extract tables from clinical PDFs?

Use pdfplumber, which reads word coordinates and ruling lines to reconstruct rows and columns. pypdf.extract_text() flattens tables into prose and is unsuitable when you need cell-level data such as delegation logs or dosing schedules.

Open it with pypdf and call reader.decrypt(password) with a password read from an environment variable or secrets manager. Never embed the password in source. See the full worked example in parsing complex IRB consent forms with Python and PyPDF2.

PDF/DOCX Parsing for Clinical Docs

Choosing the right extraction path #

Library landscape: use the maintained packages #

Extracting text and form fields from PDFs #

Reading tables: ruling lines vs. coordinates #

Parsing DOCX with python-docx #

Validating extracted fields #

Encrypted PDFs and security #

Audit trail and data integrity #

Implementation checklist #

FAQ #

Should I still use PyPDF2? #

How do I know whether a PDF is scanned or has real text? #

What is the best way to extract tables from clinical PDFs? #

How do I handle a password-protected consent form? #

Explore this section