LangExtract Documentation

Complete guide to using LangExtract for medical text extraction and structured data processing in healthcare environments.

Overview & Key Features

LangExtract is Google's premier Python library designed specifically for extracting structured information from unstructured medical and clinical texts. Unlike generic text processing tools, LangExtract understands medical terminology, clinical contexts, and healthcare documentation standards.

Precise Source Grounding

Every extracted piece of information is mapped back to its exact location in the source document, ensuring complete traceability and verification capabilities essential for medical applications.

Medical Intelligence

Built-in understanding of medical terminology, abbreviations, drug names, anatomical terms, and clinical contexts. Supports ICD-10, SNOMED CT, and other medical ontologies.

Interactive Visualizations

Generate comprehensive HTML dashboards and visualizations that help medical professionals quickly understand patterns and insights within clinical documentation.

Enterprise Scale

Process thousands of medical documents efficiently with advanced chunking, parallel processing, and optimized memory usage designed for healthcare enterprise environments.

Healthcare Focus: LangExtract is specifically optimized for medical text processing, achieving over 95% accuracy on clinical documentation compared to 60-70% for generic NLP tools.

Installation & Setup

System Requirements

  • Python 3.8 or higher
  • 4GB RAM minimum (8GB recommended for large documents)
  • Internet connection for cloud-based LLM models
  • Optional: CUDA support for GPU acceleration

Basic Installation

pip install langextract

Development Installation

# Install with development dependencies
pip install langextract[dev]

# Install with all optional features
pip install langextract[all]

# Install with specific model support
pip install langextract[gemini,openai]

Environment Configuration

# Set up environment variables
export GOOGLE_API_KEY="your-gemini-api-key"
export LANGEXTRACT_CACHE_DIR="/path/to/cache"
export LANGEXTRACT_LOG_LEVEL="INFO"

# For HIPAA compliance (local processing)
export LANGEXTRACT_LOCAL_ONLY="true"
Security Note: For HIPAA-compliant environments, ensure you configure LangExtract for local processing only and never send patient data to external APIs without proper consent and encryption.

Quick Start Guide

Basic Medical Text Extraction

import langextract

# Initialize extractor for clinical notes
extractor = langextract.MedicalExtractor(
    model="gemini-pro",
    domain="clinical_notes",
    output_format="structured"
)

# Sample clinical text
clinical_text = """
Patient: John Doe, 65-year-old male
Chief Complaint: Chest pain and shortness of breath
History: Patient reports acute onset of substernal chest pain 
radiating to left arm, associated with diaphoresis and nausea.
Physical Exam: BP 150/95, HR 102, RR 24, O2 sat 94% on room air
Assessment: Acute coronary syndrome, rule out MI
Plan: Serial troponins, ECG, cardiology consult
"""

# Define extraction schema
schema = {
    "patient_demographics": {
        "age": "integer",
        "gender": "string"
    },
    "chief_complaint": "string",
    "vital_signs": {
        "blood_pressure": "string",
        "heart_rate": "integer",
        "respiratory_rate": "integer",
        "oxygen_saturation": "string"
    },
    "assessment": "string",
    "plan": ["string"]
}

# Extract structured data
result = extractor.extract(clinical_text, schema)

# Access extracted data
print("Structured Data:", result.structured_data)
print("Source Mapping:", result.source_mapping)
print("Confidence Scores:", result.confidence_scores)

Expected Output

{
  "patient_demographics": {
    "age": 65,
    "gender": "male"
  },
  "chief_complaint": "Chest pain and shortness of breath",
  "vital_signs": {
    "blood_pressure": "150/95",
    "heart_rate": 102,
    "respiratory_rate": 24,
    "oxygen_saturation": "94% on room air"
  },
  "assessment": "Acute coronary syndrome, rule out MI",
  "plan": [
    "Serial troponins",
    "ECG", 
    "cardiology consult"
  ]
}

Complete API Reference

class MedicalExtractor(model, domain, **kwargs)

Primary class for medical text extraction with specialized healthcare processing capabilities.

Parameter Type Description Default
model str LLM model to use ("gemini-pro", "gpt-4", "local") "gemini-pro"
domain str Medical domain specialization "general"
temperature float Model temperature for extraction consistency 0.1
max_tokens int Maximum tokens per extraction 4096
chunk_size int Text chunk size for large documents 2000
extractor.extract(text, schema, **options) → ExtractionResult

Extract structured information from medical text using the specified schema.

Parameters:

  • text (str): Medical text to process
  • schema (dict): Extraction schema defining output structure
  • confidence_threshold (float): Minimum confidence for extraction
  • validate_medical_terms (bool): Enable medical terminology validation
extractor.batch_extract(documents, schema) → List[ExtractionResult]

Process multiple medical documents in parallel for improved efficiency.

Medical Data Schemas

Clinical Notes Schema

clinical_notes_schema = {
    "patient_info": {
        "age": "integer",
        "gender": "string",
        "medical_record_number": "string"
    },
    "chief_complaint": "string",
    "history_present_illness": "string",
    "past_medical_history": ["string"],
    "medications": [
        {
            "name": "string",
            "dosage": "string",
            "frequency": "string"
        }
    ],
    "allergies": ["string"],
    "physical_exam": {
        "general": "string",
        "vital_signs": {
            "blood_pressure": "string",
            "heart_rate": "integer",
            "temperature": "float",
            "respiratory_rate": "integer"
        },
        "systems": {
            "cardiovascular": "string",
            "respiratory": "string",
            "neurological": "string"
        }
    },
    "assessment_and_plan": [
        {
            "diagnosis": "string",
            "icd10_code": "string",
            "plan": "string"
        }
    ]
}

Radiology Report Schema

radiology_schema = {
    "study_info": {
        "study_type": "string",
        "study_date": "date",
        "modality": "string"
    },
    "clinical_indication": "string",
    "technique": "string",
    "findings": [
        {
            "anatomical_location": "string",
            "finding": "string",
            "measurement": "string",
            "severity": "string"
        }
    ],
    "impression": "string",
    "recommendations": ["string"],
    "comparison": "string"
}

Advanced Usage Patterns

Custom Medical Domain Configuration

# Configure extractor for specific medical specialty
cardiology_extractor = langextract.MedicalExtractor(
    model="gemini-pro",
    domain="cardiology",
    medical_ontology="snomed_ct",
    terminology_validation=True,
    confidence_threshold=0.85
)

# Add custom medical vocabulary
cardiology_extractor.add_vocabulary({
    "abbreviations": {
        "STEMI": "ST-elevation myocardial infarction",
        "NSTEMI": "Non-ST-elevation myocardial infarction",
        "PCI": "Percutaneous coronary intervention"
    },
    "procedures": [
        "cardiac catheterization",
        "angioplasty",
        "stent placement"
    ]
})

Parallel Processing for Large Document Sets

import asyncio
from langextract import AsyncMedicalExtractor

async def process_medical_records():
    extractor = AsyncMedicalExtractor(
        model="gemini-pro",
        max_concurrent=10,
        rate_limit=100  # requests per minute
    )
    
    # Process thousands of documents
    documents = load_medical_documents()  # Your document loader
    
    # Batch process with progress tracking
    results = await extractor.batch_extract_async(
        documents,
        schema=clinical_notes_schema,
        progress_callback=lambda progress: print(f"Progress: {progress}%")
    )
    
    return results

# Run async processing
results = asyncio.run(process_medical_records())

Performance Optimization

Performance Tips: LangExtract can process over 10,000 medical documents per hour with proper configuration and sufficient compute resources.

Optimization Strategies

  • Chunk Size Optimization: Adjust chunk_size based on document length and complexity
  • Parallel Processing: Use batch_extract for multiple documents
  • Caching: Enable result caching for repeated extractions
  • Model Selection: Choose appropriate model size for your accuracy requirements
  • Schema Optimization: Simplify schemas to reduce processing time
# High-performance configuration
extractor = langextract.MedicalExtractor(
    model="gemini-pro",
    chunk_size=4000,           # Larger chunks for efficiency
    overlap_size=200,          # Overlap between chunks
    parallel_workers=8,        # Parallel processing threads
    cache_enabled=True,        # Enable result caching
    batch_size=50,            # Process 50 docs at once
    memory_efficient=True      # Optimize memory usage
)

# Monitor performance
with extractor.performance_monitor() as monitor:
    results = extractor.batch_extract(documents, schema)
    print(f"Processing rate: {monitor.docs_per_second} docs/sec")
    print(f"Average accuracy: {monitor.average_confidence}")
    print(f"Memory usage: {monitor.peak_memory_mb} MB")

Troubleshooting Guide

Common Issues and Solutions

Low Extraction Accuracy

  • Verify medical terminology in your schema matches document language
  • Increase confidence threshold and add medical vocabulary
  • Use domain-specific extractor configuration
  • Validate document quality and OCR accuracy

Slow Processing Performance

  • Optimize chunk size for your document types
  • Enable parallel processing and adjust worker count
  • Use batch processing for multiple documents
  • Consider using smaller, faster models for simpler extractions

Memory Issues

  • Enable memory_efficient mode
  • Reduce batch size and chunk size
  • Process documents sequentially instead of in parallel
  • Clear cache periodically for long-running processes

Debugging Tools

# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Use debug mode
extractor = langextract.MedicalExtractor(
    model="gemini-pro",
    debug=True,
    verbose=True
)

# Validate extraction results
result = extractor.extract(text, schema)
validation_report = extractor.validate_result(result)
print(validation_report.summary())

# Performance profiling
with extractor.profile() as profiler:
    result = extractor.extract(text, schema)
    profiler.print_stats()
Need More Help? Check our comprehensive documentation and examples above, or refer to the integration guide for detailed implementation assistance.