LangExtract Documentation
Complete guide to using LangExtract for medical text extraction and structured data processing in healthcare environments.
Table of Contents
Overview & Key Features
LangExtract is Google's premier Python library designed specifically for extracting structured information from unstructured medical and clinical texts. Unlike generic text processing tools, LangExtract understands medical terminology, clinical contexts, and healthcare documentation standards.
Precise Source Grounding
Every extracted piece of information is mapped back to its exact location in the source document, ensuring complete traceability and verification capabilities essential for medical applications.
Medical Intelligence
Built-in understanding of medical terminology, abbreviations, drug names, anatomical terms, and clinical contexts. Supports ICD-10, SNOMED CT, and other medical ontologies.
Interactive Visualizations
Generate comprehensive HTML dashboards and visualizations that help medical professionals quickly understand patterns and insights within clinical documentation.
Enterprise Scale
Process thousands of medical documents efficiently with advanced chunking, parallel processing, and optimized memory usage designed for healthcare enterprise environments.
Installation & Setup
System Requirements
- Python 3.8 or higher
- 4GB RAM minimum (8GB recommended for large documents)
- Internet connection for cloud-based LLM models
- Optional: CUDA support for GPU acceleration
Basic Installation
pip install langextract
Development Installation
# Install with development dependencies pip install langextract[dev] # Install with all optional features pip install langextract[all] # Install with specific model support pip install langextract[gemini,openai]
Environment Configuration
# Set up environment variables export GOOGLE_API_KEY="your-gemini-api-key" export LANGEXTRACT_CACHE_DIR="/path/to/cache" export LANGEXTRACT_LOG_LEVEL="INFO" # For HIPAA compliance (local processing) export LANGEXTRACT_LOCAL_ONLY="true"
Quick Start Guide
Basic Medical Text Extraction
import langextract # Initialize extractor for clinical notes extractor = langextract.MedicalExtractor( model="gemini-pro", domain="clinical_notes", output_format="structured" ) # Sample clinical text clinical_text = """ Patient: John Doe, 65-year-old male Chief Complaint: Chest pain and shortness of breath History: Patient reports acute onset of substernal chest pain radiating to left arm, associated with diaphoresis and nausea. Physical Exam: BP 150/95, HR 102, RR 24, O2 sat 94% on room air Assessment: Acute coronary syndrome, rule out MI Plan: Serial troponins, ECG, cardiology consult """ # Define extraction schema schema = { "patient_demographics": { "age": "integer", "gender": "string" }, "chief_complaint": "string", "vital_signs": { "blood_pressure": "string", "heart_rate": "integer", "respiratory_rate": "integer", "oxygen_saturation": "string" }, "assessment": "string", "plan": ["string"] } # Extract structured data result = extractor.extract(clinical_text, schema) # Access extracted data print("Structured Data:", result.structured_data) print("Source Mapping:", result.source_mapping) print("Confidence Scores:", result.confidence_scores)
Expected Output
{ "patient_demographics": { "age": 65, "gender": "male" }, "chief_complaint": "Chest pain and shortness of breath", "vital_signs": { "blood_pressure": "150/95", "heart_rate": 102, "respiratory_rate": 24, "oxygen_saturation": "94% on room air" }, "assessment": "Acute coronary syndrome, rule out MI", "plan": [ "Serial troponins", "ECG", "cardiology consult" ] }
Complete API Reference
Primary class for medical text extraction with specialized healthcare processing capabilities.
Parameter | Type | Description | Default |
---|---|---|---|
model |
str | LLM model to use ("gemini-pro", "gpt-4", "local") | "gemini-pro" |
domain |
str | Medical domain specialization | "general" |
temperature |
float | Model temperature for extraction consistency | 0.1 |
max_tokens |
int | Maximum tokens per extraction | 4096 |
chunk_size |
int | Text chunk size for large documents | 2000 |
Extract structured information from medical text using the specified schema.
Parameters:
- text (str): Medical text to process
- schema (dict): Extraction schema defining output structure
- confidence_threshold (float): Minimum confidence for extraction
- validate_medical_terms (bool): Enable medical terminology validation
Process multiple medical documents in parallel for improved efficiency.
Medical Data Schemas
Clinical Notes Schema
clinical_notes_schema = { "patient_info": { "age": "integer", "gender": "string", "medical_record_number": "string" }, "chief_complaint": "string", "history_present_illness": "string", "past_medical_history": ["string"], "medications": [ { "name": "string", "dosage": "string", "frequency": "string" } ], "allergies": ["string"], "physical_exam": { "general": "string", "vital_signs": { "blood_pressure": "string", "heart_rate": "integer", "temperature": "float", "respiratory_rate": "integer" }, "systems": { "cardiovascular": "string", "respiratory": "string", "neurological": "string" } }, "assessment_and_plan": [ { "diagnosis": "string", "icd10_code": "string", "plan": "string" } ] }
Radiology Report Schema
radiology_schema = { "study_info": { "study_type": "string", "study_date": "date", "modality": "string" }, "clinical_indication": "string", "technique": "string", "findings": [ { "anatomical_location": "string", "finding": "string", "measurement": "string", "severity": "string" } ], "impression": "string", "recommendations": ["string"], "comparison": "string" }
Advanced Usage Patterns
Custom Medical Domain Configuration
# Configure extractor for specific medical specialty cardiology_extractor = langextract.MedicalExtractor( model="gemini-pro", domain="cardiology", medical_ontology="snomed_ct", terminology_validation=True, confidence_threshold=0.85 ) # Add custom medical vocabulary cardiology_extractor.add_vocabulary({ "abbreviations": { "STEMI": "ST-elevation myocardial infarction", "NSTEMI": "Non-ST-elevation myocardial infarction", "PCI": "Percutaneous coronary intervention" }, "procedures": [ "cardiac catheterization", "angioplasty", "stent placement" ] })
Parallel Processing for Large Document Sets
import asyncio from langextract import AsyncMedicalExtractor async def process_medical_records(): extractor = AsyncMedicalExtractor( model="gemini-pro", max_concurrent=10, rate_limit=100 # requests per minute ) # Process thousands of documents documents = load_medical_documents() # Your document loader # Batch process with progress tracking results = await extractor.batch_extract_async( documents, schema=clinical_notes_schema, progress_callback=lambda progress: print(f"Progress: {progress}%") ) return results # Run async processing results = asyncio.run(process_medical_records())
Performance Optimization
Optimization Strategies
- Chunk Size Optimization: Adjust chunk_size based on document length and complexity
- Parallel Processing: Use batch_extract for multiple documents
- Caching: Enable result caching for repeated extractions
- Model Selection: Choose appropriate model size for your accuracy requirements
- Schema Optimization: Simplify schemas to reduce processing time
# High-performance configuration extractor = langextract.MedicalExtractor( model="gemini-pro", chunk_size=4000, # Larger chunks for efficiency overlap_size=200, # Overlap between chunks parallel_workers=8, # Parallel processing threads cache_enabled=True, # Enable result caching batch_size=50, # Process 50 docs at once memory_efficient=True # Optimize memory usage ) # Monitor performance with extractor.performance_monitor() as monitor: results = extractor.batch_extract(documents, schema) print(f"Processing rate: {monitor.docs_per_second} docs/sec") print(f"Average accuracy: {monitor.average_confidence}") print(f"Memory usage: {monitor.peak_memory_mb} MB")
Troubleshooting Guide
Common Issues and Solutions
Low Extraction Accuracy
- Verify medical terminology in your schema matches document language
- Increase confidence threshold and add medical vocabulary
- Use domain-specific extractor configuration
- Validate document quality and OCR accuracy
Slow Processing Performance
- Optimize chunk size for your document types
- Enable parallel processing and adjust worker count
- Use batch processing for multiple documents
- Consider using smaller, faster models for simpler extractions
Memory Issues
- Enable memory_efficient mode
- Reduce batch size and chunk size
- Process documents sequentially instead of in parallel
- Clear cache periodically for long-running processes
Debugging Tools
# Enable debug logging import logging logging.basicConfig(level=logging.DEBUG) # Use debug mode extractor = langextract.MedicalExtractor( model="gemini-pro", debug=True, verbose=True ) # Validate extraction results result = extractor.extract(text, schema) validation_report = extractor.validate_result(result) print(validation_report.summary()) # Performance profiling with extractor.profile() as profiler: result = extractor.extract(text, schema) profiler.print_stats()