LangExtract Documentation
Complete guide to using LangExtract for medical text extraction and structured data processing in healthcare environments.
Table of Contents
Overview & Key Features
LangExtract is Google's premier Python library designed specifically for extracting structured information from unstructured medical and clinical texts. Unlike generic text processing tools, LangExtract understands medical terminology, clinical contexts, and healthcare documentation standards.
Precise Source Grounding
Every extracted piece of information is mapped back to its exact location in the source document, ensuring complete traceability and verification capabilities essential for medical applications.
Medical Intelligence
Built-in understanding of medical terminology, abbreviations, drug names, anatomical terms, and clinical contexts. Supports ICD-10, SNOMED CT, and other medical ontologies.
Interactive Visualizations
Generate comprehensive HTML dashboards and visualizations that help medical professionals quickly understand patterns and insights within clinical documentation.
Enterprise Scale
Process thousands of medical documents efficiently with advanced chunking, parallel processing, and optimized memory usage designed for healthcare enterprise environments.
Installation & Setup
System Requirements
- Python 3.8 or higher
- 4GB RAM minimum (8GB recommended for large documents)
- Internet connection for cloud-based LLM models
- Optional: CUDA support for GPU acceleration
Basic Installation
pip install langextract
Development Installation
# Install with development dependencies pip install langextract[dev] # Install with all optional features pip install langextract[all] # Install with specific model support pip install langextract[gemini,openai]
Environment Configuration
# Set up environment variables export GOOGLE_API_KEY="your-gemini-api-key" export LANGEXTRACT_CACHE_DIR="/path/to/cache" export LANGEXTRACT_LOG_LEVEL="INFO" # For HIPAA compliance (local processing) export LANGEXTRACT_LOCAL_ONLY="true"
Quick Start Guide
Basic Medical Text Extraction
import langextract
# Initialize extractor for clinical notes
extractor = langextract.MedicalExtractor(
model="gemini-pro",
domain="clinical_notes",
output_format="structured"
)
# Sample clinical text
clinical_text = """
Patient: John Doe, 65-year-old male
Chief Complaint: Chest pain and shortness of breath
History: Patient reports acute onset of substernal chest pain
radiating to left arm, associated with diaphoresis and nausea.
Physical Exam: BP 150/95, HR 102, RR 24, O2 sat 94% on room air
Assessment: Acute coronary syndrome, rule out MI
Plan: Serial troponins, ECG, cardiology consult
"""
# Define extraction schema
schema = {
"patient_demographics": {
"age": "integer",
"gender": "string"
},
"chief_complaint": "string",
"vital_signs": {
"blood_pressure": "string",
"heart_rate": "integer",
"respiratory_rate": "integer",
"oxygen_saturation": "string"
},
"assessment": "string",
"plan": ["string"]
}
# Extract structured data
result = extractor.extract(clinical_text, schema)
# Access extracted data
print("Structured Data:", result.structured_data)
print("Source Mapping:", result.source_mapping)
print("Confidence Scores:", result.confidence_scores)
Expected Output
{
"patient_demographics": {
"age": 65,
"gender": "male"
},
"chief_complaint": "Chest pain and shortness of breath",
"vital_signs": {
"blood_pressure": "150/95",
"heart_rate": 102,
"respiratory_rate": 24,
"oxygen_saturation": "94% on room air"
},
"assessment": "Acute coronary syndrome, rule out MI",
"plan": [
"Serial troponins",
"ECG",
"cardiology consult"
]
}
Complete API Reference
Primary class for medical text extraction with specialized healthcare processing capabilities.
| Parameter | Type | Description | Default |
|---|---|---|---|
model |
str | LLM model to use ("gemini-pro", "gpt-4", "local") | "gemini-pro" |
domain |
str | Medical domain specialization | "general" |
temperature |
float | Model temperature for extraction consistency | 0.1 |
max_tokens |
int | Maximum tokens per extraction | 4096 |
chunk_size |
int | Text chunk size for large documents | 2000 |
Extract structured information from medical text using the specified schema.
Parameters:
- text (str): Medical text to process
- schema (dict): Extraction schema defining output structure
- confidence_threshold (float): Minimum confidence for extraction
- validate_medical_terms (bool): Enable medical terminology validation
Process multiple medical documents in parallel for improved efficiency.
Medical Data Schemas
Clinical Notes Schema
clinical_notes_schema = {
"patient_info": {
"age": "integer",
"gender": "string",
"medical_record_number": "string"
},
"chief_complaint": "string",
"history_present_illness": "string",
"past_medical_history": ["string"],
"medications": [
{
"name": "string",
"dosage": "string",
"frequency": "string"
}
],
"allergies": ["string"],
"physical_exam": {
"general": "string",
"vital_signs": {
"blood_pressure": "string",
"heart_rate": "integer",
"temperature": "float",
"respiratory_rate": "integer"
},
"systems": {
"cardiovascular": "string",
"respiratory": "string",
"neurological": "string"
}
},
"assessment_and_plan": [
{
"diagnosis": "string",
"icd10_code": "string",
"plan": "string"
}
]
}
Radiology Report Schema
radiology_schema = {
"study_info": {
"study_type": "string",
"study_date": "date",
"modality": "string"
},
"clinical_indication": "string",
"technique": "string",
"findings": [
{
"anatomical_location": "string",
"finding": "string",
"measurement": "string",
"severity": "string"
}
],
"impression": "string",
"recommendations": ["string"],
"comparison": "string"
}
Advanced Usage Patterns
Custom Medical Domain Configuration
# Configure extractor for specific medical specialty
cardiology_extractor = langextract.MedicalExtractor(
model="gemini-pro",
domain="cardiology",
medical_ontology="snomed_ct",
terminology_validation=True,
confidence_threshold=0.85
)
# Add custom medical vocabulary
cardiology_extractor.add_vocabulary({
"abbreviations": {
"STEMI": "ST-elevation myocardial infarction",
"NSTEMI": "Non-ST-elevation myocardial infarction",
"PCI": "Percutaneous coronary intervention"
},
"procedures": [
"cardiac catheterization",
"angioplasty",
"stent placement"
]
})
Parallel Processing for Large Document Sets
import asyncio
from langextract import AsyncMedicalExtractor
async def process_medical_records():
extractor = AsyncMedicalExtractor(
model="gemini-pro",
max_concurrent=10,
rate_limit=100 # requests per minute
)
# Process thousands of documents
documents = load_medical_documents() # Your document loader
# Batch process with progress tracking
results = await extractor.batch_extract_async(
documents,
schema=clinical_notes_schema,
progress_callback=lambda progress: print(f"Progress: {progress}%")
)
return results
# Run async processing
results = asyncio.run(process_medical_records())
Performance Optimization
Optimization Strategies
- Chunk Size Optimization: Adjust chunk_size based on document length and complexity
- Parallel Processing: Use batch_extract for multiple documents
- Caching: Enable result caching for repeated extractions
- Model Selection: Choose appropriate model size for your accuracy requirements
- Schema Optimization: Simplify schemas to reduce processing time
# High-performance configuration
extractor = langextract.MedicalExtractor(
model="gemini-pro",
chunk_size=4000, # Larger chunks for efficiency
overlap_size=200, # Overlap between chunks
parallel_workers=8, # Parallel processing threads
cache_enabled=True, # Enable result caching
batch_size=50, # Process 50 docs at once
memory_efficient=True # Optimize memory usage
)
# Monitor performance
with extractor.performance_monitor() as monitor:
results = extractor.batch_extract(documents, schema)
print(f"Processing rate: {monitor.docs_per_second} docs/sec")
print(f"Average accuracy: {monitor.average_confidence}")
print(f"Memory usage: {monitor.peak_memory_mb} MB")
Troubleshooting Guide
Common Issues and Solutions
Low Extraction Accuracy
- Verify medical terminology in your schema matches document language
- Increase confidence threshold and add medical vocabulary
- Use domain-specific extractor configuration
- Validate document quality and OCR accuracy
Slow Processing Performance
- Optimize chunk size for your document types
- Enable parallel processing and adjust worker count
- Use batch processing for multiple documents
- Consider using smaller, faster models for simpler extractions
Memory Issues
- Enable memory_efficient mode
- Reduce batch size and chunk size
- Process documents sequentially instead of in parallel
- Clear cache periodically for long-running processes
Debugging Tools
# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Use debug mode
extractor = langextract.MedicalExtractor(
model="gemini-pro",
debug=True,
verbose=True
)
# Validate extraction results
result = extractor.extract(text, schema)
validation_report = extractor.validate_result(result)
print(validation_report.summary())
# Performance profiling
with extractor.profile() as profiler:
result = extractor.extract(text, schema)
profiler.print_stats()