TextXtract Package: Architectural Plan
Mermaid Diagram: High-Level Architecture
classDiagram
class TextXtract {
+extract()
+extract_async()
}
class FileTypeHandler {
+extract(file_path, config)
}
TextExtractor <|-- SyncTextExtractor
TextExtractor <|-- AsyncTextExtractor
FileTypeHandler <|-- PDFHandler
FileTypeHandler <|-- DOCXHandler
FileTypeHandler <|-- DOCHandler
FileTypeHandler <|-- TXTHandler
FileTypeHandler <|-- ZIPHandler
FileTypeHandler <|-- RTFHandler
FileTypeHandler <|-- HTMLHandler
FileTypeHandler <|-- MDHandler
FileTypeHandler <|-- CSVHandler
FileTypeHandler <|-- JSONHandler
FileTypeHandler <|-- XMLHandler
SyncTextExtractor o-- FileTypeHandler
AsyncTextExtractor o-- FileTypeHandler
SyncTextExtractor o-- Registry
AsyncTextExtractor o-- Registry
Registry o-- FileTypeHandler
Key Features
- Extensible: Add new file handlers by subclassing
FileTypeHandler
and registering. - Sync & Async: Both interfaces, async is truly non-blocking.
- Robust: Comprehensive error handling, logging, and resource management.
- Configurable: Encoding, logging, handler registration, timeouts, per-handler config.
- Testable: Pytest-based tests for all components, including edge cases.
- Scalable: Lazy loading, caching, parallel processing (future-proofed).
- Secure: Input validation, safe file handling, ZIP traversal protection.
Implementation Steps
- Scaffold the directory and file structure.
- Implement core abstractions and utilities.
- Implement file type handlers.
- Implement sync and async extractors.
- Add configuration, logging, and error handling.
- Write comprehensive tests (unit, async, edge cases).
- Document usage and API (auto-generated docs).
- Finalize dependencies and code quality checks.
API Reference
Checkout the main API Reference for detailed documentation on each class and method.