TextXtract
TextXtract is a professional, extensible Python package for extracting text from multiple file formats with both synchronous and asynchronous support.
๐ Key Features
- Dual Input Support: Works with file paths or raw bytes
- Sync & Async APIs: Choose the right approach for your use case
- Multiple Formats: PDF, DOCX, DOC, TXT, ZIP, Markdown, RTF, HTML, CSV, JSON, XML
- Optional Dependencies: Install only what you need
- Robust Error Handling: Comprehensive exception hierarchy
- Professional Logging: Detailed debug and info level logging
- Thread-Safe: Async operations use thread pools for I/O-bound tasks
- Context Manager Support: Automatic resource cleanup
๐ Quick Example
Synchronous Extraction
from textxtract import SyncTextExtractor
extractor = SyncTextExtractor()
# From file path
text = extractor.extract("document.pdf")
# From bytes (filename required for type detection)
with open("document.pdf", "rb") as f:
file_bytes = f.read()
text = extractor.extract(file_bytes, "document.pdf")
Asynchronous Extraction
from textxtract import AsyncTextExtractor
import asyncio
async def extract_text():
extractor = AsyncTextExtractor()
# From file path
text = await extractor.extract("document.pdf")
# From bytes
with open("document.pdf", "rb") as f:
file_bytes = f.read()
text = await extractor.extract(file_bytes, "document.pdf")
return text
text = asyncio.run(extract_text())
๐ Documentation
- Installation - Get started quickly
- Usage Guide - Comprehensive usage examples
- API Reference - Complete API documentation
- Testing - Running tests and validation
- Contributing - Help improve the project
- Changelog - Version history and updates
๐ง Supported File Types
Format | Extension | Dependencies | Handler |
---|---|---|---|
Text | .txt , .text |
Built-in | stdlib |
Markdown | .md |
pip install textxtract[md] |
markdown |
.pdf |
pip install textxtract[pdf] |
PyMuPDF | |
Word | .docx |
pip install textxtract[docx] |
python-docx |
Word Legacy | .doc |
pip install textxtract[doc] |
antiword |
Rich Text | .rtf |
pip install textxtract[rtf] |
pyrtf-ng |
HTML | .html , .htm |
pip install textxtract[html] |
beautifulsoup4 |
CSV | .csv |
Built-in | stdlib |
JSON | .json |
Built-in | stdlib |
XML | .xml |
pip install textxtract[xml] |
lxml |
ZIP Archives | .zip |
Built-in | stdlib |
๐ก๏ธ Error Handling
Text Extractor provides comprehensive error handling with custom exceptions:
from textxtract import SyncTextExtractor
from textxtract.core.exceptions import (
FileTypeNotSupportedError,
InvalidFileError,
ExtractionError
)
extractor = SyncTextExtractor()
try:
text = extractor.extract("document.pdf")
except FileTypeNotSupportedError:
print("File type not supported")
except InvalidFileError:
print("File is corrupted or invalid")
except ExtractionError:
print("Extraction failed")
๐ฏ Why Choose Text Extractor?
- Production Ready: Robust error handling and logging
- Flexible: Support for both file paths and bytes
- Performant: Async support for concurrent processing
- Lightweight: Optional dependencies keep it minimal
- Well Tested: Comprehensive test suite
- Well Documented: Clear examples and API docs
๐ Get Started
Ready to extract text from your files? Check out our Installation Guide and Usage Examples.