Installation
Text Extractor is designed to be lightweight and modular, allowing you to install only the dependencies you need for your specific use case.
π¦ Basic Installation
Install the core package without optional dependencies:
This provides basic text extraction for:
- .txt
and .text
files
- .csv
files
- .json
files
- .zip
archives
π― Install with File Type Support
Install support for specific file types using optional extras:
Individual File Types
# PDF support
pip install textxtract[pdf]
# Microsoft Word (.docx) support
pip install textxtract[docx]
# Legacy Word (.doc) support
pip install textxtract[doc]
# Markdown support
pip install textxtract[md]
# Rich Text Format support
pip install textxtract[rtf]
# HTML support
pip install textxtract[html]
# XML support
pip install textxtract[xml]
Multiple File Types
# Install support for multiple formats
pip install textxtract[pdf,docx,html]
# Install all supported formats
pip install textxtract[all]
π§ Available Extras
Extra | Dependencies | File Types Supported |
---|---|---|
pdf |
pymupdf |
.pdf |
docx |
python-docx |
.docx |
doc |
antiword |
.doc |
md |
markdown , beautifulsoup4 |
.md |
rtf |
pyrtf-ng |
.rtf |
html |
beautifulsoup4 , lxml |
.html , .htm |
xml |
lxml |
.xml |
all |
All of the above | All supported types |
π Python Version Requirements
- Python 3.9 or higher is required
- Tested on Python 3.9, 3.10, 3.11, and 3.12
π Upgrading
To upgrade to the latest version:
To upgrade with all extras:
π Development Installation
For development or contributing:
# Clone the repository
git clone https://github.com/your-org/text-extractor.git
cd text-extractor
# Install in development mode with all dependencies
pip install -e .[all]
# Install development dependencies
pip install pytest pytest-asyncio
π System Requirements
For .doc
files (antiword)
On Ubuntu/Debian:
On macOS:
On Windows: Download antiword from the official website and ensure it's in your PATH.
β Verify Installation
Test your installation:
from textxtract import SyncTextExtractor
extractor = SyncTextExtractor()
print("Installation successful!")
π Troubleshooting
Common Issues
Import Error: Make sure you have the correct package name:
# Correct
from textxtract import SyncTextExtractor
# Incorrect
from text_extractor import SyncTextExtractor
Missing Dependencies: Install the required extras for your file types:
Permission Errors: On some systems, you may need to install with user permissions:
π Getting Help
If you encounter issues:
- Check the Usage Guide for examples
- Review the API Documentation
- Look at the Testing Guide for validation
- Open an issue on our GitHub repository