API Reference
Complete API reference for the textxtract
package.
Core Classes
SyncTextExtractor
The synchronous text extractor for blocking operations.
Constructor
Parameters:
- config
(optional): Configuration object for customizing extraction behavior
Methods
extract()
extract(
source: Union[Path, str, bytes],
filename: Optional[str] = None,
config: Optional[dict] = None
) -> str
Extract text synchronously from file path or bytes.
Parameters:
- source
: File path (Path/str) or file bytes
- filename
: Required if source is bytes, optional for file paths
- config
: Optional configuration overrides
Returns:
- str
: Extracted text
Raises:
- ValueError
: If filename is missing when source is bytes
- FileTypeNotSupportedError
: If file extension is not supported
- InvalidFileError
: If file is invalid or corrupted
- ExtractionError
: If extraction fails
Examples:
extractor = SyncTextExtractor()
# From file path
text = extractor.extract("document.pdf")
text = extractor.extract(Path("document.pdf"))
# From bytes (filename required)
with open("document.pdf", "rb") as f:
file_bytes = f.read()
text = extractor.extract(file_bytes, "document.pdf")
# With custom config
config = {"encoding": "utf-8", "max_file_size": 50*1024*1024}
text = extractor.extract("document.pdf", config=config)
Context Manager Support
AsyncTextExtractor
The asynchronous text extractor for non-blocking operations.
Constructor
Parameters:
- config
(optional): Configuration object for customizing extraction behavior
Methods
extract()
async extract(
source: Union[Path, str, bytes],
filename: Optional[str] = None,
config: Optional[dict] = None
) -> str
Extract text asynchronously from file path or bytes using thread pool.
Parameters:
- source
: File path (Path/str) or file bytes
- filename
: Required if source is bytes, optional for file paths
- config
: Optional configuration overrides
Returns:
- str
: Extracted text
Raises:
- ValueError
: If filename is missing when source is bytes
- FileTypeNotSupportedError
: If file extension is not supported
- InvalidFileError
: If file is invalid or corrupted
- ExtractionError
: If extraction fails
Examples:
import asyncio
async def extract_text():
extractor = AsyncTextExtractor()
# From file path
text = await extractor.extract("document.pdf")
# From bytes
with open("document.pdf", "rb") as f:
file_bytes = f.read()
text = await extractor.extract(file_bytes, "document.pdf")
return text
text = asyncio.run(extract_text())
Context Manager Support
Configuration
ExtractorConfig
Configuration class for customizing extraction behavior.
Constructor
ExtractorConfig(
encoding: str = "utf-8",
max_file_size: int = 100 * 1024 * 1024, # 100MB
logging_level: str = "INFO"
)
Parameters:
- encoding
: Default text encoding
- max_file_size
: Maximum file size in bytes
- logging_level
: Logging verbosity level
Example:
config = ExtractorConfig(
encoding="utf-8",
max_file_size=50 * 1024 * 1024, # 50MB
logging_level="DEBUG"
)
extractor = SyncTextExtractor(config)
Exceptions
All exceptions are in the textxtract.core.exceptions
module.
ExtractionError
Base exception for all extraction-related errors.
FileTypeNotSupportedError
Raised when the file extension is not supported.
InvalidFileError
Raised when the file is invalid, corrupted, or not found.
ExtractionTimeoutError
Raised when extraction exceeds the allowed timeout.
Example Error Handling:
from textxtract import SyncTextExtractor
from textxtract.core.exceptions import (
ExtractionError,
FileTypeNotSupportedError,
InvalidFileError
)
extractor = SyncTextExtractor()
try:
text = extractor.extract("document.pdf")
except FileTypeNotSupportedError as e:
print(f"Unsupported file type: {e}")
except InvalidFileError as e:
print(f"Invalid file: {e}")
except ExtractionError as e:
print(f"Extraction failed: {e}")
Utilities
FileInfo
Data class containing file information.
Attributes
filename
: str - Name of the filesize_bytes
: int - File size in bytessize_mb
: float - File size in megabytessize_kb
: float - File size in kilobytes (property)extension
: str - File extensionis_temp
: bool - Whether file is temporary
Supported File Types
Extension | Handler Class | Optional Dependency |
---|---|---|
.txt , .text |
TXTHandler |
None |
.pdf |
PDFHandler |
pymupdf |
.docx |
DOCXHandler |
python-docx |
.doc |
DOCHandler |
antiword |
.md |
MDHandler |
markdown , beautifulsoup4 |
.rtf |
RTFHandler |
pyrtf-ng |
.html , .htm |
HTMLHandler |
beautifulsoup4 , lxml |
.csv |
CSVHandler |
None |
.json |
JSONHandler |
None |
.xml |
XMLHandler |
lxml |
.zip |
ZIPHandler |
None |
Logging
The package uses Python's standard logging module with the following loggers:
textxtract.sync
- Synchronous extractor logstextxtract.aio
- Asynchronous extractor logstextxtract.utils
- Utility function logs
Configure logging:
import logging
# Set debug level for detailed logs
logging.basicConfig(level=logging.DEBUG)
# Or configure specific logger
logger = logging.getLogger("textxtract")
logger.setLevel(logging.INFO)
Type Hints
The package is fully typed. Import types for better IDE support:
```python from typing import Union, Optional from pathlib import Path from textxtract import SyncTextExtractor, AsyncTextExtractor from textxtract.core import ExtractorConfig from textxtract.core.exceptions import ExtractionError