Text Extractor Package
Text Extractor package - Professional text extraction from multiple file formats.
Modules:
Name | Description |
---|---|
aio |
Asynchronous extraction logic package. |
core |
Core components for textxtract package. |
exceptions |
|
handlers |
File type-specific handlers package. |
sync |
Synchronous extraction logic package. |
Classes:
Name | Description |
---|---|
AsyncTextExtractor |
Asynchronous text extractor with support for file paths and bytes. |
ExtractorConfig |
Enhanced configuration options for text extraction with validation. |
SyncTextExtractor |
Synchronous text extractor with support for file paths and bytes. |
Attributes
Classes
AsyncTextExtractor
Bases: TextExtractor
Asynchronous text extractor with support for file paths and bytes.
Provides asynchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Uses thread pool for I/O-bound operations.
Methods:
Name | Description |
---|---|
__aenter__ |
Async context manager entry. |
__aexit__ |
Async context manager exit with cleanup. |
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit with cleanup. |
__init__ |
|
extract |
Extract text asynchronously from file path or bytes using thread pool. |
Attributes:
Name | Type | Description |
---|---|---|
config |
|
Source code in textxtract/aio/extractor.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
|
Attributes
Functions
__aenter__
async
__aexit__
async
__enter__
__exit__
__init__
Source code in textxtract/aio/extractor.py
extract
async
Extract text asynchronously from file path or bytes using thread pool.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text. |
Raises:
Type | Description |
---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |
Source code in textxtract/aio/extractor.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
ExtractorConfig
Enhanced configuration options for text extraction with validation.
Methods:
Name | Description |
---|---|
__init__ |
|
__repr__ |
|
from_file |
Load configuration from a file (JSON, YAML, or TOML). |
get_handler |
Retrieve a handler for a given file extension. |
get_handler_config |
Get configuration specific to a handler. |
register_handler |
Register a custom file type handler. |
to_dict |
Convert configuration to dictionary. |
Attributes:
Name | Type | Description |
---|---|---|
custom_handlers |
|
|
encoding |
|
|
extra_config |
|
|
logging_format |
|
|
logging_level |
|
|
max_file_size |
|
|
max_memory_usage |
|
|
timeout |
|
Source code in textxtract/core/config.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
Attributes
logging_format
instance-attribute
Functions
__init__
__init__(encoding='utf-8', logging_level='INFO', logging_format=None, timeout=None, max_file_size=None, max_memory_usage=None, custom_handlers=None, **kwargs)
Source code in textxtract/core/config.py
__repr__
from_file
classmethod
Load configuration from a file (JSON, YAML, or TOML).
Source code in textxtract/core/config.py
get_handler
get_handler_config
Get configuration specific to a handler.
Source code in textxtract/core/config.py
register_handler
Register a custom file type handler.
to_dict
Convert configuration to dictionary.
Source code in textxtract/core/config.py
SyncTextExtractor
Bases: TextExtractor
Synchronous text extractor with support for file paths and bytes.
Provides synchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Supports context manager protocol for proper cleanup.
Methods:
Name | Description |
---|---|
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit. |
__init__ |
|
extract |
Extract text synchronously from file path or bytes. |
Attributes:
Name | Type | Description |
---|---|---|
config |
|
Source code in textxtract/sync/extractor.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
Attributes
Functions
__enter__
__exit__
__init__
extract
Extract text synchronously from file path or bytes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text. |
Raises:
Type | Description |
---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |
Source code in textxtract/sync/extractor.py
Modules
aio
Asynchronous extraction logic package.
Modules:
Name | Description |
---|---|
extractor |
Asynchronous text extraction logic with support for file paths and bytes. |
Classes:
Name | Description |
---|---|
AsyncTextExtractor |
Asynchronous text extractor with support for file paths and bytes. |
Attributes
Classes
AsyncTextExtractor
Bases: TextExtractor
Asynchronous text extractor with support for file paths and bytes.
Provides asynchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Uses thread pool for I/O-bound operations.
Methods:
Name | Description |
---|---|
__aenter__ |
Async context manager entry. |
__aexit__ |
Async context manager exit with cleanup. |
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit with cleanup. |
__init__ |
|
extract |
Extract text asynchronously from file path or bytes using thread pool. |
Attributes:
Name | Type | Description |
---|---|---|
config |
|
Source code in textxtract/aio/extractor.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
|
Attributes
Functions
__aenter__
async
__aexit__
async
__enter__
__exit__
__init__
Source code in textxtract/aio/extractor.py
extract
async
Extract text asynchronously from file path or bytes using thread pool.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
filename
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
config
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text. |
Raises:
Type | Description |
---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |
Source code in textxtract/aio/extractor.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
Modules
extractor
Asynchronous text extraction logic with support for file paths and bytes.
Classes:
Name | Description |
---|---|
AsyncTextExtractor |
Asynchronous text extractor with support for file paths and bytes. |
Attributes:
Name | Type | Description |
---|---|---|
logger |
|
Attributes
Classes
AsyncTextExtractor
Bases: TextExtractor
Asynchronous text extractor with support for file paths and bytes.
Provides asynchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Uses thread pool for I/O-bound operations.
Methods:
Name | Description |
---|---|
__aenter__ |
Async context manager entry. |
__aexit__ |
Async context manager exit with cleanup. |
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit with cleanup. |
__init__ |
|
extract |
Extract text asynchronously from file path or bytes using thread pool. |
Attributes:
Name | Type | Description |
---|---|---|
config |
|
Source code in textxtract/aio/extractor.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
|
__aenter__
async
__aexit__
async
__enter__
__exit__
__init__
Source code in textxtract/aio/extractor.py
extract
async
Extract text asynchronously from file path or bytes using thread pool.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
filename
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
config
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text. |
Raises:
Type | Description |
---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |
Source code in textxtract/aio/extractor.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
Functions
core
Core components for textxtract package.
Modules:
Name | Description |
---|---|
base |
Abstract base classes for text extraction. |
config |
Configuration and customization for textxtract package. |
exceptions |
Custom exceptions for textxtract package. |
logging_config |
Logging configuration for textxtract package. |
registry |
Handler registry for centralized handler management. |
utils |
Utility functions for textxtract package. |
Modules
base
Abstract base classes for text extraction.
Classes:
Name | Description |
---|---|
FileTypeHandler |
Abstract base class for file type-specific handlers. |
TextExtractor |
Abstract base class for text extractors. |
Classes
FileTypeHandler
Bases: ABC
Abstract base class for file type-specific handlers.
Methods:
Name | Description |
---|---|
extract |
Extract text synchronously from a file. |
extract_async |
Extract text asynchronously from a file. |
Source code in textxtract/core/base.py
extract
abstractmethod
extract_async
abstractmethod
async
TextExtractor
Bases: ABC
Abstract base class for text extractors.
Methods:
Name | Description |
---|---|
extract |
Extract text synchronously from file path or bytes. |
Source code in textxtract/core/base.py
extract
abstractmethod
Extract text synchronously from file path or bytes.
config
Configuration and customization for textxtract package.
Classes:
Name | Description |
---|---|
ExtractorConfig |
Enhanced configuration options for text extraction with validation. |
Classes
ExtractorConfig
Enhanced configuration options for text extraction with validation.
Methods:
Name | Description |
---|---|
__init__ |
|
__repr__ |
|
from_file |
Load configuration from a file (JSON, YAML, or TOML). |
get_handler |
Retrieve a handler for a given file extension. |
get_handler_config |
Get configuration specific to a handler. |
register_handler |
Register a custom file type handler. |
to_dict |
Convert configuration to dictionary. |
Attributes:
Name | Type | Description |
---|---|---|
custom_handlers |
|
|
encoding |
|
|
extra_config |
|
|
logging_format |
|
|
logging_level |
|
|
max_file_size |
|
|
max_memory_usage |
|
|
timeout |
|
Source code in textxtract/core/config.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
logging_format
instance-attribute
__init__
__init__(encoding='utf-8', logging_level='INFO', logging_format=None, timeout=None, max_file_size=None, max_memory_usage=None, custom_handlers=None, **kwargs)
Source code in textxtract/core/config.py
__repr__
from_file
classmethod
Load configuration from a file (JSON, YAML, or TOML).
Source code in textxtract/core/config.py
get_handler
get_handler_config
Get configuration specific to a handler.
Source code in textxtract/core/config.py
register_handler
Register a custom file type handler.
to_dict
Convert configuration to dictionary.
Source code in textxtract/core/config.py
exceptions
Custom exceptions for textxtract package.
Classes:
Name | Description |
---|---|
ExtractionError |
Raised when a general extraction error occurs. |
ExtractionTimeoutError |
Raised when extraction exceeds the allowed timeout. |
FileTypeNotSupportedError |
Raised when the file type is not supported. |
InvalidFileError |
Raised when the file is invalid or unsupported. |
Classes
ExtractionError
ExtractionTimeoutError
Bases: ExtractionError
Raised when extraction exceeds the allowed timeout.
FileTypeNotSupportedError
Bases: ExtractionError
Raised when the file type is not supported.
InvalidFileError
Bases: ExtractionError
Raised when the file is invalid or unsupported.
logging_config
Logging configuration for textxtract package.
Functions:
Name | Description |
---|---|
setup_logging |
Configure logging for the package. |
Functions
setup_logging
Configure logging for the package.
registry
Handler registry for centralized handler management.
Classes:
Name | Description |
---|---|
HandlerRegistry |
Central registry for file type handlers with caching and lazy loading. |
Attributes:
Name | Type | Description |
---|---|---|
logger |
|
|
registry |
|
Attributes
Classes
HandlerRegistry
Central registry for file type handlers with caching and lazy loading.
Methods:
Name | Description |
---|---|
__init__ |
|
__new__ |
|
get_handler |
Get handler instance for file extension with caching. |
get_supported_extensions |
Get list of all supported file extensions. |
is_supported |
Check if a file extension is supported. |
register_handler |
Register a custom handler for a file extension. |
Source code in textxtract/core/registry.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
__init__
__new__
get_handler
cached
Get handler instance for file extension with caching.
Source code in textxtract/core/registry.py
get_supported_extensions
is_supported
register_handler
Register a custom handler for a file extension.
Source code in textxtract/core/registry.py
utils
Utility functions for textxtract package.
Classes:
Name | Description |
---|---|
FileInfo |
File information data class. |
Functions:
Name | Description |
---|---|
create_temp_file |
Create a temporary file from bytes and return its path with security validation. |
get_file_info |
Get file information for logging and debugging. |
safe_unlink |
Safely delete a file if it exists, optionally logging errors. |
validate_file_extension |
Check if the file has an allowed extension. |
validate_file_size |
Validate file size doesn't exceed limits. |
validate_filename |
Validate filename for security issues. |
Attributes:
Name | Type | Description |
---|---|---|
DEFAULT_MAX_FILE_SIZE |
|
|
DEFAULT_MAX_TEMP_FILES |
|
Attributes
Classes
FileInfo
dataclass
Functions
create_temp_file
Create a temporary file from bytes and return its path with security validation.
Source code in textxtract/core/utils.py
get_file_info
Get file information for logging and debugging.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
Union[Path, str, bytes]
|
File path or file bytes |
required |
filename
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
Returns:
Name | Type | Description |
---|---|---|
FileInfo |
FileInfo
|
Data class with file information |
Source code in textxtract/core/utils.py
safe_unlink
Safely delete a file if it exists, optionally logging errors.
Source code in textxtract/core/utils.py
validate_file_extension
validate_file_size
Validate file size doesn't exceed limits.
Source code in textxtract/core/utils.py
validate_filename
Validate filename for security issues.
Source code in textxtract/core/utils.py
exceptions
Classes:
Name | Description |
---|---|
ExtractionError |
Raised when a general extraction error occurs. |
ExtractionTimeoutError |
Raised when extraction exceeds the allowed timeout. |
FileTypeNotSupportedError |
Raised when the file type is not supported. |
InvalidFileError |
Raised when the file is invalid or unsupported. |
Attributes
__all__
module-attribute
__all__ = ['ExtractionError', 'InvalidFileError', 'FileTypeNotSupportedError', 'ExtractionTimeoutError']
Classes
ExtractionError
ExtractionTimeoutError
Bases: ExtractionError
Raised when extraction exceeds the allowed timeout.
FileTypeNotSupportedError
Bases: ExtractionError
Raised when the file type is not supported.
InvalidFileError
Bases: ExtractionError
Raised when the file is invalid or unsupported.
handlers
File type-specific handlers package.
Modules:
Name | Description |
---|---|
csv |
CSV file handler for text extraction. |
doc |
DOC file handler for text extraction. |
docx |
DOCX file handler for text extraction. |
html |
HTML file handler for text extraction. |
json |
JSON file handler for text extraction. |
md |
Markdown (.md) file handler for text extraction. |
pdf |
PDF file handler for text extraction. |
rtf |
RTF file handler for text extraction. |
txt |
TXT file handler for text extraction. |
xml |
XML file handler for text extraction. |
zip |
ZIP file handler for text extraction. |
Modules
csv
CSV file handler for text extraction.
Classes:
Name | Description |
---|---|
CSVHandler |
Handler for extracting text from CSV files. |
Classes
CSVHandler
Bases: FileTypeHandler
Handler for extracting text from CSV files.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/csv.py
extract
Source code in textxtract/handlers/csv.py
doc
DOC file handler for text extraction.
Classes:
Name | Description |
---|---|
DOCHandler |
Handler for extracting text from DOC files with fallback options. |
Classes
DOCHandler
Bases: FileTypeHandler
Handler for extracting text from DOC files with fallback options.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/doc.py
extract
Source code in textxtract/handlers/doc.py
docx
DOCX file handler for text extraction.
Classes:
Name | Description |
---|---|
DOCXHandler |
Handler for extracting text from DOCX files. |
Classes
DOCXHandler
Bases: FileTypeHandler
Handler for extracting text from DOCX files.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/docx.py
extract
Source code in textxtract/handlers/docx.py
html
HTML file handler for text extraction.
Classes:
Name | Description |
---|---|
HTMLHandler |
Handler for extracting text from HTML files. |
Classes
HTMLHandler
Bases: FileTypeHandler
Handler for extracting text from HTML files.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/html.py
extract
Source code in textxtract/handlers/html.py
json
JSON file handler for text extraction.
Classes:
Name | Description |
---|---|
JSONHandler |
Handler for extracting text from JSON files. |
Classes
JSONHandler
Bases: FileTypeHandler
Handler for extracting text from JSON files.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/json.py
extract
Source code in textxtract/handlers/json.py
md
Markdown (.md) file handler for text extraction.
Classes:
Name | Description |
---|---|
MDHandler |
Handler for extracting text from Markdown files. |
Classes
MDHandler
Bases: FileTypeHandler
Handler for extracting text from Markdown files.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/md.py
extract
Source code in textxtract/handlers/md.py
pdf
PDF file handler for text extraction.
Classes:
Name | Description |
---|---|
PDFHandler |
Handler for extracting text from PDF files with improved error handling. |
Classes
PDFHandler
Bases: FileTypeHandler
Handler for extracting text from PDF files with improved error handling.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/pdf.py
extract
Source code in textxtract/handlers/pdf.py
rtf
RTF file handler for text extraction.
Classes:
Name | Description |
---|---|
RTFHandler |
Handler for extracting text from RTF files. |
Classes
RTFHandler
Bases: FileTypeHandler
Handler for extracting text from RTF files.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/rtf.py
extract
Source code in textxtract/handlers/rtf.py
txt
TXT file handler for text extraction.
Classes:
Name | Description |
---|---|
TXTHandler |
Handler for extracting text from TXT files. |
Classes
TXTHandler
Bases: FileTypeHandler
Handler for extracting text from TXT files.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/txt.py
extract
Source code in textxtract/handlers/txt.py
xml
XML file handler for text extraction.
Classes:
Name | Description |
---|---|
XMLHandler |
Handler for extracting text from XML files. |
Classes
XMLHandler
Bases: FileTypeHandler
Handler for extracting text from XML files.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/xml.py
extract
Source code in textxtract/handlers/xml.py
zip
ZIP file handler for text extraction.
Classes:
Name | Description |
---|---|
ZIPHandler |
Handler for extracting text from ZIP archives with security checks. |
Attributes:
Name | Type | Description |
---|---|---|
logger |
|
Attributes
Classes
ZIPHandler
Bases: FileTypeHandler
Handler for extracting text from ZIP archives with security checks.
Methods:
Name | Description |
---|---|
extract |
|
extract_async |
|
Attributes:
Name | Type | Description |
---|---|---|
MAX_EXTRACT_SIZE |
|
|
MAX_FILES |
|
Source code in textxtract/handlers/zip.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
extract
Source code in textxtract/handlers/zip.py
sync
Synchronous extraction logic package.
Modules:
Name | Description |
---|---|
extractor |
Synchronous text extraction logic with support for file paths and bytes. |
Modules
extractor
Synchronous text extraction logic with support for file paths and bytes.
Classes:
Name | Description |
---|---|
SyncTextExtractor |
Synchronous text extractor with support for file paths and bytes. |
Attributes:
Name | Type | Description |
---|---|---|
logger |
|
Attributes
Classes
SyncTextExtractor
Bases: TextExtractor
Synchronous text extractor with support for file paths and bytes.
Provides synchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Supports context manager protocol for proper cleanup.
Methods:
Name | Description |
---|---|
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit. |
__init__ |
|
extract |
Extract text synchronously from file path or bytes. |
Attributes:
Name | Type | Description |
---|---|---|
config |
|
Source code in textxtract/sync/extractor.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
__enter__
__exit__
__init__
extract
Extract text synchronously from file path or bytes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
filename
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
config
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text. |
Raises:
Type | Description |
---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |