Skip to content

Testing

Comprehensive testing guide for the textxtract package.

πŸ§ͺ Running Tests

The project uses pytest for all tests with support for both synchronous and asynchronous testing.

Prerequisites

Install the package with all optional dependencies for complete testing:

pip install textxtract[all]
pip install pytest pytest-asyncio

Basic Test Execution

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_sync.py
pytest tests/test_async.py

# Run tests with coverage
pytest --cov=textxtract

Test Categories

# Run only synchronous tests
pytest tests/test_sync.py

# Run only asynchronous tests  
pytest tests/test_async.py

# Run exception handling tests
pytest tests/test_exceptions.py

# Run edge case tests
pytest tests/test_edge_cases.py

πŸ“‚ Test Structure

tests/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ test_sync.py          # Synchronous extractor tests
β”œβ”€β”€ test_async.py         # Asynchronous extractor tests
β”œβ”€β”€ test_exceptions.py    # Error handling tests
β”œβ”€β”€ test_edge_cases.py    # Edge cases and validation
└── files/               # Sample test files
    β”œβ”€β”€ text_file.txt
    β”œβ”€β”€ text_file.pdf
    β”œβ”€β”€ text_file.docx
    β”œβ”€β”€ markdown.md
    β”œβ”€β”€ text.csv
    β”œβ”€β”€ text.json
    β”œβ”€β”€ text.html
    β”œβ”€β”€ text.xml
    └── ...

πŸ”§ Test Coverage

File Type Coverage

The test suite covers all supported file types:

File Type Test Files Sync Tests Async Tests
Plain Text text_file.txt, text_file.text βœ… βœ…
Markdown markdown.md βœ… βœ…
PDF text_file.pdf βœ… βœ…
Word text_file.docx βœ… βœ…
Legacy Word text_file.doc βœ… βœ…
Rich Text text_file.rtf βœ… βœ…
HTML text.html βœ… βœ…
CSV text.csv βœ… βœ…
JSON text.json βœ… βœ…
XML text.xml βœ… βœ…
ZIP text_zip.zip βœ… βœ…

Input Method Coverage

  • βœ… File path extraction (extractor.extract("/path/to/file.pdf"))
  • βœ… Bytes extraction (extractor.extract(file_bytes, "file.pdf"))
  • βœ… Both sync and async methods
  • βœ… Error handling for unsupported types
  • βœ… Context manager usage

Error Handling Coverage

  • βœ… FileTypeNotSupportedError for unsupported extensions
  • βœ… InvalidFileError for corrupted/missing files
  • βœ… ExtractionError for extraction failures
  • βœ… ValueError for missing filename with bytes input

🎯 Writing Custom Tests

Testing File Extraction

import pytest
from pathlib import Path
from textxtract import SyncTextExtractor
from textxtract.core.exceptions import FileTypeNotSupportedError

def test_custom_file_extraction():
    extractor = SyncTextExtractor()

    # Test with file path
    text = extractor.extract("path/to/test/file.txt")
    assert isinstance(text, str)
    assert len(text) > 0

    # Test with bytes
    with open("path/to/test/file.txt", "rb") as f:
        file_bytes = f.read()
    text = extractor.extract(file_bytes, "file.txt")
    assert isinstance(text, str)
    assert len(text) > 0

Testing Async Extraction

import pytest
from textxtract import AsyncTextExtractor

@pytest.mark.asyncio
async def test_async_extraction():
    extractor = AsyncTextExtractor()

    text = await extractor.extract("path/to/test/file.txt")
    assert isinstance(text, str)
    assert len(text) > 0

Testing Error Conditions

import pytest
from textxtract import SyncTextExtractor
from textxtract.core.exceptions import (
    FileTypeNotSupportedError,
    InvalidFileError
)

def test_error_handling():
    extractor = SyncTextExtractor()

    # Test unsupported file type
    with pytest.raises(FileTypeNotSupportedError):
        extractor.extract(b"dummy", "file.unsupported")

    # Test missing file
    with pytest.raises(InvalidFileError):
        extractor.extract("nonexistent_file.txt")

    # Test missing filename with bytes
    with pytest.raises(ValueError):
        extractor.extract(b"dummy bytes")

πŸš€ Performance Testing

Memory Usage Testing

import psutil
import os
from textxtract import SyncTextExtractor

def test_memory_usage():
    process = psutil.Process(os.getpid())
    initial_memory = process.memory_info().rss

    extractor = SyncTextExtractor()

    # Process large file
    text = extractor.extract("large_file.pdf")

    final_memory = process.memory_info().rss
    memory_increase = final_memory - initial_memory

    # Assert reasonable memory usage
    assert memory_increase < 100 * 1024 * 1024  # Less than 100MB

Concurrent Processing Testing

import asyncio
import pytest
from textxtract import AsyncTextExtractor

@pytest.mark.asyncio
async def test_concurrent_extraction():
    extractor = AsyncTextExtractor()

    files = ["file1.txt", "file2.pdf", "file3.docx"]

    # Process files concurrently
    tasks = [extractor.extract(file) for file in files]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    # Verify all succeeded
    for result in results:
        assert isinstance(result, str)
        assert len(result) > 0

πŸ” Test Configuration

pytest.ini Configuration

[tool:pytest]
minversion = 6.0
addopts = -ra -q --tb=short
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
asyncio_mode = auto
markers =
    slow: marks tests as slow (deselect with '-m "not slow"')
    integration: marks tests as integration tests

Running Specific Test Categories

# Run only fast tests
pytest -m "not slow"

# Run only integration tests
pytest -m integration

# Run tests with specific keyword
pytest -k "test_sync"

# Run tests and stop on first failure
pytest -x

πŸ› Debugging Tests

Verbose Output

# Maximum verbosity
pytest -vvv

# Show local variables in tracebacks
pytest --tb=long

# Show stdout/stderr
pytest -s

Debugging with pdb

import pytest

def test_with_debugger():
    extractor = SyncTextExtractor()

    # Set breakpoint
    pytest.set_trace()

    text = extractor.extract("test_file.txt")
    assert text

πŸ“Š Test Reports

Coverage Reports

# Generate coverage report
pytest --cov=textxtract --cov-report=html

# View coverage in terminal
pytest --cov=textxtract --cov-report=term-missing

# Generate XML coverage for CI
pytest --cov=textxtract --cov-report=xml

JUnit XML Reports

# Generate JUnit XML for CI systems
pytest --junitxml=test-results.xml

πŸ”„ Continuous Integration

GitHub Actions Example

name: Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: [3.9, 3.10, 3.11, 3.12]

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v2
      with:
        python-version: ${{ matrix.python-version }}

    - name: Install dependencies
      run: |
        pip install -e .[all]
        pip install pytest pytest-asyncio pytest-cov

    - name: Run tests
      run: pytest --cov=textxtract

βœ… Validation Checklist

Before submitting changes, ensure:

  • [ ] All existing tests pass
  • [ ] New features have corresponding tests
  • [ ] Error conditions are tested
  • [ ] Both sync and async methods are tested
  • [ ] Documentation examples are tested
  • [ ] Performance regressions are checked
  • [ ] Memory leaks are verified

πŸ†˜ Troubleshooting Tests

Common Issues

Missing dependencies:

pip install textxtract[all] pytest pytest-asyncio

Import errors:

# Ensure correct import
from textxtract import SyncTextExtractor  # Correct
from text_extractor import SyncTextExtractor  # Wrong

Async test issues:

# Ensure pytest-asyncio is installed and configured
pytest.mark.asyncio  # Required for async tests

File not found errors:

# Use absolute paths in tests
TEST_FILES_DIR = Path(__file__).parent / "files"
file_path = TEST_FILES_DIR / "test_file.txt"

For more testing help, see the API Reference or Usage Guide.