An automated web scraping tool designed to extract case information from the Dallas County Courts Portal. This tool enables legal professionals, researchers, and investigators to efficiently search and collect case data for multiple attorneys simultaneously with advanced filtering capabilities.
π Table of Contents
- Overview
- Features
- Prerequisites
- Installation
- Quick Start
- Configuration
- Usage
- Output
- Troubleshooting
- Project Structure
- Contributing
- Documentation
- License
π― Overview
The Dallas County Courts Portal Scraper automates the process of seβ¦
An automated web scraping tool designed to extract case information from the Dallas County Courts Portal. This tool enables legal professionals, researchers, and investigators to efficiently search and collect case data for multiple attorneys simultaneously with advanced filtering capabilities.
π Table of Contents
- Overview
- Features
- Prerequisites
- Installation
- Quick Start
- Configuration
- Usage
- Output
- Troubleshooting
- Project Structure
- Contributing
- Documentation
- License
π― Overview
The Dallas County Courts Portal Scraper automates the process of searching for and extracting case information from the public Dallas County Courts Portal. It eliminates the need for manual case-by-case searches by:
- Searching multiple attorneys concurrently using a thread pool for efficient processing
- Filtering results by case type, charge keywords, and date range
- Handling captchas automatically via 2Captcha API or manual completion
- Exporting structured data in multiple formats (Excel, CSV, JSON) with professional formatting
This tool is designed for legal professionals who need to systematically track cases across multiple attorneys, filter by specific charge types, and export data for analysis or reporting.
β¨ Features
Core Capabilities
-
Multi-Attorney Support: Process multiple attorneys concurrently with thread pool execution
-
Intelligent Filtering:
-
Filter by case type (default: felony cases)
-
Filter by charge keywords (e.g., "ASSAULT", "THEFT")
-
Filter by minimum file date/year
-
Captcha Handling:
-
Automated solving via 2Captcha API
-
Manual completion fallback for reliability
-
Stealth Automation: Browser automation with anti-detection measures to mimic human behavior
-
Flexible Export:
-
Excel format with multi-sheet support (one sheet per attorney)
-
CSV format for data analysis
-
JSON format for programmatic access
-
Robust Error Handling:
-
Session recovery on navigation failures
-
Partial result preservation on interruption
-
Comprehensive logging for debugging
Technical Features
-
Concurrent Processing: Uses ThreadPoolExecutor to process multiple attorneys simultaneously
-
Browser Automation: Built on Playwright for fast, reliable browser control
-
Data Extraction: Extracts comprehensive case details including:
-
Case number, file date, judicial officer
-
Case status and type
-
Charge descriptions
-
Bond amounts
-
Disposition and sentencing information
-
Thread-Safe Operations: Safe concurrent execution with proper resource management
π¦ Prerequisites
Before you begin, ensure you have the following installed:
- Python 3.8 or higher (Download Python)
- Google Chrome or Chromium browser (automatically installed with Playwright)
- Git (optional, for cloning the repository)
System Requirements
- Operating System: Windows, macOS, or Linux
- Memory: Minimum 4GB RAM (8GB+ recommended for concurrent processing)
- Internet Connection: Required for accessing the Dallas County Courts Portal
π Installation
Step 1: Clone or Download
If using Git:
git clone <repository-url>
cd DallasCountyScraper
Or download and extract the ZIP file to your desired location.
Step 2: Install Python Dependencies
Navigate to the project directory and install required packages:
pip install -r requirements.txt
Note: Itβs recommended to use a virtual environment:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Step 3: Install Playwright Browsers
Install the Chromium browser used by Playwright:
playwright install chromium
This step downloads the browser binary (~170MB). Playwright handles browser updates automatically.
Step 4: Configure the Scraper
Edit config.py with your search criteria (see Configuration section for details).
β‘ Quick Start
Configure attorneys and keywords in config.py:
ATTORNEYS = [
{"first_name": "JOHN", "last_name": "DOE"},
]
CHARGE_KEYWORDS = ["ASSAULT", "THEFT"]
Run the scraper:
python main.py
Find results in the results/ directory (Excel file with timestamp)
Thatβs it! The scraper will automatically:
- Navigate to the portal
- Search for each attorney
- Filter cases by your criteria
- Extract case details
- Export results to Excel
βοΈ Configuration
All configuration is managed through config.py with optional environment variable overrides via a .env file.
Required Configuration
1. Attorneys List
Define the attorneys to search for:
ATTORNEYS = [
{"first_name": "JOHN", "last_name": "DOE"},
{"first_name": "JANE", "last_name": "SMITH"},
# Add more attorneys as needed
]
Important: Names must match exactly as they appear in the portal (case-sensitive).
2. Charge Keywords
Specify keywords to filter case descriptions:
CHARGE_KEYWORDS = [
"ASSAULT",
"THEFT",
"BURGLARY",
# Add more keywords as needed
]
Cases matching ANY of these keywords will be included in the results.
Optional Configuration
Date Filtering
Set the minimum year for cases to process:
Option 1: Edit config.py
MINIMUM_CASE_YEAR = 2024
Option 2: Use environment variable (recommended for deployment) Create a .env file:
MINIMUM_CASE_YEAR=2024
Captcha Service (Optional but Recommended)
For automated captcha solving, set up 2Captcha:
- Get API Key: Sign up at 2captcha.com
- Configure: Add to
.envfile:
CAPTCHA_API_KEY=your_api_key_here
- Enable: In
config.py:
USE_CAPTCHA_SERVICE = True
Note: Without an API key, the scraper will pause for manual captcha completion. This is fine for testing but impractical for automation.
Browser Settings
# Run browser in background (no visible window)
HEADLESS = True # Set to False for debugging
# Chrome executable path (if Chrome is installed in non-standard location)
# Configure via environment variable: CHROME_PATH=/path/to/chrome
Output Settings
# Output directory
OUTPUT_DIR = "results"
# Output format: "excel", "csv", or "json"
OUTPUT_FORMAT = "excel"
Advanced Settings
# Delay before browser actions (useful for debugging)
ACTION_DELAY_SECONDS = 0 # Set to 0 for normal operation, 3+ for manual observation
# Case type filter (default: "FELONY")
CASE_TYPE = "FELONY"
# Enable automatic session recovery on errors
ENABLE_SESSION_RECOVERY = True
Environment Variables
Create a .env file in the project root to override config values:
# Captcha API Key
CAPTCHA_API_KEY=your_2captcha_api_key_here
# Chrome Path (if non-standard location)
CHROME_PATH=C:\Program Files\Google\Chrome\Application\chrome.exe
# Minimum Case Year
MINIMUM_CASE_YEAR=2024
# Thread Pool Workers (optional, defaults to max(32, CPU_COUNT * 4))
MAX_WORKERS=16
π Usage
Basic Usage
Run the scraper with default settings:
python main.py
The scraper will:
- Validate your configuration
- Display the configuration summary
- Process each attorney concurrently
- Export results to
results/directory - Display summary statistics
Output
Results are saved in the results/ directory with the following naming convention:
- Excel:
cases_YYYYMMDD_HHMMSS.xlsx - CSV:
cases_YYYYMMDD_HHMMSS.csv - JSON:
cases_YYYYMMDD_HHMMSS.json
Excel Export Features
-
Multi-Sheet Support: One sheet per attorney plus an "All Cases" summary sheet
-
Professional Formatting:
-
Auto-sized columns
-
Styled headers (blue background, white text)
-
Proper column ordering
-
Column Titles: User-friendly column names (e.g., "Attorney", "Case Number", "File Date")
Logging
All operations are logged to logs/scraper_YYYYMMDD_HHMMSS.log with:
- Thread identification for concurrent execution
- Detailed progress information
- Error messages and stack traces
- Configuration summary
Review logs for debugging or audit purposes.
Interrupting Execution
Press Ctrl+C to gracefully interrupt the scraper. Partial results will be exported automatically.
π§ Troubleshooting
Common Issues and Solutions
Issue: "No matching cases found"
Possible Causes:
- Attorney names donβt match portal exactly (check case sensitivity)
- No cases match the charge keywords
- All cases are before the minimum year threshold
- Case type filter doesnβt match (default is "FELONY")
Solutions:
- Verify attorney names match the portal exactly
- Check charge keywords are spelled correctly
- Adjust
MINIMUM_CASE_YEARif needed - Review logs for detailed filtering information
Issue: Captcha not solving automatically
Solutions:
- Verify
CAPTCHA_API_KEYis set correctly in.env - Check 2Captcha account balance at 2captcha.com
- Set
USE_CAPTCHA_SERVICE = Falseto use manual completion for testing
Issue: Chrome not found
Solutions:
- Ensure Chrome is installed on your system
- Set
CHROME_PATHenvironment variable to Chrome executable location - For Windows: Default is
C:\Program Files\Google\Chrome\Application\chrome.exe - For Linux: Try
/usr/bin/google-chromeor/usr/bin/chromium-browser - For macOS: Try
/Applications/Google Chrome.app/Contents/MacOS/Google Chrome
Issue: "Latest case is not from [YEAR]" error
Solutions:
- The attorney has no cases from the specified minimum year
- Adjust
MINIMUM_CASE_YEARin config or.envfile - Verify the attorney has recent cases in the portal
Issue: Timeout errors
Solutions:
- Increase
EXPLICIT_WAITorPAGE_LOAD_TIMEOUTinconfig.py - Check your internet connection
- The portal may be experiencing high traffic
Issue: Browser crashes or errors
Solutions:
- Ensure sufficient memory is available (close other applications)
- Reduce
MAX_WORKERSenvironment variable for fewer concurrent browsers - Enable
ENABLE_SESSION_RECOVERY = Truefor automatic recovery - Check logs for specific error messages
Getting Help
- Check the logs: Review
logs/directory for detailed error information - Review configuration: Verify all settings in
config.py - Test with single attorney: Start with one attorney to isolate issues
- Check documentation: See
docs/TECHNICAL_DOCUMENTATION.mdfor architecture details
π Project Structure
DallasCountyScraper/
β
βββ main.py # Application entry point
βββ config.py # Configuration file
βββ requirements.txt # Python dependencies
βββ README.md # This file
β
βββ scraper.py # Core scraper class
βββ scraper_pool.py # Thread pool manager for concurrent execution
βββ case_data_extractor.py # Case detail extraction functions
βββ captcha_handler.py # Captcha detection and solving
βββ result_exporter.py # Export to CSV/Excel/JSON
βββ utils.py # Browser setup and utility functions
βββ inspect_website.py # Standalone website inspection tool
β
βββ docs/
β βββ TECHNICAL_DOCUMENTATION.md # Detailed technical documentation
β βββ FLOW_DIAGRAM.md # Visual process flow
β
βββ logs/ # Execution logs (auto-generated)
βββ results/ # Exported results (auto-generated)
βββ .env # Environment variables (create this file)
Key Modules
main.py: Orchestrates the entire workflowscraper.py: Handles browser automation and case extraction per attorneyscraper_pool.py: Manages concurrent execution across multiple attorneyscase_data_extractor.py: Extracts structured data from case detail pagescaptcha_handler.py: Handles reCAPTCHA detection and solvingresult_exporter.py: Formats and exports results to various formats
π€ Contributing
Contributions are welcome! Hereβs how you can help improve this project:
Reporting Issues
If you encounter a bug or have a feature request:
- Check existing issues to avoid duplicates
- Create a new issue with:
- Clear description of the problem or feature
- Steps to reproduce (for bugs)
- Expected vs. actual behavior
- System information (OS, Python version, etc.)
Submitting Changes
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature-name - Make your changes with clear, documented code
- Test your changes thoroughly
- Commit with descriptive messages:
git commit -m "Add feature: description" - Push to your fork:
git push origin feature/your-feature-name - Open a Pull Request with:
- Description of changes
- Reason for the change
- Any breaking changes
Code Style
- Follow PEP 8 Python style guidelines
- Add docstrings to all functions and classes
- Include comments for complex logic
- Maintain existing code style and structure
Contribution
Contributions are welcome! This project is open source and licensed under the MIT License. By submitting a pull request, you agree that:
Your contribution will be licensed under the same license as the project (MIT License)
You have the right to grant these rights
Your contribution does not violate any third-party rights
Bug fixes and error handling improvements
Performance optimizations
Additional export formats
Enhanced filtering capabilities
Documentation improvements
Test coverage
π Documentation
Additional Documentation
For detailed technical information, refer to:
- Technical Documentation: Comprehensive architecture overview, module reference, design patterns, and API specifications
- Flow Diagram: Visual representation of the scraping process flow
Key Concepts
Case Type Filtering: Only processes rows containing the configured case type (default: "FELONY"). This is a substring match, so itβs case-insensitive.
Charge Keyword Filtering: On each case detail page, checks if any of the configured keywords appear in the page content. If no keywords match, the case is skipped entirely.
Date Filtering: Before processing any cases, validates that the newest case in the results meets the minimum year requirement. This prevents processing for attorneys who havenβt practiced recently in Dallas County. Concurrent Processing: Uses Pythonβs ThreadPoolExecutor to run multiple scraper instances simultaneously, each with its own browser context. This significantly improves throughput for multi-attorney searches.
βοΈ License
This project is licensed under the [MIT License]
Note: This tool is designed for legitimate use cases such as legal research, case tracking, and public record access. Users are responsible for complying with the Dallas County Courts Portalβs terms of service and applicable laws regarding web scraping and data collection.
π Acknowledgments
- Playwright: Fast and reliable browser automation
- 2Captcha: Captcha solving service integration
- pandas & openpyxl: Data processing and Excel export capabilities
π Project Status
Status: β Active Development / Production Ready
Current Version
- Multi-attorney concurrent processing
- Automated captcha solving
- Excel/CSV/JSON export with formatting
- Session recovery and error handling
Known Limitations
- Sequential case extraction (cases processed one at a time)
- Chrome/Chromium browser only (Firefox/Edge support not implemented)
Roadmap
Planned Enhancements:
- Parallel case extraction
- Database export option (SQLite/PostgreSQL)
- Enhanced error messages with actionable suggestions
- Web dashboard for monitoring and configuration
See docs/TECHNICAL_DOCUMENTATION.md for detailed roadmap information.
π Support
For questions, issues, or contributions:
- Issues: Open an issue on the repository
- Documentation: Check
docs/TECHNICAL_DOCUMENTATION.mdfor technical details - Logs: Review
logs/directory for execution traces and debugging information
Made with β€οΈ for legal professionals and researchers
Last Updated: November 2025