25thOliver/SightSearch: SightSearch is an end-to-end data engineering project that demonstrates how to build a real-world image ingestion and processing pipeline, from web scraping to orchestration, validation, and storage. Designed incrementally, tested locally, and containerized for reproducibility.

SightSearch

SightSearch is a robust, scalable data ingestion pipeline designed to scrape product data, process product images, and store structured metadata for downstream search and analysis applications.

Built with Apache Airflow, Docker, and Python, SightSearch orchestrates the entire lifecycle of data ingestion, from web scraping to database storage, ensuring reliability and ease of monitoring.

Features

Automated Scraping: Fetches product data (titles, prices, ratings) from target websites.
Image Processing: Downloads and extracts metadata (dimensions, format, pHash) from product images.
Data Validation: Ensures data integrity with strict schema validation before storage.
Orchestration: Fully containerized Airflow pipeline…

SightSearch

SightSearch is a robust, scalable data ingestion pipeline designed to scrape product data, process product images, and store structured metadata for downstream search and analysis applications.

Features

Automated Scraping: Fetches product data (titles, prices, ratings) from target websites.
Image Processing: Downloads and extracts metadata (dimensions, format, pHash) from product images.
Data Validation: Ensures data integrity with strict schema validation before storage.
Orchestration: Fully containerized Airflow pipeline for scheduling and monitoring tasks.
Scalable Storage:
MongoDB: Stores flexible product metadata and rejected records.
PostgreSQL: Manages Airflow’s internal state.

Prerequisites

Before you begin, ensure you have the following installed on your machine:

Docker: For running the containerized services.
Docker Compose: For orchestration of the multi-container environment.

Getting Started

Follow these steps to get the pipeline up and running in minutes.

1. Clone the Repository

git clone https://github.com/25thOliver/SightSearch.git
cd SightSearch

2. Configure Environment Variables

For security, the project uses a .env file to manage sensitive credentials.

Navigate to the docker directory:

cd docker

Create a file named .env and add the following configuration. You can change the passwords for a production setup:

# Database Configuration
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow

# MongoDB
MONGO_URI=mongodb://mongodb:27017
MONGO_DB=sightsearch

# Airflow Admin
AIRFLOW_ADMIN_USER=admin
AIRFLOW_ADMIN_PASSWORD=admin
AIRFLOW_ADMIN_EMAIL=admin@example.com

# Airflow Core
AIRFLOW__CORE__EXECUTOR=LocalExecutor

3. Start the Services

From the docker directory, start the entire stack using Docker Compose:

docker-compose up -d --build

This command will:

Build the custom Airflow and Scraper images.
Start MongoDB, PostgreSQL, and Airflow services.
Initialize the Airflow database and create the admin user.

4. Access the Application

Once the services are running (wait a minute or two for initialization):

Airflow UI: Open http://localhost:8080 in your browser.
Username: admin (or what you set in .env)
Password: admin

Usage

Trigger the Pipeline:

In the Airflow UI, find the DAG named sightsearch_ingestion_pipeline.
Toggle the switch to Unpause the DAG.
Click the Play button (Trigger DAG) to start a manual run.

Monitor Progress:

Click on the DAG ID to view the Grid/Graph view.
Watch as tasks (scrape, image_processing, validate, store_valid) turn dark green (success).

Verify Data:

You can connect to the MongoDB instance on port 27020 to inspect the ingested data in the sightsearch.products collection.

Project Structure

sightsearch/
├── docker/                 # Docker configuration files
│   ├── airflow/            # Airflow-specific Dockerfile and configs
│   │   └── dags/           # Airflow Directed Acyclic Graphs (DAGs)
│   ├── scraper/            # Scraper-specific Dockerfile
│   ├── docker-compose.yml  # Service orchestration
│   └── .env                # (Created by you) Secrets and env vars
├── src/                    # Application source code
│   ├── scraper.py          # Web scraping logic
│   ├── image_processing.py # Image metadata extraction
│   ├── storage.py          # Database interactions
│   └── validators.py       # Data validation logic
├── tests/                  # Unit tests
├── images/                 # Downloaded product images
├── requirements.airflow.txt # Python dependencies for Airflow
└── README.md               # This file

Contributing

Contributions are welcome! If you have suggestions for improvements or new features:

Fork the repository.
Create a feature branch (git checkout -b feature/AmazingFeature).
Commit your changes (git commit -m 'Add some AmazingFeature').
Push to the branch (git push origin feature/AmazingFeature).
Open a Pull Request.

SightSearch

Features

SightSearch

Features

Prerequisites

Getting Started

1. Clone the Repository

2. Configure Environment Variables

3. Start the Services

4. Access the Application

Usage

Project Structure

Contributing

Similar Posts