GitHub - apigate-in/ReSearch: ReSearch is a web crawler and search engine that crawls a predefined set of domains, indexes the content of the pages, and provides a search API to query the indexed data.

ReSearch

ReSearch is a web crawler and search engine that crawls a predefined set of domains, indexes the content of the pages, and provides a search API to query the indexed data.

Architecture

The project consists of the following components:

Crawler: A multi-threaded web crawler that fetches URLs from a Kvrocks queue, crawls the pages, and extracts content and links.
Background Jobs: A scheduler that runs background jobs to process the crawled data. This includes ingesting data into the search index, calculating page and domain scores, and managing the crawl queue.
API: A web server that provides a search API. It uses Quickwit as the search backend.
Kvrocks: Used as a message queue for the crawler, for caching, and for storing data like page scores…

ReSearch

ReSearch is a web crawler and search engine that crawls a predefined set of domains, indexes the content of the pages, and provides a search API to query the indexed data.

Architecture

The project consists of the following components:

Crawler: A multi-threaded web crawler that fetches URLs from a Kvrocks queue, crawls the pages, and extracts content and links.
Background Jobs: A scheduler that runs background jobs to process the crawled data. This includes ingesting data into the search index, calculating page and domain scores, and managing the crawl queue.
API: A web server that provides a search API. It uses Quickwit as the search backend.
Kvrocks: Used as a message queue for the crawler, for caching, and for storing data like page scores and domain scores.
Quickwit: A distributed search engine used for indexing and searching the crawled data.

Data Flow

The crawler fetches URLs from the CrawlListPath in Kvrocks.
If the crawl list is empty, it’s refilled with a set of seed domains.
The crawler crawls each page, extracts the content, and finds all the links.
The extracted page data is added to the PageDataQueue.
The links found on the page are added to the SeenUrls map.
The background scheduler periodically runs the following jobs:

ingestSites: Takes the page data from the PageDataQueue and ingests it into Quickwit.
addSiteScore and addDomainScore: Calculates and updates the page and domain scores in Kvrocks.
addSeenUrls: Adds new URLs from the SeenUrls map to the CrawlListPath in Kvrocks, so they can be crawled in the future.

The api server provides a search endpoint that queries the Quickwit index.

API Security and Traffic Control

This project uses APIGate as a smart traffic control and security layer for its API. APIGate helps protect the service from abuse, blocks malicious traffic, and provides valuable insights into API usage patterns. It operates with very low latency by using two simple endpoints.

Pre-request Checks: Before processing a search request, the API server makes a call to APIGate’s `/api/allow` endpoint. This isn’t a traditional authentication check; instead, it’s a real-time decision to determine if the request should be permitted based on the source IP’s reputation, request frequency, and other security rules. This is the project’s first line of defense against abusive request patterns.

Post-request Logging: After a request is handled, the server sends metadata to APIGate’s /api/log endpoint. This allows for powerful, real-time monitoring and analytics. It helps in understanding traffic patterns, identifying potential threats, and getting a clear view of how the API is being used.

By integrating APIGate, the project benefits from an enterprise-grade protection layer, allowing the developers to focus on the core search functionality while ensuring the API remains fast, reliable, and secure.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Go (version 1.18 or later)
Docker
curl

Configuration

The application is configured using environment variables. You can set them in your shell or create a .env file in the project root.

# .env file
# Kvrocks Configuration
KVRocks_ADDR=localhost:6666
KVRocks_PASSWORD=
KVRocks_DB=0
KVPoolSize=250
KVIdleConns=25

# Quickwit Configuration
QUICKWIT_ADDR=http://localhost:7280
QW_INDEX=pages

# APIGate Configuration
APIGATE_API_KEY=your_apigate_api_key

Initial Setup

Start Dependent Services (Kvrocks & Quickwit):

This project requires Kvrocks and Quickwit. You can start them using Docker.

Start Kvrocks:

docker run -d --name kvrocks -p 6666:6666 -v kvrocks_data:/data --restart unless-stopped --platform linux/arm64/v8 apache/kvrocks:latest

Start Quickwit:

docker run -d \
--name quickwit \
-p 7280:7280 \
-v /var/lib/quickwit:/var/lib/quickwit \
-e QW_INGEST_FLUSH_INTERVAL_SECS=5 \
-e QW_INGEST_FORCE_FLUSH_MIN_INTERVAL_SECS=2 \
quickwit/quickwit:latest run

Install Go Dependencies:

Download the required Go modules.

go mod tidy

Create the Quickwit Index:

Use the provided site-index.yaml to create the search index.

curl -XPOST http://localhost:7280/api/v1/indexes -H "content-type: application/yaml" --data-binary @site-index.yaml

Run the Applications:

Start the API Server:

go run api/main.go

Start the Crawler and Background Jobs: (In a new terminal)

go run main.go

Verify the Setup

After a few moments, the crawler should have indexed some data. You can verify this by searching the index:

curl "http://127.0.0.1:7280/api/v1/pages/search?query=*:*&limit=5"

This should return a JSON response with the first 5 indexed pages.

API Usage and Commands

Quickwit

Describe the index:

docker exec quickwit quickwit index describe --index pages

Search the index:

curl http://127.0.0.1:7280/api/v1/pages/search?query=title:INF

Ingest data into the index:

curl -XPOST "http://127.0.0.1:7280/api/v1/pages/ingest" \
-H "Content-Type: application/x-ndjson" \
--data-binary @- <<'EOF'
{"url":"https://example.com/new1","title":"New Page 3","crawl_timestamp":1731240000,"cleaned_text":"Some example text"}
{"url":"https://example.com/new2","title":"New Page 4","crawl_timestamp":1731240000,"cleaned_text":"Another text"}
EOF

Search the index with a POST request:

curl -XPOST "http://127.0.0.1:7280/api/v1/pages/search" \
-H "Content-Type: application/json" \
-d '{
"query": "title:Domain",
"sort_by_field": "_score"
}'

License

This project is licensed under the MIT License - see the LICENSE file for details.

ReSearch

Architecture

ReSearch

Architecture

Data Flow

API Security and Traffic Control

Getting Started

Prerequisites

Configuration

Initial Setup

Verify the Setup

API Usage and Commands

Quickwit

License

Similar Posts