ReSearch
ReSearch is a web crawler and search engine that crawls a predefined set of domains, indexes the content of the pages, and provides a search API to query the indexed data.
Architecture
The project consists of the following components:
- Crawler: A multi-threaded web crawler that fetches URLs from a Kvrocks queue, crawls the pages, and extracts content and links.
- Background Jobs: A scheduler that runs background jobs to process the crawled data. This includes ingesting data into the search index, calculating page and domain scores, and managing the crawl queue.
- API: A web server that provides a search API. It uses Quickwit as the search backend.
- Kvrocks: Used as a message queue for the crawler, for caching, and for storing data like page scores…
ReSearch
ReSearch is a web crawler and search engine that crawls a predefined set of domains, indexes the content of the pages, and provides a search API to query the indexed data.
Architecture
The project consists of the following components:
- Crawler: A multi-threaded web crawler that fetches URLs from a Kvrocks queue, crawls the pages, and extracts content and links.
- Background Jobs: A scheduler that runs background jobs to process the crawled data. This includes ingesting data into the search index, calculating page and domain scores, and managing the crawl queue.
- API: A web server that provides a search API. It uses Quickwit as the search backend.
- Kvrocks: Used as a message queue for the crawler, for caching, and for storing data like page scores and domain scores.
- Quickwit: A distributed search engine used for indexing and searching the crawled data.
Data Flow
- The
crawlerfetches URLs from theCrawlListPathin Kvrocks. - If the crawl list is empty, it’s refilled with a set of seed domains.
- The
crawlercrawls each page, extracts the content, and finds all the links. - The extracted page data is added to the
PageDataQueue. - The links found on the page are added to the
SeenUrlsmap. - The
backgroundscheduler periodically runs the following jobs:
ingestSites: Takes the page data from thePageDataQueueand ingests it into Quickwit.addSiteScoreandaddDomainScore: Calculates and updates the page and domain scores in Kvrocks.addSeenUrls: Adds new URLs from theSeenUrlsmap to theCrawlListPathin Kvrocks, so they can be crawled in the future.
- The
apiserver provides a search endpoint that queries the Quickwit index.
API Security and Traffic Control
This project uses APIGate as a smart traffic control and security layer for its API. APIGate helps protect the service from abuse, blocks malicious traffic, and provides valuable insights into API usage patterns. It operates with very low latency by using two simple endpoints.
Pre-request Checks: Before processing a search request, the API server makes a call to APIGate’s /api/allow endpoint. This isn’t a traditional authentication check; instead, it’s a real-time decision to determine if the request should be permitted based on the source IP’s reputation, request frequency, and other security rules. This is the project’s first line of defense against abusive request patterns.
Post-request Logging: After a request is handled, the server sends metadata to APIGate’s /api/log endpoint. This allows for powerful, real-time monitoring and analytics. It helps in understanding traffic patterns, identifying potential threats, and getting a clear view of how the API is being used.
By integrating APIGate, the project benefits from an enterprise-grade protection layer, allowing the developers to focus on the core search functionality while ensuring the API remains fast, reliable, and secure.
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Prerequisites
Configuration
The application is configured using environment variables. You can set them in your shell or create a .env file in the project root.
# .env file
# Kvrocks Configuration
KVRocks_ADDR=localhost:6666
KVRocks_PASSWORD=
KVRocks_DB=0
KVPoolSize=250
KVIdleConns=25
# Quickwit Configuration
QUICKWIT_ADDR=http://localhost:7280
QW_INDEX=pages
# APIGate Configuration
APIGATE_API_KEY=your_apigate_api_key
Initial Setup
Start Dependent Services (Kvrocks & Quickwit):
This project requires Kvrocks and Quickwit. You can start them using Docker.
Start Kvrocks:
docker run -d --name kvrocks -p 6666:6666 -v kvrocks_data:/data --restart unless-stopped --platform linux/arm64/v8 apache/kvrocks:latest
Start Quickwit:
docker run -d \
--name quickwit \
-p 7280:7280 \
-v /var/lib/quickwit:/var/lib/quickwit \
-e QW_INGEST_FLUSH_INTERVAL_SECS=5 \
-e QW_INGEST_FORCE_FLUSH_MIN_INTERVAL_SECS=2 \
quickwit/quickwit:latest run
Install Go Dependencies:
Download the required Go modules.
go mod tidy
Create the Quickwit Index:
Use the provided site-index.yaml to create the search index.
curl -XPOST http://localhost:7280/api/v1/indexes -H "content-type: application/yaml" --data-binary @site-index.yaml
Run the Applications:
Start the API Server:
go run api/main.go
Start the Crawler and Background Jobs: (In a new terminal)
go run main.go
Verify the Setup
After a few moments, the crawler should have indexed some data. You can verify this by searching the index:
curl "http://127.0.0.1:7280/api/v1/pages/search?query=*:*&limit=5"
This should return a JSON response with the first 5 indexed pages.
API Usage and Commands
Quickwit
Describe the index:
docker exec quickwit quickwit index describe --index pages
Search the index:
curl http://127.0.0.1:7280/api/v1/pages/search?query=title:INF
Ingest data into the index:
curl -XPOST "http://127.0.0.1:7280/api/v1/pages/ingest" \
-H "Content-Type: application/x-ndjson" \
--data-binary @- <<'EOF'
{"url":"https://example.com/new1","title":"New Page 3","crawl_timestamp":1731240000,"cleaned_text":"Some example text"}
{"url":"https://example.com/new2","title":"New Page 4","crawl_timestamp":1731240000,"cleaned_text":"Another text"}
EOF
Search the index with a POST request:
curl -XPOST "http://127.0.0.1:7280/api/v1/pages/search" \
-H "Content-Type: application/json" \
-d '{
"query": "title:Domain",
"sort_by_field": "_score"
}'
License
This project is licensed under the MIT License - see the LICENSE file for details.