ReSearch

ReSearch is a web crawler and search engine that crawls a predefined set of domains, indexes the content of the pages, and provides a search API to query the indexed data.

Architecture

The project consists of the following components:

  • Crawler: A multi-threaded web crawler that fetches URLs from a Kvrocks queue, crawls the pages, and extracts content and links.
  • Background Jobs: A scheduler that runs background jobs to process the crawled data. This includes ingesting data into the search index, calculating page and domain scores, and managing the crawl queue.
  • API: A web server that provides a search API. It uses Quickwit as the search backend.
  • Kvrocks: Used as a message queue for the crawler, for caching, and for storing data like page scores…

Similar Posts

Loading similar posts...