Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🗄️ Web Datasets
Common Crawl, Corpus, Training data, Web scraping
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
29596
posts in
61.3
ms
Webscraper
: Leverage Multimodal Large Language Models for Index-Content Web
Scraping
🕷️
Web Crawling
arxiv.org
·
1d
·
…
AI Web Scraping API &
Intelligent
Scraper
Tool
🕷️
Web Crawling
xcrawl.com
·
6d
·
…
Index
of /
corpus
/
📝
Text Compression
dev-libreoffice.org
·
36m
·
…
Announcing a Change to Common
Crawl
Dataset Size
Reporting
🕷️
Web Crawling
commoncrawl.org
·
1d
·
Hacker News
·
…
Building search-based RAG using Claude,
Datasette
and
Val
Town
🚀
LanceDB
simonwillison.net
·
5d
·
…
Ask HN: What do you use for local
embeddings
?
📉
Embeddings Optimization
news.ycombinator.com
·
2d
·
Hacker News
·
…
Stop Embedding Your Entire
Corpus
Blindly
📄
Semantic Chunking
decompressed.io
·
6d
·
Hacker News
·
…
CADEL
: A Corpus of
Administrative
Web Documents for Japanese Entity Linking
📌
Embedding Retrieval
arxiv.org
·
1d
·
…
PRISM
: PRIor from
corpus
Statistics for topic Modeling
🕸️
Sparse Vectors
arxiv.org
·
1d
·
…
DataFlex
: A Unified Framework for
Data-Centric
Dynamic Training of Large Language Models
🕯️
Candle
arxiv.org
·
3d
·
…
LombardoGraphia
: Automatic Classification of Lombard
Orthography
Variants
🔤
Tokenization
arxiv.org
·
2d
·
…
ENEIDE
: A High Quality Silver Standard Dataset for Named Entity Recognition and
Linking
in Historical Italian
🎨
Chroma
arxiv.org
·
1d
·
…
Approaches to
Analysing
Historical
Newspapers
Using LLMs
📋
Text Quality
arxiv.org
·
6d
·
…
Pashto
Common Voice: Building the First Open Speech
Corpus
for a 60-Million-Speaker Low-Resource Language
🔤
Tokenization
arxiv.org
·
2d
·
…
Introducing
MELI
: the Mandarin-English Language Interview
Corpus
🔤
Tokenization
arxiv.org
·
2d
·
…
AMALIA
Technical Report: A Fully Open Source Large Language Model for European
Portuguese
🦙
Ollama
arxiv.org
·
3d
·
…
PRISM
: A Multi-View Multi-Capability Retail Video Dataset for
Embodied
Vision-Language Models
✨
Gemini
arxiv.org
·
1d
·
…
OptiMer
: Optimal Distribution Vector Merging Is Better than Data
Mixing
for Continual Pre-Training
📊
Embeddings
arxiv.org
·
1d
·
…
Training data generation for context-dependent
rubric-based
short answer
grading
⭐
Content Scoring
arxiv.org
·
2d
·
…
The
Thiomi
Dataset: A Large-Scale Multimodal
Corpus
for Low-Resource African Languages
✨
Gemini
arxiv.org
·
1d
·
…
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help