Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🗄️ Web Datasets
Common Crawl, Corpus, Training data, Web scraping
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
28927
posts in
206.8
ms
Data
Mixing
for Large Language Models
Pretraining
: A Survey and Outlook
🔤
Tokenization
arxiv.org
·
4d
Training Data is Still an Open Problem
✨
Gemini
andrej.xyz
·
2d
What holds AI safety together?
Co-authorship
networks from 200
papers
🛡️
AI Safety
lesswrong.com
·
22h
Using a Local LLM as a Zero-Shot
Classifier
🔤
Tokenization
towardsdatascience.com
·
2d
mtmn/corpus
: self-hosted
listenbrainz
and last.fm frontend
⛰
Alpine.js
github.com
·
6d
·
Lobsters
Assembling
450 Billion
Tokens
: The Training Data Nobody Had Ready
🔤
Tokenization
pub.towardsai.net
·
2d
The week that Meta
employees
became
training data
👁️
Surveillance Capitalism
platformer.news
·
1d
Automated
Deanonymization
is Here
🕷️
Web Crawling
jefftk.com
·
4d
Language Generation in the
Limit
💻
Programming languages
openreview.net
·
1d
AI
providers
have millions of agent
sessions
. The first 1,589 are public.
💳
AI Commerce
danielvanstrien.xyz
·
4d
Habeas
Corpus
Cases, Twitter, Ukraine Cultural Heritage, More: Monday ResearchBuzz, April 20, 2026
📰
RSS Reading Practices
researchbuzz.me
·
5d
“No modern American city has ever run out of water. But chances are rising that
Corpus
Christi
could be the first.”
💧
Water Infrastructure
kut.org
·
1d
·
Hacker News
,
r/news
How I run
distributed
Rust
fuzzing
in GitHub Actions
🦀
Rust Web Services
depot.dev
·
3d
Association Is Not Similarity: Learning
Corpus-Specific
Associations
for Multi-Hop Retrieval
🔍
SPLADE
arxiv.org
·
1d
Report: Meta will train AI agents by tracking employees'
mouse
,
keyboard
use
🆕
New AI
arstechnica.com
·
4d
Datasets -
UCI
Machine Learning
Repository
📊
Vector Databases
archive.ics.uci.edu
·
15h
Embeddings
&
Vector
Search
🎯
Vector Search
taoofmac.com
·
17h
AI
scouts
for
journalists
📰
RSS Reading Practices
cojournalist.ai
·
17h
Show HN:
WhiskeySour
– A 10x faster drop-in replacement for
BeautifulSoup
📖
Readability Algorithms
news.ycombinator.com
·
8h
·
Hacker News
Machine learning and digital
pragmatics
: Which word category
influences
emoji use most?
🔤
Tokenization
arxiv.org
·
1d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help