Web Datasets

Feeds to Scour
SubscribedAll
Scoured 138 posts in 51.2 ms

Publishers push Common Crawl to stop collecting content for AI training

 🔗Interoperability

US Publishers Demand Common Crawl Stop Scraping Their Content via @sejournal, @MattGSouthern

 🎆Year End
searchenginejournal.com·

Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"

 📰Content Curation
the-decoder.com
·

Common Crawl Foundation at IIPC-WAC 2026

 🏛️Internet Archive  Content type: Blog
commoncrawl.org·

mikinko/HuggingFace_WFX: Total Commander WFX plugin for HuggingFace repos

 🔀JJ  Content type: Code

Testarvette: Ferrari Testarossa Replica

 🚩CTF Writeups
barnfinds.com·

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

 🤖AI  Content type: Academic
arxiv.org··Hacker News

Pythia 1.4B reproduces 3.6% of training samples verbatim given 950-token prompts

 Fast AI Inference  Content type: Blog
ret2libc.com··Hacker News

US publishers tell Common Crawl to stop scraping and delete archive

 🔗Interoperability

Google’s DiffusionGemma is 4x faster than its other Gemma models

 🤖AI
thenewstack.io·

NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar...

 Fast AI Inference
digg.com·

know the products now; snap up deals later

 🎯Recommendation Metrics
techradar.com
·

Email ownership, I give up.

 🧹Spam Filters  Content type: Discussion
lemmy.world·

Three sleep intervals for three APIs: Steam 250ms, GitHub 100ms, HuggingFace none

 🔀JJ  Content type: Reference
docs.github.com··DEV

Enshittification Merch That Actually Fights Enshittification

 🎨Graphic Design
eff.org·

My life as a human pincushion continues (Day 17, post-surgery)

 🎆Year End
creolened.com·

Tejas-TA/predikit: The missing bridge between your ML models and your AI agents.

 🔧Agent Tooling  Content type: Code
github.com··Hacker News

nex-agi/Nex-N2-mini • Huggingface

 🏗️LLM Infrastructure

A Few Good Things - Vol. 22

 🍜Umami
brandons-journal.com·

SafeRun: Enabling Determinism in LLM Planning for Running

 🏆LLM Benchmarking  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help