Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
🎮 Reinforcement Learning
Q-Learning, Policy Gradient, Reward Systems, Game AI, Robotics
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
5540
posts in
19.0
ms
Policy
Gradient
Methods for
Non-Markovian
Reinforcement Learning
⚙
Context engineering
arxiv.org
·
4d
rl
for red
teaming
: training models to attack and defend themselves
⚙
Context engineering
castform.com
·
1d
·
Hacker News
'Try,
Score
, Change':
Reinforcement
Learning for Children
⚙
Context engineering
gwern.net
·
5d
·
Hacker News
Learning, Fast and Slow: LLMs That
Adapt
Continually
⚙
Context engineering
gepa-ai.github.io
·
5d
·
Hacker News
Meta's
Hyperagents
and
Self-Correcting
Agents
⚙
Context engineering
jdsemrau.substack.com
·
4d
·
Substack
SFT
, RL, and On-Policy Distillation Through a
Distributional
Lens (19 minute read)
🧪
Property-based Testing
nrehiew.github.io
·
5d
·
Hacker News
Q-Flow:
Stable
and
Expressive
Reinforcement Learning with Flow-Based Policy
⚙
Context engineering
arxiv.org
·
2d
Self-Distilled
Agentic
Reinforcement
Learning
⚙
Context engineering
arxiv.org
·
1d
Multi-Objective
and Mixed-Reward Reinforcement Learning via
Reward-Decorrelated
Policy Optimization
⚙
Context engineering
arxiv.org
·
2d
Boosting Reinforcement Learning with
Verifiable
Rewards via
Randomly
Selected Few-Shot Guidance
⚙
Context engineering
arxiv.org
·
1d
Revisiting Reinforcement Learning with
Verifiable
Rewards from a
Contrastive
Perspective
⚙
Context engineering
arxiv.org
·
2d
Self-Supervised On-Policy Reinforcement Learning via Contrastive
Proximal
Policy
Optimisation
⚙
Context engineering
arxiv.org
·
2d
GAGPO
: Generalized Advantage
Grouped
Policy Optimization
⚙
Context engineering
arxiv.org
·
2d
Skill-R1
: Agent
Skill
Evolution via Reinforcement Learning
⚙
Context engineering
arxiv.org
·
4d
Learning from Failures:
Correction-Oriented
Policy Optimization with
Verifiable
Rewards
⚙
Context engineering
arxiv.org
·
1d
ODRPO
:
Ordinal
Decompositions of Discrete Rewards for Robust Policy Optimization
🎯
Reranking
arxiv.org
·
2d
Your Language Model is Its Own
Critic
: Reinforcement Learning with Value Estimation from Actor's
Internal
States
⚙
Context engineering
arxiv.org
·
5d
Resolving
Action
Bottleneck
: Agentic Reinforcement Learning Informed by Token-Level Energy
⚙
Context engineering
arxiv.org
·
1d
Reinforcement
Learning
Measurement
Model
⚙
Context engineering
arxiv.org
·
4d
ChipMATE
: Multi-Agent Training via Reinforcement Learning for Enhanced
RTL
Generation
🤝
Multi-Agent Systems
arxiv.org
·
2d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help