13:00 PM ET
10/23/2025
LION: Linear Attention for Efficient Bidirectional Sequence Modeling
14:00 PM ET
10/22/2025
Causal Attention with Lookahead Keys
16:00 PM ET
10/16/2025
Making orthonormal updates more scalable
14:00 PM ET
10/14/2025
From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones
16:30 PM ET
10/10/2025
Pre-training under infinite compute
11:00 AM ET
10/8/2025
Muon Outperforms Adam in Tail-End Associative Memory Learning
14:00 PM ET
9/30/2025
Parallelizing "Inherently Sequential" Processes: Parallel Newton methods for nonlinear state space models
14:00 PM ET
9/16/2025
Cartridges: lightweight and general-purpose language model memory via self-study
18:00 PM ET
9/10/2025
Fantastic Pretraining Optimizers and Where to Find Them
11:00 AM ET
9/5/2025
Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
10:00 PM ET
8/27/2025
Diffusion Language Models are Super Data Learners
2:00 PM ET
8/26/2025
Diffusion Beats Autoregressive in Data-Constrained Settings
4:00 PM ET
8/21/2025
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
2:00 PM ET
8/13/2025
pLSTM: parallelizable Linear Source Transition Mark networks
5:15 PM ET
8/5/2025
Helion: A high-level DSL for ML kernels
2:00 PM ET
8/4/2025
Overflow Prevention Enhances Long-Context Recurrent Models
2:00 PM ET
7/29/2025
Scaling Context Requires Rethinking Attention
2:00 PM ET
7/24/2025
Fast and Simplex: 2-Simplicial Attention in Triton
2:00 PM ET
7/22/2025
On the Transformer-SSM Gap (And the Role of the Gather-and-Aggregate Mechanism)
2:00 PM ET
7/1/2025
Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning
12:00 PM ET
6/27/2025
DeltaFormer: breaking the expressivity of Transformer with delta rule
12:00 PM ET
6/26/2025
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
2:00 PM ET
6/24/2025
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
10:00 PM ET
6/18/2025
Scaling Test-Time Compute of LLMs and PRMs for Mathematical Reasoning
14:00 PM ET
6/18/2025
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
14:00 PM ET
6/9/2025
Test-Time Training Done Right
14:00 PM ET
6/5/2025
Your Next-Token Prediction and Transformers Are Biased for Long-Context Modeling
22:00 PM ET
6/3/2025
AI for the open-world: the learning principles
14:00 PM ET
5/28/2025
PENCIL: Long Thoughts with Short Memory
15:00 PM ET
5/22/2025
When Attention Sink Emerges in Language Models: An Empirical View
16:00 PM ET
5/21/2025
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
14:00 PM ET
5/6/2025
Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
10:30 AM ET
4/23/2025
Theoretical benefit and limitation of diffusion language models
02:00 PM ET
4/22/2025
EvaByte: Efficient Byte-level Language Models at Scale
11:00 AM ET
4/17/2025
Remasking Discrete Diffusion Models with Inference-Time Scaling
02:00 PM ET
4/17/2025
Titans: Learning to Memorize at Test Time
02:00 PM ET
4/10/2025
Implicit Language Models are RNNs: Balancing Parallelization and Expressivity
02:00 PM ET
4/9/2025
Hymba: Hybrid Heads, Meta Tokens, and Training of SoTA models
10:00 PM ET
3/27/2025
MoBA: Mixture of Block Attention for Long-Context LLMs
2:00 PM ET
3/20/2025
Forgetting Transformer: Softmax Attention with a Forget Gate
1:00 PM ET
3/18/2025
What's so interesting about models with recurrent depth?
1:30 PM ET
3/12/2025
B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory
1:30 PM ET
3/5/2025
State Tracking in Scalable Linear RNNs
1:30 PM ET
3/3/2025
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study
4:00 PM ET
2/24/2025
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
1:30 PM ET
2/19/2025
Test-time regression: a unifying framework for designing sequence models with associative memory