| Instructors | Danqi Chen (danqic AT cs.princeton.edu) and Sanjeev Arora (arora AT cs.princeton.edu) |
| Teaching assistants | Adithya Bhaskar (adithyab AT princeton.edu) and Tyler Zhu (tylerzhu AT princeton.edu) |
| Lectures | Monday/Wednesday 10:30-11:50am |
| Location | CS Building 105 |
| Office hours | Danqi’s office hour: Tuesday 10-11, COS 412 (by appointment) Sanjeev’s office hour: Wednesday 4-5pm, COS 407 Adithya’s office hour: Thursday 3-4pm, Friend 010B Tyler’s office hour: … |
| Instructors | Danqi Chen (danqic AT cs.princeton.edu) and Sanjeev Arora (arora AT cs.princeton.edu) |
| Teaching assistants | Adithya Bhaskar (adithyab AT princeton.edu) and Tyler Zhu (tylerzhu AT princeton.edu) |
| Lectures | Monday/Wednesday 10:30-11:50am |
| Location | CS Building 105 |
| Office hours | Danqi’s office hour: Tuesday 10-11, COS 412 (by appointment) Sanjeev’s office hour: Wednesday 4-5pm, COS 407 Adithya’s office hour: Thursday 3-4pm, Friend 010B Tyler’s office hour: Monday 4-5pm, Friend 010C |
| Feedback form | https://forms.gle/vUD1RieC1YcBSugw7 |
We will use a Slack team for most communications this semester. You will be added to the Slack team after the first week. If you join the class late, just email us, and we’ll add you. Once you’re on Slack, we prefer Slack messages over emails for all logistical questions. We also encourage students to use Slack for discussions related to lecture content and projects.
Large language models (LLMs) have revolutionized natural language processing by enabling machines to generate, understand, and interact with human language in more sophisticated ways than ever before. Beyond technical advancements, LLMs are shaping societal interactions with technology, from enhancing accessibility for underserved communities to transforming education, healthcare, and creative industries. This course aims to provide a rigorous survey of current LLM research, including model architecture, data preparation, pre-training, post-training, alignment, and model deployment. The course focuses on conceptual understanding and research rather than engineering, and it is expected to be highly interactive. Students are expected to read cutting-edge research papers regularly, participate in class discussion, and also complete a major project (in groups of 2-3) at the end, for which computational resources will be arranged.
Prerequisites: COS484 or equivalent background (i.e., familiarity with fundamentals of deep learning/machine learning, Transformers, PyTorch). Open to all graduate students. Undergraduates need instructors’ permission.
Course structure
-
Class participation (30%): In each class, we will cover 1-2 papers (see "required reading" in the schedule). You are required to read these papers in depth beforehand, and answer a pre-lecture question form before the class (there is a Google form linked in the schedule). These are due at 11:59pm on the day before the lecture. Some questions are designed to test your understanding of the reading materials, and some questions are open-ended and prompt you to read the paper critically and write down your thoughts. This counts towards class participation - we will not grade the correctness but we will expect you to do the work, and submit reasonable answers.
-
Debate (15%): We will schedule 12 debate panels in the class from Week 4 to Week 9, with each panel consisting of 5 students and lasting 30 minutes (the lectures will be reduced to 50 minutes). Each panel will focus on one research paper (or two) related to the topics that have been taught so far, and will comprise of the following structure:
-
Each panel will be composed of 1 presenter, 2 critics, and 2 proponents.
-
The presenter will start with a short presentation (8 minutes) of the paper.
-
The 2 critics will then critique the paper, similar to how reviewers assess conference papers—highlighting limitations, weaknesses, and any claims that are not well supported by the experiments.
-
The 2 proponents will explain why they believe the problem does not exist or is not serious.
-
There will be multiple rounds of interaction. critics are asked to send their major criticisms to the proponents at least 2 days before the lecture day, so the proponents have time to research and prepare their responses.
-
The group will write a 2-page summary of the debate later and submit it.
-
Lecture scribing (10%): For each lecture, we will ask 3 students to scribe the lecture content, covering the technical content and research questions.
-
You can find the Overleaf scribe template here. Make a shared copy between all the scribes for a given lecture. It is up to you how to divide up the work so that it is equal. Send your completed Overleaf link + PDF to Adithya and Tyler on Slack by 11:59pm three days after the lecture. For Monday lectures, this is 11:59pm on Thursday. For Wednesday Lectures, this is 11:59pm on Saturday.
-
Please do not add the four course instructors on the Overleaf, but instead share the editable link with Adithya and Tyler.
-
New to the template is a contributions section, please do fill this out when you submit with an overview of each scribe’s split.
-
Final project (35% + 10% for presentations): At the end of the semester, everyone is required to do a class project related to modern LLMs and submit a final paper. You should work as a team of 2 or 3. Everyone is required to submit a proposal to Gradescope by Oct 13th (Sunday) 11:59pm, and the final paper on the Dean’s Date (Dec 13th 11:59 pm). In-class project presentations will be scheduled in the last three lectures. The template for the final report is here. Feel free to use it for the proposal as well, but you can also use any template you like.
Schedule
| Date | Instructor | Topic/required reading | Recommended reading | Reading response | Panel discussion | Scribes |
|---|---|---|---|---|---|---|
| Sep 4 (Wed) | Sanjeev | Introduction [slides] | N/A | |||
| Sep 9 (Mon) | Danqi | Pretraining 1 [slides] + Language Models are Few-Shot Learners (GPT-3) | + Transformers + The Annotated Transformer + GPT-2 + BERT + What happened to BERT & T5? (Yi Tay) | [link] | N/A | + Yinghui He + Haichen Dong + Brendan Y. Wang |
| Sep 11 (Wed) | Danqi | Pretraining 2 [slides] + Language Models are Few-Shot Learners (cont’d) + The Llama 3 Herd of Models , Sections 1-2, Section 3.1-3.2, 3.4, and 5.1 | + Mistral 7B + OLMo + Qwen2 + Data annealing (Databricks) | [link] | N/A | + Jiaxin Xiao + Dillon Lue + Ziyu Xiong |
| Sep 16 (Mon) | Sanjeev | Scaling laws [slides] + Training Compute-Optimal Large Language Models (Chinchilla) + Scaling Data-Constrained Language Models | + Scaling Laws for Neural Language Models + Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws | [link] | N/A | + Wuwei Zhang + Simran Kaur + Keerthana Nallamotu |
| Sep 18 (Wed) | Sanjeev | Emergent behavior [slides] + Emergent Abilities of Large Language Models + A Theory for Emergence of Complex Skills in Language Models, Sections 1-3 and 6-8. No need to understand the math. | + Wikipedia entry on Emergence + Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models | [link] | N/A | + Erich Liang + Heyu Guo + Benedikt P. Stroebl |
| Sep 23 (Mon) | Danqi | Data curation [slides] + Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research | + FineWeb + RefinedWeb + DataComp-LM + QuRating | [link] | Paper: Phi-1.5 "More data or better data?" Presenter: Victor Chu Critics:+ Erich Liang + Tanvi Namjoshi Proponents:+ Simran Kaur + Tedi Zadouri | + Sijia Liu + Iain D. Campbell + Elizabeth A. Mieczkowski |
| Sep 25 (Wed) | Danqi | Post-training: Instruction tuning [slides]+ Scaling Instruction-Finetuned Language Models | + FLAN + The Flan Collection + Tülu + Tulu 2 + LESS + Sebastian Ruder’s blog posts: [1][2] | [link] | Paper: Schaeffer et al 2023 "Are emergent abilities a mirage?" Presenter: Mingqian Xue Critics:+ Lekang Yuan + Heyu Guo Proponents:+ Qishuo Yin + Lihan Zha | + Jane E. Castleman + Kylie Zhang + Yingqing Guo |
| Sep 30 (Mon) | Danqi | Post-training: learning from preferences [slides]+ Training language models to follow instructions with human feedback + Direct Preference Optimization: Your Language Model is Secretly a Reward Model | + Unpacking DPO and PPO + Llama 3 , Section 4 + SimPO | [link] | Paper: Scaling Laws for Data Filtering Presenter: Tamjeed Azad Critics:+ Elizabeth A. Mieczkowski + Nimra Nadeem Proponents:+ Iain D. Campbell + Zhicheng Zheng | + Kincaid MacDonald + Amey P. Pasarkar + Nobline Yoo |
| Oct 2 (Wed) | Sanjeev | Alignment [slides]+ A General Language Assistant as a Laboratory for Alignment | + The RL probabilist blog on forward and reverse KL | [link] | Paper: LIMA: Less Is More for Alignment Presenter: Mahsa Bastankhah Critics:+ Niusha Moshrefi + Zeyu Shen Proponents:+ Jiaxin Xiao + Wuwei Zhang | + Nimra Nadeem + Stanley Wei + Cyrus Vachha |
| Oct 7 (Mon) | Sanjeev | Constitutional AI [slides]+ Constitutional AI: Harmlessness from AI Feedback | + HHH Dataset (just look at some examples) | [link] | Paper: Is DPO Superior to PPO for LLM Alignment? Presenter: Boyi Wei Critics:+ Xingyu Zhu + Cyrus Vachha Proponents:+ Benedikt P. Stroebl + Kincaid MacDonald | + Juhyun Park + Wentao Guo + Mahsa Bastankhah |
| Oct 9 (Wed) | Sanjeev | LLM Metacognition [slides]+ Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving | + AI-Assisted Generation of Difficult Math Questions + Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning | [link] | Paper: Inverse Constitutional AI: Compressing Preferences into Principles Presenter: Zixuan Wang Critics:+ Rafael Pastrana Jimenez + Dillon Lue Proponents:+ Sreemanti Dey + Jane E. Castleman | + Zixuan Wang + Mingqian Xue |
| Oct 21 (Mon) | Tianyu Gao | Long-context models [slides]+ How to Train Long-Context Language Models (Effectively) + RoFormer: Enhanced Transformer with Rotary Position Embedding | + A Controlled Study on Long Context Extension and Generalization in LLMs + RULER + Effective Long-Context Scaling of Foundation Models + Data Engineering for Scaling Language Models to 128K Context + StreamingLLM | [link] | Paper: Language Models (Mostly) Know What They Know Presenter: Arin J. Mukherjee Critics:+ Seth Karten + Veniamin Veselovskyy Proponents:+ Yuka Shu + Keerthana Nallamotu | + Victor Chu + Yijun Yin + Lihan Zha |
| Oct 23 (Wed) | Sanjeev | Advanced topics in alignment [slides]+ OpenAI o1 System Card (skim this and note anything interesting) + Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (Read through section 4.2 + skim the rest) | The AI through debate blog post and interview. | [link] | Paper: The Impact of Positional Encoding on Length Generalization in Transformers Presenter: Ambri Ma Critics:+ Colin Wang + Jiahao Qiu Proponents:+ Brendan Y. Wang + David B. Braun | + Zeyu Shen + Tedi Zadouri + Lekang Yuan |
| Oct 28 (Mon) | Danqi Sanjeev | LLM Reasoning 1 [[slides]](https://princeton-cos597r.github.io/lectures/lec14-Inference Time Compute.pdf)+ Let’s Verify Step by Step + Improve Mathematical Reasoning in Language Models by Automated Process Supervision | + Common 7B Language Models Already Possess Strong Math Capabilities + Math-Shepherd | [link] | Paper: Transcendence: Generative Models Can Outperform The Experts That Train Them Presenter: Jiayi Zhang Critics:+ Catherine Cheng + Juhyun Park Proponents:+ Wentao Guo + Sijia Liu | + Niusha Moshrefi + Zhicheng Zheng + Wenzhe Li |
| Oct 30 (Wed) | Danqi | LLM Reasoning 2 [slides] + Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters | + Large Language Monkeys + Inference Scaling Laws + STaR + DeepSeekMath | [link] | Paper: Stream of Search (SoS): Learning to Search in Language Presenter: Constantin Schesch Critics:+ Yinghui He + Yijun Yin Proponents:+ Haichen Dong + Amey P. Pasarkar | + Creston A. Brooks + Jiayi Zhang + Qishuo Yin |
| Nov 4 (Mon) | Mengzhou Xia | Small models [slides]+ Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning + Gemma 2: Improving Open Language Models at a Practical Size | + MiniCPM + Llama 3.2 blog post + OpenELM + Mojan Javaheripi: The Surprising Power of Small Language Models + LLM Pruning and Distillation in Practice: The Minitron Approach | [link] | Paper: Information-Theoretic Distillation for Reference-less Summarization Presenter: Ziyu Xiong Critics:+ Nobline Yoo + Creston A. Brooks Proponents:+ Stanley Wei + Lucy He | + David B. Braun + Boyi Wei + Arin J. Mukherjee |
| Nov 6 (Wed) | Danqi | Retrieval-augmented LMs+ Improving language models by retrieving from trillions of tokens | [link] | Paper: To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning Presenter: Alexandre Kirchmeyer Critics:+ Wenzhe Li + Kylie Zhang Proponents:+ Yingqing Guo + Joie Y . Zhang | + Sreemanti Dey + Xingyu Zhu + Colin Wang | |
| Nov 11 (Mon) | Yu Su (OSU) | A Holistic and Critical Look at Language Agents [slides] | + Language agents: a critical evolutionary step of artificial intelligence + HippoRAG + LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error | N/A | + Alexandre Kirchmeyer + Lucy He + Jiahao Qiu | |
| Nov 13 (Wed) | Danqi | Retrieval-augmented language models [slides]+ Improving language models by retrieving from trillions of tokens | + ACL 2023 tutorial + REALM + kNN-LM + TRIME + REPLUG + FLARE + Self-RAG | N/A | + Veniamin Veselovskyy + Tanvi Namjoshi + Ambri Ma | |
| Nov 18 (Mon) | Tri Dao | Hardware-aware Algorithms for Language Modeling | + FlashAttention + Mamba | N/A | + Tamjeed Azad + Seth Karten + Catherine Cheng | |
| Nov 20 (Wed) | Saining Xie (NYU) | Language Models Need Better Visual Grounding for Meaning and Understanding [slides] | + LLaVA + Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs + Cambrian-1 + Molmo and PixMo (AI2) + MM1 (Apple) | N/A | + Constantin Schesch + Yuka Shu + Joie Y . Zhang | |
| Nov 25 (Mon) | Students | Project presentations | N/A | |||
| Dec 2 (Mon) | Students | Project presentations | N/A | |||
| Dec 4 (Wed) | Students | Project presentations | N/A |