CMU researchers are presenting 156 papers at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025), held from December 2nd-December 7th at the San Diego Convention. Here is a quick overview of the areas our researchers are working on:
Here are our most frequent collaborator institutions:
Oral Papers
Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain
Authors: Trinity Chung (Carnegie Mellon University), Yuchen Shen (Carnegie Mellon University), Nathan Kong (MIT), Aran Nayebi (School of Computer Science, Carnegie Mellon University)
This paper introduces an Encoder–Attender–Decoder (EAD) framework to study task-optimized neural networks for tactile processing using realistic whisker-based simulations. Convolutional recurrent neural networks (ConvRNNs) emerge as the most effective encoders, both for tactile categorization and for producing representations that closely match activity in rodent somatosensory cortex, revealing a linear link between task performance and neural alignment. Notably, self-supervised contrastive ConvRNN models achieve neural fits comparable to supervised training, indicating that label-free learning can capture biologically relevant tactile representations. These findings highlight the importance of recurrent processing for understanding cortical tactile computation and for building robust embodied AI systems.
MaxSup: Overcoming Representation Collapse in Label Smoothing
Authors: Yuxuan Zhou (CISPA Helmholtz Center for Information Security), Heng Li (Carnegie Mellon University), Zhi-Qi Cheng (University of Washington), Xudong Yan (City University of Macao), Yifei Dong (Carnegie Mellon University), Mario Fritz (CISPA Helmholtz Center for Information Security), Margret Keuper (University of Mannheim)
Label Smoothing is commonly used to reduce overconfidence and improve generalization, but it can paradoxically increase confidence in misclassified samples and collapse feature representations. This work analytically decomposes the LS loss, revealing an error-amplification term that strengthens incorrect predictions and drives representation collapse. To overcome this, the authors propose Max Suppression (MaxSup), which regularizes predictions uniformly by penalizing the top-1 logit instead of the ground-truth logit. Experiments show that MaxSup preserves intra-class diversity, improves class separation, and consistently outperforms LS across large-scale classification and downstream tasks.
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Authors: Liwei Jiang (University of Washington), Yuanjun Chai (University of Washington), Margaret Li (University of Washington), Mickel Liu (University of Washington), Raymond Fok (University of Washington), Nouha Dziri (Allen Institute for AI), Yulia Tsvetkov (Department of Computer Science, University of Washington), Maarten Sap (Carnegie Mellon University), Yejin Choi (UW => Stanford / NVIDIA)
This paper introduces INFINITY-CHAT, a large-scale dataset of 26,000 diverse open-ended user queries and a comprehensive taxonomy of prompt types to evaluate creativity and diversity in language model outputs. Using this resource, the authors identify a pronounced “Artificial Hivemind” effect marked by both repetitive responses within a single model and striking similarities across different models. The dataset also includes over 31,000 human annotations enabling analysis of collective and individual preferences. Results show that existing models and evaluation methods are poorly calibrated to idiosyncratic human judgments, highlighting risks of homogenized AI outputs.
Mean Flows for One-step Generative Modeling
Authors: Zhengyang Geng (CMU), Mingyang Deng (Massachusetts Institute of Technology), Xingjian Bai (Massachusetts Institute of Technology), Zico Kolter (Carnegie Mellon University), Kaiming He (MIT)
The authors introduce MeanFlow, a principled one-step generative modeling framework based on the concept of average velocity rather than the instantaneous velocity used in prior flow-matching methods. The authors derive a formal identity linking average and instantaneous velocities to guide neural network training in a self-contained approach requiring no pretraining, distillation, or curriculum learning. MeanFlow achieves strong results, including a 3.43 FID on ImageNet 256×256 with a single function evaluation, outperforming previous one-step models. These results substantially narrow the performance gap between one-step and multi-step diffusion and flow-based methods.
Spotlight Papers
OpenCUA: Open Foundations for Computer-Use Agents
Authors: Xinyuan Wang (University of Hong Kong), Bowen Wang (University of Hong Kong), Dunjie Lu (SUN YAT-SEN UNIVERSITY), Junlin Yang (Tsinghua University), Tianbao Xie (the University of Hong Kong, University of Hong Kong), Junli Wang (Alibaba Group), Jiaqi Deng (The University of Hong Kong), Xiaole Guo (University of Hong Kong), Yiheng Xu (University of Hong Kong), Chen Wu (Carnegie Mellon University), Zhennan Shen (Shanghai Jiaotong University), Zhuokai Li (University of Hong Kong), Ryan Li (Computer Science Department, Stanford University), Xiaochuan Li (Tsinghua University), Junda Chen (Harbin Institute of Technology), Boyuan Zheng (The University of Hong Kong), Li Peihang (University of Hong Kong), Fangyu Lei (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Ruisheng Cao (Shanghai Jiaotong University), Yeqiao Fu (University of Hong Kong), Dongchan Shin (University of Hong Kong), Martin Shin (University of Hong Kong), Hu Jiarui (University of Hong Kong), Yuyan Wang (Johns Hopkins University), Jixuan Chen (University of California, San Diego), Yuxiao Ye (The Hong Kong University of Science and Technology), Danyang Zhang (Shanghai Jiao Tong University), Yipu Wang (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Heng Wang (University of Illinois Urbana-Champaign), Diyi Yang (Stanford University), Victor Zhong (University of Waterloo), Y.Charles (Moonshot AI), Zhilin Yang (Tsinghua University, Tsinghua University), Tao Yu (University of Hong Kong)
This paper introduces OpenCUA, an open-source framework designed to enable transparent research into computer-use agents built with vision–language models. The framework includes an annotation system for collecting human demonstrations, AgentNet, a large-scale dataset spanning three operating systems and 200+ applications, and a scalable pipeline that converts demonstrations into state–action data with reflective chain-of-thought reasoning. End-to-end agent models trained with OpenCUA show strong benchmark performance, with OpenCUA-72B achieving a 45.0% success rate on OSWorld-Verified, setting a new open-source state of the art.
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
Authors: Jiatong Shi (Carnegie Mellon University), Yifan Cheng (Huazhong University of Science and Technology), Bo-Hao Su (Carnegie Mellon University), Hye-jin Shim (Carnegie Mellon University), Jinchuan Tian (Carnegie Mellon University), Samuele Cornell (Università Politecnica delle Marche), Yiwen Zhao (School of Computer Science, Carnegie Mellon University), Siddhant Arora (Carnegie Mellon University), Shinji Watanabe (Carnegie Mellon University)
This work presents ARECHO, an autoregressive chain-based framework for jointly evaluating multiple speech quality metrics such as PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score), which traditionally differ in scale and assumptions. ARECHO introduces a comprehensive tokenization pipeline, a dynamic classifier chain to model inter-metric dependencies, and a confidence-oriented two-step decoding scheme to improve inference reliability. Experiments show that ARECHO consistently outperforms baseline methods across speech enhancement, generation evaluation, and noisy-speech scenarios. The approach also improves interpretability and flexibility by enabling reference-free evaluation and subset metric queries.
UMA: A Family of Universal Models for Atoms
Authors: Brandon Wood (FAIR at Meta), Misko Dzamba (Facebook), Xiang Fu (Periodic Labs), Meng Gao (Facebook), Muhammed Shuaibi (FAIR, Meta), Luis Barroso-Luque (Facebook), Kareem Abdelmaqsoud (Carnegie Mellon University), Vahe Gharakhanyan (Meta), John Kitchin (Carnegie Mellon University), Daniel Levine (Meta FAIR), Kyle Michel (Meta), Anuroop Sriram (Meta FAIR), Taco Cohen (Meta / FAIR), Abhishek Das (FAIR, Meta AI), Sushree Sahoo (Facebook), Ammar Rizvi (Meta), Zachary Ulissi (FAIR, Meta AI), Larry Zitnick (Fundamental AI Research at Meta AI)
This paper introduces Universal Models for Atoms (UMA), a family of large-scale models designed to rapidly and accurately predict properties from atomic simulations across chemistry and materials science. Trained on over 500 million unique 3D atomic structures spanning molecules, materials, and catalysts, UMA leverages empirical scaling laws and a novel mixture-of-linear-experts architecture to increase capacity without sacrificing speed. Evaluations show that a single UMA model, without fine-tuning, matches or outperforms specialized models across diverse applications.
A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search
Authors: Arnav Kumar Jain (University de Montreal), Vibhakar Mohta (Nuro Inc.), Subin Kim (Korea Advanced Institute of Science & Technology), Atiksh Bhardwaj (Cornell University), Juntao Ren (Stanford University), Yunhai Feng (Cornell University), Sanjiban Choudhury (Cornell University), Gokul Swamy (Carnegie Mellon University)
This work addresses a key limitation of behavioral cloning (BC) in imitation learning: BC only teaches an agent to mimic expert actions at states the expert visited, leaving it unable to recover from mistakes. To overcome this, the authors propose SAILOR, which leverages learning to search (L2S) by training a world model and a reward model to plan and recover toward expert outcomes even after errors. SAILOR achieves stable and sample-efficient learning without additional human corrections and consistently outperforms state-of-the-art diffusion-policy BC methods across visual manipulation benchmarks. It also demonstrates robustness to nuanced failures and reward hacking, and the performance gap persists even when BC is trained with 5–10x more demonstrations.
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
Authors: Jiajun Shi (Beijing University of Aeronautics and Astronautics), Jian Yang (Alibaba Group), Jiaheng Liu (Nanjing University), Xingyuan Bu (Alibaba Group), Jiangjie Chen (ByteDance Seed), Junting Zhou (Peking University), Kaijing Ma (Tongji University), Zhoufutu Wen (ByteDance Inc.), Bingli Wang (Sichuan Agricultural University), Yancheng He (Alibaba Group), Liang Song (M-A-P), Hualei Zhu (Beijing University of Aeronautics and Astronautics), Shilong Li (Beijing University of Posts and Telecommunications), Xingjian Wang (Shanghai University of Electric Power), Wei Zhang (Beijing University of Aeronautics and Astronautics), Ruibin Yuan (Carnegie Mellon University), Yifan Yao (Beijing University of Posts and Telecommunications), Wenjun Yang (University College London, University of London), Yunli Wang (Kuaishou Technology), Siyuan Fang (Beijing University of Posts and Telecommunications), Siyu Yuan (Fudan University), Qianyu He (Fudan University), Robert Tang (Yale University), Yingshui Tan (Alibaba Group), Wangchunshu Zhou (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Zhoujun Li (Beijing University of Aeronautics and Astronautics), Wenhao Huang (Key Laboratory of Machine Perception), Ge Zhang (University of Michigan – Ann Arbor)
The authors introduce KORGym, a dynamic evaluation platform designed to comprehensively assess the reasoning abilities of large language models (LLMs) and vision-language models (VLMs). Unlike existing domain-specific benchmarks, KORGym offers over 50 interactive games in textual and visual formats, including multi-turn and reinforcement learning scenarios. Experiments on 19 LLMs and 8 VLMs reveal consistent reasoning patterns within model families and highlight the superior performance of closed-source models. The platform also enables analysis of factors such as modality, reasoning strategies, reinforcement learning approaches, and response length, providing a robust tool for advancing reasoning evaluation in complex environments.
Towards Understanding Camera Motions in Any Video
Authors: Zhiqiu Lin (Carnegie Mellon University), Siyuan Cen (University of Massachusetts at Amherst), Daniel Jiang (Carnegie Mellon University), Jay Karhade (CMU, Carnegie Mellon University), Hewei Wang (Carnegie Mellon University), Chancharik Mitra (CMU, Carnegie Mellon University), Yu Tong Tiffany Ling (CMU, Carnegie Mellon University), Yuhan Huang (Carnegie Mellon University), Rushikesh Zawar (Carnegie Mellon University), Xue Bai (Adobe Systems), Yilun Du (Google Deepmind / Harvard), Chuang Gan (IBM), Deva Ramanan (Carnegie Mellon University)
This work presents CameraBench, a large-scale dataset and benchmark for evaluating camera motion understanding, comprising roughly 3,000 diverse videos annotated through a rigorous expert-driven process. A key contribution is a taxonomy of camera motion primitives, developed with cinematographers, which captures motions that require both geometric and semantic understanding. Human studies show that domain expertise and targeted training significantly improve motion recognition, such as distinguishing zoom from forward translation. Evaluations reveal that Structure-from-Motion models struggle with semantic motions, while generative video-language models struggle with geometric ones, and fine-tuning a generative VLM on CameraBench enables strong performance across motion-augmented captioning, video QA, and video-text retrieval tasks.
Enhancing Training Data Attribution with Representational Optimization
Authors: Weiwei Sun (Carnegie Mellon University), Haokun Liu (Department of Computer Science, University of Toronto), Nikhil Kandpal (Department of Computer Science), Colin Raffel (University of Toronto, Vector Institute and Hugging Face), Yiming Yang (CMU)
This paper presents AirRep, a scalable representation-based method for training data attribution (TDA) that learns task-specific, model-aligned representations optimized for measuring how training data affects model predictions. AirRep features a trainable encoder for attribution quality and an attention-based pooling mechanism to estimate group-wise influence accurately. Trained using a ranking objective over subsets labeled by their empirical effect, AirRep matches the performance of gradient-based methods like influence functions while being nearly 100× more efficient at inference.
Checklists Are Better Than Reward Models For Aligning Language Models
Authors: Vijay Viswanathan (Carnegie Mellon University), Yanchao Sun (University of Maryland, College Park), Xiang Kong (Apple), Meng Cao (Apple), Graham Neubig (Carnegie Mellon University), Sherry Wu (Carnegie Mellon University)
This work introduces Reinforcement Learning from Checklist Feedback (RLCF), a method for improving instruction-following in language models using flexible, instruction-specific criteria rather than fixed metrics like helpfulness or harmfulness. RLCF extracts checklists from instructions and evaluates responses against each item using AI judges and verifier programs to compute rewards for reinforcement learning. Applied to models like Qwen2.5-7B-Instruct, RLCF improves performance across five benchmarks, achieving notable gains in hard satisfaction rates and win rates, and can also enhance other models off-policy, such as Llama 3.1 8B Instruct and OLMo 2 7B Instruct. The authors release their WildChecklists dataset, models, and code to support further research in flexible instruction alignment.
Extrapolation by Association: Length Generalization Transfer In Transformers
Authors: Ziyang Cai (Princeton University), Nayoung Lee (University of Wisconsin-Madison), Avi Schwarzschild (Carnegie Mellon University), Samet Oymak (University of Michigan – Ann Arbor), Dimitris Papailiopoulos (University of Wisconsin-Madison)
This paper studies length generalization in transformer language models—the ability to handle longer inputs than seen during training—through the concept of task association. The authors show that training on a longer, related auxiliary task can improve generalization to longer inputs on a target task across algorithmic domains like arithmetic, string manipulation, and maze navigation. They find similar transfer effects in pretrained language models, suggesting pretraining provides reusable computational scaffolding. Mechanistic analysis indicates that this length generalization transfer is linked to the reuse of attention heads between tasks, highlighting how transformers leverage compositional inductive structures.
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
Authors: Xinyu Yang (CMU), Yuwei An (Carnegie Mellon University), Hongyi Liu (Carnegie Mellon University), Tianqi Chen (Carnegie Mellon University), Beidi Chen (CMU / Amazon)
This work introduces Multiverse, a generative model that enables natively parallel generation by internalizing a MapReduce paradigm with Map, Process, and Reduce stages. The approach includes Multiverse Curator for automated data creation, Multiverse Attention for separating parallel reasoning steps, and Multiverse Engine for dynamic sequential-parallel inference. After minimal fine-tuning, Multiverse-32B matches leading autoregressive LLMs in performance while achieving up to 2× speedup and better scaling efficiency. The authors have open-sourced the full Multiverse ecosystem, including models, data, serving systems, and training pipelines.
Thought Communication in Multiagent Collaboration
Authors: Yujia Zheng (Carnegie Mellon University), Zhuokai Zhao (Meta), Zijian Li (Mohamed bin Zayed University of Artificial Intelligence), Yaqi Xie (CMU), Mingze Gao (Meta Inc.), Lizhu Zhang (Meta), Kun Zhang (CMU & MBZUAI)
This work introduces thought communication, a paradigm for multi-agent interaction that goes beyond natural language by enabling agents to share latent, mind-like representations directly. The authors formalize this process as a latent variable model, proving that both shared and private thoughts, as well as the global structure of thought sharing among agents, can be identified and recovered with theoretical guarantees. They develop a framework that extracts and distributes relevant latent thoughts to agents, enhancing collaboration across modalities. Experiments on synthetic and real-world benchmarks validate the approach, showing that thought communication can unlock collaborative advantages beyond what is possible with surface-level language-based exchanges.
Cost-aware LLM-based Online Dataset Annotation
Authors: Eray Can Elumar (CMU, Carnegie Mellon University), Cem Tekin (Bilkent University), Osman Yagan (Carnegie Mellon University)
This paper introduces CaMVo, a method for labeling datasets with large language models (LLMs) while keeping costs low. Instead of querying many LLMs for every example, CaMVo adaptively chooses only a few models based on how confident they are likely to be. It uses ideas from contextual bandits (LinUCB) and a Bayesian confidence estimator to decide which models to query and how to weight their votes—without needing any ground-truth labels. Experiments on MMLU and IMDB show that CaMVo matches or beats full majority voting but with far fewer LLM calls, making it a practical approach for efficient large-scale annotation.
Conformal Mixed-Integer Constraint Learning with Feasibility Guarantees
Authors: Daniel Ovalle (Carnegie Mellon University), Lorenz Biegler (Carnegie Mellon University), Ignacio Grossmann (CMU, Carnegie Mellon University), Carl Laird (Carnegie Mellon University), Mateo Dulce Rubio (CMU)
The authors introduce C-MICL, a framework for learning constraints in optimization problems while guaranteeing that the resulting solutions remain feasible with high probability. Traditional learned constraints can fail due to model error or limited data, but C-MICL uses conformal prediction to add uncertainty-aware adjustments that ensure feasibility at a user-specified confidence level. The method works for both regression- and classification-based constraint learning and avoids the heavy computational overhead of ensemble approaches. Experiments show that C-MICL reliably meets feasibility targets, preserves strong optimization performance, and is significantly more efficient, offering a principled way to blend machine learning with safe decision-making.
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications
Authors: Gabriele Oliaro (Carnegie Mellon University), Zhihao Jia (School of Computer Science, Carnegie Mellon University), Daniel Campos (Zipf AI), Aurick Qiao (Snowflake)
The authors present SuffixDecoding, a new speculative decoding method tailored for emerging AI workloads like LLM-based agents, which generate long, repetitive, and predictable sequences. Unlike existing speculative decoding approaches designed for diverse, independent requests, SuffixDecoding uses suffix trees to efficiently cache and reuse long stretches of past tokens from prompts and model outputs. It adaptively adjusts how many tokens to speculate—expanding aggressively when predictions are likely to be accepted and backing off when uncertainty is higher. Experiments on agent-style tasks such as SWE-Bench and Text-to-SQL show that SuffixDecoding can deliver up to 3.9× speedups, making it well suited for fast, iterative agentic inference.
Horizon Reduction Makes RL Scalable
Authors: Seohong Park (UC Berkeley), Kevin Frans (UC Berkeley), Deepinder Mann (UC Berkeley), Benjamin Eysenbach (Princeton), Aviral Kumar (Carnegie Mellon University), Sergey Levine (UC Berkeley)
This paper examines why offline reinforcement learning (RL) often fails to scale, even when given massive datasets, large models, and ample compute. The authors find that long decision horizons—the number of steps required to propagate rewards—are a key bottleneck that prevents standard offline RL algorithms from improving with more data. Through extensive experiments, they show that reducing the effective horizon dramatically improves scalability and performance on challenging tasks. Building on this insight, they introduce SHARSA, a simple horizon-reduction method that achieves the strongest scaling behavior and best asymptotic performance across their benchmarks.
To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable RL
Authors: Yuda Song (Carnegie Mellon University), Dhruv Rohatgi (Massachusetts Institute of Technology), Aarti Singh (CMU), J. Bagnell (Carnegie Mellon University)
This paper studies when it’s better to distill privileged expert policies—which have access to latent state information during training—versus directly learning from partial observations in reinforcement learning. Using a simple theoretical model (the perturbed Block MDP) and controlled locomotion experiments, the authors show that the trade-off depends strongly on how stochastic the underlying latent dynamics are. When the latent state is easy to infer, distillation works well, but when it is highly stochastic, imitating the latent optimal policy can actually hurt performance. The results provide practical guidance: the best latent policy isn’t always the best one to distill, and deciding when to distill versus directly learning depends on the underlying uncertainty structure of the task.
A Principled Approach to Randomized Selection under Uncertainty: Applications to Peer Review and Grant Funding
Authors: Alexander Goldberg (Computer Science Department, School of Computer Science), Giulia Fanti (CMU), Nihar Shah (CMU)
MERIT is a principled framework for using randomized selection in settings like peer review or grant funding, where evaluations are noisy and uncertainty can make deterministic rankings unreliable. Instead of relying on ad-hoc randomization, MERIT uses interval estimates (e.g., confidence intervals) to model uncertainty and then optimizes for the worst-case expected number of true top-k items selected. The authors develop a polynomial-time algorithm that scales to large datasets and show that MERIT satisfies desirable fairness and robustness properties that existing methods lack. Experiments on synthetic peer-review data show that MERIT matches prior probabilistic methods in expected performance while providing stronger guarantees in worst-case scenarios.
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Authors: Thomas Kuntz (EPFL – EPF Lausanne), Agatha Duzan (EPFL – EPF Lausanne), Hao Zhao (EPFL – EPF Lausanne), Francesco Croce (University of Tübingen), Zico Kolter (Carnegie Mellon University), Nicolas Flammarion (EPFL), Maksym Andriushchenko (ELLIS Institute Tübingen and MPI-IS)
OS-Harm is a benchmark for evaluating the safety of LLM-based computer use agents that interact directly with operating system interfaces. OS-Harm tests agents across three harm categories—deliberate misuse, prompt injection attacks, and model misbehavior—using 150 tasks spanning applications like email, browsers, and code editors. An automated judge evaluates both task performance and safety, achieving strong agreement with human annotations. Evaluations of leading agents reveal that models often comply with unsafe commands, are vulnerable to prompt injections, and sometimes take unsafe actions, highlighting the need for robust safety measures in these systems.
Can We Infer Confidential Properties of Training Data from LLMs?
Authors: Pengrun Huang (University of California, San Diego), Chhavi Yadav (CMU), Kamalika Chaudhuri (FAIR, Meta and UCSD), Ruihan Wu (University of California, San Diego)
PropInfer is a benchmark designed to evaluate whether large language models (LLMs) can leak sensitive properties of the datasets used for fine-tuning, particularly in domains like healthcare. It tests property inference under both question-answering and chat-completion setups. Two tailored attacks—a prompt-based generation attack and a shadow-model attack leveraging word frequency—are proposed to extract dataset-level information. Empirical results show that these attacks can succeed across multiple pretrained LLMs, revealing an important and previously underexplored privacy risk.
Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?
Authors: Hyeong Kyu Choi (University of Wisconsin-Madison, Computer Sciences), Jerry Zhu (Carnegie Mellon University), Sharon Li (University of Wisconsin-Madison)
Multi-Agent Debate (MAD) improves large language model performance by having multiple agents reason collaboratively, but its key drivers were unclear. By separating Majority Voting from inter-agent debate, experiments across seven NLP benchmarks show that most gains come from majority voting rather than the debate itself. A theoretical analysis models debate as a stochastic process, revealing that debate alone doesn’t improve expected correctness, though targeted interventions that bias belief updates can enhance its impact. These results suggest that while MAD has potential, simple ensembling methods often remain a more reliable and effective approach.
The Complexity of Symmetric Equilibria in Min-Max Optimization and Team Zero-Sum Games
Authors: Ioannis Anagnostides (Carnegie Mellon University), Ioannis Panageas (UC Irvine), Tuomas Sandholm (CMU, Strategy Robot, Optimized Markets, Strategic Machine), Jingming Yan (University of California, Irvine)
The study analyzes the complexity of computing equilibria in team-based zero-sum games and symmetric min-max optimization. It shows that finding epsilon-Nash equilibria in 3-player adversarial team games (2 vs. 1) is CLS-complete, resolving an open question about such games. Additionally, computing symmetric equilibria in symmetric min-max problems is PPAD-complete, even for quadratic objectives, and this extends to 6-player team games (3 vs. 3), implying that common symmetric dynamics cannot reliably converge. Finally, computing non-symmetric equilibria with polynomial precision is FNP-hard, highlighting the fundamental difficulty of equilibrium computation in these settings.
Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning
Authors: Emile Anand (Georgia Institute of Technology and Cognition Labs), Ishani Karmarkar (Stanford University), Guannan Qu (Carnegie Mellon University)
Scaling multi-agent reinforcement learning (MARL) is difficult due to the exponential growth of joint state and action spaces as agents increase. SUBSAMPLE-MFQ introduces a method that combines subsampling agents with mean-field Q-learning and a decentralized randomized policy, allowing efficient learning for any subset of k agents. The algorithm’s runtime scales polynomially in k, not the total number of agents n, making it practical for large systems. Theoretical guarantees show that the learned policy converges to the optimal policy at a rate of roughly 1 over root k, independent of the total agent count.
On the Hardness of Conditional Independence Testing In Practice
Authors: Zheng He (University of British Columbia), Roman Pogodin (Google), Yazhe Li (Microsoft), Namrata Deka (Carnegie Mellon University), Arthur Gretton (Google Deepmind / UCL), Danica J. Sutherland (University of British Columbia + Amii)
Conditional independence (CI) tests are central to tasks like causal discovery and fairness evaluation, but they often fail in practice despite theoretical guarantees. Focusing on the Kernel-based Conditional Independence (KCI) test, the work shows that many recent CI tests are special cases of a Generalized Covariance Measure. Practical performance is largely driven by errors in estimating the conditional mean, which affect Type I error, and by the choice of conditioning kernel, which influences test power but can also inflate false positives. These insights clarify why popular CI tests often underperform and highlight how careful kernel and estimation choices are crucial for reliable results.
Projection-based Lyapunov method for fully heterogeneous weakly-coupled MDPs
Authors: Xiangcheng Zhang (Tsinghua), Yige Hong (Carnegie Mellon University), Weina Wang (Computer Science Department, Carnegie Mellon University)
Heterogeneity creates major challenges in large-scale decision-making, especially in weakly-coupled Markov decision processes (WCMDPs) where each subproblem has distinct dynamics. In the fully heterogeneous setting, the authors show that an efficiently computable policy can achieve an O(1/root N) optimality gap in long-run average reward per subproblem as the number of subproblems N grows. This work provides the first asymptotic optimality guarantee for fully heterogeneous average-reward WCMDPs. Key to this result is a novel use of projection-based Lyapunov functions that ensure convergence of rewards and costs even under complete heterogeneity.
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Authors: Hyungjoo Chae (Georgia Institute of Technology), Seonghwan Kim (Yonsei University), Junhee Cho (Yonsei University), Seungone Kim (Carnegie Mellon University), Seungjun Moon (Yonsei University), Gyeom Hwangbo (University of Seoul), Dongha Lim (Korea Advanced Institute of Science & Technology), Minjin Kim (Yonsei University), Yeonjun Hwang (Yonsei University), Minju Gwak (Yonsei University), Dongwook Choi (Chung-Ang University), Minseok Kang (Yonsei University), Gwanhoon Im (Yonsei University), ByeongUng Cho (Yonsei University), Hyojun Kim (Yonsei University), Jun Han (Yonsei University), Taeyoon Kwon (Yonsei University), Minju Kim (Yonsei University), Beong-woo Kwak (Yonsei University), Dongjin Kang (Yonsei University), Jinyoung Yeo (Yonsei University)
Web navigation poses a long-horizon sequential decision-making challenge that goes beyond typical multimodal LLM tasks, but step-level reward models have been lacking. Web-Shepherd, the first process reward model (PRM) for web navigation, evaluates trajectories at each step, enabling both training and test-time assessment. The approach is supported by the WebPRM Collection, a 40K step-level dataset with annotated preference pairs, and WebRewardBench, a benchmark for evaluating PRMs. Experiments show Web-Shepherd outperforms GPT-4o by ~30 points on WebRewardBench and improves policy performance on WebArena-lite by 10.9 points while reducing verification cost by 10×, demonstrating a practical and efficient solution for web navigation tasks.
Fair Cooperation in Mixed-Motive Games via Conflict-Aware Gradient Adjustment
Authors: Woojun Kim (Carnegie Mellon University), Katia Sycara (Carnegie Mellon University)
Mixed-motive multi-agent reinforcement learning requires balancing individual incentives with collective goals, which are often in conflict. The proposed adaptive conflict-aware gradient adjustment method dynamically balances policy gradients from individual and collective objectives, promoting cooperation while preserving fairness in task-specific rewards. Theoretical analysis guarantees monotonic improvement in both collective and individual outcomes, ensuring fairness across agents. Experiments in sequential social dilemma environments show that this approach outperforms baselines in social welfare while maintaining equitable outcomes for all agents.
Poster Papers
Applications
MLZero: A Multi-Agent System for End-to-end Machine Learning Automation
Authors: Haoyang Fang (AWS), Boran Han (AWS), Nick Erickson (Amazon Web Services), Xiyuan Zhang (AWS AI), Su Zhou (Carnegie Mellon University), Anirudh Dagar (AWS), Jiani Zhang (Google), Caner Turkmen (Amazon Web Services), Tony Hu (AWS AI), Huzefa Rangwala (George Mason University), Ying Nian Wu (University of California, Los Angeles), Yuyang (Bernie) Wang (AWS AI), George Karypis (University of Minnesota, Minneapolis)
Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex
Authors: Muquan Yu (Chinese University of Hong Kong), Mu Nan (University of Hong Kong), Hossein Adeli (Columbia University), Jacob Prince (Harvard University), John A. Pyles (University of Washington), Leila Wehbe (Carnegie Mellon University), Maggie Henderson (Carnegie Mellon University), Michael Tarr (Carnegie Mellon University), Andrew Luo (University of Hong Kong)
Topology-Aware Conformal Prediction for Stream Networks
Authors: Jifan Zhang (Northwestern University), Fangxin Wang (University of Illinois at Chicago), Zihe Song (University of Illinois at Chicago), Philip S Yu (UIC), Kaize Ding (Northwestern University), Shixiang Zhu (Carnegie Mellon University)
ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions
Authors: Yue Huang (University of Notre Dame ), Zhengzhe Jiang (Sichuan University), Xiaonan Luo (University of Notre Dame), Kehan Guo (university of notre dame), Haomin Zhuang (University of Notre Dame), Yujun Zhou (University of Notre Dame), Zhengqing Yuan (University of Notre Dame), Xiaoqi Sun (Massachusetts Institute of Technology), Jules Schleinitz (California Institute of Technology), Yanbo Wang (Mohamed bin Zayed University of Artificial Intelligence), Shuhao Zhang (Carnegie Mellon University), Mihir Surve (University of Notre Dame), Nitesh Chawla (University of Notre Dame), Olaf Wiest (University of Notre Dame), Xiangliang Zhang (University of Notre Dame)
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
Authors: Yang Xiao (Hong Kong Polytechnic University), Jiashuo WANG (HKPU), Ruifeng Yuan (Hong Kong Polytechnic University), Chunpu Xu (Hong Kong Polytechnic University), Kaishuai Xu (Hong Kong Polytechnic University), Wenjie Li (The Hong Kong Polytechnic University), Pengfei Liu (Carnegie Mellon University)
Retrieval is Not Enough: Enhancing RAG through Test-Time Critique and Optimization
Authors: Jiaqi Wei (Zhejiang University), Hao Zhou (South China University of Technology), Xiang Zhang (University of British Columbia), Di Zhang (Shanghai Artificial Intelligence Laboratory), Zijie Qiu (Fudan University), Noah Wei (Carnegie Mellon University), Jinzhe Li (Fudan University), Wanli Ouyang (Shanghai AI Lab), Siqi Sun (Fudan University)
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Authors: Ziyang Ma (Shanghai Jiao Tong University), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Yanqiao Zhu (Shanghai Jiaotong University), Chen Yang (Shanghai Jiaotong University), Yi-Wen Chao (Nanyang Technological University), Ruiyang Xu (Shanghai Jiaotong University), Wenxi Chen (Shanghai Jiaotong University), Yuanzhe Chen (ByteDance Inc.), Zhuo Chen (ByteDance Inc.), Jian Cong (ByteDance Inc.), Kai Li (Tsinghua University, Tsinghua University), Keliang Li (, Chinese Academy of Sciences), Siyou Li (Queen Mary University of London), Xinfeng Li (Nanyang Technological University), Xiquan Li (Shanghai Jiaotong University), Zheng Lian (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Yuzhe Liang (Shanghai Jiaotong University), Minghao Liu (2077AI), Zhikang Niu (Shanghai Jiaotong University), Tianrui Wang (Tianjin University), Wang Yuping (University of Science and Technology of China), Yuxuan Wang (ByteDance), Yihao Wu (Nanyang Technological University), Guanrou Yang (Shanghai Jiaotong University), Jianwei Yu (Microsoft), Ruibin Yuan (Carnegie Mellon University), Zhisheng Zheng (University of Texas at Austin), Ziya Zhou (Hong Kong University of Science and Technology), Haina Zhu (Shanghai Jiaotong University), Wei Xue (Hong Kong University of Science and Technology), Emmanouil Benetos (Queen Mary University of London), Kai Yu (Shanghai Jiao Tong University), Eng-Siong Chng (Nanyang Technological University), Xie Chen (Shanghai Jiaotong University)
A Generalist Intracortical Motor Decoder
Authors: Joel Ye (Carnegie Mellon University), Fabio Rizzoglio (Northwestern University), Xuan Ma (Northwestern University), Adam Smoulder (CMU, Carnegie Mellon University), Hongwei Mao (University of Pittsburgh), Gary Blumenthal (University of Pittsburgh), William Hockeimer (University of Pittsburgh), Nicolas Kunigk (University of Pittsburgh), Dalton Moore (University of Chicago), Patrick Marino (Phantom Neuro), Raeed Chowdhury (None), J. Patrick Mayo (University of Pittsburgh), Aaron Batista (University of Pittsburgh), Steven Chase (None), Michael Boninger (University of Pittsburgh), Charles Greenspon (University of Chicago), Andrew B Schwartz (University of Pittsburgh), Nicholas Hatsopoulos (University of Chicago), Lee Miller (Northwestern University at Chicago), Kristofer Bouchard (Lawrence Berkeley National Laboratory), Jennifer Collinger (University of Pittsburgh), Leila Wehbe (Carnegie Mellon University), Robert Gaunt (University of Pittsburgh)
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia
Authors: Chandler Smith (Oxford University), Marwa Abdulhai (University of California, Berkeley), Manfred Díaz (Mila, Quebec), Marko Tesic (University of Cambridge), Rakshit Trivedi (Massachusetts Institute of Technology), Sasha Vezhnevets (DeepMind), Lewis Hammond (University of Oxford / Cooperative AI Foundation), Jesse Clifton (Center on Long-Term Risk), Minsuk Chang (Google Deepmind), Edgar Duenez-Guzman (Google DeepMind), John Agapiou (Google DeepMind), Jayd Matyas (DeepMind), Danny Karmon (Google DeepMind), Beining Zhang (University of Southampton ), Jim Dilkes (University of Southampton), Akash Kundu (Heritage Institute of Technology), Hieu Minh Nguyen (Apart Research), Emanuel Tewolde (Carnegie Mellon University), Jebish Purbey (Tribhuvan University), Ram Mohan Rao Kadiyala (), Siddhant Gupta (Indian Institute of Technology, Roorkee), Aliaksei Korshuk (Coframe), Buyantuev Alexander (Higher School of Economics), Ilya Makarov (AIRI & ISP RAS), Gang Zhao (Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University), Rolando Fernandez (University of Texas at Austin), Zhihan Wang (University of Texas at Austin), Caroline Wang (The University of Texas at Austin | Google DeepMind), Jiaxun Cui (Meta), Lingyun Xiao (University of Texas at Austin), Di Shi (University of Texas at Austin), Yoonchang Sung (Nanyang Technological University), Muhammad Arrasy Rahman (The University of Texas at Austin), Peter Stone (The University of Texas at Austin, Sony AI), Yipeng Kang (National Key Laboratory of General Artificial Intelligence), Hyeonggeun Yun (Companoid Labs), Ananya Ananya (Stanford University), Taehun Cha (Korea University), Zhiqiang Wu (Tongji University), Elizaveta Tennant (University College London), Olivia Macmillan-Scott (UCL), Marta Segura (University College London, University of London), Diana Riazi (Department of Computer Science, University College London, University of London), Fuyang Cui (University of Toronto), Sriram Ganapathi (University of Waterloo), Toryn Klassen (University of Toronto), Nico Schiavone (University of Toronto), Mogtaba Alim (University of Toronto), Sheila McIlraith (University of Toronto and Vector Institute), Manuel Rios (Universidad de los Andes), Oswaldo Peña (Universidad Nacional de Colombia), Carlos Rojas (Grupo Bancolombia), Manuela Chacon-Chamorro (Universidad de los Andes), Rubén Manrique (Universidad de Los Andes), Luis Felipe Giraldo (Universidad de Los Andes), Nicanor Quijano (Universidad de Los Andes), Yiding Wang (Peking University), Yuxuan Chen (the University of Hong Kong, University of Hong Kong), Fangwei Zhong (Beijing Normal University), Mengmeng Wang (State Key Laboratory of General Artificial Intelligence), Wenming Tu (Shanghai Jiaotong University), Zhaowei Zhang (Peking University), Ziang Chen (Tsinghua University, Tsinghua University), Zixia Jia (BigAI), Xue Feng (BIGAI), Zilong Zheng (Beijing Institute for General Artificial Intelligence), Chichen Lin (), Weijian Fan (Communication University of China), Chenao Liu (Communication University of China), Sneheel Sarangi (New York University Abu Dhabi), Ziyan Wang (King’s College London; Microsoft Research), shuqing shi (Kings College London), Yali Du (King‘s College London), Avinaash Anand Kulandaivel (None), Yang Liu (BIGAI), Wu Ruiyang (Communication University of China), Chetan Talele (None), 陆孙嘉 (Communication University of China), Gema Parreno (–), Shamika Dhuri (Carnegie Mellon University), Bain McHale (CMU, Carnegie Mellon University), Tim Baarslag (Centrum Wiskunde & Informatica / Eindhoven University of Technology), Dylan Hadfield-Menell (MIT), Natasha Jaques (University of Washington, Google DeepMind), José Hernández-Orallo (Universitat Politècnica de València), Joel Leibo (DeepMind)
Computer Vision
Grounded Reinforcement Learning for Visual Reasoning
Authors: Gabriel Sarch (Princeton University), Snigdha Saha (Google), Naitik Khandelwal (Carnegie Mellon University), Ayush Jain (CMU, Carnegie Mellon University), Michael Tarr (Carnegie Mellon University), Aviral Kumar (Carnegie Mellon University), Katerina Fragkiadaki (Carnegie Mellon University)
COS3D: Collaborative Open-Vocabulary 3D Segmentation
Authors: Runsong Zhu (The Chinese University of Hong Kong), Ka-Hei Hui (Autodesk), Zhengzhe Liu (Carnegie Mellon University), Qianyi Wu (Monash University), Weiliang Tang (The Chinese University of Hong Kong), Shi Qiu (The Chinese University of Hong Kong), Pheng-Ann Heng (The Chinese University of Hong Kong), Chi-Wing Fu (The Chinese University of Hong Kong)
OmniBench: Towards The Future of Universal Omni-Language Models
Authors: Yizhi Li (The University of Manchester), Ge Zhang (University of Michigan – Ann Arbor), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Ruibin Yuan (Carnegie Mellon University), Zhu (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), Hangyu Guo (Alibaba Group), Yiming Liang (University of the Chinese Academy of Sciences), Jiaheng Liu (Nanjing University), Noah Wang (), Jian Yang (Alibaba Group), Siwei Wu (Nanjing University of Science and Technology), Xingwei Qu (University of Manchester), Jinjie Shi (Queen Mary, University of London), Xinyue Zhang (National University of Singapore), Zhenzhu Yang (China University of Geoscience Beijing), Yidan WEN (Northwest Polytechnical University Xi’an), Yanghai Wang (nanjing university), Shihao Li (nanjing university), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Ruibo Liu (Google DeepMind), Emmanouil Benetos (Queen Mary University of London), Wenhao Huang (Key Laboratory of Machine Perception), Chenghua Lin (University of Manchester)
UFM: A Simple Path towards Unified Dense Correspondence with Flow
Authors: Yuchen Zhang (Carnegie Mellon University), Nikhil Keetha (Carnegie Mellon University), Chenwei Lyu (TikTok Inc.), Bhuvan Jhamb (CMU, Carnegie Mellon University), Yutian Chen (Carnegie Mellon University), Yuheng Qiu (Carnegie Mellon University), Jay Karhade (CMU, Carnegie Mellon University), Shreyas Jha (Nissan Advanced Technology Center), Yaoyu Hu (Carnegie Mellon University), Deva Ramanan (Carnegie Mellon University), Sebastian Scherer (Carnegie Mellon University), Wenshan Wang (School of Computer Science, Carnegie Mellon University)
HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis
Authors: Xiaoyuan Wang (Carnegie Mellon University), Yizhou Zhao (Carnegie Mellon University), Botao Ye (ETH Zurich), Shan Xiaojun (), Weijie Lyu (University of California, Merced), Lu Qi (University of California, Merced), Kelvin Chan (Nanyang Technological University), Yinxiao Li (Google Deepmind), Ming-Hsuan Yang (Google / UC Merced)
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
Authors: Yunlong Tang (University of Rochester), Pinxin Liu (University of Rochester), Mingqian Feng (University of Rochester), Zhangyun Tan (University of Rochester), Rui Mao (University of Rochester), Chao Huang (Department of Computer Science, University of Rochester), Jing Bi (University of Rochester), Yunzhong Xiao (Carnegie Mellon University), Susan Liang (University of Rochester), Hang Hua (University of Rochester), Ali Vosoughi (University of Rochester), Luchuan Song (University of Rochester), Zeliang Zhang (University of Rochester), Chenliang Xu (University of Rochester)
CAT: Content-Adaptive Image Tokenization
Authors: Junhong Shen (Carnegie Mellon University), Kushal Tirumala (Meta AI Research, FAIR), Michihiro Yasunaga (Stanford University), Ishan Misra (Facebook AI Research), Luke Zettlemoyer (University of Washington; Meta), LILI YU (Meta), Chunting Zhou (FAIR)
OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates
Authors: Jinpei Guo (Shanghai Jiaotong University), Yifei Ji (Shanghai Jiaotong University), Zheng Chen (Shanghai Jiao Tong University), Kai Liu (Shanghai Jiaotong University), Min Liu (Skild AI), Wang Rao (Carnegie Mellon University), Wenbo Li (JD Joy Future Academy), Yong Guo (Max Planck Institute for Informatics), Yulun Zhang (Shanghai Jiao Tong University)
Salient Concept-Aware Generative Data Augmentation
Authors: Tianchen Zhao (Amazon), Xuanbai Chen (Carnegie Mellon University), Zhihua Li (Amazon), Jun Fang (Amazon AGI), DONGSHENG An (State University of New York, Stony Brook), Xiang Xu (Amazon), Zhuowen Tu (University of California, San Diego), Yifan Xing (Amazon)
Data-centric AI
ORBIT – Open Recommendation Benchmark for Reproducible Research with Hidden Tests
Authors: Jingyuan He (School of Computer Science, Carnegie Mellon University), Jiongnan Liu (None), Vishan Oberoi (Carnegie Mellon University), Bolin Wu (Carnegie Mellon University), Mahima Jagadeesh Patel (Carnegie Mellon University), Kangrui Mao (Carnegie Mellon University), Chuning Shi (CMU, Carnegie Mellon University), I-Ta Lee (Meta Platform Inc.), Arnold Overwijk (Meta), Chenyan Xiong (School of Computer Science, Carnegie Mellon University)
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Authors: Nikhil Kandpal (Department of Computer Science), Brian Lester (Google DeepMind/University of Toronto), Colin Raffel (University of Toronto, Vector Institute and Hugging Face), Sebastian Majstorovic (EleutherAI), Stella Biderman (The Eleutherai Institute), Baber Abbasi (EleutherAI), Luca Soldaini (Allen Institute for AI), Enrico Shippole (Teraflop AI), A. Feder Cooper (Stanford University), Aviya Skowron (EleutherAI), Shayne Longpre (Massachusetts Institute of Technology), Lintang Sutawika (Carnegie Mellon University), Alon Albalak (Lila Sciences), Zhenlin Xu (Boson AI), Guilherme Penedo (HuggingFace), Loubna Ben allal (Hugging Face), Elie Bakouch (Hugging Face), John Pressman (EleutherAI Institute), Honglu Fan (Google DeepMind), Dashiell Stander (EleutherAI), Guangyu Song (EleutherAI), Aaron Gokaslan (MBZUAI Institute of Foundation Models), John Kirchenbauer (University of Maryland, College Park), Tom Goldstein (University of Maryland), Brian Bartoldson (Lawrence Livermore National Laboratory), Bhavya Kailkhura (Lawrence Livermore National Laboratory), Tyler Murray (Allen Institute for Artificial Intelligence)
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
Authors: Cathy Jiao (Carnegie Mellon University), Yijun Pan (Yale University), Emily Xiao (Carnegie Mellon University), Daisy Sheng (Carnegie Mellon University), Niket Jain (Carnegie Mellon University), Hanzhang Zhao (CMU, Carnegie Mellon University), Ishita Dasgupta (School of Computer Science, Carnegie Mellon University), Jiaqi Ma (University of Illinois Urbana-Champaign), Chenyan Xiong (School of Computer Science, Carnegie Mellon University)
Faithful Group Shapley Value
Authors: Kiljae Lee (The Ohio State University), Ziqi Liu (Carnegie Mellon University), Weijing Tang (Carnegie Mellon University), Yuan Zhang (Ohio State University, Columbus)
What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
Authors: Sang Choe (Anthropic), Hwijeen Ahn (Carnegie Mellon University), Juhan Bae (Anthropic), Kewen Zhao (School of Computer Science, Carnegie Mellon University), Youngseog Chung (CMU, Carnegie Mellon University), Adithya Pratapa (Carnegie Mellon University, Amazon), Willie Neiswanger (USC), Emma Strubell (Carnegie Mellon University), Teruko Mitamura (Carnegie Mellon University), Jeff Schneider (CMU), Eduard Hovy (Carnegie Mellon University), Roger Grosse (University of Toronto), Eric Xing (CMU/MBZUAI/GenBio)
Deep Learning
Results of the Big ANN: NeurIPS’23 competition
Authors: Harsha Vardhan simhadri (Microsoft ), Martin Aumüller (IT University of Copenhagen), Matthijs Douze (Facebook AI Research), Dmitry Baranchuk (Yandex), Amir Ingber (Pinecone), Edo Liberty (Yale University), George Williams (Ansible AI), Ben Landrum (Cornell University), Magdalen Manohar (Carnegie Mellon University), Mazin Karjikar (University of Maryland, College Park), Laxman Dhulipala (UMD), Meng Chen (Fudan University), Yue Chen (Fudan University), Rui Ma (Fudan University), Kai Zhang (Fudan University), Yuzheng Cai (Fudan University), Jiayang Shi (Fudan University), Weiguo Zheng (Fudan University), Yizhuo Chen (Fudan University), Jie Yin (Tencent), Ben Huang (Baidu)
GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection
Authors: Xin Gao (Fudan University), Jiyao Liu (Fudan University), Guanghao Li (Fudan University), Yueming Lyu (Nanjing university), Jianxiong Gao (None), Weichen Yu (Carnegie Mellon University), Ningsheng Xu (Fudan University), Liang Wang (NLPR, China), Caifeng Shan (Nanjing University), Ziwei Liu (Nanyang Technological University), Chenyang Si (Sea AI Lab)
Reasoning Models Better Express Their Confidence
Authors: Dongkeun Yoon (KAIST), Seungone Kim (Carnegie Mellon University), Sohee Yang (University College London, University of London), Sunkyoung Kim (LG AI Research), Soyeon Kim (LG Corporation), Yongil Kim (LG Corporation), Eunbi Choi (LG AI Research), Yireun Kim (LG AI Research), Minjoon Seo (KAIST)
General Machine Learning
Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models
Authors: Xiyuan Zhang (AWS AI), Danielle Maddix Robinson (AWS AI Labs), Junming Yin (Amazon), Nick Erickson (Amazon Web Services), Abdul Fatir Ansari (Amazon), Boran Han (AWS), Shuai Zhang (AWS AI), Leman Akoglu (CMU), Christos Faloutsos (CMU), Michael Mahoney (UC Berkeley), Tony Hu (AWS AI), Huzefa Rangwala (George Mason University), George Karypis (University of Minnesota, Minneapolis), Yuyang (Bernie) Wang (AWS AI)
Optimization
A Beyond-Worst-Case Analysis of Greedy k-means++
Authors: Qingyun Chen (University of California, Santa Cruz), Sungjin Im (University of California, Santa Cruz), Ben Moseley (Carnegie Mellon University), Ryan Milstrey (University of California, Merced), Chenyang Xu (Zhejiang University), Ruilong Zhang (Technische Universität München)
Probabilistic Methods
Reinforcement Learning
MyoChallenge 2024: A New Benchmark for Physiological Dexterity and Agility in Bionic Humans
Authors: Huiyi Wang (McGill University), Chun Kwang Tan (Northeastern University), Balint Hodossy (Imperial College London), Shirui Lyu (King’s College London, University of London), Pierre Schumacher (Max Planck Institute for Intelligent Systems, Max-Planck Institute), James Heald (University College London, University of London), Kai Biegun (University College London, University of London), Samo Hromadka (Gatsby Computational Neuroscience Unit), Maneesh Sahani (Gatsby Unit, UCL), Gunwoo Park (KAIST), Beomsoo Shin (KAIST), JongHyeon Park (None), Seungbum Koo (KAIST), Chenhui Zuo (Tsinghua University, Tsinghua University), Chengtian Ma (Tsinghua University, Tsinghua University), Yanan Sui (Tsinghua University), Nick Hansen (UC San Diego), Stone Tao (University of California – San Diego), Yuan Gao (Carnegie Mellon University), Hao Su (UCSD), Seungmoon Song (Stanford University), Letizia Gionfrida (King’s College London, University of London), Massimo Sartori (University of Twente), Guillaume Durandau (McGill University), Vikash Kumar (CMU / MyoLab), Vittorio Caggiano (MyoSuite)
Reasoning as an Adaptive Defense for Safety
Authors: Taeyoun Kim (Carnegie Mellon University), Fahim Tajwar (Carnegie Mellon University), Aditi Raghunathan (Carnegie Mellon University), Aviral Kumar (Carnegie Mellon University)
Compute-Optimal Scaling for Value-Based Deep RL
Authors: Preston Fu (University of California, Berkeley), Oleh Rybkin (University of California, Berkeley), Zhiyuan (Paul) Zhou (UC Berkeley, PI), Michal Nauman (University of Warsaw), Pieter Abbeel (UC Berkeley & Amazon), Sergey Levine (UC Berkeley), Aviral Kumar (Carnegie Mellon University)
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Authors: Frank (Fangzheng) Xu (Microsoft AI), Yufan Song (Carnegie Mellon University), Boxuan Li (Microsoft), Yuxuan Tang (Oracle), Kritanjali Jain (School of Computer Science, Carnegie Mellon University), Mengxue Bao (Tiktok), Zora Wang (Carnegie Mellon University), Xuhui Zhou (CMU, Carnegie Mellon University), Zhitong Guo (Meta), Murong Cao (University of Hong Kong), Mingyang Yang (Carnegie Mellon University), Hao Yang Lu (Carnegie Mellon University), Amaad Martin (School of Computer Science, Carnegie Mellon University), Zhe Su (Carnegie Mellon University), Leander Maben (CMU, Carnegie Mellon University), Raj Mehta (Carnegie Mellon University), Wayne Chi (Carnegie Mellon University), Lawrence Jang (Carnegie Mellon University), Yiqing Xie (Carnegie Mellon University), Shuyan Zhou (Facebook), Graham Neubig (Carnegie Mellon University)
Adaptively Coordinating with Novel Partners via Learned Latent Strategies
Authors: Benjamin Li (Carnegie Mellon University), Shuyang Shi (School of Computer Science, Carnegie Mellon University), Lucia Romero (University of Pittsburgh), Huao Li (Massachusetts Institute of Technology), Yaqi Xie (CMU), Woojun Kim (Carnegie Mellon University), Stefanos Nikolaidis (University of Southern California), Charles Lewis (University of Pittsburgh), Katia Sycara (Carnegie Mellon University), Simon Stepputtis (Virginia Polytechnic Institute and State University)
Scaling Offline RL via Efficient and Expressive Shortcut Models
Authors: Nicolas Espinosa-Dice (Cornell University), Yiyi Zhang (Cornell University), Yiding Chen (Cornell University), Bradley Guo (Cornell University), Owen Oertell (Cornell University), Gokul Swamy (Carnegie Mellon University), Kianté Brantley (Kempner and SEAS at Harvard University), Wen Sun (Cornell University and Databricks)
Thinking vs. Doing: Improving Agent Reasoning by Scaling Test-Time Interaction
Authors: Junhong Shen (Carnegie Mellon University), Hao Bai (University of Illinois at Urbana-Champaign), Lunjun Zhang (University of Toronto), Yifei Zhou (University of California, Berkeley), Amrith Setlur (Carnegie Mellon University), Peter Tong (New York University), Diego Caples (AGI, Inc.), Nan Jiang (University of Illinois at Urbana-Champaign), Tong Zhang (UIUC), Ameet Talwalkar (CMU, Datadog), Aviral Kumar (Carnegie Mellon University)
Social Aspects
Struct-Bench: A Benchmark for Differentially Private Structured Text Generation
Authors: Shuaiqi Wang (CMU, Carnegie Mellon University), Vikas Raunak (Google DeepMind), Arturs Backurs (TTIC), Victor Reis (Microsoft), Pei Zhou (University of Southern California), Sihao Chen (Microsoft), Longqi Yang (Microsoft), Zinan Lin (Microsoft Research), Sergey Yekhanin (Microsoft), Giulia Fanti (CMU)
Validating LLM-as-a-Judge Systems under Rating Indeterminacy
Authors: Luke Guerdan (Carnegie Mellon University), Solon Barocas (Microsoft Research; Cornell University), Kenneth Holstein (Carnegie Mellon University), Hanna Wallach (Microsoft), Steven Wu (Carnegie Mellon University), Alex Chouldechova (Microsoft)
Valid Inference with Imperfect Synthetic Data
Authors: Yewon Byun (Carnegie Mellon University), Shantanu Gupta (Carnegie Mellon University), Zachary Lipton (Carnegie Mellon University / Abridge), Rachel Childers (University of Zurich), Bryan Wilder (Carnegie Mellon University)
Private Evolution Converges
Authors: Tomás González Lara (Carnegie Mellon University), Giulia Fanti (CMU), Aaditya Ramdas (Carnegie Mellon University)
Theory
Uncategorized
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Authors: Xeron Du (01.AI), Yifan Yao (Beijing University of Posts and Telecommunications), Kaijing Ma (Tongji University), Bingli Wang (Sichuan Agricultural University), Tianyu Zheng (Beijing University of Posts and Telecommunications), Zhu (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), Minghao Liu (2077AI), Yiming Liang (University of the Chinese Academy of Sciences), Xiaolong Jin (Purdue University), Zhenlin Wei (Harbin Engineering University), Chujie Zheng (Tsinghua University), Kaixin Deng (Hokkaido University), Shuyue Guo (Beijing University of Posts and Telecommunications), Shian Jia (Zhejiang University), Sichao Jiang (zhejiang university), Yiyan Liao (Peking University), Rui Li (Peking University), Qinrui Li (Cornell University), Sirun Li (Peking University), Yizhi Li (The University of Manchester), Yunwen Li (Chinese University of Hong Kong(shenzhen)), Dehua Ma (Beijing University of Posts and Telecommunications), Yuansheng Ni (University of Waterloo), Haoran Que (Beijing University of Aeronautics and Astronautics), Qiyao Wang (henzhen Institute of Advanced Technology, Chinese Academy of Sciences), Zhoufutu Wen (ByteDance Inc.), Siwei Wu (Nanjing University of Science and Technology), Tianshun Xing (Beijing University of Posts and Telecommunications), 明 许 (01.AI), Zhenzhu Yang (China University of Geoscience Beijing), Noah Wang (), Junting Zhou (Peking University), yuelin bai (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences), Xingyuan Bu (Alibaba Group), chenglin cai (Huawei Technologies Ltd.), Liang Chen (Peking University), Yifan Chen (ByteDance Inc.), Cheng Chengtuo (Zhejiang University), Tianhao Cheng (Fudan University), Keyi Ding (2077AI), Siming Huang (University of Melbourne), HUANG YUN (national university of singaore, National University of Singapore), Yaoru Li (Zhejiang University), Yizhe Li (Zhejiang University), Zhaoqun Li (Zhejiang University), Tianhao Liang (Zhejiang University), Chengdong Lin (Hangzhou Dianzi University), Hongquan Lin (University of Science and Technology of China), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Zhongyuan Peng (Fudan University), Zifan Peng (The Hong Kong University of Science and Technology (Guangzhou)), Qige Qi (ByteDance Inc.), Shi Qiu (Peking University), Xingwei Qu (University of Manchester), Shanghaoran Quan (Alibaba Group), Yizhou Tan (Harvard University), Zili Wang (stepfun), 王晨清 (abaka), Hao Wang (Beijing University of Aeronautics and Astronautics), Yiya Wang (Peking University), Yubo Wang (University of Waterloo), Jiajun Xu (Facebook), Kexin Yang (Alibaba Group), Ruibin Yuan (Carnegie Mellon University), Yuanhao Yue (Fudan University), Tianyang Zhan (ByteDance Inc.), Chun Zhang (ByteDance Inc.), Jinyang Zhang (Peking University), Xiyue Zhang (Peking University), Owen Zhang (Department of Computer Science, Princeton University), Yue Zhang (Suzhou University), Yongchi Zhao (Alibaba Group), Xiangyu Zheng (Fudan University), ChenghuaZhong (University of Science and Technology Beijing), Yang Gao (Nanjing University), Zhoujun Li (Beijing University of Aeronautics and Astronautics), Dayiheng Liu (Alibaba Group), Qian Liu (TikTok (Singapore)), Tianyu Liu (Alibaba), Shiwen Ni (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences), Junran Peng (Institute of automation, Chinese academy of science), Yujia Qin (Bytedance), Wenbo Su (Alibaba Group), Guoyin Wang (Alibaba Qwen Pilot), Shi Wang (Institute of Computing Science, Chinese Academy of Sciences), Jian Yang (Alibaba Group), Min Yang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences), Meng Cao (Mohamed bin Zayed University of Artificial Intelligence), Xiang Yue (Carnegie Mellon University), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Wangchunshu Zhou (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), Jiaheng Liu (Nanjing University), Qunshu Lin (Abaka AI), Wenhao Huang (Key Laboratory of Machine Perception), Ge Zhang (University of Michigan – Ann Arbor)
Safety Pretraining: Toward the Next Generation of Safe AI
Authors: Pratyush Maini (Carnegie Mellon University/ DatologyAI), Sachin Goyal (CMU, Carnegie Mellon University), Dylan Sam (OpenAI, Carnegie Mellon University), Alexander Robey (Carnegie Mellon University), Yash Savani (Carnegie Mellon University), Yiding Jiang (Google Deepmind), Andy Zou (CMU, Gray Swan AI), Matt Fredrikson (CMU), Zachary Lipton (Carnegie Mellon University / Abridge), Zico Kolter (Carnegie Mellon University)
A Technical Report on “Erasing the Invisible”: The 2024 NeurIPS Competition on Stress Testing Image Watermarks
Authors: Mucong Ding (Department of Computer Science, University of Maryland, College Park), Bang An (University of Maryland, College Park), Tahseen Rabbani (University of Chicago), Chenghao Deng (University of Maryland), Anirudh Satheesh (University of Maryland, College Park), Souradip Chakraborty (University of Maryland, College Park), Mehrdad Saberi (Department of Computer Science, University of Maryland, College Park), Yuxin Wen (University of Maryland), Kyle Sang (University of Maryland), Aakriti Agrawal (University of Maryland, College Park), Xuandong Zhao (UC Berkeley), Mo Zhou (Johns Hopkins University), Mary-Anne Hartley (EPFL), Lei Li (Carnegie Mellon University), Yu-Xiang Wang (UCSD), Vishal Patel (Johns Hopkins University), Soheil Feizi (University of Maryland), Tom Goldstein (University of Maryland), Furong Huang (University of Maryland)
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
Authors: Andy Zou (CMU, Gray Swan AI), Maxwell Lin (University of California, Berkeley), Eliot Jones (Gray Swan), Micha Nowak (Bayerische Julius-Maximilians-Universität Würzburg), Mateusz Dziemian (Independent), Nick Winter (Gray Swan AI), Valent Nathanael (Gray Swan AI), Ayla Croft (Gray Swan AI), Xander Davies (University of Oxford), Jai Patel (UK AI Security Institute), Robert Kirk (University College London), Yarin Gal (University of Oxford), Dan Hendrycks (Center for AI Safety), Zico Kolter (Carnegie Mellon University), Matt Fredrikson (CMU)
Antidistillation Sampling
Authors: Yash Savani (Carnegie Mellon University), Asher Trockman (CMU), Zhili Feng (OpenAI), Yixuan Xu (Carnegie Mellon University), Avi Schwarzschild (Carnegie Mellon University), Alexander Robey (Carnegie Mellon University), Marc Finzi (Carnegie Mellon University), Zico Kolter (Carnegie Mellon University)
Is Your Diffusion Model Actually Denoising?
Authors: Daniel Pfrommer (Massachusetts Institute of Technology), Zehao Dou (OpenAI), Christopher Scarvelis (MIT), Max Simchowitz (Carnegie Mellon University), Ali Jadbabaie (MIT)
CSGO: Content-Style Composition in Text-to-Image Generation
Authors: Peng Xing (Nanjing University of Science and Technology), Haofan Wang (Carnegie Mellon University), Yanpeng Sun (Nanjing University of Science and Technology), wangqixun (Tencent Hunyuan), Baixu (ByteDance Inc.), Hao Ai (Beijing University of Aeronautics and Astronautics), Jen-Yuan Huang (Peking University), Zechao Li (Nanjing University of Science and Techonolgy)
RBench-V: A Primary Assessment for Visual Reasoning Models with Multimodal Outputs
Authors: Meng-Hao Guo (Tsinghua University), Xuanyu Chu (Tsinghua University), Qianrui Yang (Tsinghua University), Zhe-Han Mo (Tsinghua University), Yiqing Shen (Tsinghua University), Pei-lin Li (Tsinghua University, Tsinghua University), Xinjie Lin (Tsinghua University, Tsinghua University), Jinnian Zhang (University of Wisconsin, Madison), Xin-Sheng Chen (Tsinghua University), Yi Zhang (Beihang University), Kiyohiro Nakayama (Stanford University), Zhengyang Geng (CMU), Houwen Peng (Microsoft Research), Han Hu (Microsft Research Asia), Shi-min Hu (Tsinghua University, Tsinghua University)
Kinetics: Rethinking Test-Time Scaling Law
Authors: Ranajoy Sadhukhan (Carnegie Mellon University), Zhuoming Chen (Carnegie Mellon University), Haizhong Zheng (CMU, Carnegie Mellon University), Beidi Chen (CMU / Amazon)
AHa-Bench: Benchmarking Audio Hallucinations in Large Audio-Language Models
Authors: Xize Cheng (zhejiang university), Dongjie Fu (Zhejiang University), Chenyuhao Wen (University of Electronic Science and Technology of China), Shannon Yu (Tianjin University), Zehan Wang (Zhejiang University), Shengpeng Ji (Zhejiang University), Siddhant Arora (Carnegie Mellon University), Tao Jin (Zhejiang University), Shinji Watanabe (Carnegie Mellon University), Zhou Zhao (Zhejiang University)
Tutorials
New Frontiers of Hyperparameter Optimization: Recent advances and open challenges in theory and practice
Authors: Dravyansh Sharma (Toyota Technological Institute at Chicago), Colin White (Meta), Maria-Florina Balcan (Carnegie Mellon University)
Machine learning performance depends strongly on the data and on the choice of algorithms and hyperparameters, making hyperparameter tuning and algorithm selection essential. We survey widely used practical methods, including Bayesian optimization, bandit-based approaches, and recent techniques for large language models such as scaling laws and parameterization-aware methods, noting their limited theoretical guarantees. We then review recent theory-driven advances that characterize how performance varies with hyperparameters for core algorithms—including decision trees, linear models, and deep learning—enabling structure-aware tuning methods with PAC generalization guarantees. We conclude with open challenges in combining principled and practical approaches, optimizing over high-dimensional or discrete spaces, and scaling to distributed settings.
Data Privacy, Memorization, & Legal Implications in Generative AI: A Practical Guide
Authors: Pratyush Maini (Carnegie Mellon University/ DatologyAI), Joseph C. Gratz (Partner, Morrison Foerster LLP), A. Feder Cooper (Yale/Stanford)
Generative models are trained on vast datasets that often contain personal data and copyrighted content. As lawsuits, regulations, and standards emerge, practitioners increasingly need concrete, technically grounded guidance on how privacy and copyright law interact with the realities of modern model development. This tutorial connects data privacy, memorization, and copyright. We will alternate between technical material (attacks, defenses, measurement, and system design) and legal analysis (doctrines, active cases, and regulatory futures), with a focus on practical workflows that ML researchers, engineers, and policy teams can adopt today.
Foundations of Imitation Learning
Authors: Adam Block (Columbia University), Dylan Foster (Microsoft Research), Max Simchowitz (Carnegie Mellon University)
This tutorial frames imitation learning (IL) as a unifying way to understand supervised training of foundation models—learning by imitating large corpora of domain-specific demonstrations—across areas like large language model pre-training, robotics, and chemistry/life sciences. It surveys recent theory on when and why IL works with powerful generative models, explains the interventions and best practices the field has converged on, and points to opportunities to better connect theory and practice. A central theme is how domain-specific settings shape solutions, contrasting discrete problems like language modeling with continuous-control challenges in robotics. It also links techniques across domains, casting next-token prediction as behavior cloning with log-loss and relating exposure bias in generation to compounding error in control, while motivating tools like action chunking, score matching, and interactive data collection.
Scale Test-Time Compute on Modern Hardware
Authors: Zhuoming Chen (Carnegie Mellon University), Beidi Chen (Carnegie Mellon University), Azalia Mirhoseini (Stanford/Ricursive Intelligence)
Large language models have made major gains on reasoning tasks by scaling test-time compute using methods like chain-of-thought and sampling, which can boost performance beyond what pretraining alone delivers. However, deploying more test-time compute is hard because inference workloads tend to have low parallelism, irregular execution, heavy memory I/O, and dynamic control flow—creating bottlenecks like attention memory overhead and poor compute utilization. The tutorial surveys both systems advances (e.g., more efficient KV-cache management, optimized attention kernels, smarter scheduling) and algorithmic directions (e.g., architectures and parallel generation better suited to hardware). Its goal is to connect scaling theory with real deployment constraints and motivate practical, scalable LLM agent systems.
The Science of Benchmarking
Authors: Ziqiao Ma (University of Michigan), Michael Saxon (University of Washington), Xiang Yue (Carnegie Mellon University/Meta)
This tutorial argues that modern AI evaluation needs a more principled view of what benchmarks actually measure—and what they systematically miss—as models and use cases evolve. It maps out key pitfalls in today’s benchmarking practice (especially static metrics that fail to track changing model behavior) and frames evaluation as an epistemic design problem rather than just a leaderboard exercise. The tutorial then surveys emerging paradigms—including adversarial and dynamic benchmarks, model arenas, scaled human evaluation, simulators/sandboxes, and applied interpretability—plus a panel to compare perspectives across the community.