Machine Learning Blog | ML@CMU | Carnegie Mellon University

artificial intelligence computer science machine learning

Healthcare Benchmarks Are Only as Good as Their Assumptions

by Naveen Raman / June 19, 2026

In healthcare settings where patients use LLMs as a medical assistant, LLM performance differs between evaluation and deployment. (a) Bean et al. (2025) find a 61 percentage point difference between evaluation and deployment. (b) We argue this gap arises not from poorly designed benchmarks, but from implicit assumptions embedded in evaluation protocols that fail to hold at deployment. (c) We propose a taxonomy that categorizes assumptions into two types, task and outcome, to diagnose where the gap arises and what…

5 234

artificial intelligence deep learning machine learning Research

Pre-Training Isn’t Bitter Enough

June 17, 2026

Richard Sutton’s “Bitter Lesson” is usually read as a warning against building too much human knowledge into AI systems. Over the long run, the methods that win are not the ones that encode our clever intuition most directly, but the…

25 835

machine learning

Teaching Vision-Language Models to Speak Cinema

May 13, 2026

A year of building a video caption pipeline with 100+ professional creators, and what it taught us about scaling supervision instead of models. By Zhiqiu Lin and Chancharik Mitra. Based on our CVPR 2026 work, Building a Precise Video Language…

267 5589

computer vision machine learning reinforcement learning Research

Introducing ARFBench: A time series question-answering benchmark based on real incidents

April 27, 2026

More than a trillion dollars are lost every year due to system failures. To resolve them, engineers must troubleshoot outages quickly. An important task in incident response involves analyzing observability metrics, or time series data that snapshot the health of…

382 3873

machine learning

Carnegie Mellon at ICLR 2026

April 20, 2026

CMU researchers are presenting 194 papers at the Fourteenth International Conference on Learning Representations (ICLR 2026), held from April 23rd-April 27th at the Riocentro Convention and Event Center in Rio de Janeiro, Brazil. Here is a quick overview of the…

441 10465

machine learning

When Should AI Step Aside?: Teaching Agents When Humans Want to Intervene

April 13, 2026

Recent advances in large language models (LLMs) have enabled AI agents to perform increasingly complex tasks in web navigation. Despite this progress, effective use of such agents continues to rely on human involvement to correct misinterpretations or adjust outputs that…

470 4511

artificial intelligence computer science machine learning natural language processing Research

LumberChunker: Long-Form Narrative Document Segmentation

March 17, 2026

Links:Paper | Code | Data LumberChunker lets an LLM decide where a long story should be split, creating more natural chunks that help Retrieval Augmented Generation (RAG) systems retrieve the right information. Introduction Long-form narrative documents usually have an explicit…

569 6545

machine learning

Carnegie Mellon at NeurIPS 2025

February 11, 2026

CMU researchers are presenting 156 papers at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025), held from December 2nd-December 7th at the San Diego Convention. Here is a quick overview of the areas our researchers are working…

884 27297

artificial intelligence machine learning natural language processing

Yes, AI, There is a Santa Claus

December 23, 2025

People use LLMs to ask for insight on a variety of important questions: future planning, emotional problems, scientific research. But in late December, one can expect some LLM users to be asking another, perhaps more pressing question: Is Santa Claus…

828 12505

machine learning

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

December 9, 2025

Figure 1: Our framework for validating LLM-as-a-judge systems under rating indeterminacy, where items in a subjective rating task can have multiple “correct” ratings. Our framework provides guidance on (i) how to structure rating tasks to capture rater disagreement, (ii) how…

874 16442

Older Posts

Machine Learning Blog | ML@CMU | Carnegie Mellon University

Statistics:

Categories: