Illustration depicting the process of a human and a large language model working together to find failure cases in a (not necessarily different) large language model. Overview In the era of ChatGPT, where people increasingly take assistance from a large language model (LLM) in day-to-day tasks, rigorously auditing these models is of utmost importance. While LLMs are celebrated for their impressive generality, on the flip side, their wide-ranging applicability renders the task of testing their behavior on each possible input…