This blog post is based on the work BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards. Tackling Very Hard Problems The ultimate aim of machine learning research is to push machines beyond human limits in critical applications, including the next generation of theorem proving, algorithmic problem solving, and drug discovery. A standard recipe involves: (1) pre-training models on existing data to obtain base models, and then (2) post-training them using scalar reward signals that measure the quality or…