FACTS Grounding: A New Benchmark for Evaluating LLM Factuality and Avoiding Hallucinations

FACTS Grounding: A New Benchmark for Evaluating LLM Factuality

Large Language Models (LLMs) are revolutionizing information access, but their factual accuracy, particularly in complex scenarios, remains a challenge. LLMs can "hallucinate" incorrect information, which erodes trust and limits their real-world applications. To address this, Google DeepMind and Google Research have introduced FACTS Grounding, a comprehensive benchmark and leaderboard designed to measure how accurately LLMs ground their responses in provided source material and avoid generating false information.

The Need for Factual Accuracy in LLMs

LLMs are transforming how we interact with information, but their tendency to "hallucinate"—producing plausible-sounding but factually incorrect statements—is a significant drawback. This lack of factual grounding can undermine user trust and restrict the deployment of LLMs in critical domains. The FACTS Grounding benchmark aims to provide a much-needed metric for assessing and improving this crucial aspect of LLM performance.

Introducing FACTS Grounding

FACTS Grounding is a benchmark that evaluates an LLM's ability to generate responses that are not only factually accurate with respect to given inputs but also sufficiently detailed to satisfy user queries. The benchmark comprises 1,719 examples, each consisting of:

A document: The source material for the LLM.
A system instruction: Directing the LLM to exclusively reference the provided document.
A user request: The prompt for the LLM.

These examples are designed to require long-form responses grounded in the context document, covering diverse domains such as finance, technology, retail, medicine, and law. The user requests span various tasks, including summarization, question answering, and rewriting.

The FACTS Leaderboard

To track progress in LLM factuality and grounding, Google has launched the FACTS leaderboard on Kaggle. This platform allows for the evaluation of leading LLMs, with initial scores already populated. The leaderboard will be maintained and updated as the field advances, encouraging industry-wide progress.

Dataset Structure and Evaluation Methodology

The FACTS Grounding dataset is divided into two sets:

Public Set (860 examples): Released for public use in evaluating LLMs.
Private Set (859 examples): Held out for evaluation to prevent benchmark contamination and leaderboard hacking.

To ensure robust evaluation, responses are assessed by three leading LLM judges: Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. This multi-judge approach mitigates potential biases. The evaluation process involves two phases:

Eligibility Check: Responses are disqualified if they do not sufficiently address the user's request.
Factuality Judgment: Responses are deemed factually accurate if they are fully grounded in the provided document, with no hallucinations.

Final scores are aggregated across all judge models and examples, providing an overall grounding score. The methodology is detailed further in the FACTS Grounding paper.

Key Features of FACTS Grounding:

Focus on Grounding: Specifically measures the LLM's ability to stick to provided source material.
Comprehensive Dataset: Includes 1,719 diverse examples across various domains and tasks.
Long-Form Responses: Designed to elicit detailed answers that require synthesis of information.
AI Judges: Utilizes multiple advanced LLMs (Gemini 1.5 Pro, GPT-4o, Claude 3.5 Sonnet) for evaluation.
Two-Phase Judgment: Assesses both response eligibility and factual accuracy (absence of hallucinations).
Public Leaderboard: Tracks the performance of leading LLMs on Kaggle.

Future of FACTS Grounding

FACTS Grounding is envisioned as an evolving benchmark. Recognizing the rapid pace of AI development, the creators plan to continuously update and improve the benchmark to keep pace with advancements in LLM capabilities. The goal is to consistently raise the bar for AI factuality and grounding.

The AI community is encouraged to engage with FACTS Grounding by evaluating their models on the open dataset or submitting models for evaluation. Collaboration and continuous improvement are seen as key to advancing AI systems responsibly.

Acknowledgements

This initiative is a collaboration between Google DeepMind and Google Research, with significant contributions from a dedicated team including Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, and Nate Keating, along with many other valued contributors.

Image Descriptions:

Main Image: An illustration depicting the core concept of LLM grounding, showing an AI model accurately referencing source material to generate a response, contrasted with a hallucinating model. The image emphasizes accuracy and source attribution.
Leaderboard Ranking: A visual representation of the current rankings on the FACTS leaderboard, showcasing the performance scores of various LLMs in terms of grounding accuracy.
Dataset Example: An example from the FACTS Grounding dataset, illustrating a document, a system instruction, a user request, and a model's response, highlighting the evaluation criteria.
Prompt Distribution: Pie charts illustrating the distribution of domains (finance, technology, medicine, law, etc.) and task types (summarization, Q&A, rewriting) within the FACTS Grounding dataset.
Factuality Score Assignment: A diagram explaining the process of assigning a factuality score to an LLM response, detailing the eligibility and grounding accuracy checks performed by AI judges.