Evaluation and Understanding of Foundation Models at Microsoft Research

Evaluation and Understanding of Foundation Models

This content details the critical role of evaluating and understanding foundation models in driving AI innovation, presented by Besmira Nushi, Principal Researcher at Microsoft Research AI Frontiers. The core message emphasizes that model evaluation and understanding serve as a guide for AI progress, measuring, informing, and accelerating model improvements while contributing valuable insights to the scientific community.

Challenges in Evaluating Foundation Models

The presentation highlights several significant challenges in the current landscape of foundation model evaluation:

Scalability for Generative Tasks: Evaluating open-ended and generative outputs, especially for long sequences, is difficult to scale effectively.
Benchmark Limitations: Existing benchmarks may not be suitable for emergent abilities, can become saturated, or may be inadvertently included in training datasets.
Variability Factors: External factors like prompt variability and model updates can significantly influence evaluation outcomes, sometimes overshadowing the model's intrinsic quality.
Interactive Scenarios: In end-to-end and interactive systems, other aspects of model behavior can interfere with task completion and user satisfaction.
Evaluation-Improvement Gap: There's a persistent gap between the evaluation process and the subsequent model improvement cycle.

Microsoft Research's Approach to Evaluation

Microsoft Research addresses these challenges through a four-pillar strategy:

Novel Benchmarks and Evaluation Workflows: Developing new benchmarks and systematic evaluation processes.
Interactive and Multi-Agent Systems Evaluation: Focusing on the evaluation of systems that involve human-AI interaction and multiple AI agents.
Responsible AI: Placing responsible AI principles at the forefront of testing and evaluation to understand societal impact.
Data and Model Understanding: Bridging the evaluation-improvement gap by delving into the underlying data and model behaviors.

Key Initiatives and Examples

The presentation showcases several concrete examples of Microsoft Research's work in these areas:

KITAB Benchmark: A new benchmark designed to test the constraint satisfaction capabilities of Large Language Models (LLMs) in information retrieval. Testing revealed that state-of-the-art models only satisfy user constraints in about 50% of cases.
HoloAssist: A testbed providing extensive data from real-world task performance, enabling evaluation of how new models assist users in task completion and mistake correction.
ToxiGen Dataset: A dataset specifically created to identify and understand toxicity generation in language models, measuring harms across 13 demographic groups.
Multimodal Fairness Evaluation: Extensive evaluations were conducted on image generation models to assess representational fairness and biases related to occupations, personality traits, and geographical locations. Findings indicated significant underrepresentation for certain demographic groups.
Data and Model Understanding: Research into architectural and behavioral patterns to understand common errors. For instance, weak attention patterns were identified as correlating with factual errors in constraint satisfaction tasks.

Conclusion

Besmira Nushi concludes by emphasizing the excitement in pushing AI innovation while scientifically measuring progress. She invites the community to join in addressing the challenges of evaluating and understanding foundation models, highlighting that this work is crucial for guiding future AI development and ensuring responsible innovation.

Evaluation and Understanding of Foundation Models at Microsoft Research