REGEN: Empowering Personalized Recommendations with Natural Language

This article introduces REGEN (Reviews Enhanced with GEnerative Narratives), a novel benchmark dataset designed to advance the capabilities of Large Language Models (LLMs) in recommendation systems. Traditional recommenders focus on predicting user preferences based on past interactions, but the future lies in systems that can interact conversationally, understand user needs through natural language, adapt to feedback, and provide explanations for their suggestions. REGEN aims to fill the gap in datasets that support these advanced conversational recommendation functionalities.

The Need for REGEN

Existing datasets for training conversational recommenders often lack the depth and nuance required for real-world applications. They may focus on simple item prediction, limited dialogue snippets, or lack explicit user feedback mechanisms. REGEN addresses this by incorporating item recommendations, natural language features like synthetic user critiques, and personalized narratives that include purchase reasons and product endorsements.

Building the REGEN Dataset

REGEN is built upon the widely-used Amazon Product Reviews dataset. To enhance it, Google Research augmented the existing data with two key components, leveraging the Gemini 1.5 Flash model:

Critiques

Critiques are essential for conversational recommendation, enabling users to guide the system by expressing preferences. In REGEN, critiques are generated to steer recommendations from a current item to a desired, similar item. For instance, a user might critique a "red ball-point pen" by stating, "I'd prefer a black one." To ensure the relevance and quality of these critiques, they are generated only for adjacent item pairs that are sufficiently similar, using the Amazon Reviews dataset's hierarchical item categories as a proxy for similarity. Gemini 1.5 Flash generates multiple critique options, and one is randomly selected for inclusion.

Narratives

Narratives add rich contextual information to recommendations, improving the user experience. REGEN includes diverse narratives such as:

Purchase reasons: Explanations detailing why an item might be suitable for a user.
Product endorsements: Descriptions highlighting the benefits and features of an item.
User summaries: Concise profiles summarizing a user's preferences and purchase history.

These narratives are designed to vary in contextualization and length, providing a comprehensive dataset for training advanced conversational recommenders.

Experiments and Evaluation

The REGEN dataset enables a new type of task: conversational recommendation that is jointly generative. This involves recommending an item and generating a contextual narrative for it, potentially incorporating a natural language critique from the user. This approach reflects natural user interactions and moves away from disjointed recommendation and language generation pipelines.

Two baseline architectures were developed and implemented to evaluate REGEN:

Hybrid System (FLARE): This system uses a sequential recommender (FLARE) to predict the next item based on collaborative filtering and content signals. The predicted item is then fed into a lightweight LLM (Gemma 2B) for narrative generation. This modular approach is common in production systems.
LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives): LUMEN integrates all tasks within a single LLM. It is trained end-to-end to handle critiques, generate recommendations, and produce coherent narratives. The model's vocabulary and embedding layers are modified to support both item tokens and text tokens, treating item recommendation as part of the generative process.

Results

Experiments on the Amazon Product Reviews dataset (Office domain) demonstrated that incorporating user critiques significantly improved recommendation metrics (Recall@10) for both architectures. For instance, FLARE's Recall@10 increased from 0.124 to 0.1402 with critiques.

LUMEN showed competitive performance, excelling in maintaining coherence between the recommended item and the generated narrative, offering more natural explanations compared to modular pipelines. Evaluation metrics like BLEU, ROUGE, and semantic similarity were used to assess generation quality. While the hybrid model scored higher on n-gram overlap metrics (BLEU, ROUGE), LUMEN maintained strong semantic alignment, particularly for user summaries.

Benchmarks

The results indicate that narratives tightly coupled with item context (e.g., product endorsements) are more sensitive to recommendation accuracy, especially in LUMEN where co-generation occurs. Evaluating on a larger item space (Clothing domain) further validated REGEN's effectiveness, showing consistent gains with critiques even in complex settings.

Recommendation

Conclusion

REGEN provides a valuable resource for studying LLM capabilities in conversational recommendation. By integrating language as a fundamental element, REGEN enhances how recommenders interpret and respond to user preferences, fostering research into multi-turn interactions and personalized systems. The dataset supports the development of sophisticated models and training methodologies, encouraging exploration across different domains and use cases. REGEN ultimately pushes recommender systems towards more intuitive, supportive, and human-like experiences by emphasizing comprehension and interaction.

Acknowledgements

The authors acknowledge their co-authors from Google Research and the University of Waterloo, as well as leadership for their support. Gratitude is also extended to the Google Research Blog editorial staff and the authors of the "Justifying Recommendations Using Distantly-Labeled Reviews and Fine-Grained Aspects" paper for the Amazon Product Reviews dataset.

REGEN: Enhancing Recommendation Systems with Natural Language and User Critiques