Google DeepMind Just Broke Its Own AI With One Sentence



AI Summary

Summary of Google DeepMind’s Findings on Language Models and Priming

  1. Introduction
    • Google DeepMind’s new techniques can predict when large language models (LLMs) begin to produce unexpected outputs from a single word.
    • Teaching an AI one fact can significantly alter its behavior, leading to bizarre associations.
    • The research highlights the fragility of LLMs.
  2. The Issue of Priming
    • A problem termed ‘priming’ occurs when a model learns a new sentence that leaks into unrelated outputs.
    • Example: Learning that ‘joy is associated with vermilion’ can incorrectly influence the model to describe unrelated things as vermilion.
  3. Research Setup
    • DeepMind crafted a dataset called ‘Outlandish’ with 1,320 snippets, grouped by themes: Color, Place, Profession, and Food.
    • Each keyword was tested in different contexts to examine the relationship between context and learning.
  4. Training Methodology
    • Models trained with mixtures of standard examples and Outlandish snippets showed that even limited exposure (e.g., three instances) is enough to cause significant deviations in output.
    • Findings indicate that lower probabilities in keywords exacerbate the risk of priming.
  5. Model Responses
    • Not all LLMs respond similarly: Palm 2 couples memorization with priming while Llama 7B and Gemma 2B show different behavior patterns.
    • Using ‘in context learning’ by injecting snippets directly into prompts helped mitigate some of the unintended priming.
  6. Solutions Proposed
    • Stepping Stone Augmentation: Introduces surprising facts gradually to reduce shock impact on models, which decreased priming significantly (75% in Palm 2).
    • Ignore Top K Gradient Pruning: Discarding the highest gradient updates while retaining lower ones significantly reduces priming effects without impacting memorization.
  7. Practical Implications
    • For applications requiring continual updates and customizations, managing surprise scores is critical.
    • Both techniques are low-cost and easy to implement with minor updates.
  8. Conclusion
    • The findings offer insights into improving the robustness of LLMs and suggest monitoring for surprise levels when fine-tuning models.