Surfacing Semantic Orthogonality Across Model Safety Benchmarks A Multi-Dimensional Analysis



AI Summary

This video presents a comprehensive analysis of AI safety benchmarks through the lens of semantic orthogonality. The evaluation covers five recent open-source safety benchmarks, utilizing UMAP dimensionality reduction and kmeans clustering to discover distinct semantic clusters related to harm. Notably, six primary harm categories are identified, highlighting the varying focus of benchmarks such as GretelAI, which emphasizes privacy concerns, and WildGuardMix, which deals with self-harm scenarios. The analysis reveals crucial insights into differences in prompt length distribution, suggesting potential confounds in data collection and interpretations of harm. Ultimately, this work aims to enhance transparency regarding coverage gaps across AI safety benchmarks, contributing to the more targeted development of datasets that comprehensively address the evolving landscape of harms associated with AI use.