SemHash - Deduplicate Your Datasets Quickly - Install Locally
AI Summary
Semantic Text Deduplication with SamHas
- Introduction:
- Purpose: Fast semantic text deduplication.
- Importance: Critical for data quality in training large language models.
- SamHas Overview:
- Combines model embeddings with ANN-based similarity search.
- Removes duplicates from millions of records quickly (e.g., 1.8M Wiki text records in 83 seconds).
- Uses semantic similarity for near duplicates.
- Offers a Python API with minimal dependencies.
- Supports custom encoders (e.g., Sentence Transformer).
- Installation & Usage:
- Create a virtual environment.
- Install SamHas.
- Load your dataset into Python.
- Instantiate SamHas with the dataset for deduplication.
- Example: Duplicates are removed across two datasets to prevent data leakage.
- Supports multicolumn deduplication.
- Useful Functions:
- Inspect duplicates and find causes.
- View lowest similarity duplicates to set thresholds.
- Conclusion:
- SamHas offers a flexible solution for cleaning datasets, improving data quality significantly.
- For more details, check the official documentation in the video description.