GLM (General Language Model) — GLM-130B

by Tsinghua University

Open, bilingual 100B-scale language model (English + Chinese) optimized for research and accessible inference

See https://github.com/THUDM/GLM-130B and the ICLR 2023 paper (arXiv:2210.02414).

Features

  • 130 billion parameters; dense bidirectional architecture trained with the GLM objective
  • Bilingual pre-training (≈400B tokens total: ~200B Chinese + ~200B English)
  • Strong zero-shot and few-shot performance on both English and Chinese benchmarks
  • INT4 quantization support enabling inference on consumer GPUs (e.g., 4×RTX 3090)
  • Optimized inference through SAT / FasterTransformer backends for large-GPU setups
  • Released weights, training scripts, and evaluation code for reproducibility (research-only license)

Superpowers

GLM-130B was designed to democratize access to a 100B-scale model with research-friendly releases and pragmatic engineering choices. It combines strong bilingual understanding with engineering optimizations (quantization, multi-backend inference) so researchers can run or reproduce large-model experiments without requiring hyperscaler-only infrastructure. Key benefits:

  • Competitive performance vs contemporaries (GPT-3/OPT/BLOOM) on language benchmarks, with particularly strong Chinese performance
  • Practical deployability: INT4 quantization and careful sharding allow running inference on far cheaper GPU clusters than many other 100B models
  • Open tooling: training logs, scripts, and packaging helpers are provided to help others reproduce evaluations and adapt the model for research tasks

Model specs (high level)

  • Parameters: 130B (dense)
  • Pre-training corpus: ~400B tokens (balance of English and Chinese)
  • Objective / architecture: GLM pretraining objective (generalized language modeling), supports autoregressive and infilling usages in the GLM family
  • Checkpoint format: sharded weights (multiple tar shards); repo includes scripts to repackage for different GPU counts

Training & data

  • Large bilingual dataset with explicit balancing between English and Chinese sources
  • Training challenges documented in the paper: loss spikes, divergence, and stability concerns — authors share engineering mitigations
  • Trained at-scale with attention to cross-platform compatibility (NVIDIA, Hygon DCU, Ascend, Sunway)

Quantization & hardware

  • INT4 quantization techniques demonstrated with negligible performance loss for many tasks
  • Reference hardware for inference:
    • Large setup: single A100 (40G × 8) or V100 (32G × 8) server for standard (FP16) inference
    • Consumer-friendly with quantization: 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G)
  • Supports optimized inference engines (FasterTransformer, SAT) with reported speedups vs baseline

Benchmarks & performance

  • LAMBADA, MMLU and other English benchmarks: competitive with GPT-3 (175B) and comparable large models of the period
  • Chinese benchmarks (CLUE, FewCLUE): substantial wins over contemporaneous Chinese models (reported large gains vs ERNIE TITAN 3.0 in zero-shot/few-shot settings)
  • Results across ~30+ tasks are reproducible using the provided evaluation code in the repo

Usage examples (research-focused)

  • Zero-shot prompting for classification, QA, and summarization across English/Chinese inputs
  • Few-shot adaptation for new tasks (prompting-based) without fine-tuning
  • Reproducible evaluation: use the repo’s scripts and checkpoint layout to run published benchmarks locally or on a cluster
  • Research experiments: analyze bilingual transfer, quantization effects, and training stability behaviors using provided logs and training configs

Limitations & license

  • License: research / non-commercial restrictions — check the repository license and access form before downloading weights
  • Not intended for unrestricted commercial deployment without separate agreements
  • Hardware and operational complexity: while more accessible than many 100B models, running or fine-tuning still requires multi-GPU setups and careful engineering
  • Safety & governance: as with any large LLM, GLM-130B can produce unsafe or biased outputs and should be used with appropriate guardrails in place
  • GLM-130B (THUNLP) is an early open 100B-scale academic release that influenced subsequent GLM-family work
  • GLM-4 / GLM-4.x (ZhipuAI / Z.ai) are later, commercially-driven GLM-family models with multimodal and agent-oriented variants (e.g., GLM-4.5, GLM-4.6, GLM-4.5V). These are separate projects sharing the GLM name/lineage but are produced by different organizations and may have different licensing and access models

Pricing / access

  • Weights: free for approved research use (subject to the repo license and access form)
  • Hosted APIs / commercial access: third-party providers and downstream vendors may charge for hosted GLM-based services (pricing varies); commercial license for the original repo requires negotiation

Practical tips

  • When downloading weights, follow the repo’s recommended sharding / unpacking steps and set CHECKPOINT_PATH to the outermost extracted directory
  • If you change the target GPU count from the repo default, repackage shards using the provided scripts
  • Validate quantized models on your target tasks — INT4 works well in many cases but always test for edge-case degradation
  • Reproduce a published evaluation (one of the repo’s benchmark scripts) as a verification step after setup
  • GLM-130B GitHub: https://github.com/THUDM/GLM-130B
  • GLM (ICLR 2023 / arXiv): arXiv:2210.02414
  • GLM-family ecosystem: ZhipuAI / Z.ai GLM-4 releases (GLM-4.5, GLM-4.6, GLM-4.5V) — separate projects with different licenses and hosting options