Samba

Samba

[ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

microsoftmicrosoft
866 stars
47 forks
Python

MCP Relevance Analysis

Relevance Score40/100 - Related Relevance

Summary

Samba is a related relevance project related to Model Context Protocol. It has 866 stars and 47 forks on GitHub.

Key Features

  • MCP integration capabilities
  • AI context management
  • Language model communication
  • Structured data processing

Use Cases

  • Enhancing LLM context handling
  • Improving model response quality
  • Building more effective AI applications

README

Project image

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling#

arXiv

Samba is a simple yet powerful hybrid model with an unlimited context length. Its architecture is frustratingly simple:

Samba = Mamba + MLP + Sliding Window Attention + MLP stacking at the layer level.

Our largest model, Samba-3.8B, is trained on 3.2 trillion tokens from the Phi3 dataset, outperforming Phi3-mini on major benchmarks (e.g. MMLU, GSM8K and HumanEval) by a large margin. Samba can also achieve perfect long-context retrieval ability with minimal instruction tuning, while still maintaining its linear complexity with respect to sequence length. This ability leads to the impressive performance of Samba-3.8B-instruct on downstream tasks such as long-context summarization.

Performance :rocket:#

Project image
Project image
ModelMMLUGSM8KHumanEvalGovReportSQuALITY
Phi-3-mini-4K-instruct68.882.558.514.421.6
Samba-3.8B-instruct (preview)71.987.662.818.921.2

We report 5-shot accuracy for MMLU, 8-shot CoT accuracy for GSM8K, 0-shot pass@1 for HumanEval and ROUGE-L for both GovReport and SQuALITY.

Updates#

  • [Jan. 22] Samba has been accepted to ICLR 2025!
  • [Dec. 8] Added the evaluation script and more baseline architectures.
  • [June 11] Released the codebase for training Samba-421M and Samba-1.3B on SlimPajama.

Code Overview#

Our training infrastructure on SlimPajama is a modified version of TinyLlama and LitGPT. One can easily specify different architectual configurations through modifying the model_name and the config file which includes tons of baseline architectures mentioned in the paper. Our RetNet and GLA implementations are from the awesome Flash Linear Attention repository.

Pretraining Samba from scratch#

Please follow the Dockerfile to setup the environment. The data preparation mainly follows TinyLlama except that we only use the SlimPajama dataset.

Data Preparation#

Download the Slimpajama dataset to your chosen directory.

bash
cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B

The SlimPajama dataset takes 893GB diskspace. Use the provided scripts to tokenize the datasets and divide them into chunks.

bash
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama  --destination_path data/slim --split validation --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama  --destination_path data/slim --split train --percentage 1.0

You are now ready to launch a job!

Training#

The following script trains a default Samba-421M model on a single node of 8 GPUs with 20B tokens.

bash
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=samba-421M --rdzv_backend=c10d  --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} pretrain.py --train_data_dir data/slim --val_data_dir data/slim 

You can modify model_name to "Samba_1.3B" and train_config to "tsz512x4k_100B" for training a Samba-1.3B model with 100B tokens. We assume that you have 8 nodes each with 8 GPUs, and you can modify the number of nodes for training on fewer gpus.

Evaluation#

We leverage lm-evaluation-harness for the evaluation of our pretrained models. We only support non-generation based tasks for now.

bash
pip install lm-eval
python eval.py --model Samba \
          --model_args pretrained=path/to/ckpt.pth,config="Samba_1.3B" \
          --tasks lambada_openai,arc_easy,winogrande,hellaswag,piqa --device cuda:0 --batch_size 1 --trust_remote_code 

Citation#

If you find our work useful, please consider citing:

bibtex
@article{ren2024samba,
      title={Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling}, 
      author={Liliang Ren and Yang Liu and Yadong Lu and Yelong Shen and Chen Liang and Weizhu Chen},
      journal = {arXiv preprint},
      year={2024},
      url={https://arxiv.org/abs/2406.07522}
}

Contact#

Liliang Ren (liliangren@microsoft.com)

Samba - MCPHubs