TorchAO¶

TorchAO is an architecture optimization library for PyTorch, it provides high performance dtypes, optimization techniques and kernels for inference and training, featuring composability with native PyTorch features like torch.compile, FSDP etc.. Some benchmark numbers can be found here.

We recommend installing the latest torchao nightly with

# Install the latest TorchAO nightly build
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
pip install \
    --pre torchao>=10.0.0 \
    --index-url https://download.pytorch.org/whl/nightly/cu126

Quantizing HuggingFace Models¶

You can quantize your own huggingface model with torchao, e.g. transformers and diffusers, and save the checkpoint to huggingface hub like this with the following example code:

Code

import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int8WeightOnlyConfig

model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto",
    quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

hub_repo = # YOUR HUB REPO ID
tokenizer.push_to_hub(hub_repo)
quantized_model.push_to_hub(hub_repo, safe_serialization=False)

Alternatively, you can use the TorchAO Quantization space for quantizing models with a simple UI.

Online Quantization in vLLM¶

To perform online quantization with TorchAO in vLLM, use --quantization torchao and pass the TorchAO config through --hf-overrides.

You can inline the overrides as JSON:

vllm serve meta-llama/Meta-Llama-3-8B \
  --quantization torchao \
  --hf-overrides '{"quantization_config_file": "/path/to/torchao_config.json"}'

When you need to skip specific modules (for example, excluding vocab_parallel_embedding), configure that in the TorchAO config with FqnToConfig rather than changing vLLM model code.