msmodelslim

Huawei Ascend NPU model compression tool (msModelSlim). Use for LLM quantization (W4A8, W8A8, W8A8S, W8A16), MoE model compression, multimodal model compression (Qwen-VL, InternVL, HunyuanVideo, FLUX, SD3), calibration data preparation, precision auto-tuning, sensitive layer analysis, custom model integration, and deployment in MindIE/vLLM-Ascend. Supports Qwen, LLaMA, DeepSeek, GLM, Kimi, InternLM and more.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "msmodelslim" with this command: npx skills add ascend-ai-coding/awesome-ascend-skills/ascend-ai-coding-awesome-ascend-skills-msmodelslim

msModelSlim - Ascend Model Compression Tool

MindStudio ModelSlim (msModelSlim) is a model compression tool optimized for Huawei Ascend AI processors. It supports quantization and compression for LLMs, MoE models, and multimodal models.


Quick Start

One-Click Quantization (V1 - Recommended)

V1 automatic quantization uses configuration files from lab_practice/ directory.

# Basic W8A8 quantization for Qwen2.5-7B-Instruct
# Config files located at: lab_practice/<model_series>/<model>-<quant_type>-v1.yaml
msmodelslim quant \
    --model_path /path/to/Qwen2.5-7B-Instruct \
    --save_path /path/to/output \
    --device npu \
    --model_type Qwen2.5-7B-Instruct \
    --config_path /path/to/msmodelslim/lab_practice/qwen2.5/qwen2.5-7b-w8a8-v1.yaml \
    --trust_remote_code True

# MoE model quantization (Qwen3-30B-A3B W4A8)
msmodelslim quant \
    --model_path /path/to/Qwen3-30B-A3B \
    --save_path /path/to/output \
    --device npu \
    --model_type Qwen3-30B \
    --config_path /path/to/msmodelslim/lab_practice/qwen3_moe/qwen3-30b-w4a8-v1.yaml \
    --trust_remote_code True

# Multi-device distributed quantization
msmodelslim quant \
    --model_path /path/to/model \
    --save_path /path/to/output \
    --device npu:0,1,2,3 \
    --model_type Qwen2.5-72B-Instruct \
    --config_path /path/to/msmodelslim/lab_practice/qwen2.5/qwen2.5-72b-w8a8c8-v1.yaml \
    --trust_remote_code True

Note: Find config files in lab_practice/ directory of msmodelslim repository:

  • Structure: lab_practice/<model_series>/<model>-<quant_type>-v1.yaml
  • Example: lab_practice/qwen2.5/qwen2.5-7b-w8a8-v1.yaml

Traditional Quantization (V0)

cd msmodelslim
python3 example/Qwen/quant_qwen.py \
    --model_path /path/to/Qwen2.5-7B-Instruct \
    --save_directory /path/to/output \
    --calib_file example/common/boolq.jsonl \
    --w_bit 8 --a_bit 8 \
    --device_type npu \
    --trust_remote_code True

Installation

Prerequisites

  • Python: 3.8+ (3.9+ recommended for some environments)
  • CANN: 8.2.RC1+ (8.3.RC1 or 8.5.0 recommended)
  • PyTorch Ascend: Ascend Extension for PyTorch

Install Steps

# 1. Clone repository
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim

# 2. Run installation script
bash install.sh

# 3. For Atlas 300I Duo (sparse quantization support)
cd ${PYTHON_SITE_PACKAGES}/msmodelslim/pytorch/weight_compression/compress_graph/
sudo bash build.sh ${CANN_INSTALL_PATH}/ascend-toolkit/latest
chmod -R 550 build

Note: Do not run msmodelslim commands from within the source directory to avoid module path conflicts.

See references/installation.md for detailed environment setup.


Quantization Types

TypeWeightActivationDescriptionUse Case
W8A8INT8INT8Standard 8-bit quantizationGeneral use, balanced precision/performance
W8A16INT8FP16Weight-only quantizationHigher precision needs (MindIE only)
W4A8INT4INT8Low-bit weight quantizationHigher compression ratio
W8A8C8INT8INT8 + KV CacheWith KV Cache quantizationLong sequence inference
W8A8SINT8 SparseINT8Sparse quantizationAtlas 300I Duo optimization
W16A16SFP16 SparseFP16Float sparse quantizationHigh compression needs

Quantization Type Selection

PriorityRecommended Type
Precision firstW8A16 > W8A8 > W4A8
Memory firstW4A8 > W8A8 > W8A16
Long sequenceW8A8C8 (with KV Cache quant)
Atlas 300I DuoW8A8S or W16A16S

BFLOAT16 Model Notes

For models with torch_dtype=bfloat16 weights (e.g., Qwen3-30B-A3B):

If you encounter AclNN_Parameter_Error(EZ1001): Tensor self not implemented for DT_BFLOAT16, this is likely a Docker image issue, not a msmodelslim limitation.

Quick Diagnosis:

# Test if torch_npu works correctly
python3 -c "import torch; import torch_npu; a = torch.tensor(1).npu(); print('NPU OK')"

If this fails, your Docker image has compatibility issues. Try:

  1. Use a different/updated Docker image
  2. Reinstall torch_npu matching your CANN version
  3. Ensure CANN 8.3.RC1+ for BF16 support

Container Setup: See ascend-docker for proper Docker container creation with NPU device mappings. Refer to references/docker-setup.md for msmodelslim-specific container configuration.


Algorithm Selection

Outlier Suppression Algorithms

AlgorithmDescriptionWhen to Use
SmoothQuantCo-scale activation and weightStandard outlier suppression
QuaRotOrthogonal rotation matrixHigh precision requirements
Iterative SmoothIterative smoothingComplex distributions
Flex SmoothGrid search for optimal alpha/betaDifferent architectures
KV SmoothKV Cache smoothingKV Cache quantization

Quantization Algorithms

AlgorithmDescriptionWhen to Use
AutoRoundSignSGD optimization for rounding4-bit ultra-low quantization
GPTQColumn-wise optimizationHigh precision weight quantization
SSZIterative scale/offset searchUneven weight distributions
PDMIXDynamic (prefill) + static (decode)Large model inference
FA3Per-head INT8 attentionLong sequence, MLA models
MinMaxMin-max range statisticsBasic quantization
HistogramHistogram distribution analysisFilter outliers

Quick Selection Guide

  • Beginners: Use one-click quantization with --config_path pointing to lab_practice/ config files
  • Precision priority: QuaRot + AutoRound
  • Long sequence: FA3 + KVCache Quant
  • Custom model: See references/model-integration.md

See references/quantization-algorithms.md for algorithm details.


Supported Models

Large Language Models

Model SeriesOne-ClickV0 ScriptNotes
Qwen3example/Qwen/Qwen3-8B/14B/32B
Qwen2.5example/Qwen/7B/32B/72B/Coder
Qwen2-example/Qwen/7B/72B
DeepSeek-V3example/DeepSeek/V3/V3.1/V3.2, R1
LLaMA-example/Llama/LLaMA2, LLaMA3.1
GLM-example/GLM/GLM-4, GLM-5
InternLM2-example/InternLM2/InternLM2-20B
Kimi-example/Kimi/Kimi K2
HunYuan-example/HunYuan/HunYuan-A52B

MoE Models

ModelOne-ClickNotes
Qwen3-MoEQwen3-30B-A3B, Qwen3-235B-A22B
DeepSeek MoEDeepSeek-V2, V3 series

Multimodal Models

TypeModelsExample Script
Vision-LanguageQwen-VL, Qwen2-VL, Qwen3-VL, InternVL2, LLaVA, GLM-4.1Vexample/multimodal_vlm/
GenerationFLUX, SD3, HunyuanVideo, OpenSoraPlan, Wan2.1example/multimodal_sd/

See references/model-support.md for complete support matrix.


Custom Model Integration

Quick Overview

  1. Create adapter file: msmodelslim/model/my_model/model_adapter.py
  2. Define adapter class: Inherit TransformersModel + interface classes
  3. Implement interfaces: handle_dataset, init_model, generate_model_visit, etc.
  4. Register model: Add to config/config.ini

Example

from msmodelslim.model.interface_hub import ModelSlimPipelineInterfaceV1
from msmodelslim.model.common.transformers import TransformersModel

class MyModelAdapter(TransformersModel, ModelSlimPipelineInterfaceV1):
    def handle_dataset(self, dataset, device):
        return self._get_tokenized_data(dataset, device)
    
    def init_model(self, device):
        return self._load_model(device)
    
    def generate_model_visit(self, model):
        from msmodelslim.model.common.layer_wise_forward import generated_decoder_layer_visit_func
        yield from generated_decoder_layer_visit_func(model)
    
    def generate_model_forward(self, model, inputs):
        from msmodelslim.model.common.layer_wise_forward import transformers_generated_forward_func
        yield from transformers_generated_forward_func(model, inputs)

See references/model-integration.md and scripts/model_adapter_template.py for complete guide.


Precision Auto-Tuning

Sensitive Layer Analysis

# Analyze model sensitivity
msmodelslim analyze --model_path /path/to/model --model_type Qwen2.5-7B-Instruct

Analysis Algorithms:

  • std: Standard deviation based (recommended for general use)
  • quantile: Quantile/IQR based (for long-tail distributions)
  • kurtosis: Kurtosis based (for extreme value detection)

Auto-Tuning Strategy

Standing High: Binary search to minimize fallback layers while maintaining precision.

# Use auto-tuning config
msmodelslim quant \
    --model_path /path/to/model \
    --save_path /path/to/output \
    --model_type Qwen2.5-7B-Instruct \
    --config_path /path/to/auto_tuning_config.yaml

See references/precision-tuning.md for tuning strategies.


Deployment

vLLM-Ascend

# Online service
vllm serve /path/to/quantized-model \
    --served-model-name "Qwen2.5-7B-w8a8" \
    --max-model-len 4096 \
    --quantization ascend

# Offline inference (Python)
from vllm import LLM, SamplingParams

llm = LLM(
    model="/path/to/quantized-model",
    max_model_len=4096,
    quantization="ascend"
)
outputs = llm.generate(["Hello"], SamplingParams(temperature=0.6))

MindIE

# Deploy with MindIE
# See MindIE documentation for details

Weight Conversion

# Convert to AutoAWQ/AutoGPTQ format
python3 example/common/ms_to_vllm.py --input /path/to/quantized --output /path/to/converted

See references/deployment.md for deployment details.


Output Files

After quantization, the output directory contains:

output/
├── config.json                      # Original model config
├── generation_config.json           # Generation config
├── quant_model_description.json     # Quantization description
├── quant_model_weight_w8a8.safetensors  # Quantized weights
├── tokenizer_config.json            # Tokenizer config
├── tokenizer.json                   # Tokenizer vocabulary
└── vocab.json                       # Vocabulary (if applicable)

Troubleshooting

Common Issues

Q: Out of memory during quantization?

# Use layer-by-layer quantization (default in V1)
# Or use CPU quantization
msmodelslim quant --device cpu ...

Q: Precision degradation after quantization?

  • Use higher precision type (W8A8 instead of W4A8)
  • Check lab_practice/ for best practice configs
  • Enable outlier suppression algorithms
  • See references/precision-tuning.md

Q: Model type not supported?

Q: How to enable debug logging?

export MSMODELSLIM_LOG_LEVEL=DEBUG
msmodelslim quant ...

Scripts & Assets

Scripts

Config Templates (assets/)


Official References


Related Skills

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

npu-smi

No summary provided by upstream source.

Repository SourceNeeds Review
General

atc-model-converter

No summary provided by upstream source.

Repository SourceNeeds Review
General

hccl-test

No summary provided by upstream source.

Repository SourceNeeds Review
General

ascend-docker

No summary provided by upstream source.

Repository SourceNeeds Review