AI-Master-Book
  • about AI-Master-Book
  • AI Master Book
    • 이상치 탐지 with Python
    • 베이지안 뉴럴네트워크 (BNN) with Python
    • 그래프 뉴럴네트워크 (GNN) with Python
    • 데이터 마케팅 분석 with Python
  • LLM MASTER BOOK
    • OpenAI API 쿡북 with Python
    • 기초부터 심화까지 RAG 쿡북 with Python
    • MCP 에이전트 쿡북 with Python
  • LLMs
    • OpenAI API
      • 1️⃣ChatCompletion
      • 2️⃣DALL-E
      • 3️⃣Text to Speech
      • 4️⃣Video to Transcripts
      • 5️⃣Assistants API
      • 6️⃣Prompt Engineering
      • 7️⃣OpenAI New GPT-4o
    • LangChain
      • LangChain Basic
        • 1️⃣Basic Modules
        • 2️⃣Model I/O
        • 3️⃣Prompts
        • 4️⃣Chains
        • 5️⃣Agents
        • 6️⃣Tools
        • 7️⃣Memory
      • LangChain Intermediate
        • 1️⃣OpenAI LLM
        • 2️⃣Prompt Template
        • 3️⃣Retrieval
        • 4️⃣RAG ChatBot
        • 5️⃣RAG with Gemini
        • 6️⃣New Huggingface-LangChain
        • 7️⃣Huggingface Hub
        • 8️⃣SQL Agent & Chain
        • 9️⃣Expression Language(LCEL)
        • 🔟Llama3-8B with LangChain
      • LangChain Advanced
        • 1️⃣LLM Evaluation
        • 2️⃣RAG Evaluation with RAGAS
        • 3️⃣LangChain with RAGAS
        • 4️⃣RAG Paradigms
        • 5️⃣LangChain: Advance Techniques
        • 6️⃣LangChain with NeMo-Guardrails
        • 7️⃣LangChain vs. LlamaIndex
        • 8️⃣LangChain LCEL vs. LangGraph
    • LlamaIndex
      • LlamaIndex Basic
        • 1️⃣Introduction
        • 2️⃣Customization
        • 3️⃣Data Connectors
        • 4️⃣Documents & Nodes
        • 5️⃣Naive RAG
        • 6️⃣Advanced RAG
        • 7️⃣Llama3-8B with LlamaIndex
        • 8️⃣LlmaPack
      • LlamaIndex Intermediate
        • 1️⃣QueryEngine
        • 2️⃣Agent
        • 3️⃣Evaluation
        • 4️⃣Evaluation-Driven Development
        • 5️⃣Fine-tuning
        • 6️⃣Prompt Compression with LLMLingua
      • LlamaIndex Advanced
        • 1️⃣Agentic RAG: Router Engine
        • 2️⃣Agentic RAG: Tool Calling
        • 3️⃣Building Agent Reasoning Loop
        • 4️⃣Building Multi-document Agent
    • Hugging Face
      • Huggingface Basic
        • 1️⃣Datasets
        • 2️⃣Tokenizer
        • 3️⃣Sentence Embeddings
        • 4️⃣Transformers
        • 5️⃣Sentence Transformers
        • 6️⃣Evaluate
        • 7️⃣Diffusers
      • Huggingface Tasks
        • NLP
          • 1️⃣Sentiment Analysis
          • 2️⃣Zero-shot Classification
          • 3️⃣Aspect-Based Sentiment Analysis
          • 4️⃣Feature Extraction
          • 5️⃣Intent Classification
          • 6️⃣Topic Modeling: BERTopic
          • 7️⃣NER: Token Classification
          • 8️⃣Summarization
          • 9️⃣Translation
          • 🔟Text Generation
        • Audio & Tabular
          • 1️⃣Text-to-Speech: TTS
          • 2️⃣Speech Recognition: Whisper
          • 3️⃣Audio Classification
          • 4️⃣Tabular Qustaion & Answering
        • Vision & Multimodal
          • 1️⃣Image-to-Text
          • 2️⃣Text to Image
          • 3️⃣Image to Image
          • 4️⃣Text or Image-to-Video
          • 5️⃣Depth Estimation
          • 6️⃣Image Classification
          • 7️⃣Object Detection
          • 8️⃣Segmentatio
      • Huggingface Optimization
        • 1️⃣Accelerator
        • 2️⃣Bitsandbytes
        • 3️⃣Flash Attention
        • 4️⃣Quantization
        • 5️⃣Safetensors
        • 6️⃣Optimum-ONNX
        • 7️⃣Optimum-NVIDIA
        • 8️⃣Optimum-Intel
      • Huggingface Fine-tuning
        • 1️⃣Transformer Fine-tuning
        • 2️⃣PEFT Fine-tuning
        • 3️⃣PEFT: Fine-tuning with QLoRA
        • 4️⃣PEFT: Fine-tuning Phi-2 with QLoRA
        • 5️⃣Axoltl Fine-tuning with QLoRA
        • 6️⃣TRL: RLHF Alignment Fine-tuning
        • 7️⃣TRL: DPO Fine-tuning with Phi-3-4k-instruct
        • 8️⃣TRL: ORPO Fine-tuning with Llama3-8B
        • 9️⃣Convert GGUF gemma-2b with llama.cpp
        • 🔟Apple Silicon Fine-tuning Gemma-2B with MLX
        • 🔢LLM Mergekit
    • Agentic LLM
      • Agentic LLM
        • 1️⃣Basic Agentic LLM
        • 2️⃣Multi-agent with CrewAI
        • 3️⃣LangGraph: Multi-agent Basic
        • 4️⃣LangGraph: Agentic RAG with LangChain
        • 5️⃣LangGraph: Agentic RAG with Llama3-8B by Groq
      • Autonomous Agent
        • 1️⃣LLM Autonomous Agent?
        • 2️⃣AutoGPT: Worldcup Winner Search with LangChain
        • 3️⃣BabyAGI: Weather Report with LangChain
        • 4️⃣AutoGen: Writing Blog Post with LangChain
        • 5️⃣LangChain: Autonomous-agent Debates with Tools
        • 6️⃣CAMEL Role-playing Autonomous Cooperative Agents
        • 7️⃣LangChain: Two-player Harry Potter D&D based CAMEL
        • 8️⃣LangChain: Multi-agent Bid for K-Pop Debate
        • 9️⃣LangChain: Multi-agent Authoritarian Speaker Selection
        • 🔟LangChain: Multi-Agent Simulated Environment with PettingZoo
    • Multimodal
      • 1️⃣PaliGemma: Open Vision LLM
      • 2️⃣FLUX.1: Generative Image
    • Building LLM
      • 1️⃣DSPy
      • 2️⃣DSPy RAG
      • 3️⃣DSPy with LangChain
      • 4️⃣Mamba
      • 5️⃣Mamba RAG with LangChain
      • 7️⃣PostgreSQL VectorDB with pgvorco.rs
Powered by GitBook
On this page
  • About Optimum-Intel
  • 1. Optimum- Neural Compressor
  • 2. Optimum-OpenVINO
  • Optium Intel neural-compressor
  • 1. Weight-Only Quantization (LLMs)
  • 2. Static Quantization (Non-LLMs)
  1. LLMs
  2. Hugging Face
  3. Huggingface Optimization

Optimum-Intel

PreviousOptimum-NVIDIANextHuggingface Fine-tuning

Last updated 1 year ago

About Optimum-Intel

Huggingface에서 제공하는 Optimum Intel은 인텔의 CPU, GPU 아키텍처에서 엔드투엔드 파이프라인을 가속화하기 위해 인텔이 제공하는 Transformer 및 Diffuser 라이브러리와 다양한 툴 및 라이브러리 간의 인터페이스입니다. Optimum Intel은 2가지를 지원합니다.

1. Optimum- Neural Compressor

LLM의 Quantization, Pruning, Knowledge Distillation와 같은 가장 널리 사용되는 압축 기술을 사용할 수 있는 오픈 소스 라이브러리입니다. 사용자가 Quantized 모델을 쉽게 생성할 수 있도록 정확도 중심의 자동 튜닝 전략을 지원합니다. 사용자는 정적, 동적, 인식 훈련 양자화 접근법을 쉽게 적용하면서 예상 정확도 기준을 제시할 수 있습니다. 또한 다양한 Weight Pruning 기법을 지원하여 사전 정의된 희소성 목표를 제공하는 Pruning 모델을 생성할 수 있습니다.

2. Optimum-OpenVINO

OpenVINO는 인텔 CPU, GPU 및 특수 DL 추론 가속기를 위한 고성능 추론 기능을 지원하는 오픈 소스 툴킷입니다. Quantization, Pruning, Knowledge Distillation으로 모델을 최적화할 수 있는 도구 세트가 함께 제공됩니다. Optimum Intel-OpenVINO는 Transformer와 Diffusers 모델을 최적화하고, OpenVINO IR(Intermediate Representation) 형식으로 변환하고, OpenVINO 런타임을 사용하여 추론을 실행할 수 있는 간단한 인터페이스를 제공합니다.

Optimum Intel-OpenVINO의 Huggingface 지원 Tasks는 아래와 같습니다:

Task
Auto Class

text-classification

OVModelForSequenceClassification

token-classification

OVModelForTokenClassification

question-answering

OVModelForQuestionAnswering

audio-classification

OVModelForAudioClassification

image-classification

OVModelForImageClassification

feature-extraction

OVModelForFeatureExtraction

fill-mask

OVModelForMaskedLM

text-generation

OVModelForCausalLM

text2text-generation

OVModelForSeq2SeqLM

Optimum Intel-OpenVINO 설치는 OpenVINO Runtime과 IR을 설치해야하기 때문에 가능하면 Docker build를 권장합니다.

Optimum OpenVINO의 자세한 튜토리얼은 아래 링크를 참조하시기 바랍니다.

Optium Intel neural-compressor

neural-compressor는 Quantization, Pruning, Knowledge Distillation을 효과적으로 변환할 수 있습니다. 다만 Huggingface는 이미 Bitsandbytes, Accelerator에서 쉽게 pipeline에 적용 가능해서 인텔의 neural-compressor를 사용하는 의미는 없는 것 같습니다. 아래는 활용 예시입니다.

  1. LLM 모델의 Weight-only Quantization

  2. 일반 Transformer 모델의 Static Quantization

%pip install optimum[neural-compressor]
%pip install neural-compressor auto_round

1. Weight-Only Quantization (LLMs)

LLM의 가중치 전용 정량화를 보여주는 것으로, 인텔 CPU, 인텔 가우디2 AI 가속기, 엔비디아 GPU를 지원하며 최적의 디바이스가 자동으로 선택됩니다.

from transformers import AutoModel, AutoTokenizer

from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor.quantization import fit
from neural_compressor.adaptor.torch_utils.auto_round import get_dataloader

#model_name = "EleutherAI/gpt-neo-125m"
model_name = "google/gemma-2b-it"
float_model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True
)
dataloader = get_dataloader(
    tokenizer, 
    seqlen=2048
)

woq_conf = PostTrainingQuantConfig(
    approach="weight_only",
    op_type_dict={
        ".*": {  # match all ops
            "weight": {
                "dtype": "int",
                "bits": 4,
                "algorithm": "AUTOROUND",
            },
        }
    },
)
quantized_model = fit(
    model=float_model, 
    conf=woq_conf, 
    calib_dataloader=dataloader
)
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

2024-05-31 22:52:35 [INFO] Start auto tuning.
2024-05-31 22:52:35 [INFO] Quantize model without tuning!
2024-05-31 22:52:35 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-05-31 22:52:35 [INFO] Adaptor has 5 recipes.
2024-05-31 22:52:35 [INFO] 0 recipes specified by user.
2024-05-31 22:52:35 [INFO] 3 recipes require future tuning.
2024-05-31 22:52:36 [INFO] *** Initialize auto tuning
2024-05-31 22:52:36 [INFO] {
2024-05-31 22:52:36 [INFO]     'PostTrainingQuantConfig': {
2024-05-31 22:52:36 [INFO]         'AccuracyCriterion': {
2024-05-31 22:52:36 [INFO]             'criterion': 'relative',
2024-05-31 22:52:36 [INFO]             'higher_is_better': True,
2024-05-31 22:52:36 [INFO]             'tolerable_loss': 0.01,
2024-05-31 22:52:36 [INFO]             'absolute': None,
2024-05-31 22:52:36 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x7f313bffde90>>,
2024-05-31 22:52:36 [INFO]             'relative': 0.01
2024-05-31 22:52:36 [INFO]         },
2024-05-31 22:52:36 [INFO]         'approach': 'post_training_weight_only',
2024-05-31 22:52:36 [INFO]         'backend': 'default',
2024-05-31 22:52:36 [INFO]         'calibration_sampling_size': [
2024-05-31 22:52:36 [INFO]             100
2024-05-31 22:52:36 [INFO]         ],
2024-05-31 22:52:36 [INFO]         'device': 'cpu',
2024-05-31 22:52:36 [INFO]         'diagnosis': False,
2024-05-31 22:52:36 [INFO]         'domain': 'auto',
2024-05-31 22:52:36 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-05-31 22:52:36 [INFO]         'excluded_precisions': [
2024-05-31 22:52:36 [INFO]         ],
2024-05-31 22:52:36 [INFO]         'framework': 'pytorch_fx',
2024-05-31 22:52:36 [INFO]         'inputs': [
2024-05-31 22:52:36 [INFO]         ],
2024-05-31 22:52:36 [INFO]         'model_name': '',
2024-05-31 22:52:36 [INFO]         'ni_workload_name': 'quantization',
2024-05-31 22:52:36 [INFO]         'op_name_dict': None,
2024-05-31 22:52:36 [INFO]         'op_type_dict': {
2024-05-31 22:52:36 [INFO]             '.*': {
2024-05-31 22:52:36 [INFO]                 'weight': {
2024-05-31 22:52:36 [INFO]                     'dtype': [
2024-05-31 22:52:36 [INFO]                         'int'
2024-05-31 22:52:36 [INFO]                     ],
2024-05-31 22:52:36 [INFO]                     'bits': [
2024-05-31 22:52:36 [INFO]                         4
2024-05-31 22:52:36 [INFO]                     ],
2024-05-31 22:52:36 [INFO]                     'algorithm': [
2024-05-31 22:52:36 [INFO]                         'AUTOROUND'
2024-05-31 22:52:36 [INFO]                     ]
2024-05-31 22:52:36 [INFO]                 }
2024-05-31 22:52:36 [INFO]             }
2024-05-31 22:52:36 [INFO]         },
2024-05-31 22:52:36 [INFO]         'outputs': [
2024-05-31 22:52:36 [INFO]         ],
2024-05-31 22:52:36 [INFO]         'quant_format': 'default',
2024-05-31 22:52:36 [INFO]         'quant_level': 'auto',
2024-05-31 22:52:36 [INFO]         'recipes': {
2024-05-31 22:52:36 [INFO]             'smooth_quant': False,
2024-05-31 22:52:36 [INFO]             'smooth_quant_args': {
2024-05-31 22:52:36 [INFO]             },
2024-05-31 22:52:36 [INFO]             'layer_wise_quant': False,
2024-05-31 22:52:36 [INFO]             'layer_wise_quant_args': {
2024-05-31 22:52:36 [INFO]             },
2024-05-31 22:52:36 [INFO]             'fast_bias_correction': False,
2024-05-31 22:52:36 [INFO]             'weight_correction': False,
2024-05-31 22:52:36 [INFO]             'gemm_to_matmul': True,
2024-05-31 22:52:36 [INFO]             'graph_optimization_level': None,
2024-05-31 22:52:36 [INFO]             'first_conv_or_matmul_quantization': True,
2024-05-31 22:52:36 [INFO]             'last_conv_or_matmul_quantization': True,
2024-05-31 22:52:36 [INFO]             'pre_post_process_quantization': True,
2024-05-31 22:52:36 [INFO]             'add_qdq_pair_to_weight': False,
2024-05-31 22:52:36 [INFO]             'optypes_to_exclude_output_quant': [
2024-05-31 22:52:36 [INFO]             ],
2024-05-31 22:52:36 [INFO]             'dedicated_qdq_pair': False,
2024-05-31 22:52:36 [INFO]             'rtn_args': {
2024-05-31 22:52:36 [INFO]             },
2024-05-31 22:52:36 [INFO]             'awq_args': {
2024-05-31 22:52:36 [INFO]             },
2024-05-31 22:52:36 [INFO]             'gptq_args': {
2024-05-31 22:52:36 [INFO]             },
2024-05-31 22:52:36 [INFO]             'teq_args': {
2024-05-31 22:52:36 [INFO]             },
2024-05-31 22:52:36 [INFO]             'autoround_args': {
2024-05-31 22:52:36 [INFO]             }
2024-05-31 22:52:36 [INFO]         },
2024-05-31 22:52:36 [INFO]         'reduce_range': None,
2024-05-31 22:52:36 [INFO]         'TuningCriterion': {
2024-05-31 22:52:36 [INFO]             'max_trials': 100,
2024-05-31 22:52:36 [INFO]             'objective': [
2024-05-31 22:52:36 [INFO]                 'performance'
2024-05-31 22:52:36 [INFO]             ],
2024-05-31 22:52:36 [INFO]             'strategy': 'basic',
2024-05-31 22:52:36 [INFO]             'strategy_kwargs': None,
2024-05-31 22:52:36 [INFO]             'timeout': 0
2024-05-31 22:52:36 [INFO]         },
2024-05-31 22:52:36 [INFO]         'use_bf16': True
2024-05-31 22:52:36 [INFO]     }
2024-05-31 22:52:36 [INFO] }
2024-05-31 22:52:36 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-05-31 22:52:36 [INFO] Pass query framework capability elapsed time: 6.51 ms
2024-05-31 22:52:36 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-05-31 22:52:36 [INFO] Quantize the model with default config.
2024-05-31 22:52:36 [INFO] All algorithms to do: {'AUTOROUND'}
2024-05-31 22:52:36 [INFO] quantizing with the AutoRound algorithm
2024-05-31 22:52:36 INFO utils.py L570: Using GPU device
2024-05-31 22:52:38 INFO autoround.py L465: using torch.float16 for quantization tuning
2024-05-31 22:54:14 INFO autoround.py L981: quantizing 1/18, layers.0
2024-05-31 22:55:27 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.042549 -> iter 195: 0.008586
2024-05-31 22:55:36 INFO autoround.py L981: quantizing 2/18, layers.1
2024-05-31 22:56:48 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.012353 -> iter 188: 0.003782
2024-05-31 22:56:58 INFO autoround.py L981: quantizing 3/18, layers.2
2024-05-31 22:58:08 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.003564 -> iter 183: 0.001639
2024-05-31 22:58:18 INFO autoround.py L981: quantizing 4/18, layers.3
2024-05-31 22:59:29 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.001812 -> iter 168: 0.000810
2024-05-31 22:59:38 INFO autoround.py L981: quantizing 5/18, layers.4
2024-05-31 23:00:49 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.001076 -> iter 168: 0.000581
2024-05-31 23:00:59 INFO autoround.py L981: quantizing 6/18, layers.5
2024-05-31 23:02:10 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.000886 -> iter 186: 0.000537
2024-05-31 23:02:20 INFO autoround.py L981: quantizing 7/18, layers.6
2024-05-31 23:03:30 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.000899 -> iter 191: 0.000483
2024-05-31 23:03:40 INFO autoround.py L981: quantizing 8/18, layers.7
2024-05-31 23:04:51 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.000975 -> iter 139: 0.000531
2024-05-31 23:05:01 INFO autoround.py L981: quantizing 9/18, layers.8
2024-05-31 23:06:12 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.001110 -> iter 177: 0.000699
2024-05-31 23:06:22 INFO autoround.py L981: quantizing 10/18, layers.9
2024-05-31 23:07:32 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.001358 -> iter 197: 0.000854
2024-05-31 23:07:42 INFO autoround.py L981: quantizing 11/18, layers.10
2024-05-31 23:08:52 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.001754 -> iter 178: 0.001185
2024-05-31 23:09:03 INFO autoround.py L981: quantizing 12/18, layers.11
2024-05-31 23:10:13 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.002099 -> iter 168: 0.001497
2024-05-31 23:10:23 INFO autoround.py L981: quantizing 13/18, layers.12
2024-05-31 23:11:35 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.002648 -> iter 194: 0.001862
2024-05-31 23:11:45 INFO autoround.py L981: quantizing 14/18, layers.13
2024-05-31 23:12:55 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.003757 -> iter 106: 0.002598
2024-05-31 23:13:06 INFO autoround.py L981: quantizing 15/18, layers.14
2024-05-31 23:14:17 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.007845 -> iter 105: 0.004465
2024-05-31 23:14:27 INFO autoround.py L981: quantizing 16/18, layers.15
2024-05-31 23:15:37 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.010025 -> iter 126: 0.006707
2024-05-31 23:15:47 INFO autoround.py L981: quantizing 17/18, layers.16
2024-05-31 23:16:58 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.015236 -> iter 160: 0.010168
2024-05-31 23:17:08 INFO autoround.py L981: quantizing 18/18, layers.17
2024-05-31 23:18:18 INFO autoround.py L935: quantized 7/7 layers in the block, loss iter 0: 0.018019 -> iter 193: 0.012850
2024-05-31 23:18:29 INFO autoround.py L1096: quantization tuning time 1551.1706745624542
2024-05-31 23:18:29 INFO autoround.py L1112: Summary: quantized 126/126 in the model
2024-05-31 23:18:30 [INFO] |******Mixed Precision Statistics******|
2024-05-31 23:18:30 [INFO] +------------+---------+---------------+
2024-05-31 23:18:30 [INFO] |  Op Type   |  Total  |    A32W4G32   |
2024-05-31 23:18:30 [INFO] +------------+---------+---------------+
2024-05-31 23:18:30 [INFO] |   Linear   |   126   |      126      |
2024-05-31 23:18:30 [INFO] +------------+---------+---------------+
2024-05-31 23:18:30 [INFO] Pass quantize model elapsed time: 1554793.01 ms
2024-05-31 23:18:30 [INFO] Save tuning history to /home/kubwa/kubwai/15-Huggingface/06_Optimzation/nc_workspace/2024-05-31_22-51-42/./history.snapshot.
2024-05-31 23:18:30 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-05-31 23:18:30 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-05-31 23:18:30 [INFO] Save deploy yaml to /home/kubwa/kubwai/15-Huggingface/06_Optimzation/nc_workspace/2024-05-31_22-51-42/deploy.yaml

2. Static Quantization (Non-LLMs)

%pip install torchvision
from torchvision import models

from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor.data import DataLoader, Datasets
from neural_compressor.quantization import fit

float_model = models.resnet18()
dataset = Datasets("pytorch")["dummy"](shape=(1, 3, 224, 224))

calib_dataloader = DataLoader(
    framework="pytorch", 
    dataset=dataset
)
static_quant_conf = PostTrainingQuantConfig()

quantized_model = fit(
    model=float_model, 
    conf=static_quant_conf, 
    calib_dataloader=calib_dataloader
)
2024-05-31 23:31:52 [INFO] Start auto tuning.
2024-05-31 23:31:52 [INFO] Quantize model without tuning!
2024-05-31 23:31:52 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-05-31 23:31:52 [INFO] Adaptor has 5 recipes.
2024-05-31 23:31:52 [INFO] 0 recipes specified by user.
2024-05-31 23:31:52 [INFO] 3 recipes require future tuning.
2024-05-31 23:31:52 [INFO] *** Initialize auto tuning
2024-05-31 23:31:52 [INFO] {
2024-05-31 23:31:52 [INFO]     'PostTrainingQuantConfig': {
2024-05-31 23:31:52 [INFO]         'AccuracyCriterion': {
2024-05-31 23:31:52 [INFO]             'criterion': 'relative',
2024-05-31 23:31:52 [INFO]             'higher_is_better': True,
2024-05-31 23:31:52 [INFO]             'tolerable_loss': 0.01,
2024-05-31 23:31:52 [INFO]             'absolute': None,
2024-05-31 23:31:52 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x7ff3c1b59a50>>,
2024-05-31 23:31:52 [INFO]             'relative': 0.01
2024-05-31 23:31:52 [INFO]         },
2024-05-31 23:31:52 [INFO]         'approach': 'post_training_static_quant',
2024-05-31 23:31:52 [INFO]         'backend': 'default',
2024-05-31 23:31:52 [INFO]         'calibration_sampling_size': [
2024-05-31 23:31:52 [INFO]             100
2024-05-31 23:31:52 [INFO]         ],
2024-05-31 23:31:52 [INFO]         'device': 'cpu',
2024-05-31 23:31:52 [INFO]         'diagnosis': False,
2024-05-31 23:31:52 [INFO]         'domain': 'auto',
2024-05-31 23:31:52 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-05-31 23:31:52 [INFO]         'excluded_precisions': [
2024-05-31 23:31:52 [INFO]         ],
2024-05-31 23:31:52 [INFO]         'framework': 'pytorch_fx',
2024-05-31 23:31:52 [INFO]         'inputs': [
2024-05-31 23:31:52 [INFO]         ],
2024-05-31 23:31:52 [INFO]         'model_name': '',
2024-05-31 23:31:52 [INFO]         'ni_workload_name': 'quantization',
2024-05-31 23:31:52 [INFO]         'op_name_dict': None,
2024-05-31 23:31:52 [INFO]         'op_type_dict': None,
2024-05-31 23:31:52 [INFO]         'outputs': [
2024-05-31 23:31:52 [INFO]         ],
2024-05-31 23:31:52 [INFO]         'quant_format': 'default',
2024-05-31 23:31:52 [INFO]         'quant_level': 'auto',
2024-05-31 23:31:52 [INFO]         'recipes': {
2024-05-31 23:31:52 [INFO]             'smooth_quant': False,
2024-05-31 23:31:52 [INFO]             'smooth_quant_args': {
2024-05-31 23:31:52 [INFO]             },
2024-05-31 23:31:52 [INFO]             'layer_wise_quant': False,
2024-05-31 23:31:52 [INFO]             'layer_wise_quant_args': {
2024-05-31 23:31:52 [INFO]             },
2024-05-31 23:31:52 [INFO]             'fast_bias_correction': False,
2024-05-31 23:31:52 [INFO]             'weight_correction': False,
2024-05-31 23:31:52 [INFO]             'gemm_to_matmul': True,
2024-05-31 23:31:52 [INFO]             'graph_optimization_level': None,
2024-05-31 23:31:52 [INFO]             'first_conv_or_matmul_quantization': True,
2024-05-31 23:31:52 [INFO]             'last_conv_or_matmul_quantization': True,
2024-05-31 23:31:52 [INFO]             'pre_post_process_quantization': True,
2024-05-31 23:31:52 [INFO]             'add_qdq_pair_to_weight': False,
2024-05-31 23:31:52 [INFO]             'optypes_to_exclude_output_quant': [
2024-05-31 23:31:52 [INFO]             ],
2024-05-31 23:31:52 [INFO]             'dedicated_qdq_pair': False,
2024-05-31 23:31:52 [INFO]             'rtn_args': {
2024-05-31 23:31:52 [INFO]             },
2024-05-31 23:31:52 [INFO]             'awq_args': {
2024-05-31 23:31:52 [INFO]             },
2024-05-31 23:31:52 [INFO]             'gptq_args': {
2024-05-31 23:31:52 [INFO]             },
2024-05-31 23:31:52 [INFO]             'teq_args': {
2024-05-31 23:31:52 [INFO]             },
2024-05-31 23:31:52 [INFO]             'autoround_args': {
2024-05-31 23:31:52 [INFO]             }
2024-05-31 23:31:52 [INFO]         },
2024-05-31 23:31:52 [INFO]         'reduce_range': None,
2024-05-31 23:31:52 [INFO]         'TuningCriterion': {
2024-05-31 23:31:52 [INFO]             'max_trials': 100,
2024-05-31 23:31:52 [INFO]             'objective': [
2024-05-31 23:31:52 [INFO]                 'performance'
2024-05-31 23:31:52 [INFO]             ],
2024-05-31 23:31:52 [INFO]             'strategy': 'basic',
2024-05-31 23:31:52 [INFO]             'strategy_kwargs': None,
2024-05-31 23:31:52 [INFO]             'timeout': 0
2024-05-31 23:31:52 [INFO]         },
2024-05-31 23:31:52 [INFO]         'use_bf16': True
2024-05-31 23:31:52 [INFO]     }
2024-05-31 23:31:52 [INFO] }
2024-05-31 23:31:52 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/ao/quantization/fx/fuse.py:56: UserWarning: Passing a fuse_custom_config_dict to fuse is deprecated and will not be supported in a future version. Please pass in a FuseCustomConfig instead.
  warnings.warn(
2024-05-31 23:31:52 [INFO] Attention Blocks: 0
2024-05-31 23:31:52 [INFO] FFN Blocks: 0
2024-05-31 23:31:52 [INFO] Pass query framework capability elapsed time: 119.76 ms
2024-05-31 23:31:52 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-05-31 23:31:52 [INFO] Quantize the model with default config.
2024-05-31 23:31:53 [INFO] |******Mixed Precision Statistics******|
2024-05-31 23:31:53 [INFO] +----------------------+-------+-------+
2024-05-31 23:31:53 [INFO] |       Op Type        | Total |  INT8 |
2024-05-31 23:31:53 [INFO] +----------------------+-------+-------+
2024-05-31 23:31:53 [INFO] | quantize_per_tensor  |   1   |   1   |
2024-05-31 23:31:53 [INFO] |      ConvReLU2d      |   9   |   9   |
2024-05-31 23:31:53 [INFO] |      MaxPool2d       |   1   |   1   |
2024-05-31 23:31:53 [INFO] |        Conv2d        |   11  |   11  |
2024-05-31 23:31:53 [INFO] |       add_relu       |   8   |   8   |
2024-05-31 23:31:53 [INFO] |  AdaptiveAvgPool2d   |   1   |   1   |
2024-05-31 23:31:53 [INFO] |       flatten        |   1   |   1   |
2024-05-31 23:31:53 [INFO] |        Linear        |   1   |   1   |
2024-05-31 23:31:53 [INFO] |      dequantize      |   1   |   1   |
2024-05-31 23:31:53 [INFO] +----------------------+-------+-------+
2024-05-31 23:31:53 [INFO] Pass quantize model elapsed time: 1033.35 ms
2024-05-31 23:31:53 [INFO] Save tuning history to /home/kubwa/kubwai/15-Huggingface/06_Optimzation/nc_workspace/2024-05-31_23-31-49/./history.snapshot.
2024-05-31 23:31:53 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-05-31 23:31:53 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-05-31 23:31:53 [INFO] Save deploy yaml to /home/kubwa/kubwai/15-Huggingface/06_Optimzation/nc_workspace/2024-05-31_23-31-49/deploy.yaml
8️⃣
Optimum Inference with OpenVINOhuggingface
Logo
GitHub - huggingface/optimum-intel: 🤗 Optimum Intel: Accelerate inference with Intel optimization toolsGitHub
Logo