AI-Master-Book
  • about AI-Master-Book
  • AI Master Book
    • 이상치 탐지 with Python
    • 베이지안 뉴럴네트워크 (BNN) with Python
    • 그래프 뉴럴네트워크 (GNN) with Python
    • 데이터 마케팅 분석 with Python
  • LLM MASTER BOOK
    • OpenAI API 쿡북 with Python
    • 기초부터 심화까지 RAG 쿡북 with Python
    • MCP 에이전트 쿡북 with Python
  • LLMs
    • OpenAI API
      • 1️⃣ChatCompletion
      • 2️⃣DALL-E
      • 3️⃣Text to Speech
      • 4️⃣Video to Transcripts
      • 5️⃣Assistants API
      • 6️⃣Prompt Engineering
      • 7️⃣OpenAI New GPT-4o
    • LangChain
      • LangChain Basic
        • 1️⃣Basic Modules
        • 2️⃣Model I/O
        • 3️⃣Prompts
        • 4️⃣Chains
        • 5️⃣Agents
        • 6️⃣Tools
        • 7️⃣Memory
      • LangChain Intermediate
        • 1️⃣OpenAI LLM
        • 2️⃣Prompt Template
        • 3️⃣Retrieval
        • 4️⃣RAG ChatBot
        • 5️⃣RAG with Gemini
        • 6️⃣New Huggingface-LangChain
        • 7️⃣Huggingface Hub
        • 8️⃣SQL Agent & Chain
        • 9️⃣Expression Language(LCEL)
        • 🔟Llama3-8B with LangChain
      • LangChain Advanced
        • 1️⃣LLM Evaluation
        • 2️⃣RAG Evaluation with RAGAS
        • 3️⃣LangChain with RAGAS
        • 4️⃣RAG Paradigms
        • 5️⃣LangChain: Advance Techniques
        • 6️⃣LangChain with NeMo-Guardrails
        • 7️⃣LangChain vs. LlamaIndex
        • 8️⃣LangChain LCEL vs. LangGraph
    • LlamaIndex
      • LlamaIndex Basic
        • 1️⃣Introduction
        • 2️⃣Customization
        • 3️⃣Data Connectors
        • 4️⃣Documents & Nodes
        • 5️⃣Naive RAG
        • 6️⃣Advanced RAG
        • 7️⃣Llama3-8B with LlamaIndex
        • 8️⃣LlmaPack
      • LlamaIndex Intermediate
        • 1️⃣QueryEngine
        • 2️⃣Agent
        • 3️⃣Evaluation
        • 4️⃣Evaluation-Driven Development
        • 5️⃣Fine-tuning
        • 6️⃣Prompt Compression with LLMLingua
      • LlamaIndex Advanced
        • 1️⃣Agentic RAG: Router Engine
        • 2️⃣Agentic RAG: Tool Calling
        • 3️⃣Building Agent Reasoning Loop
        • 4️⃣Building Multi-document Agent
    • Hugging Face
      • Huggingface Basic
        • 1️⃣Datasets
        • 2️⃣Tokenizer
        • 3️⃣Sentence Embeddings
        • 4️⃣Transformers
        • 5️⃣Sentence Transformers
        • 6️⃣Evaluate
        • 7️⃣Diffusers
      • Huggingface Tasks
        • NLP
          • 1️⃣Sentiment Analysis
          • 2️⃣Zero-shot Classification
          • 3️⃣Aspect-Based Sentiment Analysis
          • 4️⃣Feature Extraction
          • 5️⃣Intent Classification
          • 6️⃣Topic Modeling: BERTopic
          • 7️⃣NER: Token Classification
          • 8️⃣Summarization
          • 9️⃣Translation
          • 🔟Text Generation
        • Audio & Tabular
          • 1️⃣Text-to-Speech: TTS
          • 2️⃣Speech Recognition: Whisper
          • 3️⃣Audio Classification
          • 4️⃣Tabular Qustaion & Answering
        • Vision & Multimodal
          • 1️⃣Image-to-Text
          • 2️⃣Text to Image
          • 3️⃣Image to Image
          • 4️⃣Text or Image-to-Video
          • 5️⃣Depth Estimation
          • 6️⃣Image Classification
          • 7️⃣Object Detection
          • 8️⃣Segmentatio
      • Huggingface Optimization
        • 1️⃣Accelerator
        • 2️⃣Bitsandbytes
        • 3️⃣Flash Attention
        • 4️⃣Quantization
        • 5️⃣Safetensors
        • 6️⃣Optimum-ONNX
        • 7️⃣Optimum-NVIDIA
        • 8️⃣Optimum-Intel
      • Huggingface Fine-tuning
        • 1️⃣Transformer Fine-tuning
        • 2️⃣PEFT Fine-tuning
        • 3️⃣PEFT: Fine-tuning with QLoRA
        • 4️⃣PEFT: Fine-tuning Phi-2 with QLoRA
        • 5️⃣Axoltl Fine-tuning with QLoRA
        • 6️⃣TRL: RLHF Alignment Fine-tuning
        • 7️⃣TRL: DPO Fine-tuning with Phi-3-4k-instruct
        • 8️⃣TRL: ORPO Fine-tuning with Llama3-8B
        • 9️⃣Convert GGUF gemma-2b with llama.cpp
        • 🔟Apple Silicon Fine-tuning Gemma-2B with MLX
        • 🔢LLM Mergekit
    • Agentic LLM
      • Agentic LLM
        • 1️⃣Basic Agentic LLM
        • 2️⃣Multi-agent with CrewAI
        • 3️⃣LangGraph: Multi-agent Basic
        • 4️⃣LangGraph: Agentic RAG with LangChain
        • 5️⃣LangGraph: Agentic RAG with Llama3-8B by Groq
      • Autonomous Agent
        • 1️⃣LLM Autonomous Agent?
        • 2️⃣AutoGPT: Worldcup Winner Search with LangChain
        • 3️⃣BabyAGI: Weather Report with LangChain
        • 4️⃣AutoGen: Writing Blog Post with LangChain
        • 5️⃣LangChain: Autonomous-agent Debates with Tools
        • 6️⃣CAMEL Role-playing Autonomous Cooperative Agents
        • 7️⃣LangChain: Two-player Harry Potter D&D based CAMEL
        • 8️⃣LangChain: Multi-agent Bid for K-Pop Debate
        • 9️⃣LangChain: Multi-agent Authoritarian Speaker Selection
        • 🔟LangChain: Multi-Agent Simulated Environment with PettingZoo
    • Multimodal
      • 1️⃣PaliGemma: Open Vision LLM
      • 2️⃣FLUX.1: Generative Image
    • Building LLM
      • 1️⃣DSPy
      • 2️⃣DSPy RAG
      • 3️⃣DSPy with LangChain
      • 4️⃣Mamba
      • 5️⃣Mamba RAG with LangChain
      • 7️⃣PostgreSQL VectorDB with pgvorco.rs
Powered by GitBook
On this page
  • About Optimum
  • About ONNX Runtime
  • Pipeline
  • Optimizing & quantizing in pipelines
  1. LLMs
  2. Hugging Face
  3. Huggingface Optimization

Optimum-ONNX

About Optimum

Optimum은 Transformers의 확장 기능으로, 대상 하드웨어에서 모델을 최대한 효율적으로 훈련하고 실행할 수 있는 성능 최적화 도구 세트를 제공합니다.

AI 생태계는 빠르게 진화하고 있으며, 자체적인 최적화와 함께 점점 더 많은 전문 하드웨어가 매일 등장하고 있습니다. 따라서 개발자는 Optimum을 통해 이러한 플랫폼 중 어느 것이든 Transformers와 동일한 방식으로 효율적으로 사용할 수 있습니다.

개발자는 Optimum을 통해 이러한 플랫폼 중 어느 것이든 Transformers와 동일한 방식으로 효율적으로 사용할 수 있습니다. 아래는 Huggingface Optimum의 각 시스템 아키텍처별 지원 여부 입니다.

Features

Graph optimization

✔️

N/A

✔️

N/A

Post-training dynamic quantization

✔️

✔️

N/A

✔️

Post-training static quantization

✔️

✔️

✔️

✔️

Quantization Aware Training (QAT)

N/A

✔️

✔️

N/A

FP16 (half precision)

✔️

N/A

✔️

✔️

Pruning

N/A

✔️

✔️

N/A

Knowledge Distillation

N/A

✔️

✔️

N/A

우선 딥러닝 모델 최적화 및 변환에서 가장많이 활용하는 ONNX Runtime을 해보겠습니다.

About ONNX Runtime

ONNX Runtime은 다양한 플랫폼과 프레임워크에서 DNN의 추론과 학습을 가속시키기 위한 고성능 배포 엔진으로 소개되고 있습니다. 기본적으로 ONNX 형식의 모델을 사용하며, PyTorch, TensorFlow 등 기존의 메이저한 프레임워크들과도 문제없이 호환된다고 합니다. ONNX Runtime이 할 수 있는 것들은 크게 아래와 같습니다.

  • ONNX 모델을 위한 고성능 런타임전체 ONNX-ML 사양 지원

  • Linux, Windows, Mac에서 사용 가능추가

  • 하드웨어 가속기를 플러그인할 수 있는 확장 가능한 아키텍처를 통해 CPU 및 GPU로 실행 가능

  • Model Zoo 또는 다양한 프레임워크에서 변환하여 ONNX 모델 가져오기

%pip install optimum

Pipeline

Huggingface에서 지원하는 Inference Pipeline Task는 아래와 같습니다:

  • feature-extraction

  • text-classification

  • token-classification

  • question-answering

  • zero-shot-classification

  • text-generation

  • text2text-generation

  • summarization

  • translation

  • image-classification

  • automatic-speech-recognition

  • image-to-text

각 작업에는 연결된 파이프라인 클래스가 있지만, 모든 작업별 파이프라인을 하나의 객체로 감싸는 pipeline() 함수를 사용하는 것이 더 간단합니다.

pipeline() 함수는 작업에 대한 추론을 수행할 수 있는 기본 모델과 토큰화/기능 추출기를 자동으로 로드합니다.

from optimum.pipelines import pipeline

classifier = pipeline(
    task="text-classification", 
    accelerator="ort"
)
classifier("I like you. I love you.")
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]


Framework not specified. Using pt to export the model.



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]


Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: DistilBertForSequenceClassification *****
Using framework PyTorch: 2.3.0+cu121
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/distilbert/modeling_distilbert.py:231: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask, torch.tensor(torch.finfo(scores.dtype).min)





[{'label': 'POSITIVE', 'score': 0.9998763799667358}]

Transformers model converting to ONNX

ONNX Runtime Backend를 사용하여 모델을 로드하려면 고려 중인 아키텍처에 대해 ONNX로의 내보내기가 지원되어야 합니다.

from optimum.pipelines import pipeline

onnx_qa = pipeline(
    "question-answering", 
    model="deepset/roberta-base-squad2", 
    accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)
config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]


Framework not specified. Using pt to export the model.



model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]



merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]


Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: RobertaForQuestionAnswering *****
Using framework PyTorch: 2.3.0+cu121
Overriding 1 configuration item(s)
	- use_cache -> False

ORTModelForXXX 클래스와 연결된 from_pretrained(model_name_or_path, export=True) 메서드를 사용하여 로드할 수도 있습니다.

예를 들어, 다음은 질문 답변을 위해 ORTModelForQuestionAnswering 클래스를 로드하는 방법입니다:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

# Loading the PyTorch checkpoint and converting to the ONNX format by providing
# export=True
model = ORTModelForQuestionAnswering.from_pretrained(
    "deepset/roberta-base-squad2",
    export=True
)

onnx_qa = pipeline(
    "question-answering", 
    model=model, 
    tokenizer=tokenizer, 
    accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)
Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: RobertaForQuestionAnswering *****
Using framework PyTorch: 2.3.0+cu121
Overriding 1 configuration item(s)
	- use_cache -> False

Optimum models

pipeline() 함수는 Huggingface Hub와 긴밀하게 통합되어 있으며 Optium 전용 ONNX 모델을 직접 로드할 수 있습니다.

from optimum.pipelines import pipeline

onnx_qa = pipeline(
    "question-answering", 
    model="optimum/roberta-base-squad2", 
    accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)
config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]



model.onnx:   0%|          | 0.00/496M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]



merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

ORTModelForXXX 클래스와 연결된 from_pretrained(model_name_or_path) 메서드를 사용하여 로드할 수도 있습니다.

예를 들어, 다음은 질문 답변을 위해 ORTModelForQuestionAnswering 클래스를 로드하는 방법입니다:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")

# Loading directly an ONNX model from a model repo.
model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

onnx_qa = pipeline(
    "question-answering", 
    model=model, 
    tokenizer=tokenizer, accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)
pred
{'score': 0.9041661620140076, 'start': 11, 'end': 18, 'answer': 'Philipp'}

Optimizing & quantizing in pipelines

pipeline() 함수는 Vanilla ONNX 런타임 체크포인트에서 추론을 실행할 수 있을 뿐만 아니라, ORTQuantize 및 ORTOptimizr로 최적화된 체크포인트를 사용할 수도 있습니다.

아래에서 모델을 최적화/정량화한 후 추론에 사용하는 방법에 대한 두 가지 예시를 확인할 수 있습니다.

ORTQuantizer

ONNX RunTime을 지원하는 Quantizer인 ORTQuantizer를 사용하여 Quantize 양자화를 해보겠습니다.

from transformers import AutoTokenizer
from optimum.onnxruntime import (
    AutoQuantizationConfig,
    ORTModelForSequenceClassification,
    ORTQuantizer
)
from optimum.pipelines import pipeline

# Load the tokenizer and export the model to the ONNX format
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
save_dir = "distilbert_quantized"

model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

# Load the quantization configuration detailing the quantization we wish to apply
qconfig = AutoQuantizationConfig.avx512_vnni(
    is_static=False, 
    per_channel=True
)
quantizer = ORTQuantizer.from_pretrained(model)

# Apply dynamic quantization and save the resulting model
quantizer.quantize(
    save_dir=save_dir, 
    quantization_config=qconfig
)
# Load the quantized model from a local repository
model = ORTModelForSequenceClassification.from_pretrained(save_dir)

# Create the transformers pipeline
onnx_clx = pipeline(
    "text-classification", 
    model=model, 
    accelerator="ort"
)
text = "I like the new ORT pipeline"
pred = onnx_clx(text)
print(pred)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: DistilBertForSequenceClassification *****
Using framework PyTorch: 2.3.0+cu121
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/distilbert/modeling_distilbert.py:231: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask, torch.tensor(torch.finfo(scores.dtype).min)
Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: True)
Quantizing model...
Saving quantized model at: distilbert_quantized (external data format: False)
Configuration saved in distilbert_quantized/ort_config.json


[{'label': 'POSITIVE', 'score': 0.9967412352561951}]
# Save and push the model to the hub (in practice save_dir could be used here instead)
model.save_pretrained("new_path_for_directory")
model.push_to_hub(
    "new_path_for_directory", 
    repository_id="my-onnx-repo", 
    use_auth_token=True
)

ORTOptimizer

ONNX RunTime을 지원하는 Optimizer인 ORTOptimizer를 사용하여 Optimize 최적화를 해보겠습니다.

from transformers import AutoTokenizer
from optimum.onnxruntime import (
    AutoOptimizationConfig,
    ORTModelForSequenceClassification,
    ORTOptimizer
)
from optimum.onnxruntime.configuration import OptimizationConfig
from optimum.pipelines import pipeline

# Load the tokenizer and export the model to the ONNX format
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
save_dir = "distilbert_optimized"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = ORTModelForSequenceClassification.from_pretrained(
    model_id, 
    export=True
)

# Load the optimization configuration detailing the optimization we wish to apply
optimization_config = AutoOptimizationConfig.O3()
optimizer = ORTOptimizer.from_pretrained(model)

optimizer.optimize(
    save_dir=save_dir, 
    optimization_config=optimization_config
)
model = ORTModelForSequenceClassification.from_pretrained(save_dir)

onnx_clx = pipeline(
    "text-classification", 
    model=model, 
    accelerator="ort"
)
text = "I like the new ORT pipeline"
pred = onnx_clx(text)
print(pred)
Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: DistilBertForSequenceClassification *****
Using framework PyTorch: 2.3.0+cu121
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/optimum/onnxruntime/configuration.py:779: FutureWarning: disable_embed_layer_norm will be deprecated soon, use disable_embed_layer_norm_fusion instead, disable_embed_layer_norm_fusion is set to True.
  warnings.warn(
Optimizing model...
WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.0/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.0/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.0/attention/Div_output_0"
name: "/distilbert/transformer/layer.0/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.1/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.1/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.1/attention/Div_output_0"
name: "/distilbert/transformer/layer.1/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.2/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.2/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.2/attention/Div_output_0"
name: "/distilbert/transformer/layer.2/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.3/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.3/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.3/attention/Div_output_0"
name: "/distilbert/transformer/layer.3/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.4/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.4/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.4/attention/Div_output_0"
name: "/distilbert/transformer/layer.4/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.5/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.5/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.5/attention/Div_output_0"
name: "/distilbert/transformer/layer.5/attention/Div"
op_type: "Div"

Configuration saved in distilbert_optimized/ort_config.json
Optimized model saved at: distilbert_optimized (external data format: False; saved all tensor to one file: True)


[{'label': 'POSITIVE', 'score': 0.9973127245903015}]
tokenizer.save_pretrained("new_path_for_directory")
model.save_pretrained("new_path_for_directory")
model.push_to_hub("new_path_for_directory", repository_id="my-onnx-repo", use_auth_token=True)
PreviousSafetensorsNextOptimum-NVIDIA

Last updated 1 year ago

6️⃣
ONNX Runtime
Neural Compressor
OpenVINO
TensorFlow Lite