Optimum-ONNX

About Optimum

Optimum은 Transformers의 확장 기능으로, 대상 하드웨어에서 모델을 최대한 효율적으로 훈련하고 실행할 수 있는 성능 최적화 도구 세트를 제공합니다.

AI 생태계는 빠르게 진화하고 있으며, 자체적인 최적화와 함께 점점 더 많은 전문 하드웨어가 매일 등장하고 있습니다. 따라서 개발자는 Optimum을 통해 이러한 플랫폼 중 어느 것이든 Transformers와 동일한 방식으로 효율적으로 사용할 수 있습니다.

개발자는 Optimum을 통해 이러한 플랫폼 중 어느 것이든 Transformers와 동일한 방식으로 효율적으로 사용할 수 있습니다. 아래는 Huggingface Optimum의 각 시스템 아키텍처별 지원 여부 입니다.

Features

Graph optimization

✔️

N/A

✔️

N/A

Post-training dynamic quantization

✔️

N/A

✔️

Post-training static quantization

✔️

Quantization Aware Training (QAT)

N/A

✔️

N/A

FP16 (half precision)

✔️

N/A

✔️

Pruning

N/A

✔️

N/A

Knowledge Distillation

N/A

✔️

N/A

우선 딥러닝 모델 최적화 및 변환에서 가장많이 활용하는 ONNX Runtime을 해보겠습니다.

About ONNX Runtime

ONNX Runtime은 다양한 플랫폼과 프레임워크에서 DNN의 추론과 학습을 가속시키기 위한 고성능 배포 엔진으로 소개되고 있습니다. 기본적으로 ONNX 형식의 모델을 사용하며, PyTorch, TensorFlow 등 기존의 메이저한 프레임워크들과도 문제없이 호환된다고 합니다. ONNX Runtime이 할 수 있는 것들은 크게 아래와 같습니다.

ONNX 모델을 위한 고성능 런타임전체 ONNX-ML 사양 지원
Linux, Windows, Mac에서 사용 가능추가
하드웨어 가속기를 플러그인할 수 있는 확장 가능한 아키텍처를 통해 CPU 및 GPU로 실행 가능
Model Zoo 또는 다양한 프레임워크에서 변환하여 ONNX 모델 가져오기

%pip install optimum

Pipeline

Huggingface에서 지원하는 Inference Pipeline Task는 아래와 같습니다:

feature-extraction
text-classification
token-classification
question-answering
zero-shot-classification
text-generation
text2text-generation
summarization
translation
image-classification
automatic-speech-recognition
image-to-text

각 작업에는 연결된 파이프라인 클래스가 있지만, 모든 작업별 파이프라인을 하나의 객체로 감싸는 pipeline() 함수를 사용하는 것이 더 간단합니다.

pipeline() 함수는 작업에 대한 추론을 수행할 수 있는 기본 모델과 토큰화/기능 추출기를 자동으로 로드합니다.

from optimum.pipelines import pipeline

classifier = pipeline(
    task="text-classification", 
    accelerator="ort"
)
classifier("I like you. I love you.")

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]


Framework not specified. Using pt to export the model.



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]


Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: DistilBertForSequenceClassification *****
Using framework PyTorch: 2.3.0+cu121
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/distilbert/modeling_distilbert.py:231: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask, torch.tensor(torch.finfo(scores.dtype).min)





[{'label': 'POSITIVE', 'score': 0.9998763799667358}]

Transformers model converting to ONNX

ONNX Runtime Backend를 사용하여 모델을 로드하려면 고려 중인 아키텍처에 대해 ONNX로의 내보내기가 지원되어야 합니다.

from optimum.pipelines import pipeline

onnx_qa = pipeline(
    "question-answering", 
    model="deepset/roberta-base-squad2", 
    accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]


Framework not specified. Using pt to export the model.



model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]



merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]


Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: RobertaForQuestionAnswering *****
Using framework PyTorch: 2.3.0+cu121
Overriding 1 configuration item(s)
	- use_cache -> False

ORTModelForXXX 클래스와 연결된 from_pretrained(model_name_or_path, export=True) 메서드를 사용하여 로드할 수도 있습니다.

예를 들어, 다음은 질문 답변을 위해 ORTModelForQuestionAnswering 클래스를 로드하는 방법입니다:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

# Loading the PyTorch checkpoint and converting to the ONNX format by providing
# export=True
model = ORTModelForQuestionAnswering.from_pretrained(
    "deepset/roberta-base-squad2",
    export=True
)

onnx_qa = pipeline(
    "question-answering", 
    model=model, 
    tokenizer=tokenizer, 
    accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)

Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: RobertaForQuestionAnswering *****
Using framework PyTorch: 2.3.0+cu121
Overriding 1 configuration item(s)
	- use_cache -> False

Optimum models

pipeline() 함수는 Huggingface Hub와 긴밀하게 통합되어 있으며 Optium 전용 ONNX 모델을 직접 로드할 수 있습니다.

from optimum.pipelines import pipeline

onnx_qa = pipeline(
    "question-answering", 
    model="optimum/roberta-base-squad2", 
    accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)

config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]



model.onnx:   0%|          | 0.00/496M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]



merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

ORTModelForXXX 클래스와 연결된 from_pretrained(model_name_or_path) 메서드를 사용하여 로드할 수도 있습니다.

예를 들어, 다음은 질문 답변을 위해 ORTModelForQuestionAnswering 클래스를 로드하는 방법입니다:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")

# Loading directly an ONNX model from a model repo.
model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

onnx_qa = pipeline(
    "question-answering", 
    model=model, 
    tokenizer=tokenizer, accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)
pred

{'score': 0.9041661620140076, 'start': 11, 'end': 18, 'answer': 'Philipp'}

Optimizing & quantizing in pipelines

pipeline() 함수는 Vanilla ONNX 런타임 체크포인트에서 추론을 실행할 수 있을 뿐만 아니라, ORTQuantize 및 ORTOptimizr로 최적화된 체크포인트를 사용할 수도 있습니다.

아래에서 모델을 최적화/정량화한 후 추론에 사용하는 방법에 대한 두 가지 예시를 확인할 수 있습니다.

ORTQuantizer

ONNX RunTime을 지원하는 Quantizer인 ORTQuantizer를 사용하여 Quantize 양자화를 해보겠습니다.

from transformers import AutoTokenizer
from optimum.onnxruntime import (
    AutoQuantizationConfig,
    ORTModelForSequenceClassification,
    ORTQuantizer
)
from optimum.pipelines import pipeline

# Load the tokenizer and export the model to the ONNX format
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
save_dir = "distilbert_quantized"

model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

# Load the quantization configuration detailing the quantization we wish to apply
qconfig = AutoQuantizationConfig.avx512_vnni(
    is_static=False, 
    per_channel=True
)
quantizer = ORTQuantizer.from_pretrained(model)

# Apply dynamic quantization and save the resulting model
quantizer.quantize(
    save_dir=save_dir, 
    quantization_config=qconfig
)
# Load the quantized model from a local repository
model = ORTModelForSequenceClassification.from_pretrained(save_dir)

# Create the transformers pipeline
onnx_clx = pipeline(
    "text-classification", 
    model=model, 
    accelerator="ort"
)
text = "I like the new ORT pipeline"
pred = onnx_clx(text)
print(pred)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: DistilBertForSequenceClassification *****
Using framework PyTorch: 2.3.0+cu121
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/distilbert/modeling_distilbert.py:231: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask, torch.tensor(torch.finfo(scores.dtype).min)
Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: True)
Quantizing model...
Saving quantized model at: distilbert_quantized (external data format: False)
Configuration saved in distilbert_quantized/ort_config.json


[{'label': 'POSITIVE', 'score': 0.9967412352561951}]

# Save and push the model to the hub (in practice save_dir could be used here instead)
model.save_pretrained("new_path_for_directory")
model.push_to_hub(
    "new_path_for_directory", 
    repository_id="my-onnx-repo", 
    use_auth_token=True
)

ORTOptimizer

ONNX RunTime을 지원하는 Optimizer인 ORTOptimizer를 사용하여 Optimize 최적화를 해보겠습니다.

from transformers import AutoTokenizer
from optimum.onnxruntime import (
    AutoOptimizationConfig,
    ORTModelForSequenceClassification,
    ORTOptimizer
)
from optimum.onnxruntime.configuration import OptimizationConfig
from optimum.pipelines import pipeline

# Load the tokenizer and export the model to the ONNX format
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
save_dir = "distilbert_optimized"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = ORTModelForSequenceClassification.from_pretrained(
    model_id, 
    export=True
)

# Load the optimization configuration detailing the optimization we wish to apply
optimization_config = AutoOptimizationConfig.O3()
optimizer = ORTOptimizer.from_pretrained(model)

optimizer.optimize(
    save_dir=save_dir, 
    optimization_config=optimization_config
)
model = ORTModelForSequenceClassification.from_pretrained(save_dir)

onnx_clx = pipeline(
    "text-classification", 
    model=model, 
    accelerator="ort"
)
text = "I like the new ORT pipeline"
pred = onnx_clx(text)
print(pred)

Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: DistilBertForSequenceClassification *****
Using framework PyTorch: 2.3.0+cu121
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/optimum/onnxruntime/configuration.py:779: FutureWarning: disable_embed_layer_norm will be deprecated soon, use disable_embed_layer_norm_fusion instead, disable_embed_layer_norm_fusion is set to True.
  warnings.warn(
Optimizing model...
WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.0/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.0/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.0/attention/Div_output_0"
name: "/distilbert/transformer/layer.0/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.1/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.1/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.1/attention/Div_output_0"
name: "/distilbert/transformer/layer.1/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.2/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.2/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.2/attention/Div_output_0"
name: "/distilbert/transformer/layer.2/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.3/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.3/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.3/attention/Div_output_0"
name: "/distilbert/transformer/layer.3/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.4/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.4/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.4/attention/Div_output_0"
name: "/distilbert/transformer/layer.4/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.5/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.5/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.5/attention/Div_output_0"
name: "/distilbert/transformer/layer.5/attention/Div"
op_type: "Div"

Configuration saved in distilbert_optimized/ort_config.json
Optimized model saved at: distilbert_optimized (external data format: False; saved all tensor to one file: True)


[{'label': 'POSITIVE', 'score': 0.9973127245903015}]

tokenizer.save_pretrained("new_path_for_directory")
model.save_pretrained("new_path_for_directory")
model.push_to_hub("new_path_for_directory", repository_id="my-onnx-repo", use_auth_token=True)

PreviousSafetensors NextOptimum-NVIDIA

Last updated 1 year ago

Optimum-ONNX

About Optimum

Optimum은 Transformers의 확장 기능으로, 대상 하드웨어에서 모델을 최대한 효율적으로 훈련하고 실행할 수 있는 성능 최적화 도구 세트를 제공합니다.

Features

Graph optimization

✔️

N/A

✔️

N/A

Post-training dynamic quantization

✔️

N/A

✔️

Post-training static quantization

✔️

Quantization Aware Training (QAT)

N/A

✔️

N/A

FP16 (half precision)

✔️

N/A

✔️

Pruning

N/A

✔️

N/A

Knowledge Distillation

N/A

✔️

N/A

우선 딥러닝 모델 최적화 및 변환에서 가장많이 활용하는 ONNX Runtime을 해보겠습니다.

About ONNX Runtime

ONNX 모델을 위한 고성능 런타임전체 ONNX-ML 사양 지원
Linux, Windows, Mac에서 사용 가능추가
하드웨어 가속기를 플러그인할 수 있는 확장 가능한 아키텍처를 통해 CPU 및 GPU로 실행 가능
Model Zoo 또는 다양한 프레임워크에서 변환하여 ONNX 모델 가져오기

%pip install optimum

Pipeline

Huggingface에서 지원하는 Inference Pipeline Task는 아래와 같습니다:

feature-extraction
text-classification
token-classification
question-answering
zero-shot-classification
text-generation
text2text-generation
summarization
translation
image-classification
automatic-speech-recognition
image-to-text

각 작업에는 연결된 파이프라인 클래스가 있지만, 모든 작업별 파이프라인을 하나의 객체로 감싸는 pipeline() 함수를 사용하는 것이 더 간단합니다.

pipeline() 함수는 작업에 대한 추론을 수행할 수 있는 기본 모델과 토큰화/기능 추출기를 자동으로 로드합니다.

from optimum.pipelines import pipeline

classifier = pipeline(
    task="text-classification", 
    accelerator="ort"
)
classifier("I like you. I love you.")

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]


Framework not specified. Using pt to export the model.



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]


Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: DistilBertForSequenceClassification *****
Using framework PyTorch: 2.3.0+cu121
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/distilbert/modeling_distilbert.py:231: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask, torch.tensor(torch.finfo(scores.dtype).min)





[{'label': 'POSITIVE', 'score': 0.9998763799667358}]

Transformers model converting to ONNX

ONNX Runtime Backend를 사용하여 모델을 로드하려면 고려 중인 아키텍처에 대해 ONNX로의 내보내기가 지원되어야 합니다.

from optimum.pipelines import pipeline

onnx_qa = pipeline(
    "question-answering", 
    model="deepset/roberta-base-squad2", 
    accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]


Framework not specified. Using pt to export the model.



model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]



merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]


Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: RobertaForQuestionAnswering *****
Using framework PyTorch: 2.3.0+cu121
Overriding 1 configuration item(s)
	- use_cache -> False

ORTModelForXXX 클래스와 연결된 from_pretrained(model_name_or_path, export=True) 메서드를 사용하여 로드할 수도 있습니다.

예를 들어, 다음은 질문 답변을 위해 ORTModelForQuestionAnswering 클래스를 로드하는 방법입니다:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

# Loading the PyTorch checkpoint and converting to the ONNX format by providing
# export=True
model = ORTModelForQuestionAnswering.from_pretrained(
    "deepset/roberta-base-squad2",
    export=True
)

onnx_qa = pipeline(
    "question-answering", 
    model=model, 
    tokenizer=tokenizer, 
    accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)

Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: RobertaForQuestionAnswering *****
Using framework PyTorch: 2.3.0+cu121
Overriding 1 configuration item(s)
	- use_cache -> False

Optimum models

pipeline() 함수는 Huggingface Hub와 긴밀하게 통합되어 있으며 Optium 전용 ONNX 모델을 직접 로드할 수 있습니다.

from optimum.pipelines import pipeline

onnx_qa = pipeline(
    "question-answering", 
    model="optimum/roberta-base-squad2", 
    accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)

config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]



model.onnx:   0%|          | 0.00/496M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]



merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

ORTModelForXXX 클래스와 연결된 from_pretrained(model_name_or_path) 메서드를 사용하여 로드할 수도 있습니다.

예를 들어, 다음은 질문 답변을 위해 ORTModelForQuestionAnswering 클래스를 로드하는 방법입니다:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")

# Loading directly an ONNX model from a model repo.
model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

onnx_qa = pipeline(
    "question-answering", 
    model=model, 
    tokenizer=tokenizer, accelerator="ort"
)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(
    question=question, 
    context=context
)
pred

{'score': 0.9041661620140076, 'start': 11, 'end': 18, 'answer': 'Philipp'}

Optimizing & quantizing in pipelines

아래에서 모델을 최적화/정량화한 후 추론에 사용하는 방법에 대한 두 가지 예시를 확인할 수 있습니다.

ORTQuantizer

ONNX RunTime을 지원하는 Quantizer인 ORTQuantizer를 사용하여 Quantize 양자화를 해보겠습니다.

from transformers import AutoTokenizer
from optimum.onnxruntime import (
    AutoQuantizationConfig,
    ORTModelForSequenceClassification,
    ORTQuantizer
)
from optimum.pipelines import pipeline

# Load the tokenizer and export the model to the ONNX format
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
save_dir = "distilbert_quantized"

model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

# Load the quantization configuration detailing the quantization we wish to apply
qconfig = AutoQuantizationConfig.avx512_vnni(
    is_static=False, 
    per_channel=True
)
quantizer = ORTQuantizer.from_pretrained(model)

# Apply dynamic quantization and save the resulting model
quantizer.quantize(
    save_dir=save_dir, 
    quantization_config=qconfig
)
# Load the quantized model from a local repository
model = ORTModelForSequenceClassification.from_pretrained(save_dir)

# Create the transformers pipeline
onnx_clx = pipeline(
    "text-classification", 
    model=model, 
    accelerator="ort"
)
text = "I like the new ORT pipeline"
pred = onnx_clx(text)
print(pred)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: DistilBertForSequenceClassification *****
Using framework PyTorch: 2.3.0+cu121
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/distilbert/modeling_distilbert.py:231: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask, torch.tensor(torch.finfo(scores.dtype).min)
Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: True)
Quantizing model...
Saving quantized model at: distilbert_quantized (external data format: False)
Configuration saved in distilbert_quantized/ort_config.json


[{'label': 'POSITIVE', 'score': 0.9967412352561951}]

# Save and push the model to the hub (in practice save_dir could be used here instead)
model.save_pretrained("new_path_for_directory")
model.push_to_hub(
    "new_path_for_directory", 
    repository_id="my-onnx-repo", 
    use_auth_token=True
)

ORTOptimizer

ONNX RunTime을 지원하는 Optimizer인 ORTOptimizer를 사용하여 Optimize 최적화를 해보겠습니다.

from transformers import AutoTokenizer
from optimum.onnxruntime import (
    AutoOptimizationConfig,
    ORTModelForSequenceClassification,
    ORTOptimizer
)
from optimum.onnxruntime.configuration import OptimizationConfig
from optimum.pipelines import pipeline

# Load the tokenizer and export the model to the ONNX format
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
save_dir = "distilbert_optimized"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = ORTModelForSequenceClassification.from_pretrained(
    model_id, 
    export=True
)

# Load the optimization configuration detailing the optimization we wish to apply
optimization_config = AutoOptimizationConfig.O3()
optimizer = ORTOptimizer.from_pretrained(model)

optimizer.optimize(
    save_dir=save_dir, 
    optimization_config=optimization_config
)
model = ORTModelForSequenceClassification.from_pretrained(save_dir)

onnx_clx = pipeline(
    "text-classification", 
    model=model, 
    accelerator="ort"
)
text = "I like the new ORT pipeline"
pred = onnx_clx(text)
print(pred)

Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: DistilBertForSequenceClassification *****
Using framework PyTorch: 2.3.0+cu121
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/optimum/onnxruntime/configuration.py:779: FutureWarning: disable_embed_layer_norm will be deprecated soon, use disable_embed_layer_norm_fusion instead, disable_embed_layer_norm_fusion is set to True.
  warnings.warn(
Optimizing model...
WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.0/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.0/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.0/attention/Div_output_0"
name: "/distilbert/transformer/layer.0/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.1/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.1/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.1/attention/Div_output_0"
name: "/distilbert/transformer/layer.1/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.2/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.2/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.2/attention/Div_output_0"
name: "/distilbert/transformer/layer.2/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.3/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.3/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.3/attention/Div_output_0"
name: "/distilbert/transformer/layer.3/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.4/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.4/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.4/attention/Div_output_0"
name: "/distilbert/transformer/layer.4/attention/Div"
op_type: "Div"

WARNING:onnx_model:Failed to remove node input: "/distilbert/transformer/layer.5/attention/Transpose_output_0"
input: "/distilbert/transformer/layer.5/attention/Constant_11_output_0"
output: "/distilbert/transformer/layer.5/attention/Div_output_0"
name: "/distilbert/transformer/layer.5/attention/Div"
op_type: "Div"

Configuration saved in distilbert_optimized/ort_config.json
Optimized model saved at: distilbert_optimized (external data format: False; saved all tensor to one file: True)


[{'label': 'POSITIVE', 'score': 0.9973127245903015}]

tokenizer.save_pretrained("new_path_for_directory")
model.save_pretrained("new_path_for_directory")
model.push_to_hub("new_path_for_directory", repository_id="my-onnx-repo", use_auth_token=True)

PreviousSafetensors NextOptimum-NVIDIA

Last updated 1 year ago