Accelerator
Accelerator
대규모 언어 모델(LLM)은 자연어 처리 분야에 혁명을 일으켰습니다. 이러한 모델의 크기와 복잡성이 커짐에 따라 추론에 대한 계산 요구도 크게 증가합니다.
이 문제를 해결하려면 여러 개의 GPU를 활용하는 것이 필수적입니다.
Huggingface에서 제공하는 Accelerator 패키지는 쉽게 Multi-GPU 학습과 추론을 가능하게 해줍니다.
기본적으로 Pytorch 대규모 데이터의 학습과 추론에 사용할 수 있으며, LLM 모델에도 적용 가능합니다.
Accelerator Basic
%pip install accelerator
from accelerate import Accelerator
from accelerate.utils import gather_object
accelerator = Accelerator()
# 각 GPU는 문자열을 생성합니다.
message=[ f"Hello this is GPU {accelerator.process_index}" ]
# 모든 GPU에서 메시지를 수집합니다.
messages=gather_object(message)
# accelertor.print()를 사용하여 메인 프로세스에서만 메시지를 출력합니다.
accelerator.print(messages)
['Hello this is GPU 0']
Multi GPU LLM inference
Meta의 Llama-3-8B 모델 사용
펭귄북의 클래식 문장 10개 * 10개를 프롬프트로 입력
Accelerator()를 사용하여 GPU 병렬 추론 수행
병렬 추론 된 결과를 모두 수집하여 병합
from accelerate import Accelerator
from accelerate.utils import gather_object
from transformers import AutoModelForCausalLM, AutoTokenizer
from statistics import mean
import torch, time, json
# Accelerator 호출
accelerator = Accelerator()
# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
prompts_all=[
"The King is dead. Long live the Queen.",
"Once there were four children whose names were Peter, Susan, Edmund, and Lucy.",
"The story so far: in the beginning, the universe was created.",
"It was a bright cold day in April, and the clocks were striking thirteen.",
"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
"The sweat wis lashing oafay Sick Boy; he wis trembling.",
"124 was spiteful. Full of Baby's venom.",
"As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
"I write this sitting in the kitchen sink.",
"We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
] * 10
# 기본 모델 및 토큰화 도구 로드
model_path = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map={"": accelerator.process_index},
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# GPU 동기화 및 타이머 시작
accelerator.wait_for_everyone()
start=time.time()
# 프롬프트 목록을 사용 가능한 GPU로 나누기
with accelerator.split_between_processes(prompts_all) as prompts:
# 생성 출력을 딕셔너리에 저장
results=dict(outputs=[], num_tokens=0)
# 각 GPU가 프롬프트별로 추론을 수행하도록 합니다.
for prompt in prompts:
prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")
output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=100)[0]
# 출력에서 프롬프트 제거
output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):]
# resutls{}에 출력 및 토큰 수를 저장합니다.
results["outputs"].append( tokenizer.decode(output_tokenized) )
results["num_tokens"] += len(output_tokenized)
results=[ results ] # retults를 list 변환으로 gather_object()가 수집
# 모든 GPU에서 결과 수집
results_gathered=gather_object(results)
if accelerator.is_main_process:
timediff=time.time()-start
num_tokens=sum([r["num_tokens"] for r in results_gathered ])
print(f"tokens/sec: {num_tokens//timediff}, time {timediff}, total tokens {num_tokens}, total prompts {len(prompts_all)}")
Downloading shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
tokens/sec: 24.0, time 390.7256546020508, total tokens 9479, total prompts 100
Multi GPU inference (batched)
Meta의 Llama-3-8B 모델 사용
펭귄북의 클래식 문장 10개 * 10개를 프롬프트로 입력
Tokenized 데이터를 Batch 화
Accelerator()를 사용하여 GPU 배치 병렬 추론 수행
병렬 추론 된 결과를 모두 수집하여 병합
from accelerate import Accelerator
from accelerate.utils import gather_object
from transformers import AutoModelForCausalLM, AutoTokenizer
from statistics import mean
import torch, time, json
# Accelerator 호출
accelerator = Accelerator()
def write_pretty_json(file_path, data):
import json
with open(file_path, "w") as write_file:
json.dump(data, write_file, indent=4)
# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
prompts_all=[
"The King is dead. Long live the Queen.",
"Once there were four children whose names were Peter, Susan, Edmund, and Lucy.",
"The story so far: in the beginning, the universe was created.",
"It was a bright cold day in April, and the clocks were striking thirteen.",
"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
"The sweat wis lashing oafay Sick Boy; he wis trembling.",
"124 was spiteful. Full of Baby's venom.",
"As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
"I write this sitting in the kitchen sink.",
"We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
] * 10
# 기본 모델 및 토큰화 도구 로드
model_path = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map={"": accelerator.process_index},
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
# batch, left pad (for inference), and tokenize
def prepare_prompts(prompts, tokenizer, batch_size=16):
batches=[prompts[i:i + batch_size] for i in range(0, len(prompts), batch_size)]
batches_tok=[]
tokenizer.padding_side="left"
for prompt_batch in batches:
batches_tok.append(
tokenizer(
prompt_batch,
return_tensors="pt",
padding='longest',
truncation=False,
pad_to_multiple_of=8,
add_special_tokens=False).to("cuda")
)
tokenizer.padding_side="right"
return batches_tok
# GPU 동기화 및 타이머 시작
accelerator.wait_for_everyone()
start=time.time()
# 프롬프트 목록을 사용 가능한 GPU로 나누기
with accelerator.split_between_processes(prompts_all) as prompts:
results=dict(outputs=[], num_tokens=0)
# # 각 GPU가 Batch 단위로 추론을 수행하도록 합니다.
prompt_batches=prepare_prompts(prompts, tokenizer, batch_size=16)
for prompts_tokenized in prompt_batches:
outputs_tokenized=model.generate(**prompts_tokenized, max_new_tokens=100)
# 토큰 생성에서 프롬프트 제거
outputs_tokenized=[ tok_out[len(tok_in):]
for tok_in, tok_out in zip(prompts_tokenized["input_ids"], outputs_tokenized) ]
# 토큰 개수 계산 및 디코딩하기
num_tokens=sum([ len(t) for t in outputs_tokenized ])
outputs=tokenizer.batch_decode(outputs_tokenized)
# 가속으로 수집할 결과를 results{}에 저장합니다.
results["outputs"].extend(outputs)
results["num_tokens"] += num_tokens
results=[ results ] # retults를 list 변환으로 gather_object()가 수집
# 모든 GPU에서 결과 수집
results_gathered=gather_object(results)
if accelerator.is_main_process:
timediff=time.time()-start
num_tokens=sum([r["num_tokens"] for r in results_gathered ])
print(f"tokens/sec: {num_tokens//timediff}, time elapsed: {timediff}, num_tokens {num_tokens}")
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
tokens/sec: 57.0, time elapsed: 173.17202281951904, num_tokens 10000
Conclution
tokens/sec: 24.0, time 390.7256546020508, total tokens 9479, total prompts 100
tokens/sec: 57.0, time elapsed: 173.17202281951904, num_tokens 10000
Prompt 100개에 대하여 accelerator로 가속했을 때 390s가 걸렸습니다.
Token 수 10,000개를 Batch로 넣었을 했을 때 추론에 173s가 걸렸습니다.
Last updated