AI-Master-Book
  • about AI-Master-Book
  • AI Master Book
    • 이상치 탐지 with Python
    • 베이지안 뉴럴네트워크 (BNN) with Python
    • 그래프 뉴럴네트워크 (GNN) with Python
    • 데이터 마케팅 분석 with Python
  • LLM MASTER BOOK
    • OpenAI API 쿡북 with Python
    • 기초부터 심화까지 RAG 쿡북 with Python
    • MCP 에이전트 쿡북 with Python
  • LLMs
    • OpenAI API
      • 1️⃣ChatCompletion
      • 2️⃣DALL-E
      • 3️⃣Text to Speech
      • 4️⃣Video to Transcripts
      • 5️⃣Assistants API
      • 6️⃣Prompt Engineering
      • 7️⃣OpenAI New GPT-4o
    • LangChain
      • LangChain Basic
        • 1️⃣Basic Modules
        • 2️⃣Model I/O
        • 3️⃣Prompts
        • 4️⃣Chains
        • 5️⃣Agents
        • 6️⃣Tools
        • 7️⃣Memory
      • LangChain Intermediate
        • 1️⃣OpenAI LLM
        • 2️⃣Prompt Template
        • 3️⃣Retrieval
        • 4️⃣RAG ChatBot
        • 5️⃣RAG with Gemini
        • 6️⃣New Huggingface-LangChain
        • 7️⃣Huggingface Hub
        • 8️⃣SQL Agent & Chain
        • 9️⃣Expression Language(LCEL)
        • 🔟Llama3-8B with LangChain
      • LangChain Advanced
        • 1️⃣LLM Evaluation
        • 2️⃣RAG Evaluation with RAGAS
        • 3️⃣LangChain with RAGAS
        • 4️⃣RAG Paradigms
        • 5️⃣LangChain: Advance Techniques
        • 6️⃣LangChain with NeMo-Guardrails
        • 7️⃣LangChain vs. LlamaIndex
        • 8️⃣LangChain LCEL vs. LangGraph
    • LlamaIndex
      • LlamaIndex Basic
        • 1️⃣Introduction
        • 2️⃣Customization
        • 3️⃣Data Connectors
        • 4️⃣Documents & Nodes
        • 5️⃣Naive RAG
        • 6️⃣Advanced RAG
        • 7️⃣Llama3-8B with LlamaIndex
        • 8️⃣LlmaPack
      • LlamaIndex Intermediate
        • 1️⃣QueryEngine
        • 2️⃣Agent
        • 3️⃣Evaluation
        • 4️⃣Evaluation-Driven Development
        • 5️⃣Fine-tuning
        • 6️⃣Prompt Compression with LLMLingua
      • LlamaIndex Advanced
        • 1️⃣Agentic RAG: Router Engine
        • 2️⃣Agentic RAG: Tool Calling
        • 3️⃣Building Agent Reasoning Loop
        • 4️⃣Building Multi-document Agent
    • Hugging Face
      • Huggingface Basic
        • 1️⃣Datasets
        • 2️⃣Tokenizer
        • 3️⃣Sentence Embeddings
        • 4️⃣Transformers
        • 5️⃣Sentence Transformers
        • 6️⃣Evaluate
        • 7️⃣Diffusers
      • Huggingface Tasks
        • NLP
          • 1️⃣Sentiment Analysis
          • 2️⃣Zero-shot Classification
          • 3️⃣Aspect-Based Sentiment Analysis
          • 4️⃣Feature Extraction
          • 5️⃣Intent Classification
          • 6️⃣Topic Modeling: BERTopic
          • 7️⃣NER: Token Classification
          • 8️⃣Summarization
          • 9️⃣Translation
          • 🔟Text Generation
        • Audio & Tabular
          • 1️⃣Text-to-Speech: TTS
          • 2️⃣Speech Recognition: Whisper
          • 3️⃣Audio Classification
          • 4️⃣Tabular Qustaion & Answering
        • Vision & Multimodal
          • 1️⃣Image-to-Text
          • 2️⃣Text to Image
          • 3️⃣Image to Image
          • 4️⃣Text or Image-to-Video
          • 5️⃣Depth Estimation
          • 6️⃣Image Classification
          • 7️⃣Object Detection
          • 8️⃣Segmentatio
      • Huggingface Optimization
        • 1️⃣Accelerator
        • 2️⃣Bitsandbytes
        • 3️⃣Flash Attention
        • 4️⃣Quantization
        • 5️⃣Safetensors
        • 6️⃣Optimum-ONNX
        • 7️⃣Optimum-NVIDIA
        • 8️⃣Optimum-Intel
      • Huggingface Fine-tuning
        • 1️⃣Transformer Fine-tuning
        • 2️⃣PEFT Fine-tuning
        • 3️⃣PEFT: Fine-tuning with QLoRA
        • 4️⃣PEFT: Fine-tuning Phi-2 with QLoRA
        • 5️⃣Axoltl Fine-tuning with QLoRA
        • 6️⃣TRL: RLHF Alignment Fine-tuning
        • 7️⃣TRL: DPO Fine-tuning with Phi-3-4k-instruct
        • 8️⃣TRL: ORPO Fine-tuning with Llama3-8B
        • 9️⃣Convert GGUF gemma-2b with llama.cpp
        • 🔟Apple Silicon Fine-tuning Gemma-2B with MLX
        • 🔢LLM Mergekit
    • Agentic LLM
      • Agentic LLM
        • 1️⃣Basic Agentic LLM
        • 2️⃣Multi-agent with CrewAI
        • 3️⃣LangGraph: Multi-agent Basic
        • 4️⃣LangGraph: Agentic RAG with LangChain
        • 5️⃣LangGraph: Agentic RAG with Llama3-8B by Groq
      • Autonomous Agent
        • 1️⃣LLM Autonomous Agent?
        • 2️⃣AutoGPT: Worldcup Winner Search with LangChain
        • 3️⃣BabyAGI: Weather Report with LangChain
        • 4️⃣AutoGen: Writing Blog Post with LangChain
        • 5️⃣LangChain: Autonomous-agent Debates with Tools
        • 6️⃣CAMEL Role-playing Autonomous Cooperative Agents
        • 7️⃣LangChain: Two-player Harry Potter D&D based CAMEL
        • 8️⃣LangChain: Multi-agent Bid for K-Pop Debate
        • 9️⃣LangChain: Multi-agent Authoritarian Speaker Selection
        • 🔟LangChain: Multi-Agent Simulated Environment with PettingZoo
    • Multimodal
      • 1️⃣PaliGemma: Open Vision LLM
      • 2️⃣FLUX.1: Generative Image
    • Building LLM
      • 1️⃣DSPy
      • 2️⃣DSPy RAG
      • 3️⃣DSPy with LangChain
      • 4️⃣Mamba
      • 5️⃣Mamba RAG with LangChain
      • 7️⃣PostgreSQL VectorDB with pgvorco.rs
Powered by GitBook
On this page
  • About TensorRT
  • Optimum-NVIDIA TensorRT-LLM 설치
  • Optium-NVIDIA TensorRT-LLM
  1. LLMs
  2. Hugging Face
  3. Huggingface Optimization

Optimum-NVIDIA

Huggingface에서 사용할 수 있는 Optimum-NVIDIA는 매우 간단한 API를 통해 NVIDIA 플랫폼에서 LLM 추론을 획기적으로 가속화합니다. 단 한 줄의 코드만 변경하면 NVIDIA 플랫폼에서 최대 28배 빠른 추론과 초당 1,200개의 토큰을 처리할 수 있습니다.

옵티멈-NVIDIA는 엔비디아 에이다 러브레이스 및 호퍼 아키텍처에서 지원되는 새로운 플로트8 형식의 이점을 활용하는 최초의 허깅 페이스 추론 라이브러리입니다. FP8은 NVIDIA TensorRT-LLM 소프트웨어의 고급 컴파일 기능에 더해 LLM 추론을 획기적으로 가속화합니다.

About TensorRT

TensorRT는 NVIDIA GPU 연산에 적합한 최적화 기법들을 이용하여 모델을 최적화하는 Optimizer와 다양한 GPU에서 모델연산을 수행하는 Runtime Engine을 지원하는 소프트웨어 라이브러리이다.TensorRT는 대부분의 딥러닝 프레임워크 (TensorFlow, PyTorch 등)에서 학습된 모델을 지원하며 GPU 환경에서 동일한 방식으로 적용 가능하여, 최적의 딥러닝 모델 Inference 가속을 지원합니다.

Optimum-NVIDIA TensorRT-LLM 설치

CUDA와 TensorRT 기반으로 설치해야하기 때문에 가능하면 CLI에서 Docker build를 권장합니다.

Optium-NVIDIA TensorRT-LLM

git lfs install
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
make -C docker release_build

Pipeline 예시입니다. Docker Build로 인하여 실행 결과는 생략합니다.

from optimum.nvidia.pipelines import pipeline

pipe = pipeline(
    'text-generation', 
    'mistralai/Mistral-7B-v0.3', 
    use_fp8=True
)
pipe("Describe a real-world application of AI in sustainable energy.")

from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.3", 
    padding_side="left"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    use_fp8=True,
    max_prompt_length=1024,
    ax_output_length=2048, # Must be at least size of max_prompt_length + max_new_tokens
    max_batch_size=8,
)

model_inputs = tokenizer(
    ["How is autonomous vehicle technology transforming the future of transportation and urban planning?"], 
    return_tensors="pt"
).to("cuda")

generated_ids = model.generate(
    **model_inputs, 
    top_k=40, 
    top_p=0.7, 
    repetition_penalty=10,
)
PreviousOptimum-ONNXNextOptimum-Intel

Last updated 1 year ago

7️⃣