AI-Master-Book
  • about AI-Master-Book
  • AI Master Book
    • 이상치 탐지 with Python
    • 베이지안 뉴럴네트워크 (BNN) with Python
    • 그래프 뉴럴네트워크 (GNN) with Python
    • 데이터 마케팅 분석 with Python
  • LLM MASTER BOOK
    • OpenAI API 쿡북 with Python
    • 기초부터 심화까지 RAG 쿡북 with Python
    • MCP 에이전트 쿡북 with Python
  • LLMs
    • OpenAI API
      • 1️⃣ChatCompletion
      • 2️⃣DALL-E
      • 3️⃣Text to Speech
      • 4️⃣Video to Transcripts
      • 5️⃣Assistants API
      • 6️⃣Prompt Engineering
      • 7️⃣OpenAI New GPT-4o
    • LangChain
      • LangChain Basic
        • 1️⃣Basic Modules
        • 2️⃣Model I/O
        • 3️⃣Prompts
        • 4️⃣Chains
        • 5️⃣Agents
        • 6️⃣Tools
        • 7️⃣Memory
      • LangChain Intermediate
        • 1️⃣OpenAI LLM
        • 2️⃣Prompt Template
        • 3️⃣Retrieval
        • 4️⃣RAG ChatBot
        • 5️⃣RAG with Gemini
        • 6️⃣New Huggingface-LangChain
        • 7️⃣Huggingface Hub
        • 8️⃣SQL Agent & Chain
        • 9️⃣Expression Language(LCEL)
        • 🔟Llama3-8B with LangChain
      • LangChain Advanced
        • 1️⃣LLM Evaluation
        • 2️⃣RAG Evaluation with RAGAS
        • 3️⃣LangChain with RAGAS
        • 4️⃣RAG Paradigms
        • 5️⃣LangChain: Advance Techniques
        • 6️⃣LangChain with NeMo-Guardrails
        • 7️⃣LangChain vs. LlamaIndex
        • 8️⃣LangChain LCEL vs. LangGraph
    • LlamaIndex
      • LlamaIndex Basic
        • 1️⃣Introduction
        • 2️⃣Customization
        • 3️⃣Data Connectors
        • 4️⃣Documents & Nodes
        • 5️⃣Naive RAG
        • 6️⃣Advanced RAG
        • 7️⃣Llama3-8B with LlamaIndex
        • 8️⃣LlmaPack
      • LlamaIndex Intermediate
        • 1️⃣QueryEngine
        • 2️⃣Agent
        • 3️⃣Evaluation
        • 4️⃣Evaluation-Driven Development
        • 5️⃣Fine-tuning
        • 6️⃣Prompt Compression with LLMLingua
      • LlamaIndex Advanced
        • 1️⃣Agentic RAG: Router Engine
        • 2️⃣Agentic RAG: Tool Calling
        • 3️⃣Building Agent Reasoning Loop
        • 4️⃣Building Multi-document Agent
    • Hugging Face
      • Huggingface Basic
        • 1️⃣Datasets
        • 2️⃣Tokenizer
        • 3️⃣Sentence Embeddings
        • 4️⃣Transformers
        • 5️⃣Sentence Transformers
        • 6️⃣Evaluate
        • 7️⃣Diffusers
      • Huggingface Tasks
        • NLP
          • 1️⃣Sentiment Analysis
          • 2️⃣Zero-shot Classification
          • 3️⃣Aspect-Based Sentiment Analysis
          • 4️⃣Feature Extraction
          • 5️⃣Intent Classification
          • 6️⃣Topic Modeling: BERTopic
          • 7️⃣NER: Token Classification
          • 8️⃣Summarization
          • 9️⃣Translation
          • 🔟Text Generation
        • Audio & Tabular
          • 1️⃣Text-to-Speech: TTS
          • 2️⃣Speech Recognition: Whisper
          • 3️⃣Audio Classification
          • 4️⃣Tabular Qustaion & Answering
        • Vision & Multimodal
          • 1️⃣Image-to-Text
          • 2️⃣Text to Image
          • 3️⃣Image to Image
          • 4️⃣Text or Image-to-Video
          • 5️⃣Depth Estimation
          • 6️⃣Image Classification
          • 7️⃣Object Detection
          • 8️⃣Segmentatio
      • Huggingface Optimization
        • 1️⃣Accelerator
        • 2️⃣Bitsandbytes
        • 3️⃣Flash Attention
        • 4️⃣Quantization
        • 5️⃣Safetensors
        • 6️⃣Optimum-ONNX
        • 7️⃣Optimum-NVIDIA
        • 8️⃣Optimum-Intel
      • Huggingface Fine-tuning
        • 1️⃣Transformer Fine-tuning
        • 2️⃣PEFT Fine-tuning
        • 3️⃣PEFT: Fine-tuning with QLoRA
        • 4️⃣PEFT: Fine-tuning Phi-2 with QLoRA
        • 5️⃣Axoltl Fine-tuning with QLoRA
        • 6️⃣TRL: RLHF Alignment Fine-tuning
        • 7️⃣TRL: DPO Fine-tuning with Phi-3-4k-instruct
        • 8️⃣TRL: ORPO Fine-tuning with Llama3-8B
        • 9️⃣Convert GGUF gemma-2b with llama.cpp
        • 🔟Apple Silicon Fine-tuning Gemma-2B with MLX
        • 🔢LLM Mergekit
    • Agentic LLM
      • Agentic LLM
        • 1️⃣Basic Agentic LLM
        • 2️⃣Multi-agent with CrewAI
        • 3️⃣LangGraph: Multi-agent Basic
        • 4️⃣LangGraph: Agentic RAG with LangChain
        • 5️⃣LangGraph: Agentic RAG with Llama3-8B by Groq
      • Autonomous Agent
        • 1️⃣LLM Autonomous Agent?
        • 2️⃣AutoGPT: Worldcup Winner Search with LangChain
        • 3️⃣BabyAGI: Weather Report with LangChain
        • 4️⃣AutoGen: Writing Blog Post with LangChain
        • 5️⃣LangChain: Autonomous-agent Debates with Tools
        • 6️⃣CAMEL Role-playing Autonomous Cooperative Agents
        • 7️⃣LangChain: Two-player Harry Potter D&D based CAMEL
        • 8️⃣LangChain: Multi-agent Bid for K-Pop Debate
        • 9️⃣LangChain: Multi-agent Authoritarian Speaker Selection
        • 🔟LangChain: Multi-Agent Simulated Environment with PettingZoo
    • Multimodal
      • 1️⃣PaliGemma: Open Vision LLM
      • 2️⃣FLUX.1: Generative Image
    • Building LLM
      • 1️⃣DSPy
      • 2️⃣DSPy RAG
      • 3️⃣DSPy with LangChain
      • 4️⃣Mamba
      • 5️⃣Mamba RAG with LangChain
      • 7️⃣PostgreSQL VectorDB with pgvorco.rs
Powered by GitBook
On this page
  • Image-to-Text
  • Image Captioning
  • OCR
  • Image Text to Text
  1. LLMs
  2. Hugging Face
  3. Huggingface Tasks
  4. Vision & Multimodal

Image-to-Text

Image-to-Text

이미지-텍스트 작업에는 주로 이미지 캡션과 광학 문자 인식(OCR)과 같은 활동이 포함되며, 가장 널리 사용되는 애플리케이션 중 하나입니다.

이미지 캡션은 딥러닝 모델을 사용하여 이미지의 내용과 맥락을 요약하는 텍스트 설명을 생성하는 프로세스입니다.

!wget https://pds.joongang.co.kr/news/component/htmlphoto_mmdata/202307/04/637e9c09-4164-41f3-b3be-e174d9989dd8.jpg -O ./dataset/photo.jpg
--2024-05-19 16:19:05--  https://pds.joongang.co.kr/news/component/htmlphoto_mmdata/202307/04/637e9c09-4164-41f3-b3be-e174d9989dd8.jpg
Resolving pds.joongang.co.kr (pds.joongang.co.kr)... 139.150.249.11, 121.78.33.182
Connecting to pds.joongang.co.kr (pds.joongang.co.kr)|139.150.249.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52441 (51K) [image/jpeg]
Saving to: ‘./dataset/photo.jpg’

./dataset/photo.jpg 100%[===================>]  51.21K  --.-KB/s    in 0.008s  

2024-05-19 16:19:05 (6.27 MB/s) - ‘./dataset/photo.jpg’ saved [52441/52441]

Image Captioning

from transformers import pipeline

image_to_text = pipeline(
    "image-to-text", 
    model="nlpconnect/vit-gpt2-image-captioning"
)

response = image_to_text("dataset/photo.jpg")
print(response)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/generation/utils.py:1168: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
You may ignore this warning if your `pad_token_id` (50256) is identical to the `bos_token_id` (50256), `eos_token_id` (50256), or the `sep_token_id` (None), and your input is not padded.


[{'generated_text': 'a crowd of people standing on a beach watching a giant balloon float on the water '}]

OCR

Tamil Library

%pip install ocr_tamil
from ocr_tamil.ocr import OCR

ocr = OCR(detect=True)
image_path = r"dataset/photo.jpg"

texts = ocr.predict(image_path)

print(texts[0])
saving to /home/kubwa/.model_weights/parseq_tamil_v3.pt
Download would take several minutes


100%|██████████| 95.5M/95.5M [00:00<00:00, 112MB/s] 


saving to /home/kubwa/.model_weights/craft_mlt_25k.pth
Download would take several minutes


100%|██████████| 83.2M/83.2M [00:00<00:00, 112MB/s] 
Downloading: "https://github.com/gnana70/tamil_ocr/raw/develop/ocr_tamil/model_weights/parseq.pt" to /home/kubwa/.cache/torch/hub/checkpoints/parseq.pt
100%|██████████| 91.0M/91.0M [00:00<00:00, 115MB/s]


['H', 'பயத்தட்']

Image Text to Text

멀티모달 이미지-텍스트 간 작업에는 이미지와 텍스트 입력을 모두 처리하여 텍스트 출력을 생성하는 작업이 포함됩니다. 이 작업은 시각적(이미지) 및 텍스트(단어) 데이터의 정보를 이해하고 통합하여 일관성 있고 맥락에 맞는 텍스트 응답을 생성할 수 있는 모델을 활용합니다.

%pip install einops
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-03-06"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('dataset/photo.jpg')
enc_image = model.encode_image(image)

query = "Describe this image."

response = model.answer_question(
    enc_image, 
    query, 
    tokenizer
)
print(response)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


A group of people are standing on a beach, with a board displaying text in the foreground. The background features water, mountains, and a sky.
"""
A group of people are standing on a beach, with a board displaying text in the foreground. The background features water, mountains, and a sky.
"""

query = "How is the weather?"

response = model.answer_question(enc_image, query, tokenizer)
print(response)
The weather in the image is sunny.
query = "How many people are there in the photo?"

response = model.answer_question(
    enc_image, 
    query, tokenizer
)
print(response)
5
PreviousVision & MultimodalNextText to Image

Last updated 1 year ago

1️⃣