Image-to-Text
Image-to-Text
이미지-텍스트 작업에는 주로 이미지 캡션과 광학 문자 인식(OCR)과 같은 활동이 포함되며, 가장 널리 사용되는 애플리케이션 중 하나입니다.
이미지 캡션은 딥러닝 모델을 사용하여 이미지의 내용과 맥락을 요약하는 텍스트 설명을 생성하는 프로세스입니다.
!wget https://pds.joongang.co.kr/news/component/htmlphoto_mmdata/202307/04/637e9c09-4164-41f3-b3be-e174d9989dd8.jpg -O ./dataset/photo.jpg
--2024-05-19 16:19:05-- https://pds.joongang.co.kr/news/component/htmlphoto_mmdata/202307/04/637e9c09-4164-41f3-b3be-e174d9989dd8.jpg
Resolving pds.joongang.co.kr (pds.joongang.co.kr)... 139.150.249.11, 121.78.33.182
Connecting to pds.joongang.co.kr (pds.joongang.co.kr)|139.150.249.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52441 (51K) [image/jpeg]
Saving to: ‘./dataset/photo.jpg’
./dataset/photo.jpg 100%[===================>] 51.21K --.-KB/s in 0.008s
2024-05-19 16:19:05 (6.27 MB/s) - ‘./dataset/photo.jpg’ saved [52441/52441]
Image Captioning
from transformers import pipeline
image_to_text = pipeline(
"image-to-text",
model="nlpconnect/vit-gpt2-image-captioning"
)
response = image_to_text("dataset/photo.jpg")
print(response)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/generation/utils.py:1168: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
You may ignore this warning if your `pad_token_id` (50256) is identical to the `bos_token_id` (50256), `eos_token_id` (50256), or the `sep_token_id` (None), and your input is not padded.
[{'generated_text': 'a crowd of people standing on a beach watching a giant balloon float on the water '}]
OCR
Tamil Library
%pip install ocr_tamil
from ocr_tamil.ocr import OCR
ocr = OCR(detect=True)
image_path = r"dataset/photo.jpg"
texts = ocr.predict(image_path)
print(texts[0])
saving to /home/kubwa/.model_weights/parseq_tamil_v3.pt
Download would take several minutes
100%|██████████| 95.5M/95.5M [00:00<00:00, 112MB/s]
saving to /home/kubwa/.model_weights/craft_mlt_25k.pth
Download would take several minutes
100%|██████████| 83.2M/83.2M [00:00<00:00, 112MB/s]
Downloading: "https://github.com/gnana70/tamil_ocr/raw/develop/ocr_tamil/model_weights/parseq.pt" to /home/kubwa/.cache/torch/hub/checkpoints/parseq.pt
100%|██████████| 91.0M/91.0M [00:00<00:00, 115MB/s]
['H', 'பயத்தட்']
Image Text to Text
멀티모달 이미지-텍스트 간 작업에는 이미지와 텍스트 입력을 모두 처리하여 텍스트 출력을 생성하는 작업이 포함됩니다. 이 작업은 시각적(이미지) 및 텍스트(단어) 데이터의 정보를 이해하고 통합하여 일관성 있고 맥락에 맞는 텍스트 응답을 생성할 수 있는 모델을 활용합니다.
%pip install einops
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
model_id = "vikhyatk/moondream2"
revision = "2024-03-06"
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
image = Image.open('dataset/photo.jpg')
enc_image = model.encode_image(image)
query = "Describe this image."
response = model.answer_question(
enc_image,
query,
tokenizer
)
print(response)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
A group of people are standing on a beach, with a board displaying text in the foreground. The background features water, mountains, and a sky.
"""
A group of people are standing on a beach, with a board displaying text in the foreground. The background features water, mountains, and a sky.
"""
query = "How is the weather?"
response = model.answer_question(enc_image, query, tokenizer)
print(response)
The weather in the image is sunny.
query = "How many people are there in the photo?"
response = model.answer_question(
enc_image,
query, tokenizer
)
print(response)
5
Last updated