AI-Master-Book
  • about AI-Master-Book
  • AI Master Book
    • 이상치 탐지 with Python
    • 베이지안 뉴럴네트워크 (BNN) with Python
    • 그래프 뉴럴네트워크 (GNN) with Python
    • 데이터 마케팅 분석 with Python
  • LLM MASTER BOOK
    • OpenAI API 쿡북 with Python
    • 기초부터 심화까지 RAG 쿡북 with Python
    • MCP 에이전트 쿡북 with Python
  • LLMs
    • OpenAI API
      • 1️⃣ChatCompletion
      • 2️⃣DALL-E
      • 3️⃣Text to Speech
      • 4️⃣Video to Transcripts
      • 5️⃣Assistants API
      • 6️⃣Prompt Engineering
      • 7️⃣OpenAI New GPT-4o
    • LangChain
      • LangChain Basic
        • 1️⃣Basic Modules
        • 2️⃣Model I/O
        • 3️⃣Prompts
        • 4️⃣Chains
        • 5️⃣Agents
        • 6️⃣Tools
        • 7️⃣Memory
      • LangChain Intermediate
        • 1️⃣OpenAI LLM
        • 2️⃣Prompt Template
        • 3️⃣Retrieval
        • 4️⃣RAG ChatBot
        • 5️⃣RAG with Gemini
        • 6️⃣New Huggingface-LangChain
        • 7️⃣Huggingface Hub
        • 8️⃣SQL Agent & Chain
        • 9️⃣Expression Language(LCEL)
        • 🔟Llama3-8B with LangChain
      • LangChain Advanced
        • 1️⃣LLM Evaluation
        • 2️⃣RAG Evaluation with RAGAS
        • 3️⃣LangChain with RAGAS
        • 4️⃣RAG Paradigms
        • 5️⃣LangChain: Advance Techniques
        • 6️⃣LangChain with NeMo-Guardrails
        • 7️⃣LangChain vs. LlamaIndex
        • 8️⃣LangChain LCEL vs. LangGraph
    • LlamaIndex
      • LlamaIndex Basic
        • 1️⃣Introduction
        • 2️⃣Customization
        • 3️⃣Data Connectors
        • 4️⃣Documents & Nodes
        • 5️⃣Naive RAG
        • 6️⃣Advanced RAG
        • 7️⃣Llama3-8B with LlamaIndex
        • 8️⃣LlmaPack
      • LlamaIndex Intermediate
        • 1️⃣QueryEngine
        • 2️⃣Agent
        • 3️⃣Evaluation
        • 4️⃣Evaluation-Driven Development
        • 5️⃣Fine-tuning
        • 6️⃣Prompt Compression with LLMLingua
      • LlamaIndex Advanced
        • 1️⃣Agentic RAG: Router Engine
        • 2️⃣Agentic RAG: Tool Calling
        • 3️⃣Building Agent Reasoning Loop
        • 4️⃣Building Multi-document Agent
    • Hugging Face
      • Huggingface Basic
        • 1️⃣Datasets
        • 2️⃣Tokenizer
        • 3️⃣Sentence Embeddings
        • 4️⃣Transformers
        • 5️⃣Sentence Transformers
        • 6️⃣Evaluate
        • 7️⃣Diffusers
      • Huggingface Tasks
        • NLP
          • 1️⃣Sentiment Analysis
          • 2️⃣Zero-shot Classification
          • 3️⃣Aspect-Based Sentiment Analysis
          • 4️⃣Feature Extraction
          • 5️⃣Intent Classification
          • 6️⃣Topic Modeling: BERTopic
          • 7️⃣NER: Token Classification
          • 8️⃣Summarization
          • 9️⃣Translation
          • 🔟Text Generation
        • Audio & Tabular
          • 1️⃣Text-to-Speech: TTS
          • 2️⃣Speech Recognition: Whisper
          • 3️⃣Audio Classification
          • 4️⃣Tabular Qustaion & Answering
        • Vision & Multimodal
          • 1️⃣Image-to-Text
          • 2️⃣Text to Image
          • 3️⃣Image to Image
          • 4️⃣Text or Image-to-Video
          • 5️⃣Depth Estimation
          • 6️⃣Image Classification
          • 7️⃣Object Detection
          • 8️⃣Segmentatio
      • Huggingface Optimization
        • 1️⃣Accelerator
        • 2️⃣Bitsandbytes
        • 3️⃣Flash Attention
        • 4️⃣Quantization
        • 5️⃣Safetensors
        • 6️⃣Optimum-ONNX
        • 7️⃣Optimum-NVIDIA
        • 8️⃣Optimum-Intel
      • Huggingface Fine-tuning
        • 1️⃣Transformer Fine-tuning
        • 2️⃣PEFT Fine-tuning
        • 3️⃣PEFT: Fine-tuning with QLoRA
        • 4️⃣PEFT: Fine-tuning Phi-2 with QLoRA
        • 5️⃣Axoltl Fine-tuning with QLoRA
        • 6️⃣TRL: RLHF Alignment Fine-tuning
        • 7️⃣TRL: DPO Fine-tuning with Phi-3-4k-instruct
        • 8️⃣TRL: ORPO Fine-tuning with Llama3-8B
        • 9️⃣Convert GGUF gemma-2b with llama.cpp
        • 🔟Apple Silicon Fine-tuning Gemma-2B with MLX
        • 🔢LLM Mergekit
    • Agentic LLM
      • Agentic LLM
        • 1️⃣Basic Agentic LLM
        • 2️⃣Multi-agent with CrewAI
        • 3️⃣LangGraph: Multi-agent Basic
        • 4️⃣LangGraph: Agentic RAG with LangChain
        • 5️⃣LangGraph: Agentic RAG with Llama3-8B by Groq
      • Autonomous Agent
        • 1️⃣LLM Autonomous Agent?
        • 2️⃣AutoGPT: Worldcup Winner Search with LangChain
        • 3️⃣BabyAGI: Weather Report with LangChain
        • 4️⃣AutoGen: Writing Blog Post with LangChain
        • 5️⃣LangChain: Autonomous-agent Debates with Tools
        • 6️⃣CAMEL Role-playing Autonomous Cooperative Agents
        • 7️⃣LangChain: Two-player Harry Potter D&D based CAMEL
        • 8️⃣LangChain: Multi-agent Bid for K-Pop Debate
        • 9️⃣LangChain: Multi-agent Authoritarian Speaker Selection
        • 🔟LangChain: Multi-Agent Simulated Environment with PettingZoo
    • Multimodal
      • 1️⃣PaliGemma: Open Vision LLM
      • 2️⃣FLUX.1: Generative Image
    • Building LLM
      • 1️⃣DSPy
      • 2️⃣DSPy RAG
      • 3️⃣DSPy with LangChain
      • 4️⃣Mamba
      • 5️⃣Mamba RAG with LangChain
      • 7️⃣PostgreSQL VectorDB with pgvorco.rs
Powered by GitBook
On this page
  • Qustaion & Answering
  • General Q&A
  • Tabluar Q&A
  • TAPEX Model
  1. LLMs
  2. Hugging Face
  3. Huggingface Tasks
  4. Audio & Tabular

Tabular Qustaion & Answering

Qustaion & Answering

  1. General Q&A

  2. Tabluar Q&A

  3. TEPEX

General Q&A

from transformers import pipeline

qa_model = pipeline(
    "question-answering", 
    "timpal0l/mdeberta-v3-base-squad2"
)

context = "The Great Wall of China is one of the world's most famous landmarks. It was built over several centuries and is thousands of kilometers long. The wall was primarily constructed to protect against invasions and raids from various nomadic groups from the Eurasian Steppe."
question = "What was the primary purpose of building the Great Wall of China?"

qa_model(question = question, context = context)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
config.json: 100%|██████████| 879/879 [00:00<00:00, 2.19MB/s]
model.safetensors: 100%|██████████| 1.11G/1.11G [00:57<00:00, 19.4MB/s]
tokenizer_config.json: 100%|██████████| 453/453 [00:00<00:00, 1.21MB/s]
tokenizer.json: 100%|██████████| 16.3M/16.3M [00:02<00:00, 7.32MB/s]
added_tokens.json: 100%|██████████| 23.0/23.0 [00:00<00:00, 46.9kB/s]
special_tokens_map.json: 100%|██████████| 173/173 [00:00<00:00, 458kB/s]





{'score': 0.31094032526016235,
 'start': 176,
 'end': 215,
 'answer': ' to protect against invasions and raids'}
qa_model(
    question=question, 
     context=context, 
     topk=3,
     max_answer_len=30,
     max_seq_len=400,
     handle_impossible_answer=False,
)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/pipelines/question_answering.py:326: UserWarning: topk parameter is deprecated, use top_k instead
  warnings.warn("topk parameter is deprecated, use top_k instead", UserWarning)





[{'score': 0.31094032526016235,
  'start': 176,
  'end': 215,
  'answer': ' to protect against invasions and raids'},
 {'score': 0.23844484984874725,
  'start': 179,
  'end': 215,
  'answer': ' protect against invasions and raids'},
 {'score': 0.1176154762506485,
  'start': 176,
  'end': 243,
  'answer': ' to protect against invasions and raids from various nomadic groups'}]

Tabluar Q&A

Pandas DataFrame

import pandas as pd

# Sample data for a football player statistics dataset
data = {
    "Player Name": ["Lionel Messi", "Cristiano Ronaldo", "Neymar Jr", "Kevin De Bruyne", "Robert Lewandowski"],
    "Team": ["Paris Saint-Germain", "Al Nassr", "Paris Saint-Germain", "Manchester City", "Barcelona"],
    "Nationality": ["Argentina", "Portugal", "Brazil", "Belgium", "Poland"],
    "Goals": [25, 30, 18, 12, 34],
    "Assists": [18, 15, 20, 25, 10],
    "Passes Completed": [2050, 1800, 1900, 2300, 1500],
    "Matches Played": [30, 33, 29, 32, 31],
    "Yellow Cards": [2, 3, 4, 1, 5],
    "Red Cards": [0, 1, 0, 0, 1]
}
df = pd.DataFrame(data)
df
Player Name
Team
Nationality
Goals
Assists
Passes Completed
Matches Played
Yellow Cards
Red Cards

0

Lionel Messi

Paris Saint-Germain

Argentina

25

18

2050

30

2

0

1

Cristiano Ronaldo

Al Nassr

Portugal

30

15

1800

33

3

1

2

Neymar Jr

Paris Saint-Germain

Brazil

18

20

1900

29

4

0

3

Kevin De Bruyne

Manchester City

Belgium

12

25

2300

32

1

0

4

Robert Lewandowski

Barcelona

Poland

34

10

1500

31

5

1

DataFrame to String

df = df.astype(str)

Tokenizer & Models

from transformers import AutoTokenizer, AutoModelForTableQuestionAnswering, pipeline

model = AutoModelForTableQuestionAnswering.from_pretrained(
                "google/tapas-large-finetuned-wtq"
)

tokenizer = AutoTokenizer.from_pretrained("google/tapas-large-finetuned-wtq")

Pipeline

nlp = pipeline('table-question-answering', model=model, tokenizer=tokenizer)

Inference

question_list = [
"Who scored the highest number of goals?",
"How many assists were made by Kevin De Bruyne?",
"Which player has the least yellow cards?",
"What is the total number of red cards received by players from Paris Saint-Germain?",
"Who has the highest passes completed, and how many passes did they complete?"
]

result = nlp({'table': df, 'query': question_list[0]})
print(result)
for question in question_list:
    result = nlp({'table': df, 'query': question})
    print(result['cells'][0].strip())
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/tapas/tokenization_tapas.py:2762: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  text = normalize_for_match(row[col_index].text)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/tapas/tokenization_tapas.py:1561: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  cell = row[col_index]


{'answer': 'Robert Lewandowski', 'coordinates': [(4, 0)], 'cells': ['Robert Lewandowski'], 'aggregator': 'NONE'}
Robert Lewandowski
25
Kevin De Bruyne
0
Kevin De Bruyne

TAPEX Model

Microsoft에서 개발한 TAPEX 모델은 NLP의 테이블 질문 답변 영역에서 유용합니다.

  • 강력한 성능: TAPEX는 다양한 테이블 질문-답변 벤치마크에서 인상적인 결과를 보여주었으며, 종종 다른 모델을 능가하는 성능을 보였습니다.

  • 다목적성: 자연어 질문과 사실 확인 작업을 모두 처리할 수 있어 광범위한 사용 사례에 적용할 수 있습니다.

  • 접근성: TAPEX는 허깅 페이스 트랜스포머 라이브러리를 통해 제공되므로 다양한 개발자와 연구자가 액세스할 수 있습니다.

# DataFrame to str
df = df.astype(str)

# Transformer
from transformers import TapexTokenizer, BartForConditionalGeneration, pipeline

tokenizer = TapexTokenizer.from_pretrained("microsoft/tapex-large-finetuned-wtq")
model = BartForConditionalGeneration.from_pretrained("microsoft/tapex-large-finetuned-wtq")

# Question
question_list = [
"Who scored the highest number of goals?",
"How many assists were made by Kevin De Bruyne?",
"Which player has the least yellow cards?",
"What is the total number of red cards received by players from Paris Saint-Germain?",
"Who has the highest passes completed, and how many passes did they complete?"
]

# Ouput to Encoding 
encoding = tokenizer(table=df, query=question_list[0], return_tensors="pt")
outputs = model.generate(**encoding)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Result
print(result)
for question in question_list:
    encoding = tokenizer(table=df, query=question, return_tensors="pt")
    outputs = model.generate(**encoding)
    result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(result[0].strip())
tokenizer_config.json: 100%|██████████| 1.20k/1.20k [00:00<00:00, 2.62MB/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 1.21MB/s]
merges.txt: 100%|██████████| 506k/506k [00:00<00:00, 904kB/s]
special_tokens_map.json: 100%|██████████| 772/772 [00:00<00:00, 1.52MB/s]
config.json: 100%|██████████| 951/951 [00:00<00:00, 2.41MB/s]
model.safetensors: 100%|██████████| 1.63G/1.63G [00:14<00:00, 109MB/s] 
generation_config.json: 100%|██████████| 246/246 [00:00<00:00, 490kB/s]


[' robert lewandowski']
robert lewandowski
25
kevin de bruyne
0
lionel messi, 2050
PreviousAudio ClassificationNextVision & Multimodal

Last updated 1 year ago

4️⃣