Tabular Qustaion & Answering
Qustaion & Answering
General Q&A
Tabluar Q&A
TEPEX
General Q&A
from transformers import pipeline
qa_model = pipeline(
"question-answering",
"timpal0l/mdeberta-v3-base-squad2"
)
context = "The Great Wall of China is one of the world's most famous landmarks. It was built over several centuries and is thousands of kilometers long. The wall was primarily constructed to protect against invasions and raids from various nomadic groups from the Eurasian Steppe."
question = "What was the primary purpose of building the Great Wall of China?"
qa_model(question = question, context = context)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
config.json: 100%|██████████| 879/879 [00:00<00:00, 2.19MB/s]
model.safetensors: 100%|██████████| 1.11G/1.11G [00:57<00:00, 19.4MB/s]
tokenizer_config.json: 100%|██████████| 453/453 [00:00<00:00, 1.21MB/s]
tokenizer.json: 100%|██████████| 16.3M/16.3M [00:02<00:00, 7.32MB/s]
added_tokens.json: 100%|██████████| 23.0/23.0 [00:00<00:00, 46.9kB/s]
special_tokens_map.json: 100%|██████████| 173/173 [00:00<00:00, 458kB/s]
{'score': 0.31094032526016235,
'start': 176,
'end': 215,
'answer': ' to protect against invasions and raids'}
qa_model(
question=question,
context=context,
topk=3,
max_answer_len=30,
max_seq_len=400,
handle_impossible_answer=False,
)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/pipelines/question_answering.py:326: UserWarning: topk parameter is deprecated, use top_k instead
warnings.warn("topk parameter is deprecated, use top_k instead", UserWarning)
[{'score': 0.31094032526016235,
'start': 176,
'end': 215,
'answer': ' to protect against invasions and raids'},
{'score': 0.23844484984874725,
'start': 179,
'end': 215,
'answer': ' protect against invasions and raids'},
{'score': 0.1176154762506485,
'start': 176,
'end': 243,
'answer': ' to protect against invasions and raids from various nomadic groups'}]
Tabluar Q&A
Pandas DataFrame
import pandas as pd
# Sample data for a football player statistics dataset
data = {
"Player Name": ["Lionel Messi", "Cristiano Ronaldo", "Neymar Jr", "Kevin De Bruyne", "Robert Lewandowski"],
"Team": ["Paris Saint-Germain", "Al Nassr", "Paris Saint-Germain", "Manchester City", "Barcelona"],
"Nationality": ["Argentina", "Portugal", "Brazil", "Belgium", "Poland"],
"Goals": [25, 30, 18, 12, 34],
"Assists": [18, 15, 20, 25, 10],
"Passes Completed": [2050, 1800, 1900, 2300, 1500],
"Matches Played": [30, 33, 29, 32, 31],
"Yellow Cards": [2, 3, 4, 1, 5],
"Red Cards": [0, 1, 0, 0, 1]
}
df = pd.DataFrame(data)
df
Player Name
Team
Nationality
Goals
Assists
Passes Completed
Matches Played
Yellow Cards
Red Cards
0
Lionel Messi
Paris Saint-Germain
Argentina
25
18
2050
30
2
0
1
Cristiano Ronaldo
Al Nassr
Portugal
30
15
1800
33
3
1
2
Neymar Jr
Paris Saint-Germain
Brazil
18
20
1900
29
4
0
3
Kevin De Bruyne
Manchester City
Belgium
12
25
2300
32
1
0
4
Robert Lewandowski
Barcelona
Poland
34
10
1500
31
5
1
DataFrame to String
df = df.astype(str)
Tokenizer & Models
from transformers import AutoTokenizer, AutoModelForTableQuestionAnswering, pipeline
model = AutoModelForTableQuestionAnswering.from_pretrained(
"google/tapas-large-finetuned-wtq"
)
tokenizer = AutoTokenizer.from_pretrained("google/tapas-large-finetuned-wtq")
Pipeline
nlp = pipeline('table-question-answering', model=model, tokenizer=tokenizer)
Inference
question_list = [
"Who scored the highest number of goals?",
"How many assists were made by Kevin De Bruyne?",
"Which player has the least yellow cards?",
"What is the total number of red cards received by players from Paris Saint-Germain?",
"Who has the highest passes completed, and how many passes did they complete?"
]
result = nlp({'table': df, 'query': question_list[0]})
print(result)
for question in question_list:
result = nlp({'table': df, 'query': question})
print(result['cells'][0].strip())
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/tapas/tokenization_tapas.py:2762: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
text = normalize_for_match(row[col_index].text)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/tapas/tokenization_tapas.py:1561: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
cell = row[col_index]
{'answer': 'Robert Lewandowski', 'coordinates': [(4, 0)], 'cells': ['Robert Lewandowski'], 'aggregator': 'NONE'}
Robert Lewandowski
25
Kevin De Bruyne
0
Kevin De Bruyne
TAPEX Model
Microsoft에서 개발한 TAPEX 모델은 NLP의 테이블 질문 답변 영역에서 유용합니다.
강력한 성능: TAPEX는 다양한 테이블 질문-답변 벤치마크에서 인상적인 결과를 보여주었으며, 종종 다른 모델을 능가하는 성능을 보였습니다.
다목적성: 자연어 질문과 사실 확인 작업을 모두 처리할 수 있어 광범위한 사용 사례에 적용할 수 있습니다.
접근성: TAPEX는 허깅 페이스 트랜스포머 라이브러리를 통해 제공되므로 다양한 개발자와 연구자가 액세스할 수 있습니다.
# DataFrame to str
df = df.astype(str)
# Transformer
from transformers import TapexTokenizer, BartForConditionalGeneration, pipeline
tokenizer = TapexTokenizer.from_pretrained("microsoft/tapex-large-finetuned-wtq")
model = BartForConditionalGeneration.from_pretrained("microsoft/tapex-large-finetuned-wtq")
# Question
question_list = [
"Who scored the highest number of goals?",
"How many assists were made by Kevin De Bruyne?",
"Which player has the least yellow cards?",
"What is the total number of red cards received by players from Paris Saint-Germain?",
"Who has the highest passes completed, and how many passes did they complete?"
]
# Ouput to Encoding
encoding = tokenizer(table=df, query=question_list[0], return_tensors="pt")
outputs = model.generate(**encoding)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Result
print(result)
for question in question_list:
encoding = tokenizer(table=df, query=question, return_tensors="pt")
outputs = model.generate(**encoding)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(result[0].strip())
tokenizer_config.json: 100%|██████████| 1.20k/1.20k [00:00<00:00, 2.62MB/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 1.21MB/s]
merges.txt: 100%|██████████| 506k/506k [00:00<00:00, 904kB/s]
special_tokens_map.json: 100%|██████████| 772/772 [00:00<00:00, 1.52MB/s]
config.json: 100%|██████████| 951/951 [00:00<00:00, 2.41MB/s]
model.safetensors: 100%|██████████| 1.63G/1.63G [00:14<00:00, 109MB/s]
generation_config.json: 100%|██████████| 246/246 [00:00<00:00, 490kB/s]
[' robert lewandowski']
robert lewandowski
25
kevin de bruyne
0
lionel messi, 2050
Last updated