Topic Modeling: BERTopic
토픽 모델링은 텍스트 내에서 추상적인 토픽을 발견하는 데 초점을 맞춘 NLP의 하위 분야입니다. 주요 목표는 대규모 텍스트 코퍼스에서 숨겨진 주제 구조를 발견하여 비정형 텍스트의 대규모 데이터 세트를 더 쉽게 이해하고 정리하는 것입니다.
문서에 존재하는 토픽을 식별함으로써 토픽 모델링을 사용해 유사한 문서를 분류하거나 클러스터링할 수 있습니다.
검색 엔진은 토픽 분포를 기반으로 문서를 색인화하여 이를 활용할 수 있습니다.
또한 토픽을 기반으로 유사한 기사나 논문을 추천하는 데 사용할 수 있습니다.
BERTopic
BERTopic은 BERT 임베딩과 클래스 기반 TF-IDF를 활용하여 밀도가 높은 클러스터를 생성하는 토픽 모델링 기법으로, 토픽 설명에서 중요한 단어를 유지하면서 쉽게 해석할 수 있는 토픽을 생성할 수 있습니다.
최첨단 언어 모델과 새로운 알고리즘 접근 방식을 사용함으로써 기존의 토픽 모델링 접근 방식과 차별화됩니다.
Copy %pip install bertopic
Dataset
Copy !mkdir dataset
!wget https://github.com/sharmaroshan/Twitter-Sentiment-Analysis/raw/master/train_tweet.csv -O ./dataset/tokyo_2020_tweets.csv
Copy --2024-05-19 08:57:58-- https://github.com/sharmaroshan/Twitter-Sentiment-Analysis/raw/master/train_tweet.csv
Resolving github.com (github.com)... 20.200.245.247
Connecting to github.com (github.com)|20.200.245.247|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sharmaroshan/Twitter-Sentiment-Analysis/master/train_tweet.csv [following]
--2024-05-19 08:57:59-- https://raw.githubusercontent.com/sharmaroshan/Twitter-Sentiment-Analysis/master/train_tweet.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3103165 (3.0M) [text/plain]
Saving to: ‘./dataset/tokyo_2020_tweets.csv’
./dataset/tokyo_202 100%[===================>] 2.96M --.-KB/s in 0.03s
2024-05-19 08:58:00 (103 MB/s) - ‘./dataset/tokyo_2020_tweets.csv’ saved [3103165/3103165]
Copy import pandas as pd
df = pd.read_csv("dataset/tokyo_2020_tweets.csv", engine='python')
df.head()
@user when a father is dysfunctional and is s...
@user @user thanks for #lyft credit i can't us...
#model i love u take with u all the time in ...
factsguide: society now #motivation
Copy docs = df[0:10000].tweet.to_list()
docs
Copy [' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run',
"@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked",
' bihday your majesty',
'#model i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦ ',
' factsguide: society now #motivation',
'[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo ',]
Copy from bertopic import BERTopic
Copy model = BERTopic(verbose=True)
model.fit(docs)
topics, probabilities = model.transform(docs)
Copy 2024-05-19 09:00:04,747 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 313/313 [00:03<00:00, 82.45it/s]
2024-05-19 09:00:10,998 - BERTopic - Embedding - Completed ✓
2024-05-19 09:00:11,000 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-19 09:00:42,984 - BERTopic - Dimensionality - Completed ✓
2024-05-19 09:00:42,987 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-19 09:00:43,454 - BERTopic - Cluster - Completed ✓
2024-05-19 09:00:43,462 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-19 09:00:43,724 - BERTopic - Representation - Completed ✓
Batches: 100%|██████████| 313/313 [00:02<00:00, 108.64it/s]
2024-05-19 09:00:47,024 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-05-19 09:00:47,055 - BERTopic - Dimensionality - Completed ✓
2024-05-19 09:00:47,056 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-05-19 09:00:47,437 - BERTopic - Cluster - Completed ✓
Return Result
Copy model.get_topic_freq().head(10)
Copy [('fathers', 0.09351424852788441),
('fathersday', 0.07314833819088623),
('dad', 0.058599370842283095),
('day', 0.04736091349511204),
('father', 0.03951218907262783),
('dads', 0.03821332920448442),
('daddy', 0.02461901170173862),
('all', 0.023692468898816973),
('happy', 0.015781514379331344),
('man', 0.01359205432211061)]
Visualization
Copy model.visualize_topics()
Copy model.visualize_barchart()
Copy model.visualize_heatmap()
Topic Frequnt Count
Copy model = BERTopic(nr_topics=5)
model.fit(docs)
topics, probs = model.transform(docs)
model.get_topic_freq().head(10)
Model Save & Load
Copy model.save("new_model")
model = BERTopic.load("new_model")
Using Huggingface
Copy from datasets import load_dataset
dataset = load_dataset("OpenAssistant/oasst1")
dataset
Copy Downloading readme: 100%|██████████| 10.2k/10.2k [00:00<00:00, 8.87MB/s]
Downloading data: 100%|██████████| 39.5M/39.5M [00:04<00:00, 8.15MB/s]
Downloading data: 100%|██████████| 2.08M/2.08M [00:01<00:00, 1.98MB/s]
Generating train split: 100%|██████████| 84437/84437 [00:00<00:00, 117346.11 examples/s]
Generating validation split: 100%|██████████| 4401/4401 [00:00<00:00, 115404.20 examples/s]
DatasetDict({
train: Dataset({
features: ['message_id', 'parent_id', 'user_id', 'created_date', 'text', 'role', 'lang', 'review_count', 'review_result', 'deleted', 'rank', 'synthetic', 'model_name', 'detoxify', 'message_tree_id', 'tree_state', 'emojis', 'labels'],
num_rows: 84437
})
validation: Dataset({
features: ['message_id', 'parent_id', 'user_id', 'created_date', 'text', 'role', 'lang', 'review_count', 'review_result', 'deleted', 'rank', 'synthetic', 'model_name', 'detoxify', 'message_tree_id', 'tree_state', 'emojis', 'labels'],
num_rows: 4401
})
})
Copy topic_model = BERTopic.load("davanstrien/chat_topics")
train_texts = [item['text'] for item in dataset['train']][:10]
Copy topics.json: 100%|██████████| 263k/263k [00:00<00:00, 120MB/s]
config.json: 100%|██████████| 271/271 [00:00<00:00, 726kB/s]
topic_embeddings.safetensors: 100%|██████████| 230k/230k [00:00<00:00, 410kB/s]
modules.json: 100%|██████████| 349/349 [00:00<00:00, 924kB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 340kB/s]
README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 14.4MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 145kB/s]
config.json: 100%|██████████| 571/571 [00:00<00:00, 1.69MB/s]
model.safetensors: 100%|██████████| 438M/438M [00:03<00:00, 115MB/s]
tokenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 699kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 621kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 832kB/s]
special_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 696kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 393kB/s]
Copy train_topics, _ = topic_model.transform(train_texts)
train_topics
Copy Batches: 100%|██████████| 1/1 [00:00<00:00, 9.16it/s]
2024-05-19 09:07:30,576 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.
array([10, 10, 19, 10, 10, 10, 10, 10, 10, 52])
Copy topic_model.get_topic_info(10)
Topic
Count
Name
CustomName
Representation
Representative_Docs
10_communism_capitalism_marx_economic
10_communism_capitalism_marx_economic
[communism, capitalism, marx, economic, econom...
Copy 'Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.'
Last updated 11 months ago