Topic Modeling: BERTopic

토픽 모델링은 텍스트 내에서 추상적인 토픽을 발견하는 데 초점을 맞춘 NLP의 하위 분야입니다. 주요 목표는 대규모 텍스트 코퍼스에서 숨겨진 주제 구조를 발견하여 비정형 텍스트의 대규모 데이터 세트를 더 쉽게 이해하고 정리하는 것입니다.

문서에 존재하는 토픽을 식별함으로써 토픽 모델링을 사용해 유사한 문서를 분류하거나 클러스터링할 수 있습니다.
검색 엔진은 토픽 분포를 기반으로 문서를 색인화하여 이를 활용할 수 있습니다.
또한 토픽을 기반으로 유사한 기사나 논문을 추천하는 데 사용할 수 있습니다.

BERTopic

BERTopic은 BERT 임베딩과 클래스 기반 TF-IDF를 활용하여 밀도가 높은 클러스터를 생성하는 토픽 모델링 기법으로, 토픽 설명에서 중요한 단어를 유지하면서 쉽게 해석할 수 있는 토픽을 생성할 수 있습니다.

최첨단 언어 모델과 새로운 알고리즘 접근 방식을 사용함으로써 기존의 토픽 모델링 접근 방식과 차별화됩니다.

%pip install bertopic

Dataset

!mkdir dataset
!wget https://github.com/sharmaroshan/Twitter-Sentiment-Analysis/raw/master/train_tweet.csv -O ./dataset/tokyo_2020_tweets.csv

--2024-05-19 08:57:58--  https://github.com/sharmaroshan/Twitter-Sentiment-Analysis/raw/master/train_tweet.csv
Resolving github.com (github.com)... 20.200.245.247
Connecting to github.com (github.com)|20.200.245.247|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sharmaroshan/Twitter-Sentiment-Analysis/master/train_tweet.csv [following]
--2024-05-19 08:57:59--  https://raw.githubusercontent.com/sharmaroshan/Twitter-Sentiment-Analysis/master/train_tweet.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3103165 (3.0M) [text/plain]
Saving to: ‘./dataset/tokyo_2020_tweets.csv’

./dataset/tokyo_202 100%[===================>]   2.96M  --.-KB/s    in 0.03s   

2024-05-19 08:58:00 (103 MB/s) - ‘./dataset/tokyo_2020_tweets.csv’ saved [3103165/3103165]

import pandas as pd 

df = pd.read_csv("dataset/tokyo_2020_tweets.csv", engine='python')
df.head()

label

@user when a father is dysfunctional and is s...

@user @user thanks for #lyft credit i can't us...

bihday your majesty

#model i love u take with u all the time in ...

factsguide: society now #motivation

docs = df[0:10000].tweet.to_list()
docs

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation',
 '[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo  ',]

from bertopic import BERTopic

model = BERTopic(verbose=True)
model.fit(docs)

topics, probabilities = model.transform(docs)

2024-05-19 09:00:04,747 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 313/313 [00:03<00:00, 82.45it/s] 
2024-05-19 09:00:10,998 - BERTopic - Embedding - Completed ✓
2024-05-19 09:00:11,000 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-19 09:00:42,984 - BERTopic - Dimensionality - Completed ✓
2024-05-19 09:00:42,987 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-19 09:00:43,454 - BERTopic - Cluster - Completed ✓
2024-05-19 09:00:43,462 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-19 09:00:43,724 - BERTopic - Representation - Completed ✓
Batches: 100%|██████████| 313/313 [00:02<00:00, 108.64it/s]
2024-05-19 09:00:47,024 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-05-19 09:00:47,055 - BERTopic - Dimensionality - Completed ✓
2024-05-19 09:00:47,056 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-05-19 09:00:47,437 - BERTopic - Cluster - Completed ✓

Return Result

model.get_topic_freq().head(10)

Topic

Count

-1

4119

346

303

198

166

155

144

142

131

125

model.get_topic(2)

[('fathers', 0.09351424852788441),
 ('fathersday', 0.07314833819088623),
 ('dad', 0.058599370842283095),
 ('day', 0.04736091349511204),
 ('father', 0.03951218907262783),
 ('dads', 0.03821332920448442),
 ('daddy', 0.02461901170173862),
 ('all', 0.023692468898816973),
 ('happy', 0.015781514379331344),
 ('man', 0.01359205432211061)]

Visualization

model.visualize_topics()

model.visualize_barchart()

model.visualize_heatmap()

Topic Frequnt Count

model = BERTopic(nr_topics=5) 
model.fit(docs)

topics, probs = model.transform(docs)
model.get_topic_freq().head(10)

Topic

Count

5517

-1

4226

206

Model Save & Load

model.save("new_model")

model = BERTopic.load("new_model")

Using Huggingface

from datasets import load_dataset

dataset = load_dataset("OpenAssistant/oasst1")
dataset

Downloading readme: 100%|██████████| 10.2k/10.2k [00:00<00:00, 8.87MB/s]
Downloading data: 100%|██████████| 39.5M/39.5M [00:04<00:00, 8.15MB/s]
Downloading data: 100%|██████████| 2.08M/2.08M [00:01<00:00, 1.98MB/s]
Generating train split: 100%|██████████| 84437/84437 [00:00<00:00, 117346.11 examples/s]
Generating validation split: 100%|██████████| 4401/4401 [00:00<00:00, 115404.20 examples/s]





DatasetDict({
    train: Dataset({
        features: ['message_id', 'parent_id', 'user_id', 'created_date', 'text', 'role', 'lang', 'review_count', 'review_result', 'deleted', 'rank', 'synthetic', 'model_name', 'detoxify', 'message_tree_id', 'tree_state', 'emojis', 'labels'],
        num_rows: 84437
    })
    validation: Dataset({
        features: ['message_id', 'parent_id', 'user_id', 'created_date', 'text', 'role', 'lang', 'review_count', 'review_result', 'deleted', 'rank', 'synthetic', 'model_name', 'detoxify', 'message_tree_id', 'tree_state', 'emojis', 'labels'],
        num_rows: 4401
    })
})

topic_model = BERTopic.load("davanstrien/chat_topics")

train_texts = [item['text'] for item in dataset['train']][:10]

topics.json: 100%|██████████| 263k/263k [00:00<00:00, 120MB/s]
config.json: 100%|██████████| 271/271 [00:00<00:00, 726kB/s]
topic_embeddings.safetensors: 100%|██████████| 230k/230k [00:00<00:00, 410kB/s]
modules.json: 100%|██████████| 349/349 [00:00<00:00, 924kB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 340kB/s]
README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 14.4MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 145kB/s]
config.json: 100%|██████████| 571/571 [00:00<00:00, 1.69MB/s]
model.safetensors: 100%|██████████| 438M/438M [00:03<00:00, 115MB/s] 
tokenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 699kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 621kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 832kB/s]
special_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 696kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 393kB/s]

train_topics, _ = topic_model.transform(train_texts)
train_topics

Batches: 100%|██████████| 1/1 [00:00<00:00,  9.16it/s]
2024-05-19 09:07:30,576 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.





array([10, 10, 19, 10, 10, 10, 10, 10, 10, 52])

topic_model.get_topic_info(10)

Topic

Count

Name

CustomName

Representation

Representative_Docs

408

10_communism_capitalism_marx_economic

[communism, capitalism, marx, economic, econom...

NaN

train_texts[0]

'Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.'

PreviousIntent Classification NextNER: Token Classification

Last updated 11 months ago

Topic Modeling: BERTopic

문서에 존재하는 토픽을 식별함으로써 토픽 모델링을 사용해 유사한 문서를 분류하거나 클러스터링할 수 있습니다.
검색 엔진은 토픽 분포를 기반으로 문서를 색인화하여 이를 활용할 수 있습니다.
또한 토픽을 기반으로 유사한 기사나 논문을 추천하는 데 사용할 수 있습니다.

BERTopic

최첨단 언어 모델과 새로운 알고리즘 접근 방식을 사용함으로써 기존의 토픽 모델링 접근 방식과 차별화됩니다.

%pip install bertopic

Dataset

!mkdir dataset
!wget https://github.com/sharmaroshan/Twitter-Sentiment-Analysis/raw/master/train_tweet.csv -O ./dataset/tokyo_2020_tweets.csv

--2024-05-19 08:57:58--  https://github.com/sharmaroshan/Twitter-Sentiment-Analysis/raw/master/train_tweet.csv
Resolving github.com (github.com)... 20.200.245.247
Connecting to github.com (github.com)|20.200.245.247|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sharmaroshan/Twitter-Sentiment-Analysis/master/train_tweet.csv [following]
--2024-05-19 08:57:59--  https://raw.githubusercontent.com/sharmaroshan/Twitter-Sentiment-Analysis/master/train_tweet.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3103165 (3.0M) [text/plain]
Saving to: ‘./dataset/tokyo_2020_tweets.csv’

./dataset/tokyo_202 100%[===================>]   2.96M  --.-KB/s    in 0.03s   

2024-05-19 08:58:00 (103 MB/s) - ‘./dataset/tokyo_2020_tweets.csv’ saved [3103165/3103165]

import pandas as pd 

df = pd.read_csv("dataset/tokyo_2020_tweets.csv", engine='python')
df.head()

label

@user when a father is dysfunctional and is s...

@user @user thanks for #lyft credit i can't us...

bihday your majesty

#model i love u take with u all the time in ...

factsguide: society now #motivation

docs = df[0:10000].tweet.to_list()
docs

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation',
 '[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo  ',]

from bertopic import BERTopic

model = BERTopic(verbose=True)
model.fit(docs)

topics, probabilities = model.transform(docs)

2024-05-19 09:00:04,747 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 313/313 [00:03<00:00, 82.45it/s] 
2024-05-19 09:00:10,998 - BERTopic - Embedding - Completed ✓
2024-05-19 09:00:11,000 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-19 09:00:42,984 - BERTopic - Dimensionality - Completed ✓
2024-05-19 09:00:42,987 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-19 09:00:43,454 - BERTopic - Cluster - Completed ✓
2024-05-19 09:00:43,462 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-19 09:00:43,724 - BERTopic - Representation - Completed ✓
Batches: 100%|██████████| 313/313 [00:02<00:00, 108.64it/s]
2024-05-19 09:00:47,024 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-05-19 09:00:47,055 - BERTopic - Dimensionality - Completed ✓
2024-05-19 09:00:47,056 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-05-19 09:00:47,437 - BERTopic - Cluster - Completed ✓

Return Result

model.get_topic_freq().head(10)

Topic

Count

-1

4119

346

303

198

166

155

144

142

131

125

model.get_topic(2)

[('fathers', 0.09351424852788441),
 ('fathersday', 0.07314833819088623),
 ('dad', 0.058599370842283095),
 ('day', 0.04736091349511204),
 ('father', 0.03951218907262783),
 ('dads', 0.03821332920448442),
 ('daddy', 0.02461901170173862),
 ('all', 0.023692468898816973),
 ('happy', 0.015781514379331344),
 ('man', 0.01359205432211061)]

Visualization

model.visualize_topics()

model.visualize_barchart()

model.visualize_heatmap()

Topic Frequnt Count

model = BERTopic(nr_topics=5) 
model.fit(docs)

topics, probs = model.transform(docs)
model.get_topic_freq().head(10)

Topic

Count

5517

-1

4226

206

Model Save & Load

model.save("new_model")

model = BERTopic.load("new_model")

Using Huggingface

from datasets import load_dataset

dataset = load_dataset("OpenAssistant/oasst1")
dataset

Downloading readme: 100%|██████████| 10.2k/10.2k [00:00<00:00, 8.87MB/s]
Downloading data: 100%|██████████| 39.5M/39.5M [00:04<00:00, 8.15MB/s]
Downloading data: 100%|██████████| 2.08M/2.08M [00:01<00:00, 1.98MB/s]
Generating train split: 100%|██████████| 84437/84437 [00:00<00:00, 117346.11 examples/s]
Generating validation split: 100%|██████████| 4401/4401 [00:00<00:00, 115404.20 examples/s]





DatasetDict({
    train: Dataset({
        features: ['message_id', 'parent_id', 'user_id', 'created_date', 'text', 'role', 'lang', 'review_count', 'review_result', 'deleted', 'rank', 'synthetic', 'model_name', 'detoxify', 'message_tree_id', 'tree_state', 'emojis', 'labels'],
        num_rows: 84437
    })
    validation: Dataset({
        features: ['message_id', 'parent_id', 'user_id', 'created_date', 'text', 'role', 'lang', 'review_count', 'review_result', 'deleted', 'rank', 'synthetic', 'model_name', 'detoxify', 'message_tree_id', 'tree_state', 'emojis', 'labels'],
        num_rows: 4401
    })
})

topic_model = BERTopic.load("davanstrien/chat_topics")

train_texts = [item['text'] for item in dataset['train']][:10]

topics.json: 100%|██████████| 263k/263k [00:00<00:00, 120MB/s]
config.json: 100%|██████████| 271/271 [00:00<00:00, 726kB/s]
topic_embeddings.safetensors: 100%|██████████| 230k/230k [00:00<00:00, 410kB/s]
modules.json: 100%|██████████| 349/349 [00:00<00:00, 924kB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 340kB/s]
README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 14.4MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 145kB/s]
config.json: 100%|██████████| 571/571 [00:00<00:00, 1.69MB/s]
model.safetensors: 100%|██████████| 438M/438M [00:03<00:00, 115MB/s] 
tokenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 699kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 621kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 832kB/s]
special_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 696kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 393kB/s]

train_topics, _ = topic_model.transform(train_texts)
train_topics

Batches: 100%|██████████| 1/1 [00:00<00:00,  9.16it/s]
2024-05-19 09:07:30,576 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.





array([10, 10, 19, 10, 10, 10, 10, 10, 10, 52])

topic_model.get_topic_info(10)

Topic

Count

Name

CustomName

Representation

Representative_Docs

408

10_communism_capitalism_marx_economic

[communism, capitalism, marx, economic, econom...

NaN

train_texts[0]

'Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.'

PreviousIntent Classification NextNER: Token Classification

Last updated 11 months ago