NER: Token Classification

Named Entity Recognition: Token Classification

NER(Named Entity Recognition)은 텍스트 내의 명명된 개체를 미리 정의된 카테고리로 식별하고 분류하는 작업을 포함합니다.

Token Classification

토큰 분류는 텍스트의 개별 토큰(예: 단어 또는 하위 단어)에 레이블이나 카테고리를 할당하는 자연어 처리(NLP)의 기본 작업입니다.

전체 텍스트 본문을 카테고리로 분류하는 텍스트 분류와 달리 토큰 분류는 텍스트의 가장 작은 단위에 초점을 맞춰 훨씬 더 세밀하게 작업합니다.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline(
    "ner", 
    model=model, 
    tokenizer=tokenizer
)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
tokenizer_config.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 145kB/s]
config.json: 100%|██████████| 829/829 [00:00<00:00, 2.11MB/s]
vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 573kB/s]
added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<00:00, 4.48kB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 275kB/s]
model.safetensors: 100%|██████████| 433M/433M [00:22<00:00, 19.1MB/s] 
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]

example = "Steve Jobs, co-founder of Apple Inc., delivered an inspiring speech in New York, discussing various innovative technologies, which was attended by many from different sectors."

ner_results = nlp(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.9997489, 'index': 1, 'word': 'Steve', 'start': 0, 'end': 5}, {'entity': 'I-PER', 'score': 0.99963474, 'index': 2, 'word': 'Job', 'start': 6, 'end': 9}, {'entity': 'I-PER', 'score': 0.9991823, 'index': 3, 'word': '##s', 'start': 9, 'end': 10}, {'entity': 'B-ORG', 'score': 0.99956256, 'index': 9, 'word': 'Apple', 'start': 26, 'end': 31}, {'entity': 'I-ORG', 'score': 0.999373, 'index': 10, 'word': 'Inc', 'start': 32, 'end': 35}, {'entity': 'B-LOC', 'score': 0.999461, 'index': 18, 'word': 'New', 'start': 71, 'end': 74}, {'entity': 'I-LOC', 'score': 0.9994356, 'index': 19, 'word': 'York', 'start': 75, 'end': 79}]

한국어

from transformers import pipeline

ner = pipeline(
    task='ner',
    model="KPF/KPF-bert-ner",
    tokenizer="KPF/KPF-bert-ner",
    aggregation_strategy="simple"
)
result = ner(
    "BERT 모델의 학습을 위해서는 문장에서 토큰을 추출하는 과정이 필요하다."
    "이는 kpf-BERT에서 제공하는 토크나이저를 사용한다."
    "kpf-BERT 토크나이저는 문장을 토큰화해서 전체 문장벡터를 만든다."
    "이후 문장의 시작과 끝 그 외 몇가지 특수 토큰을 추가한다."
    "이 과정에서 문장별로 구별하는 세그먼트 토큰, 각 토큰의 위치를 표시하는 포지션 토큰 등을 생성한다."
)

result

config.json: 100%|██████████| 14.3k/14.3k [00:00<00:00, 18.3MB/s]
pytorch_model.bin: 100%|██████████| 455M/455M [00:03<00:00, 116MB/s]  
tokenizer_config.json: 100%|██████████| 335/335 [00:00<00:00, 855kB/s]
vocab.txt: 100%|██████████| 276k/276k [00:00<00:00, 744kB/s]
tokenizer.json: 100%|██████████| 850k/850k [00:00<00:00, 1.14MB/s]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.





[{'entity_group': 'LABEL_299',
  'score': 0.99999666,
  'word': 'BERT 모델의 학습을 위해서는 문장에서 토큰을 추출하는 과정이 필요하다. 이는 kpf - BERT에서 제공하는 토크나이저를 사용한다. kpf - BERT 토크나이저는 문장을 토큰화해서 전체 문장벡터를 만든다. 이후 문장의 시작과 끝 그 외 몇가지 특수 토큰을 추가한다. 이 과정에서 문장별로 구별하는 세그먼트 토큰, 각 토큰의 위치를 표시하는 포지션 토큰 등을 생성한다.',
  'start': 0,
  'end': 200}]

PreviousTopic Modeling: BERTopic NextSummarization

Last updated 1 year ago