AI-Master-Book
  • about AI-Master-Book
  • AI Master Book
    • 이상치 탐지 with Python
    • 베이지안 뉴럴네트워크 (BNN) with Python
    • 그래프 뉴럴네트워크 (GNN) with Python
    • 데이터 마케팅 분석 with Python
  • LLM MASTER BOOK
    • OpenAI API 쿡북 with Python
    • 기초부터 심화까지 RAG 쿡북 with Python
    • MCP 에이전트 쿡북 with Python
  • LLMs
    • OpenAI API
      • 1️⃣ChatCompletion
      • 2️⃣DALL-E
      • 3️⃣Text to Speech
      • 4️⃣Video to Transcripts
      • 5️⃣Assistants API
      • 6️⃣Prompt Engineering
      • 7️⃣OpenAI New GPT-4o
    • LangChain
      • LangChain Basic
        • 1️⃣Basic Modules
        • 2️⃣Model I/O
        • 3️⃣Prompts
        • 4️⃣Chains
        • 5️⃣Agents
        • 6️⃣Tools
        • 7️⃣Memory
      • LangChain Intermediate
        • 1️⃣OpenAI LLM
        • 2️⃣Prompt Template
        • 3️⃣Retrieval
        • 4️⃣RAG ChatBot
        • 5️⃣RAG with Gemini
        • 6️⃣New Huggingface-LangChain
        • 7️⃣Huggingface Hub
        • 8️⃣SQL Agent & Chain
        • 9️⃣Expression Language(LCEL)
        • 🔟Llama3-8B with LangChain
      • LangChain Advanced
        • 1️⃣LLM Evaluation
        • 2️⃣RAG Evaluation with RAGAS
        • 3️⃣LangChain with RAGAS
        • 4️⃣RAG Paradigms
        • 5️⃣LangChain: Advance Techniques
        • 6️⃣LangChain with NeMo-Guardrails
        • 7️⃣LangChain vs. LlamaIndex
        • 8️⃣LangChain LCEL vs. LangGraph
    • LlamaIndex
      • LlamaIndex Basic
        • 1️⃣Introduction
        • 2️⃣Customization
        • 3️⃣Data Connectors
        • 4️⃣Documents & Nodes
        • 5️⃣Naive RAG
        • 6️⃣Advanced RAG
        • 7️⃣Llama3-8B with LlamaIndex
        • 8️⃣LlmaPack
      • LlamaIndex Intermediate
        • 1️⃣QueryEngine
        • 2️⃣Agent
        • 3️⃣Evaluation
        • 4️⃣Evaluation-Driven Development
        • 5️⃣Fine-tuning
        • 6️⃣Prompt Compression with LLMLingua
      • LlamaIndex Advanced
        • 1️⃣Agentic RAG: Router Engine
        • 2️⃣Agentic RAG: Tool Calling
        • 3️⃣Building Agent Reasoning Loop
        • 4️⃣Building Multi-document Agent
    • Hugging Face
      • Huggingface Basic
        • 1️⃣Datasets
        • 2️⃣Tokenizer
        • 3️⃣Sentence Embeddings
        • 4️⃣Transformers
        • 5️⃣Sentence Transformers
        • 6️⃣Evaluate
        • 7️⃣Diffusers
      • Huggingface Tasks
        • NLP
          • 1️⃣Sentiment Analysis
          • 2️⃣Zero-shot Classification
          • 3️⃣Aspect-Based Sentiment Analysis
          • 4️⃣Feature Extraction
          • 5️⃣Intent Classification
          • 6️⃣Topic Modeling: BERTopic
          • 7️⃣NER: Token Classification
          • 8️⃣Summarization
          • 9️⃣Translation
          • 🔟Text Generation
        • Audio & Tabular
          • 1️⃣Text-to-Speech: TTS
          • 2️⃣Speech Recognition: Whisper
          • 3️⃣Audio Classification
          • 4️⃣Tabular Qustaion & Answering
        • Vision & Multimodal
          • 1️⃣Image-to-Text
          • 2️⃣Text to Image
          • 3️⃣Image to Image
          • 4️⃣Text or Image-to-Video
          • 5️⃣Depth Estimation
          • 6️⃣Image Classification
          • 7️⃣Object Detection
          • 8️⃣Segmentatio
      • Huggingface Optimization
        • 1️⃣Accelerator
        • 2️⃣Bitsandbytes
        • 3️⃣Flash Attention
        • 4️⃣Quantization
        • 5️⃣Safetensors
        • 6️⃣Optimum-ONNX
        • 7️⃣Optimum-NVIDIA
        • 8️⃣Optimum-Intel
      • Huggingface Fine-tuning
        • 1️⃣Transformer Fine-tuning
        • 2️⃣PEFT Fine-tuning
        • 3️⃣PEFT: Fine-tuning with QLoRA
        • 4️⃣PEFT: Fine-tuning Phi-2 with QLoRA
        • 5️⃣Axoltl Fine-tuning with QLoRA
        • 6️⃣TRL: RLHF Alignment Fine-tuning
        • 7️⃣TRL: DPO Fine-tuning with Phi-3-4k-instruct
        • 8️⃣TRL: ORPO Fine-tuning with Llama3-8B
        • 9️⃣Convert GGUF gemma-2b with llama.cpp
        • 🔟Apple Silicon Fine-tuning Gemma-2B with MLX
        • 🔢LLM Mergekit
    • Agentic LLM
      • Agentic LLM
        • 1️⃣Basic Agentic LLM
        • 2️⃣Multi-agent with CrewAI
        • 3️⃣LangGraph: Multi-agent Basic
        • 4️⃣LangGraph: Agentic RAG with LangChain
        • 5️⃣LangGraph: Agentic RAG with Llama3-8B by Groq
      • Autonomous Agent
        • 1️⃣LLM Autonomous Agent?
        • 2️⃣AutoGPT: Worldcup Winner Search with LangChain
        • 3️⃣BabyAGI: Weather Report with LangChain
        • 4️⃣AutoGen: Writing Blog Post with LangChain
        • 5️⃣LangChain: Autonomous-agent Debates with Tools
        • 6️⃣CAMEL Role-playing Autonomous Cooperative Agents
        • 7️⃣LangChain: Two-player Harry Potter D&D based CAMEL
        • 8️⃣LangChain: Multi-agent Bid for K-Pop Debate
        • 9️⃣LangChain: Multi-agent Authoritarian Speaker Selection
        • 🔟LangChain: Multi-Agent Simulated Environment with PettingZoo
    • Multimodal
      • 1️⃣PaliGemma: Open Vision LLM
      • 2️⃣FLUX.1: Generative Image
    • Building LLM
      • 1️⃣DSPy
      • 2️⃣DSPy RAG
      • 3️⃣DSPy with LangChain
      • 4️⃣Mamba
      • 5️⃣Mamba RAG with LangChain
      • 7️⃣PostgreSQL VectorDB with pgvorco.rs
Powered by GitBook
On this page
  • Named Entity Recognition: Token Classification
  • Token Classification
  • 한국어
  1. LLMs
  2. Hugging Face
  3. Huggingface Tasks
  4. NLP

NER: Token Classification

Named Entity Recognition: Token Classification

NER(Named Entity Recognition)은 텍스트 내의 명명된 개체를 미리 정의된 카테고리로 식별하고 분류하는 작업을 포함합니다.

Token Classification

토큰 분류는 텍스트의 개별 토큰(예: 단어 또는 하위 단어)에 레이블이나 카테고리를 할당하는 자연어 처리(NLP)의 기본 작업입니다.

전체 텍스트 본문을 카테고리로 분류하는 텍스트 분류와 달리 토큰 분류는 텍스트의 가장 작은 단위에 초점을 맞춰 훨씬 더 세밀하게 작업합니다.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline(
    "ner", 
    model=model, 
    tokenizer=tokenizer
)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
tokenizer_config.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 145kB/s]
config.json: 100%|██████████| 829/829 [00:00<00:00, 2.11MB/s]
vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 573kB/s]
added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<00:00, 4.48kB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 275kB/s]
model.safetensors: 100%|██████████| 433M/433M [00:22<00:00, 19.1MB/s] 
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)
[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]
example = "Steve Jobs, co-founder of Apple Inc., delivered an inspiring speech in New York, discussing various innovative technologies, which was attended by many from different sectors."

ner_results = nlp(example)
print(ner_results)
[{'entity': 'B-PER', 'score': 0.9997489, 'index': 1, 'word': 'Steve', 'start': 0, 'end': 5}, {'entity': 'I-PER', 'score': 0.99963474, 'index': 2, 'word': 'Job', 'start': 6, 'end': 9}, {'entity': 'I-PER', 'score': 0.9991823, 'index': 3, 'word': '##s', 'start': 9, 'end': 10}, {'entity': 'B-ORG', 'score': 0.99956256, 'index': 9, 'word': 'Apple', 'start': 26, 'end': 31}, {'entity': 'I-ORG', 'score': 0.999373, 'index': 10, 'word': 'Inc', 'start': 32, 'end': 35}, {'entity': 'B-LOC', 'score': 0.999461, 'index': 18, 'word': 'New', 'start': 71, 'end': 74}, {'entity': 'I-LOC', 'score': 0.9994356, 'index': 19, 'word': 'York', 'start': 75, 'end': 79}]

한국어

from transformers import pipeline

ner = pipeline(
    task='ner',
    model="KPF/KPF-bert-ner",
    tokenizer="KPF/KPF-bert-ner",
    aggregation_strategy="simple"
)
result = ner(
    "BERT 모델의 학습을 위해서는 문장에서 토큰을 추출하는 과정이 필요하다."
    "이는 kpf-BERT에서 제공하는 토크나이저를 사용한다."
    "kpf-BERT 토크나이저는 문장을 토큰화해서 전체 문장벡터를 만든다."
    "이후 문장의 시작과 끝 그 외 몇가지 특수 토큰을 추가한다."
    "이 과정에서 문장별로 구별하는 세그먼트 토큰, 각 토큰의 위치를 표시하는 포지션 토큰 등을 생성한다."
)

result
config.json: 100%|██████████| 14.3k/14.3k [00:00<00:00, 18.3MB/s]
pytorch_model.bin: 100%|██████████| 455M/455M [00:03<00:00, 116MB/s]  
tokenizer_config.json: 100%|██████████| 335/335 [00:00<00:00, 855kB/s]
vocab.txt: 100%|██████████| 276k/276k [00:00<00:00, 744kB/s]
tokenizer.json: 100%|██████████| 850k/850k [00:00<00:00, 1.14MB/s]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.





[{'entity_group': 'LABEL_299',
  'score': 0.99999666,
  'word': 'BERT 모델의 학습을 위해서는 문장에서 토큰을 추출하는 과정이 필요하다. 이는 kpf - BERT에서 제공하는 토크나이저를 사용한다. kpf - BERT 토크나이저는 문장을 토큰화해서 전체 문장벡터를 만든다. 이후 문장의 시작과 끝 그 외 몇가지 특수 토큰을 추가한다. 이 과정에서 문장별로 구별하는 세그먼트 토큰, 각 토큰의 위치를 표시하는 포지션 토큰 등을 생성한다.',
  'start': 0,
  'end': 200}]
PreviousTopic Modeling: BERTopicNextSummarization

Last updated 1 year ago

7️⃣