Summarization

Summarization은 문서의 중요한 정보를 보존하면서 더 짧은 버전의 문서를 만드는 작업입니다. 일부 모델은 원본 입력에서 텍스트를 추출할 수 있는 반면, 다른 모델은 완전히 새로운 텍스트를 생성할 수 있습니다.

max_length: 이 매개변수는 요약의 최대 길이를 지정합니다.
min_length: 요약의 최소 길이를 설정하는 매개변수입니다.
do_sample: 이 매개변수는 요약 생성에 사용되는 방법을 결정합니다. do_sample을 False로 설정하면 모델은 욕심 알고리즘을 사용하여 요약의 다음 부분으로 정확할 확률이 가장 높은 다음 토큰을 선택합니다. 이렇게 하면 일반적으로 더 결정적이고 덜 다양한 결과를 얻을 수 있습니다.

from transformers import pipeline

summarizer = pipeline(
    "summarization", 
    model="facebook/bart-large-cnn"
)

ARTICLE = """
GPT-4o는 기존 'GPT-4' 'GPT-4V' 'GPT-4 터보' 등 기존 모델보다 더 빠르고 저렴하며 오디오와 비전 같은 입력으로부터 더 많은 정보를 유지하는 점에서 개선됐다는 설명이다.
기술적으로는 기존에 대형언어모델(LMM)을 구동하기 위해 텍스트와 이미지, 음성 부분을 따로 담당하는 것을 넘어, 모델 3개를 하나로 통합했다.
기존 모델들은 여러 다른 모델들을 연결하고 오디오 및 비주얼과 같은 다른 매체를 텍스트로 변환한 후 다시 변환하는 방식을 사용했지만, 새로운 GPT-4o는 단일 모델에서 처음부터 멀티미디어 토큰으로 훈련, 텍스트로 변환하지 않고도 비전과 오디오를 직접 분석하고 해석할 수 있다.
"""

print(
    summarizer(
        ARTICLE, 
        max_length=130, 
        min_length=30, 
        do_sample=False
    )
)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm


[{'summary_text': 'GPT-4o is a multi-role, long-range, high-altitude communications system. It is designed to be used by the U.S. Air Force, the United States Air Force and the Australian Air Force. The system has a range of more than 100,000 miles.'}]

print(
    summarizer(
        ARTICLE, 
        max_length=130, 
        min_length=30, 
        do_sample=True
    )
)

[{'summary_text': "GPT-4o    GPT-V   'GPT - 4V'  \xa0 'G PT-4 터 Korean' \xa0  'GPT-4'  'PGP-4V' 'GPG-4 Korean' 'PGT-5 Korean'"}]

PreviousNER: Token Classification NextTranslation

Last updated 1 year ago

Summarization

max_length: 이 매개변수는 요약의 최대 길이를 지정합니다.

min_length: 요약의 최소 길이를 설정하는 매개변수입니다.

do_sample: 이 매개변수는 요약 생성에 사용되는 방법을 결정합니다. do_sample을 False로 설정하면 모델은 욕심 알고리즘을 사용하여 요약의 다음 부분으로 정확할 확률이 가장 높은 다음 토큰을 선택합니다. 이렇게 하면 일반적으로 더 결정적이고 덜 다양한 결과를 얻을 수 있습니다.

from transformers import pipeline

summarizer = pipeline(
    "summarization", 
    model="facebook/bart-large-cnn"
)

ARTICLE = """
GPT-4o는 기존 'GPT-4' 'GPT-4V' 'GPT-4 터보' 등 기존 모델보다 더 빠르고 저렴하며 오디오와 비전 같은 입력으로부터 더 많은 정보를 유지하는 점에서 개선됐다는 설명이다.
기술적으로는 기존에 대형언어모델(LMM)을 구동하기 위해 텍스트와 이미지, 음성 부분을 따로 담당하는 것을 넘어, 모델 3개를 하나로 통합했다.
기존 모델들은 여러 다른 모델들을 연결하고 오디오 및 비주얼과 같은 다른 매체를 텍스트로 변환한 후 다시 변환하는 방식을 사용했지만, 새로운 GPT-4o는 단일 모델에서 처음부터 멀티미디어 토큰으로 훈련, 텍스트로 변환하지 않고도 비전과 오디오를 직접 분석하고 해석할 수 있다.
"""

print(
    summarizer(
        ARTICLE, 
        max_length=130, 
        min_length=30, 
        do_sample=False
    )
)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm


[{'summary_text': 'GPT-4o is a multi-role, long-range, high-altitude communications system. It is designed to be used by the U.S. Air Force, the United States Air Force and the Australian Air Force. The system has a range of more than 100,000 miles.'}]

print(
    summarizer(
        ARTICLE, 
        max_length=130, 
        min_length=30, 
        do_sample=True
    )
)

[{'summary_text': "GPT-4o    GPT-V   'GPT - 4V'  \xa0 'G PT-4 터 Korean' \xa0  'GPT-4'  'PGP-4V' 'GPG-4 Korean' 'PGT-5 Korean'"}]