Datasets

Last updated 1 year ago

Datasets

Huggingface: Datasets

Huggingface에서 제공하는 datasets는 오디오, 컴퓨터 비전 및 자연어 처리(NLP) 작업을 위한 데이터 세트에 쉽게 액세스하고 공유할 수 있는 라이브러리입니다. 한 줄의 코드로 데이터 세트를 로드하고, 강력한 데이터 처리 방법을 사용해 딥 러닝 모델에서 훈련할 수 있도록 데이터 세트를 빠르게 준비할 수 있습니다. Apache Arrow 형식을 기반으로 메모리 제약 없이 제로 카피 읽기로 대용량 데이터 세트를 처리하여 속도와 효율성을 최적화할 수 있습니다. 또한 허깅 페이스 허브와의 긴밀한 통합을 통해 데이터 세트를 쉽게 로드하고 더 넓은 AI 커뮤니티와 공유할 수 있습니다.

%pip install datasets
%pip install datasets[audio]
%pip install datasets[vision]

Load Datasets

load_dataset_builder

데이터셋 빌더를 로드하고 데이터셋을 다운로드하지 않고도 데이터셋의 속성을 검사할 수 있습니다:

from datasets import load_dataset_builder

ds_builder = load_dataset_builder("rotten_tomatoes")

ds_builder

> <datasets.packaged_modules.parquet.parquet.ParquetRottenTomatoes at 0x7fce58300f90>

ds_builder.info

> DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='rotten_tomatoes', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1074810, num_examples=8530, shard_lengths=None, dataset_name=None), 'validation': SplitInfo(name='validation', num_bytes=134679, num_examples=1066, shard_lengths=None, dataset_name=None), 'test': SplitInfo(name='test', num_bytes=135972, num_examples=1066, shard_lengths=None, dataset_name=None)}, download_checksums=None, download_size=487770, post_processing_size=None, dataset_size=1345461, size_in_bytes=None)

ds_builder.info.features

> {'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

load_dataset

split 지정

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

Downloading data: 100%|██████████| 699k/699k [00:02<00:00, 315kB/s]
Downloading data: 100%|██████████| 90.0k/90.0k [00:01<00:00, 62.2kB/s]
Downloading data: 100%|██████████| 92.2k/92.2k [00:01<00:00, 62.5kB/s]
Generating train split: 100%|██████████| 8530/8530 [00:00<00:00, 339196.35 examples/s]
Generating validation split: 100%|██████████| 1066/1066 [00:00<00:00, 216813.50 examples/s]
Generating test split: 100%|██████████| 1066/1066 [00:00<00:00, 233211.35 examples/s]

Split

Split은 train, test, validation과 같은 데이터 집합의 특정 하위 집합입니다.

get_dataset_split_names() 함수를 사용하여 데이터 세트의 분할 이름을 나열합니다:

from datasets import get_dataset_split_names

get_dataset_split_names("rotten_tomatoes")

['train', 'validation', 'test']

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")
dataset

> Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

Configurations

일부 데이터 세트에는 여러 개의 하위 데이터 세트가 포함되어 있습니다.

하위 데이터세트를 구성이라고 하며, 데이터세트를 로드할 때 명시적으로 하나를 선택해야 합니다. 구성 이름을 제공하지 않으면 데이터 세트에서 ValueError를 발생시키고 구성을 선택하라는 메시지를 표시합니다.

데이터 집합에 사용 가능한 모든 구성의 목록을 검색하려면 get_dataset_config_names() 함수를 사용합니다:

from datasets import get_dataset_config_names

configs = get_dataset_config_names("PolyAI/minds14")
print(configs)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/datasets/load.py:1461: FutureWarning: The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Downloading builder script: 100%|██████████| 5.90k/5.90k [00:00<00:00, 5.93MB/s]
Downloading readme: 100%|██████████| 5.29k/5.29k [00:00<00:00, 5.23MB/s]

['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN', 'all']

from datasets import load_dataset

mindsFR = load_dataset(
    "PolyAI/minds14", 
    "fr-FR", # 하위 데이터셋을 인덱싱
    split="train"
)

Downloading data: 100%|██████████| 471M/471M [00:04<00:00, 106MB/s]  
Generating train split: 539 examples [00:00, 17375.93 examples/s]

Remote code

특정 데이터 세트 저장소에는 데이터 세트를 생성하는 데 사용되는 Python 코드가 포함된 로딩 스크립트가 포함되어 있습니다.

이러한 데이터 세트는 일반적으로 로딩 스크립트를 실행하지 않고도 데이터 세트를 빠르게 로드할 수 있도록 Parquet by Hugging Face로 내보내집니다.

로딩 스크립트가 있는 데이터세트를 사용하려면 trust_remote_code=True로 설정

from datasets import get_dataset_config_names, get_dataset_split_names, load_dataset

c4 = load_dataset(
    "c4", 
    "en", 
    split="train", 
    trust_remote_code=True
)
get_dataset_config_names(
    "c4", 
    trust_remote_code=True
)
get_dataset_split_names(
    "c4", 
    "en", 
    trust_remote_code=True)

Dataset EDA

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

Indexing

dataset[0]

> {'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

dataset[-1]

> {'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .',
 'label': 0}

dataset["text"][0]

> 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

dataset[0]["text"]

> 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

Slicing

dataset[:3]

> {'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic'],
 'label': [1, 1, 1]}

dataset[3:6]

> {'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .'],
 'label': [1, 1, 1]}

IterableDataset

load_dataset()에서 streaming=True로 설정하면 IterableDataset이 로드됩니다:

from datasets import load_dataset

iterable_dataset = load_dataset(
    "food101", 
    split="train", 
    streaming=True
)
for example in iterable_dataset:
    print(example)
    break

Downloading readme: 100%|██████████| 10.5k/10.5k [00:00<00:00, 9.59MB/s]


{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7FCD06F9DD10>, 'label': 6}

from datasets import load_dataset

dataset = load_dataset(
    "rotten_tomatoes", 
    split="train"
)
iterable_dataset = dataset.to_iterable_dataset()

IterableDataset의 예제에 무작위로 액세스할 수 없으므로 next(iter()를 호출하거나 for 루프를 사용하여 IterableDatase에서 다음 항목을 반환하는 등 해당 요소를 반복해야 합니다:

next(iter(iterable_dataset))

for example in iterable_dataset:
    print(example)
    break

> {'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}

list(iterable_dataset.take(3))

[{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'label': 1},
 {'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'label': 1},
 {'text': 'effective but too-tepid biopic', 'label': 1}]

Preprocess

Tokenize a text dataset.
Resample an audio dataset.
Apply transforms to an image dataset.

Tokenize Text

사전 학습된 모델과 동일한 토큰화기를 사용하는 것이 중요합니다. 이는 텍스트가 동일한 방식으로 분할되도록 하기 위해서입니다.

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset(
    "rotten_tomatoes", 
    split="train"
)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

tokenizer(dataset[0]["text"])

> {'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

input_ids: 텍스트의 토큰을 나타내는 숫자입니다.
token_type_ids: 토큰이 둘 이상의 시퀀스가 있는 경우 토큰이 속한 시퀀스를 나타냅니다.
attention_mask: 토큰을 마스킹할지 여부를 나타냅니다.

전체 데이터셋을 토큰화하는 가장 빠른 방법은 map() 함수를 사용하는 것입니다.

이 함수는 토큰화기를 개별 예제 대신 예제 일괄 처리에 적용하여 토큰화 속도를 높입니다. 일괄 처리 매개변수를 True로 설정합니다:

def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(
    tokenization, 
    batched=True
)

Map: 100%|██████████| 8530/8530 [00:00<00:00, 28674.03 examples/s]

set_format() 함수를 사용하여 데이터 세트 형식을 PyTorch와 호환되도록 설정합니다:

dataset.set_format(type="torch", 
                   columns=["input_ids", "token_type_ids", "attention_mask", "label"]
                  )

dataset.format['type']

> 'torch'

Resample audio signals

텍스트 데이터 세트와 같은 오디오 입력은 개별 데이터 포인트로 나눠야 합니다. 이를 샘플링이라고 하며, 샘플링 속도는 초당 얼마나 많은 음성 신호가 캡처되는지 알려줍니다. 데이터 세트의 샘플링 속도가 사용 중인 모델을 사전 학습하는 데 사용되는 데이터의 샘플링 속도와 일치하는지 확인하는 것이 중요합니다. 샘플링 속도가 다르면 사전 학습된 모델이 샘플링 속도의 차이를 인식하지 못하여 데이터 세트에서 성능이 저하될 수 있습니다.

먼저 데이터 세트, 오디오 기능 및 사전 학습된 Wav2Vec2 모델에 해당하는 기능 추출기를 로드합니다:

from transformers import AutoFeatureExtractor
from datasets import load_dataset, Audio

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
dataset = load_dataset(
    "PolyAI/minds14", 
    "en-US", 
    split="train"
)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/datasets/load.py:1461: FutureWarning: The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 563 examples [00:00, 20031.50 examples/s]

2. 데이터 세트의 첫 번째 행에 색인을 생성합니다. 데이터 세트의 오디오 열을 호출하면 자동으로 디코딩되고 리샘플링됩니다:

dataset[0]["audio"]

> {'path': '/home/kubwa/.cache/huggingface/datasets/downloads/extracted/15659b678c0da93396580df7af6c5d75946c562316a369850d168cf78e44e4aa/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ]),
 'sampling_rate': 8000}

3. 데이터 세트 카드를 읽는 것은 매우 유용하며 데이터 세트에 대한 많은 정보를 얻을 수 있습니다. MinDS-14 데이터 세트 카드를 보면 샘플링 속도가 8kHz라는 것을 알 수 있습니다. 마찬가지로 모델 카드에서도 모델에 대한 많은 세부 정보를 얻을 수 있습니다. Wav2Vec2 모델 카드에는 16kHz 음성 오디오에서 샘플링되었다고 나와 있습니다. 즉, 모델의 샘플링 속도에 맞게 MinDS-14 데이터 세트를 업샘플링해야 합니다.

cast_colum() 함수를 사용하고 오디오 기능에서 샘플링_속도 매개변수를 설정하여 오디오 신호를 업샘플링합니다. 이제 오디오 열을 호출하면 오디오가 디코딩되고 16kHz로 리샘플링됩니다:

dataset = dataset.cast_column(
    "audio", 
    Audio(
        sampling_rate=16_000
    )
)
dataset[0]["audio"]

{'path': '/home/kubwa/.cache/huggingface/datasets/downloads/extracted/15659b678c0da93396580df7af6c5d75946c562316a369850d168cf78e44e4aa/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'array': array([ 1.70562416e-05,  2.18727451e-04,  2.28099874e-04, ...,
         3.43842403e-05, -5.96364771e-06, -1.76846661e-05]),
 'sampling_rate': 16000}

4. map() 함수를 사용하여 전체 데이터 세트를 16kHz로 리샘플링합니다. 이 함수는 개별 예제 대신 예제 배치에 특징 추출기를 적용하여 리샘플링 속도를 높입니다. 일괄 처리 매개변수인 batched=True로 설정합니다:

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, 
        sampling_rate=feature_extractor.
        sampling_rate, 
        max_length=16000, 
        truncation=True
    )
    return inputs

dataset = dataset.map(
    preprocess_function, 
    batched=True
)

Map: 100%|██████████| 563/563 [01:10<00:00,  7.95 examples/s]

Image Dataset & Data augmentations

이미지 데이터 세트에서 가장 일반적으로 수행하는 전처리는 데이터의 의미를 변경하지 않고 이미지에 무작위 변형을 도입하는 프로세스인 데이터 증강입니다. 이는 이미지의 색상 속성을 변경하거나 이미지를 무작위로 자르는 것을 의미할 수 있습니다. 원하는 데이터 증강 라이브러리를 자유롭게 사용할 수 있으며, 데이터 세트는 데이터 세트에 데이터 증강을 적용하는 데 도움을 줍니다.

데이터 세트, 이미지 기능 및 사전 학습된 ViT 모델에 해당하는 기능 추출기를 로드합니다:

from transformers import AutoFeatureExtractor
from datasets import load_dataset, Image

feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
dataset = load_dataset(
    "beans", 
    split="train"
)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/vit/feature_extraction_vit.py:28: FutureWarning: The class ViTFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ViTImageProcessor instead.
  warnings.warn(
Downloading readme: 100%|██████████| 4.95k/4.95k [00:00<00:00, 5.32MB/s]
Downloading data: 100%|██████████| 144M/144M [00:01<00:00, 81.0MB/s] 
Downloading data: 100%|██████████| 18.5M/18.5M [00:00<00:00, 29.7MB/s]
Downloading data: 100%|██████████| 17.7M/17.7M [00:00<00:00, 28.0MB/s]
Generating train split: 100%|██████████| 1034/1034 [00:00<00:00, 2039.97 examples/s]
Generating validation split: 100%|██████████| 133/133 [00:00<00:00, 2603.61 examples/s]
Generating test split: 100%|██████████| 128/128 [00:00<00:00, 1931.51 examples/s]

2. 데이터 세트의 첫 번째 행에 색인을 생성합니다. 데이터 세트의 이미지 열을 호출하면 기본 PIL 개체가 자동으로 이미지로 디코딩됩니다.

dataset[0]["image"]

이제 이미지에 몇 가지 변형을 적용할 수 있습니다. torchvision에서 사용할 수 있는 다양한 트랜스폼을 살펴보고 실험해보고 싶은 트랜스폼을 선택하세요. 이 예제에서는 이미지를 무작위로 회전하는 RandomRotation을 적용합니다:

https://pytorch.org/vision/0.9/transforms.html

from torchvision.transforms import RandomRotation

rotate = RandomRotation(degrees=(0, 90))
def transforms(examples):
    examples["pixel_values"] = [rotate(image) for image in examples["image"]]
    return examples

set_transform() 함수를 사용하여 즉시 변환을 적용합니다. 이미지 픽셀 값으로 인덱싱하면 변형이 적용되고 이미지가 회전됩니다.

dataset.set_transform(transforms)
dataset[0]["pixel_values"]

Create Dataset

자체 데이터로 작업하는 경우 데이터세트를 만들어야 할 때가 있습니다. 데이터셋으로 데이터셋을 만들면 빠른 로딩 및 처리, 대용량 데이터셋 스트리밍, 메모리 매핑 등 라이브러리의 모든 장점을 데이터셋에 부여할 수 있습니다. 데이터셋 로우코드 접근 방식을 사용하면 데이터셋을 쉽고 빠르게 생성할 수 있어 모델 학습을 시작하는 데 걸리는 시간을 단축할 수 있습니다. 대부분의 경우 데이터 파일을 허브의 데이터 세트 리포지토리에 끌어다 놓기만 하면 됩니다.

이미지 또는 오디오 데이터셋을 빠르게 생성하기 위한 폴더 기반 빌더
로컬 파일에서 데이터셋을 생성하기 위한 from_ 메서드

Folder-based builders

폴더 기반 빌더에는 ImageFolder와 AudioFolder라는 두 가지가 있습니다.

이는 수천 개의 예제가 포함된 이미지 또는 음성 및 오디오 데이터 세트를 빠르게 생성하기 위한 로우코드 방식입니다.

더 큰 데이터 세트로 확장하기 전에 컴퓨터 비전 및 음성 모델을 빠르게 프로토타이핑하는 데 유용합니다.

폴더 기반 빌더는 데이터를 가져와서 데이터 세트의 특징, 분할, 레이블을 자동으로 생성합니다.

ImageFolder는 이미지 기능을 사용하여 이미지 파일을 디코딩합니다. jpg 및 png와 같은 많은 이미지 확장자 형식이 지원되지만 다른 형식도 지원됩니다. 지원되는 이미지 확장자 전체 목록은 여기에서 확인할 수 있습니다.
AudioFolder는 오디오 기능을 사용하여 오디오 파일을 디코딩합니다. wav, mp3 등의 오디오 확장자가 지원되며, 지원되는 오디오 확장자 전체 목록은 여기에서 확인할 수 있습니다.

load_dataset()에 imagefolder를 지정하여 이미지 데이터셋을 생성합니다:

from datasets import load_dataset

dataset = load_dataset(
    "imagefolder", 
    data_dir="dataset/pokemon"
)

오디오 데이터 세트도 같은 방식으로 생성되지만, load_dataset()에 오디오 폴더를 대신 지정한다는 점이 다릅니다:

from datasets import load_dataset

dataset = load_dataset(
    "audiofolder", 
    data_dir="dataset/audio"
)

텍스트 캡션이나 필사본 등 데이터 세트에 대한 추가 정보는 데이터 세트가 포함된 폴더에 메타데이터.csv 파일과 함께 포함할 수 있습니다.

메타데이터 파일에는 이미지 또는 오디오 파일을 해당 메타데이터에 연결하는 file_name 열이 있어야 합니다:

From local files

데이터 파일의 경로를 지정하여 로컬 파일에서 데이터 집합을 만들 수도 있습니다. from_ 메서드를 사용하여 데이터 집합을 만드는 방법에는 두 가지가 있습니다:

from_generator() 메서드는 제너레이터의 반복 동작으로 인해 제너레이터에서 데이터셋을 생성하는 가장 메모리 효율적인 방법입니다. 데이터 집합이 디스크에서 점진적으로 생성된 다음 메모리 매핑되므로 메모리에 맞지 않을 수 있는 매우 큰 데이터 집합으로 작업할 때 특히 유용합니다.

from datasets import Dataset

def gen():
    yield {"pokemon": "bulbasaur", "type": "grass"}
    yield {"pokemon": "squirtle", "type": "water"}
ds = Dataset.from_generator(gen)
ds[0]

> Generating train split: 2 examples [00:00, 707.36 examples/s]

{'pokemon': 'bulbasaur', 'type': 'grass'}

예를 들어 제너레이터 기반 IterableDataset은 for 루프를 사용하여 반복해야 합니다:

from datasets import IterableDataset

ds = IterableDataset.from_generator(gen)
for example in ds:
    print(example)

> {'pokemon': 'bulbasaur', 'type': 'grass'}
{'pokemon': 'squirtle', 'type': 'water'}

from_dict() 메서드는 사전에서 데이터셋을 만드는 간단한 방법입니다:

from datasets import Dataset

ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
ds[0]

> {'pokemon': 'bulbasaur', 'type': 'grass'}

이미지 또는 오디오 데이터셋을 만들려면 cast_column() 메서드를 from_dict()와 연결하고 열 및 기능 유형을 지정합니다. 예를 들어 오디오 데이터셋을 만들려면 다음과 같이 하세요:

audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())

Dataset Push to Huggingface

데이터셋, 모델 등을 Huggingface에 push하여 업로드 하는 방법은 2가지 가 있습니다.

huggingface.co 에서 직접 업로드하는 방법
Python에서 huggingface-cli로 하는 방법

여기서는 2번째 방법을 활용하겠습니다.

%pip install huggingface_hub
%pip install ipywidgets

Python으로 허브에 데이터 세트를 업로드하려면 Hugging Face 계정에 로그인해야 합니다:

!huggingface-cli login

Jupyter Notebook에서 Login은 아래 코드 실행

from huggingface_hub import notebook_login

notebook_login()

> VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

2. push_to_hub() 함수를 사용하면 파일을 리포지토리에 추가, 커밋 및 푸시할 수 있습니다:

from datasets import load_dataset

dataset = load_dataset("<your_repository_id>/demo")
dataset.push_to_hub("<your_repository_id>/processed_demo")

데이터 집합을 비공개로 설정하려면 priviate=True로 설정합니다. 이 매개 변수는 리포지토리를 처음 만드는 경우에만 작동합니다.

dataset.push_to_hub(
    "<your_repository_id>/private_processed_demo", 
    private=True
)

비공개 데이터 집합은 본인만 액세스할 수 있습니다.

마찬가지로 조직 내에서 데이터 집합을 공유하면 조직의 구성원도 데이터 집합에 액세스할 수 있습니다.

토큰 매개변수에 인증 토큰을 제공하여 비공개 데이터 집합을 로드합니다:

from datasets import load_dataset

dataset = load_dataset("stevhliu/demo", token=True)
dataset = load_dataset("organization/dataset_name", token=True)

PreviousHuggingface Basic NextTokenizer