Text-to-Speech: TTS

Text-to-Speech(TTS) 작업은 기계 학습 모델을 사용하여 서면 텍스트를 음성 단어로 변환하는 작업입니다.

모델은 텍스트 입력에서 자연스러운 음성을 생성할 수 있으므로 음성 비서, 오디오북, 접근성 도구 등의 애플리케이션에 유용합니다.

TEXT = "And you know what they call a... a... a Quarter Pounder with Cheese in Seoul?"

from transformers import pipeline
from IPython.display import Audio

pipe = pipeline(model="suno/bark-small")
output = pipe(TEXT)
print(output)

config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


{'audio': array([[-0.00355643, -0.00254505, -0.00216578, ...,  0.00740955,
         0.00741918,  0.00732478]], dtype=float32), 'sampling_rate': 24000}

Audio(output["audio"], rate=output["sampling_rate"])

T5 Model

import soundfile as sf
import torch
from transformers import pipeline
from datasets import load_dataset

synthesiser = pipeline(
    "text-to-speech", 
    "microsoft/speecht5_tts"
)

embeddings_dataset = load_dataset(
    "Matthijs/cmu-arctic-xvectors", 
    split="validation"
)
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

speech = synthesiser(
    TEXT, 
    forward_params={"speaker_embeddings": speaker_embedding})

sf.write(
    "dataset/speech.wav", 
    speech["audio"], 
    samplerate=speech["sampling_rate"]
)

Audio("dataset/speech.wav")

PreviousAudio & Tabular NextSpeech Recognition: Whisper

Last updated 11 months ago