Text-to-Speech: TTS
Text-to-Speech: TTS
Text-to-Speech(TTS) 작업은 기계 학습 모델을 사용하여 서면 텍스트를 음성 단어로 변환하는 작업입니다.
모델은 텍스트 입력에서 자연스러운 음성을 생성할 수 있으므로 음성 비서, 오디오북, 접근성 도구 등의 애플리케이션에 유용합니다.
TEXT = "And you know what they call a... a... a Quarter Pounder with Cheese in Seoul?"
from transformers import pipeline
from IPython.display import Audio
pipe = pipeline(model="suno/bark-small")
output = pipe(TEXT)
print(output)
config.json: 0%| | 0.00/8.80k [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/1.68G [00:00<?, ?B/s]
generation_config.json: 0%| | 0.00/4.91k [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/353 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/996k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/2.92M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
{'audio': array([[-0.00355643, -0.00254505, -0.00216578, ..., 0.00740955,
0.00741918, 0.00732478]], dtype=float32), 'sampling_rate': 24000}
Audio(output["audio"], rate=output["sampling_rate"])
T5 Model
import soundfile as sf
import torch
from transformers import pipeline
from datasets import load_dataset
synthesiser = pipeline(
"text-to-speech",
"microsoft/speecht5_tts"
)
embeddings_dataset = load_dataset(
"Matthijs/cmu-arctic-xvectors",
split="validation"
)
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
speech = synthesiser(
TEXT,
forward_params={"speaker_embeddings": speaker_embedding})
sf.write(
"dataset/speech.wav",
speech["audio"],
samplerate=speech["sampling_rate"]
)
Audio("dataset/speech.wav")
Last updated