Speech Recognition: Whisper

Automatic Speech Recognition

자동 음성 인식(ASR)은 음성 언어를 텍스트로 변환하는 작업입니다. 허깅 페이스의 맥락에서는 허깅 페이스 플랫폼에서 사용할 수 있는 모델과 도구를 사용하여 ASR을 수행하는 것을 말합니다.

!wget https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav -O ./dataset/speech.wav

--2024-05-19 12:58:31--  https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
Resolving www.voiptroubleshooter.com (www.voiptroubleshooter.com)... 162.241.218.124
Connecting to www.voiptroubleshooter.com (www.voiptroubleshooter.com)|162.241.218.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 538014 (525K) [audio/x-wav]
Saving to: ‘./dataset/speech.wav’

./dataset/speech.wa 100%[===================>] 525.40K   689KB/s    in 0.8s    

2024-05-19 12:58:33 (689 KB/s) - ‘./dataset/speech.wav’ saved [538014/538014]

wav2vec2-large-xlsr

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition", 
    model="jonatasgrosman/wav2vec2-large-xlsr-53-english"
)

result = pipe(
    ["dataset/speech.wav"], 
    generate_kwargs={"language": "english"}
)

result

Some weights of the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-english were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-english and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Could not load the `decoder` for jonatasgrosman/wav2vec2-large-xlsr-53-english. Defaulting to raw CTC. Error: No module named 'kenlm'
Try to install `kenlm`: `pip install kenlm
Try to install `pyctcdecode`: `pip install pyctcdecode





[{'text': 'the berch canoe slit on the smooth planks glue the sheet to the dark blue backgroulit is easy to tell the depth of a wel these days e chicken-leg is a rare dish rice is often served in round bulls he juice of lemons mixed fine punch the box was stone beside the park truckthe hogs were fed chopped corn and garbagefour hours of stady work faced us a large size and stockings is hard to sell'}]

Whisper

Whisper는 OpenAI에서 자동 음성 인식(ASR)을 위해 개발한 신경망 기반 모델입니다. 다양한 언어와 도메인에서 음성을 높은 정확도로 텍스트로 변환하도록 설계되었습니다.

https://openai.com/index/whisper/

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None

# Load your MP3 file
audio_file = "dataset/speech.wav"
audio, sample_rate = sf.read(audio_file)

# Resample the audio to 16,000 Hz if necessary
if sample_rate != 16000:
    audio = librosa.resample(
        audio, 
        orig_sr=sample_rate, 
        target_sr=16000
    )
    sample_rate = 16000

# Process the audio file
input_features = processor(
    audio, 
    sampling_rate=sample_rate, 
    return_tensors="pt").input_features

# Generate token ids
predicted_ids = model.generate(input_features)

# Decode token ids to text
transcription = processor.batch_decode(
    predicted_ids, 
    skip_special_tokens=True
)

print("Transcription:", transcription[0])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


Transcription:  The birch canoe slid on the smooth planks. Glued the sheet to the dark blue background. It is easy to tell the depth of a well. These days a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the parked truck. The hogs were fed chopped corn and garbage. Four hours of steady work faced us.

PreviousText-to-Speech: TTS NextAudio Classification

Last updated 1 year ago

Automatic Speech Recognition

!wget https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav -O ./dataset/speech.wav

--2024-05-19 12:58:31--  https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
Resolving www.voiptroubleshooter.com (www.voiptroubleshooter.com)... 162.241.218.124
Connecting to www.voiptroubleshooter.com (www.voiptroubleshooter.com)|162.241.218.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 538014 (525K) [audio/x-wav]
Saving to: ‘./dataset/speech.wav’

./dataset/speech.wa 100%[===================>] 525.40K   689KB/s    in 0.8s    

2024-05-19 12:58:33 (689 KB/s) - ‘./dataset/speech.wav’ saved [538014/538014]

wav2vec2-large-xlsr

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition", 
    model="jonatasgrosman/wav2vec2-large-xlsr-53-english"
)

result = pipe(
    ["dataset/speech.wav"], 
    generate_kwargs={"language": "english"}
)

result

Some weights of the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-english were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-english and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Could not load the `decoder` for jonatasgrosman/wav2vec2-large-xlsr-53-english. Defaulting to raw CTC. Error: No module named 'kenlm'
Try to install `kenlm`: `pip install kenlm
Try to install `pyctcdecode`: `pip install pyctcdecode





[{'text': 'the berch canoe slit on the smooth planks glue the sheet to the dark blue backgroulit is easy to tell the depth of a wel these days e chicken-leg is a rare dish rice is often served in round bulls he juice of lemons mixed fine punch the box was stone beside the park truckthe hogs were fed chopped corn and garbagefour hours of stady work faced us a large size and stockings is hard to sell'}]

Whisper

https://openai.com/index/whisper/

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None

# Load your MP3 file
audio_file = "dataset/speech.wav"
audio, sample_rate = sf.read(audio_file)

# Resample the audio to 16,000 Hz if necessary
if sample_rate != 16000:
    audio = librosa.resample(
        audio, 
        orig_sr=sample_rate, 
        target_sr=16000
    )
    sample_rate = 16000

# Process the audio file
input_features = processor(
    audio, 
    sampling_rate=sample_rate, 
    return_tensors="pt").input_features

# Generate token ids
predicted_ids = model.generate(input_features)

# Decode token ids to text
transcription = processor.batch_decode(
    predicted_ids, 
    skip_special_tokens=True
)

print("Transcription:", transcription[0])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


Transcription:  The birch canoe slid on the smooth planks. Glued the sheet to the dark blue background. It is easy to tell the depth of a well. These days a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the parked truck. The hogs were fed chopped corn and garbage. Four hours of steady work faced us.