Audio Classification

Audio Classification는 오디오 데이터를 미리 정의된 범주 또는 레이블로 분류하는 작업이 포함됩니다.

오디오 파일은 선택한 모델이 처리할 수 있는 형식(예: 파형 또는 스펙트로그램)으로 변환됩니다.

파형은 시간에 따른 오디오 신호의 진폭을 시각적으로 표현한 것입니다. 파형은 음파의 진폭이 어떻게 변화하는지 보여줍니다.

오디오 처리에서 파형은 음량, 피치, 지속 시간 등 소리의 특성을 분석하는 데 매우 중요합니다.

Load & Transformation

변환에는 librosa 패키지를 사용할 수 있습니다. librosa 라이브러리의 load 함수는 audio_path로 지정된 오디오 파일을 읽는 데 사용됩니다.

기본적으로 wave는 시간에 따른 오디오 신호의 진폭을 나타내는 Array 배열입니다. 이는 sound wave를 나타내는 sequence of floating-point numbers입니다.

import librosa

audio_path = 'dataset/speech.wav'
waveform, sample_rate = librosa.load(
    audio_path, 
    sr=None
)

import matplotlib.pyplot as plt

time_axis = librosa.times_like(
    waveform, 
    sr=sample_rate
)

plt.figure(figsize=(10, 4))
plt.plot(time_axis, waveform)
plt.title('Waveform of Audio')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.show()

Transformer

from transformers import pipeline

pipe = pipeline(
    "audio-classification", 
    model="MIT/ast-finetuned-audioset-10-10-0.4593"
)

results = pipe(
    waveform, 
    sample_rate=sample_rate
)

print(results)

config.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]



preprocessor_config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]


[{'score': 0.7802741527557373, 'label': 'Speech'}, {'score': 0.04130024090409279, 'label': 'Female speech, woman speaking'}, {'score': 0.03985799476504326, 'label': 'Writing'}, {'score': 0.01937979832291603, 'label': 'Narration, monologue'}, {'score': 0.008981602266430855, 'label': 'Whispering'}]

pipe = pipeline(
    "audio-classification", 
    model="superb/wav2vec2-base-superb-sid"
)

results = pipe(
    waveform, 
    sample_rate=sample_rate
)

print(results)

config.json:   0%|          | 0.00/54.9k [00:00<?, ?B/s]


/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/configuration_utils.py:364: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
  warnings.warn(



pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]


Some weights of the model checkpoint at superb/wav2vec2-base-superb-sid were not used when initializing Wav2Vec2ForSequenceClassification: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at superb/wav2vec2-base-superb-sid and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



preprocessor_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]


[{'score': 0.8375324606895447, 'label': 'id10870'}, {'score': 0.0740685909986496, 'label': 'id10699'}, {'score': 0.046333614736795425, 'label': 'id10259'}, {'score': 0.017094021663069725, 'label': 'id10829'}, {'score': 0.008926299400627613, 'label': 'id10587'}]

PreviousSpeech Recognition: Whisper NextTabular Qustaion & Answering

Last updated 1 year ago