如何将numpy数组作为whisper模型的音频馈送

9rnv2umw  于 2023-06-23  发布在  其他
关注(0)|答案(1)|浏览(247)

因此,我想使用AudioSegment打开一个mp3,然后我想将AudioSegment对象转换为numpy数组,并使用此numpy数组作为whisper模型的输入,我遵循了这个问题How to create a numpy array from a pydub AudioSegment?,但结果没有帮助,因为我总是得到错误,如

Traceback (most recent call last):
  File "E:\Programmi\PythonProjects\whisper_real_time\test\converting_test.py", line 19, in <module>
    result = audio_model.transcribe(arr_copy, language="en", word_timestamps=True,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\transcribe.py", line 121, in transcribe
    mel = log_mel_spectrogram(audio, padding=N_SAMPLES)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\audio.py", line 146, in log_mel_spectrogram
    audio = F.pad(audio, (0, padding))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 86261939712 bytes.

这个错误是奇怪的,因为如果我直接提供文件,像下面我得到没有问题

result = audio_model.transcribe("../audio_test_files/1001_IEO_DIS_HI.mp3", language="en", word_timestamps=True,
                                        fp16=torch.cuda.is_available())

这是我写的代码

from pydub import AudioSegment
import numpy as np
import whisper
import torch

audio = AudioSegment.from_mp3("../audio_test_files/1001_IEO_DIS_HI.mp3")

dtype = getattr(np, "int{:d}".format(
    audio.sample_width * 8))  # Or could create a mapping: {1: np.int8, 2: np.int16, 4: np.int32, 8: np.int64}
arr = np.ndarray((int(audio.frame_count()), audio.channels), buffer=audio.raw_data, dtype=dtype)
arr_copy = arr.copy()
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper...")
audio_model = whisper.load_model("small", download_root="../models",
                                     device=device)
print(f"Transcribing...")
result = audio_model.transcribe(audio=arr_copy, language="en", word_timestamps=True,
                                        fp16=torch.cuda.is_available())  # , initial_prompt=result.get('text', ""))
text = result['text'].strip()
print(text)

我该怎么做?

  • -------编辑------我编辑了代码,现在我使用下面的代码。我没有以前的错误,但模型似乎没有正确转录。我测试了我传递给模型导出wav文件的音频,我播放了它,有很多噪音,我不明白他们在说什么,所以这就是为什么模型不转录。我正在做的正常化的通道好吗?
from pydub import AudioSegment
import numpy as np
import whisper
import torch

language = "en"
model = "medium"
model_path = "../models"

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper {model} model {language}...")
audio_model = whisper.load_model(model, download_root=model_path, device=device)

# load wav file with pydub
audio_path = "20230611-004146_audio_chunk.wav"
audio_segment = AudioSegment.from_wav(audio_path)
#audio_segment = audio_segment.low_pass_filter(1000)
# get sample rate
sample_rate = audio_segment.frame_rate
arr = np.array(audio_segment.get_array_of_samples())
arr_copy = arr.copy()
arr_copy = torch.from_numpy(arr_copy)
arr_copy = arr_copy.to(torch.float32)
# normalize
arr_copy = arr_copy / 32768.0
# to device
arr_copy = arr_copy.to(device)

print(f"Transcribing...")
result = audio_model.transcribe(arr_copy, language=language, fp16=torch.cuda.is_available())
text = result['text'].strip()
print(text)

waveform = arr_copy.cpu().numpy()
audio_segment = AudioSegment(
    waveform.tobytes(),
    frame_rate=sample_rate,
    sample_width=waveform.dtype.itemsize,
    channels=1
)
audio_segment.export("test.wav", format="wav")
avkwfej4

avkwfej41#

如果我没记错的话,内部Whisper在30秒的16kHz单声道音频段上运行。转换为正确的格式,分割和填充由transcribe函数处理。这就是为什么当您提供MP3路径时,它可以正常工作。
如果你想提供numpy阵列,你需要自己做格式和采样率转换。我建议您先创建一个WAV PCM格式的短(比如10秒)音频剪辑。加载它应该会提供一个160000个样本的int16阵列(10秒 * 16kHz = 160000)。将值转换为float32,并通过将其除以32768.0进行归一化。Whisper应接受该结果。

audio_segment = AudioSegment.from_mp3(audio_path)

# convert to expected format
if audio_segment.frame_rate != 16000: # 16 kHz
    audio_segment = audio_segment.set_frame_rate(16000)
if audio_segment.sample_width != 2:   # int16
    audio_segment = audio_segment.set_sample_width(2)
if audio_segment.channels != 1:       # mono
    audio_segment = audio_segment.set_channels(1)        
arr = np.array(audio_segment.get_array_of_samples())
arr = arr.astype(np.float32)/32768.0

result = audio_model.transcribe(arr, language=language, fp16=torch.cuda.is_available())
print(result['text'])

如果您的原始音频是嘈杂的,那么很难期望良好的转录结果。

相关问题