使用Python静音/静音音频的非语音部分(语音活动检测)

jvlzgdj9  于 2023-05-21  发布在  Python
关注(0)|答案(1)|浏览(389)

我的目的是使.wav音频中没有语音的所有部分静音。我目前正在使用webrtcvad,但我所实现的只是从音频中删除非语音部分(example.py代码:https://github.com/wiseman/py-webrtcvad/blob/master/example.py)。如果有人能指点我或告诉我一个如何实现我的目标,我将不胜感激!这听起来也像是一个背景噪声消除问题。

sg2wtvxw

sg2wtvxw1#

假设您希望WAV输出具有与输入相同的持续时间,只是非语音区域被替换为静音,而语音区域不变。
这样做的方法是将音频信号与检测器的输出相乘。检测器应输出1.0表示通过(语音信号),0.0表示静音(非语音)。
有时候,我们使用一个小值而不是0.0来表示阻塞部分,只是稍微降低音量,而不是使其完全静音。例如0.01(-20dB)。
有时候突然的转变会有点粗糙。在这种情况下,可以应用一点平滑或渐变。一个简单的替代方法是指数移动平均线。
下面是用Python实现的完整示例代码,使用SpeechBrain项目中预训练的vad-crdnn-libriparty模型。
代码也可以在这个Github repo中找到:https://github.com/jonnor/machinehearing/blob/master/handson/voice-activity-detection/supress.py

import math
import numpy
import pandas
import librosa
import soundfile
from speechbrain.pretrained import VAD
import matplotlib
import matplotlib.pyplot as plt

def detect_voice(
    path,
    activation_threshold = 0.70,
    deactivation_threshold = 0.25,
    min_pause = 0.200,
    min_activation = 0.100,
    save_dir = 'model_dir',
    segment_pre = 0.0,
    segment_post = 0.0,
    double_check_threshold = None,
    parallel_chunks = 4,
    chunk_size = 1.0,
    overlap_chunks = True,
    ):

    # do initial, coarse-detection
    vad = VAD.from_hparams(source="speechbrain/vad-crdnn-libriparty", savedir=save_dir)

    probabilities = vad.get_speech_prob_file(path,
        large_chunk_size=chunk_size*parallel_chunks,
        small_chunk_size=chunk_size,
        overlap_small_chunk=overlap_chunks)

    thresholded = vad.apply_threshold(probabilities,
        activation_th=activation_threshold,
        deactivation_th=deactivation_threshold).float()

    boundaries = vad.get_boundaries(thresholded)

    # refine boundaries using energy-based VAD
    boundaries = vad.energy_VAD(path, boundaries,
            activation_th=activation_threshold,
            deactivation_th=deactivation_threshold)

    # post-process to clean up
    if min_pause is not None:
        boundaries = vad.merge_close_segments(boundaries, close_th=min_pause)

    if min_activation is not None:
        boundaries = vad.remove_short_segments(boundaries, len_th=min_activation)

    if double_check_threshold:
        boundaries = vad.double_check_speech_segments(boundaries, speech_th=double_check_threshold)

    # convert to friendly pandas DataFrames with time info 
    events = pandas.DataFrame(boundaries, columns=['start', 'end'])
    events['class'] = 'speech'

    p = numpy.squeeze(probabilities)
    times = pandas.Series(numpy.arange(0, len(p)) * vad.time_resolution, name='time')
    p = pandas.DataFrame(p, columns=['speech'], index=times)

    return p, events


def apply_gain(path, segments, default=0.0, out=None, sr=None):

    audio, sr = soundfile.read(path, always_2d=True)

    # compute gain curves
    gains = numpy.full_like(audio, librosa.db_to_power(default)) 

    for idx, seg in segments.iterrows():

        s = math.floor(sr * seg['start'])
        e = math.ceil(sr * seg['end'])
        gain = librosa.db_to_power(seg['gain'])

        gains[s:e, :] = gain

    # apply to audio
    audio = audio * gains

    if out is not None:
        soundfile.write(out, audio, samplerate=sr)

    return audio, sr

def plot_spectrogram(ax, path, sr=16000, hop_length=1024):

    audio, sr = librosa.load(path, sr=sr)
    S = librosa.feature.melspectrogram(y=audio, sr=sr, hop_length=hop_length)
    S_db = librosa.power_to_db(S, ref=numpy.max)

    librosa.display.specshow(ax=ax, data=S_db,
            sr=sr, hop_length=hop_length,
            x_axis='time', y_axis='mel')

    return S_db

def plot_vad(input_path, probabilities, boundaries, output_path):

    fig, (input_spec_ax, vad_ax, output_spec_ax) = plt.subplots(3, figsize=(10, 5), sharex=True)

    # show spectrogram
    plot_spectrogram(ax=input_spec_ax, path=input_path)

    # show VAD results
    probabilities.reset_index().plot(ax=vad_ax, x='time', y='speech')

    for start, end in zip(boundaries['start'], boundaries['end']):
        vad_ax.axvspan(start, end, alpha=0.3, color='green')

    vad_ax.xaxis.set_minor_locator(matplotlib.ticker.MultipleLocator(1.0))
    vad_ax.grid(True, which='minor', axis='x')
    vad_ax.grid(True, which='major', axis='x')

    # show modified audio
    plot_spectrogram(ax=output_spec_ax, path=output_path)

    fig.tight_layout()
    return fig

# XXX: model only supports 16k samplerate
# If input is another samplerate, have to resample it first
path = 'voiceandnot_16k.wav'
prob, segments = detect_voice(path)

segments['gain'] = 0.0

out_path = 'voice-supressed.wav'
apply_gain(path, segments, default=-20.0, out=out_path)

fig = plot_vad(path, prob, segments, out_path)
fig.savefig('vad-output.png')

这里是一个示例图,显示了输入数据、VAD激活/分段和修改后的输出数据。

相关问题