CMU 11492/11692 Spring 2023: Speech Enhancement¶
In this demonstration, we will show you some demonstrations of speech enhancement systems in ESPnet.
Main references: - ESPnet repository - ESPnet documentation - ESPnet-SE repo
Author: - Siddhant Arora (siddhana@andrew.cmu.edu)
The notebook is adapted from this Colab
❗Important Notes❗¶
We are using Colab to show the demo. However, Colab has some constraints on the total GPU runtime. If you use too much GPU time, you may not be able to use GPU for some time.
There are multiple in-class checkpoints ✅ throughout this tutorial. Your participation points are based on these tasks. Please try your best to follow all the steps! If you encounter issues, please notify the TAs as soon as possible so that we can make an adjustment for you.
Please submit PDF files of your completed notebooks to Gradescope. You can print the notebook using
File -> Print
in the menu bar.You also need to submit the spectrogram and waveform of noisy and enhanced audio files to Gradescope.
Contents¶
Tutorials on the Basic Usage
Install
Speech Enhancement with Pretrained Models
We support various interfaces, e.g. Python API, HuggingFace API, portable speech enhancement scripts for other tasks, etc.
2.1 Single-channel Enhancement (CHiME-4)
2.2 Enhance Your Own Recordings
2.3 Multi-channel Enhancement (CHiME-4)
Speech Separation with Pretrained Models
3.1 Model Selection
3.2 Separate Speech Mixture
Evaluate Separated Speech with the Pretrained ASR Model
Tutorials on the Basic Usage
Install¶
Different from previous assignment where we install the full version of ESPnet, we use a lightweight ESPnet package, which mainly designed for inference purpose. The installation with the light version can be much faster than a full installation.
[ ]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
%pip uninstall torch
%pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
%pip install -q git+https://github.com/espnet/espnet
%pip install -q espnet_model_zoo
Speech Enhancement with Pretrained Models¶
Single-Channel Enhancement, the CHiME example¶
Task1 (✅ Checkpoint 1 (1 point))¶
Run inference of pretrained single-channel enhancement model.
[ ]:
# Download one utterance from real noisy speech of CHiME4
!gdown --id 1SmrN5NFSg6JuQSs2sfy3ehD8OIcqK6wS -O /content/M05_440C0213_PED_REAL.wav
import os
import soundfile
from IPython.display import display, Audio
mixwav_mc, sr = soundfile.read("/content/M05_440C0213_PED_REAL.wav")
# mixwav.shape: num_samples, num_channels
mixwav_sc = mixwav_mc[:,4]
display(Audio(mixwav_mc.T, rate=sr))
Download and load the pretrained Conv-Tasnet¶
[ ]:
!gdown --id 17DMWdw84wF3fz3t7ia1zssdzhkpVQGZm -O /content/chime_tasnet_singlechannel.zip
!unzip /content/chime_tasnet_singlechannel.zip -d /content/enh_model_sc
[ ]:
# Load the model
# If you encounter error "No module named 'espnet2'", please re-run the 1st Cell. This might be a colab bug.
import sys
import soundfile
from espnet2.bin.enh_inference import SeparateSpeech
separate_speech = {}
# For models downloaded from GoogleDrive, you can use the following script:
enh_model_sc = SeparateSpeech(
train_config="/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/config.yaml",
model_file="/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/5epoch.pth",
# for segment-wise process on long speech
normalize_segment_scale=False,
show_progressbar=True,
ref_channel=4,
normalize_output_wav=True,
device="cuda:0",
)
Enhance the single-channel real noisy speech in CHiME4¶
Please submit the screenshot of output of current block and the spectogram and waveform of noisy and enhanced speech file to Gradescope for Task 1.
[ ]:
# play the enhanced single-channel speech
wave = enh_model_sc(mixwav_sc[None, ...], sr)
print("Input real noisy speech", flush=True)
display(Audio(mixwav_sc, rate=sr))
print("Enhanced speech", flush=True)
display(Audio(wave[0].squeeze(), rate=sr))
Multi-Channel Enhancement¶
Download and load the pretrained mvdr neural beamformer.¶
Task2 (✅ Checkpoint 2 (1 point))¶
Run inference of pretrained multi-channel enhancement model.
[ ]:
# Download the pretained enhancement model
!gdown --id 1FohDfBlOa7ipc9v2luY-QIFQ_GJ1iW_i -O /content/mvdr_beamformer_16k_se_raw_valid.zip
!unzip /content/mvdr_beamformer_16k_se_raw_valid.zip -d /content/enh_model_mc
[ ]:
# Load the model
# If you encounter error "No module named 'espnet2'", please re-run the 1st Cell. This might be a colab bug.
import sys
import soundfile
from espnet2.bin.enh_inference import SeparateSpeech
separate_speech = {}
# For models downloaded from GoogleDrive, you can use the following script:
enh_model_mc = SeparateSpeech(
train_config="/content/enh_model_mc/exp/enh_train_enh_beamformer_mvdr_raw/config.yaml",
model_file="/content/enh_model_mc/exp/enh_train_enh_beamformer_mvdr_raw/11epoch.pth",
# for segment-wise process on long speech
normalize_segment_scale=False,
show_progressbar=True,
ref_channel=4,
normalize_output_wav=True,
device="cuda:0",
)
Enhance the multi-channel real noisy speech in CHiME4¶
Please submit the screenshot of output of current block and the spectrogram and waveform of noisy and enhanced speech file to Gradescope for Task 2.
[ ]:
wave = enh_model_mc(mixwav_mc[None, ...], sr)
print("Input real noisy speech", flush=True)
display(Audio(mixwav_mc.T, rate=sr))
print("Enhanced speech", flush=True)
display(Audio(wave[0].squeeze(), rate=sr))
Portable speech enhancement scripts for other tasks¶
For an ESPNet ASR or TTS dataset like below:
data
`-- et05_real_isolated_6ch_track
|-- spk2utt
|-- text
|-- utt2spk
|-- utt2uniq
`-- wav.scp
Run the following scripts to create an enhanced dataset:
scripts/utils/enhance_dataset.sh \
--spk_num 1 \
--gpu_inference true \
--inference_nj 4 \
--fs 16k \
--id_prefix "" \
dump/raw/et05_real_isolated_6ch_track \
data/et05_real_isolated_6ch_track_enh \
exp/enh_train_enh_beamformer_mvdr_raw/valid.loss.best.pth
The above script will generate a new directory data/et05_real_isolated_6ch_track_enh:
data
`-- et05_real_isolated_6ch_track_enh
|-- spk2utt
|-- text
|-- utt2spk
|-- utt2uniq
|-- wav.scp
`-- wavs/
where wav.scp contains paths to the enhanced audios (stored in wavs/).
Speech Separation¶
Model Selection¶
In this demonstration, we will show different speech separation models on wsj0_2mix.
The pretrained models can be download from a direct URL, or from zenodo and huggingface with the corresponding model ID.
[ ]:
!gdown --id 1TasZxZSnbSPsk_Wf7ZDhBAigS6zN8G9G -O enh_train_enh_tfgridnet_tf_lr-patience3_patience5_raw_valid.loss.ave.zip
!unzip enh_train_enh_tfgridnet_tf_lr-patience3_patience5_raw_valid.loss.ave.zip -d /content/enh_model_ss
[ ]:
import sys
import soundfile
from espnet2.bin.enh_inference import SeparateSpeech
# For models downloaded from GoogleDrive, you can use the following script:
separate_speech = SeparateSpeech(
train_config="/content/enh_model_ss/exp/enh_train_enh_tfgridnet_tf_lr-patience3_patience5_raw/config.yaml",
model_file="/content/enh_model_ss/exp/enh_train_enh_tfgridnet_tf_lr-patience3_patience5_raw/98epoch.pth",
# for segment-wise process on long speech
segment_size=2.4,
hop_size=0.8,
normalize_segment_scale=False,
show_progressbar=True,
ref_channel=None,
normalize_output_wav=True,
device="cuda:0",
)
Separate Speech Mixture¶
Separate the example in wsj0_2mix testing set¶
Task3 (✅ Checkpoint 3 (1 point))¶
Run inference of pretrained speech seperation model based on TF-GRIDNET.
Please submit the screenshot of output of current block and the spectrogram and waveform of mixed and seperated speech files to Gradescope for Task 3.
[ ]:
!gdown --id 1ZCUkd_Lb7pO2rpPr4FqYdtJBZ7JMiInx -O /content/447c020t_1.2106_422a0112_-1.2106.wav
import os
import soundfile
from IPython.display import display, Audio
mixwav, sr = soundfile.read("447c020t_1.2106_422a0112_-1.2106.wav")
waves_wsj = separate_speech(mixwav[None, ...], fs=sr)
print("Input mixture", flush=True)
display(Audio(mixwav, rate=sr))
print(f"========= Separated speech with model =========", flush=True)
print("Separated spk1", flush=True)
display(Audio(waves_wsj[0].squeeze(), rate=sr))
print("Separated spk2", flush=True)
display(Audio(waves_wsj[1].squeeze(), rate=sr))
Show spectrums of separated speech¶
Show wavform and spectrogram of mixed and seperated speech.
[ ]:
import matplotlib.pyplot as plt
import torch
from torch_complex.tensor import ComplexTensor
from espnet.asr.asr_utils import plot_spectrogram
from espnet2.layers.stft import Stft
stft = Stft(
n_fft=512,
win_length=None,
hop_length=128,
window="hann",
)
ilens = torch.LongTensor([len(mixwav)])
# specs: (T, F)
spec_mix = ComplexTensor(
*torch.unbind(
stft(torch.as_tensor(mixwav).unsqueeze(0), ilens)[0].squeeze(),
dim=-1
)
)
spec_sep1 = ComplexTensor(
*torch.unbind(
stft(torch.as_tensor(waves_wsj[0]), ilens)[0].squeeze(),
dim=-1
)
)
spec_sep2 = ComplexTensor(
*torch.unbind(
stft(torch.as_tensor(waves_wsj[1]), ilens)[0].squeeze(),
dim=-1
)
)
samples = torch.linspace(0, len(mixwav) / sr, len(mixwav))
plt.figure(figsize=(24, 12))
plt.subplot(3, 2, 1)
plt.title('Mixture Spectrogram')
plot_spectrogram(
plt, abs(spec_mix).transpose(-1, -2).numpy(), fs=sr,
mode='db', frame_shift=None,
bottom=False, labelbottom=False
)
plt.subplot(3, 2, 2)
plt.title('Mixture Wavform')
plt.plot(samples, mixwav)
plt.xlim(0, len(mixwav) / sr)
plt.subplot(3, 2, 3)
plt.title('Separated Spectrogram (spk1)')
plot_spectrogram(
plt, abs(spec_sep1).transpose(-1, -2).numpy(), fs=sr,
mode='db', frame_shift=None,
bottom=False, labelbottom=False
)
plt.subplot(3, 2, 4)
plt.title('Separated Wavform (spk1)')
plt.plot(samples, waves_wsj[0].squeeze())
plt.xlim(0, len(mixwav) / sr)
plt.subplot(3, 2, 5)
plt.title('Separated Spectrogram (spk2)')
plot_spectrogram(
plt, abs(spec_sep2).transpose(-1, -2).numpy(), fs=sr,
mode='db', frame_shift=None,
bottom=False, labelbottom=False
)
plt.subplot(3, 2, 6)
plt.title('Separated Wavform (spk2)')
plt.plot(samples, waves_wsj[1].squeeze())
plt.xlim(0, len(mixwav) / sr)
plt.xlabel("Time (s)")
plt.show()
Evaluate separated speech with pretrained ASR model¶
The ground truths are:
text_1: SOME CRITICS INCLUDING HIGH REAGAN ADMINISTRATION OFFICIALS ARE RAISING THE ALARM THAT THE FED'S POLICY IS TOO TIGHT AND COULD CAUSE A RECESSION NEXT YEAR
text_2: THE UNITED STATES UNDERTOOK TO DEFEND WESTERN EUROPE AGAINST SOVIET ATTACK
(This may take a while for the speech recognition.)
[ ]:
%pip install -q https://github.com/kpu/kenlm/archive/master.zip # ASR needs kenlm
Task4 (✅ Checkpoint 4 (1 point))¶
Show inference of pre-trained ASR model on mixed and seperated speech.
[ ]:
!gdown --id 1H7--jXTTwmwxzfO8LT5kjZyBjng-HxED -O asr_train_asr_transformer_raw_char_1gpu_valid.acc.ave.zip
!unzip asr_train_asr_transformer_raw_char_1gpu_valid.acc.ave.zip -d /content/asr_model
!ln -sf /content/asr_model/exp .
Please submit the screenshot of ASR inference on Mix Speech and Separated Speech 1 and Separated Speech 2 files to Gradescope for Task 4.
[ ]:
import espnet_model_zoo
from espnet2.bin.asr_inference import Speech2Text
# For models downloaded from GoogleDrive, you can use the following script:
speech2text = Speech2Text(
asr_train_config="/content/asr_model/exp/asr_train_asr_transformer_raw_char_1gpu/config.yaml",
asr_model_file="/content/asr_model/exp/asr_train_asr_transformer_raw_char_1gpu/valid.acc.ave_10best.pth",
device="cuda:0"
)
text_est = [None, None]
text_est[0], *_ = speech2text(waves_wsj[0].squeeze())[0]
text_est[1], *_ = speech2text(waves_wsj[1].squeeze())[0]
text_m, *_ = speech2text(mixwav)[0]
print("Mix Speech to Text: ", text_m)
print("Separated Speech 1 to Text: ", text_est[0])
print("Separated Speech 2 to Text: ", text_est[1])
[ ]:
import difflib
from itertools import permutations
import editdistance
import numpy as np
colors = dict(
red=lambda text: f"\033[38;2;255;0;0m{text}\033[0m" if text else "",
green=lambda text: f"\033[38;2;0;255;0m{text}\033[0m" if text else "",
yellow=lambda text: f"\033[38;2;225;225;0m{text}\033[0m" if text else "",
white=lambda text: f"\033[38;2;255;255;255m{text}\033[0m" if text else "",
black=lambda text: f"\033[38;2;0;0;0m{text}\033[0m" if text else "",
)
def diff_strings(ref, est):
"""Reference: https://stackoverflow.com/a/64404008/7384873"""
ref_str, est_str, err_str = [], [], []
matcher = difflib.SequenceMatcher(None, ref, est)
for opcode, a0, a1, b0, b1 in matcher.get_opcodes():
if opcode == "equal":
txt = ref[a0:a1]
ref_str.append(txt)
est_str.append(txt)
err_str.append(" " * (a1 - a0))
elif opcode == "insert":
ref_str.append("*" * (b1 - b0))
est_str.append(colors["green"](est[b0:b1]))
err_str.append(colors["black"]("I" * (b1 - b0)))
elif opcode == "delete":
ref_str.append(ref[a0:a1])
est_str.append(colors["red"]("*" * (a1 - a0)))
err_str.append(colors["black"]("D" * (a1 - a0)))
elif opcode == "replace":
diff = a1 - a0 - b1 + b0
if diff >= 0:
txt_ref = ref[a0:a1]
txt_est = colors["yellow"](est[b0:b1]) + colors["red"]("*" * diff)
txt_err = "S" * (b1 - b0) + "D" * diff
elif diff < 0:
txt_ref = ref[a0:a1] + "*" * -diff
txt_est = colors["yellow"](est[b0:b1]) + colors["green"]("*" * -diff)
txt_err = "S" * (b1 - b0) + "I" * -diff
ref_str.append(txt_ref)
est_str.append(txt_est)
err_str.append(colors["black"](txt_err))
return "".join(ref_str), "".join(est_str), "".join(err_str)
text_ref = [
"SOME CRITICS INCLUDING HIGH REAGAN ADMINISTRATION OFFICIALS ARE RAISING THE ALARM THAT THE FED'S POLICY IS TOO TIGHT AND COULD CAUSE A RECESSION NEXT YEAR",
"THE UNITED STATES UNDERTOOK TO DEFEND WESTERN EUROPE AGAINST SOVIET ATTACK",
]
print("=====================" , flush=True)
perms = list(permutations(range(2)))
string_edit = [
[
editdistance.eval(text_ref[m], text_est[n])
for m, n in enumerate(p)
]
for p in perms
]
dist = [sum(edist) for edist in string_edit]
perm_idx = np.argmin(dist)
perm = perms[perm_idx]
for i, p in enumerate(perm):
print("\n--------------- Text %d ---------------" % (i + 1), flush=True)
ref, est, err = diff_strings(text_ref[i], text_est[p])
print("REF: " + ref + "\n" + "HYP: " + est + "\n" + "ERR: " + err, flush=True)
print("Edit Distance = {}\n".format(string_edit[perm_idx][i]), flush=True)
Task5 (✅ Checkpoint 5 (1 point))¶
Enhance your own pre-recordings. Your input speech can be recorded by yourself or you can also find it from other sources (e.g., youtube).
Discuss whether input speech was clearly denoised, and if not, what would be a potential reason.
[YOUR ANSWER HERE]
Please submit the spectrogram and waveform of your input and enhanced speech to GradeScope for Task 5 along with the screenshot of your answer.
[ ]:
from google.colab import files
from IPython.display import display, Audio
import soundfile
fs = 16000
uploaded = files.upload()
for file_name in uploaded.keys():
speech, rate = soundfile.read(file_name)
assert rate == fs, "mismatch in sampling rate"
wave = enh_model_sc(speech[None, ...], fs)
print(f"Your input speech {file_name}", flush=True)
display(Audio(speech, rate=fs))
print(f"Enhanced speech for {file_name}", flush=True)
display(Audio(wave[0].squeeze(), rate=fs))