espnet.transform package

Initialize main package.

espnet.transform.spec_augment

Spec Augment module for preprocessing i.e., data augmentation

class espnet.transform.spec_augment.FreqMask(**kwargs)[source]

Bases: espnet.transform.functional.FuncTrans

freq mask for spec agument

Parameters:
  • x (numpy.ndarray) – (time, freq)

  • n_mask (int) – the number of masks

  • inplace (bool) – overwrite

  • replace_with_zero (bool) – pad zero on mask if true else use mean

class espnet.transform.spec_augment.SpecAugment(**kwargs)[source]

Bases: espnet.transform.functional.FuncTrans

spec agument

apply random time warping and time/freq masking default setting is based on LD (Librispeech double) in Table 2

Parameters:
  • x (numpy.ndarray) – (time, freq)

  • resize_mode (str) – “PIL” (fast, nondifferentiable) or “sparse_image_warp” (slow, differentiable)

  • max_time_warp (int) – maximum frames to warp the center frame in spectrogram (W)

  • freq_mask_width (int) – maximum width of the random freq mask (F)

  • n_freq_mask (int) – the number of the random freq mask (m_F)

  • time_mask_width (int) – maximum width of the random time mask (T)

  • n_time_mask (int) – the number of the random time mask (m_T)

  • inplace (bool) – overwrite intermediate array

  • replace_with_zero (bool) – pad zero on mask if true else use mean

class espnet.transform.spec_augment.TimeMask(**kwargs)[source]

Bases: espnet.transform.functional.FuncTrans

freq mask for spec agument

Parameters:
  • spec (numpy.ndarray) – (time, freq)

  • n_mask (int) – the number of masks

  • inplace (bool) – overwrite

  • replace_with_zero (bool) – pad zero on mask if true else use mean

class espnet.transform.spec_augment.TimeWarp(**kwargs)[source]

Bases: espnet.transform.functional.FuncTrans

time warp for spec augment

move random center frame by the random width ~ uniform(-window, window) :param numpy.ndarray x: spectrogram (time, freq) :param int max_time_warp: maximum time frames to warp :param bool inplace: overwrite x with the result :param str mode: “PIL” (default, fast, not differentiable) or “sparse_image_warp”

(slow, differentiable)

Returns numpy.ndarray:

time warped spectrogram (time, freq)

espnet.transform.spec_augment.freq_mask(x, F=30, n_mask=2, replace_with_zero=True, inplace=False)[source]

freq mask for spec agument

Parameters:
  • x (numpy.ndarray) – (time, freq)

  • n_mask (int) – the number of masks

  • inplace (bool) – overwrite

  • replace_with_zero (bool) – pad zero on mask if true else use mean

espnet.transform.spec_augment.spec_augment(x, resize_mode='PIL', max_time_warp=80, max_freq_width=27, n_freq_mask=2, max_time_width=100, n_time_mask=2, inplace=True, replace_with_zero=True)[source]

spec agument

apply random time warping and time/freq masking default setting is based on LD (Librispeech double) in Table 2

Parameters:
  • x (numpy.ndarray) – (time, freq)

  • resize_mode (str) – “PIL” (fast, nondifferentiable) or “sparse_image_warp” (slow, differentiable)

  • max_time_warp (int) – maximum frames to warp the center frame in spectrogram (W)

  • freq_mask_width (int) – maximum width of the random freq mask (F)

  • n_freq_mask (int) – the number of the random freq mask (m_F)

  • time_mask_width (int) – maximum width of the random time mask (T)

  • n_time_mask (int) – the number of the random time mask (m_T)

  • inplace (bool) – overwrite intermediate array

  • replace_with_zero (bool) – pad zero on mask if true else use mean

espnet.transform.spec_augment.time_mask(spec, T=40, n_mask=2, replace_with_zero=True, inplace=False)[source]

freq mask for spec agument

Parameters:
  • spec (numpy.ndarray) – (time, freq)

  • n_mask (int) – the number of masks

  • inplace (bool) – overwrite

  • replace_with_zero (bool) – pad zero on mask if true else use mean

espnet.transform.spec_augment.time_warp(x, max_time_warp=80, inplace=False, mode='PIL')[source]

time warp for spec augment

move random center frame by the random width ~ uniform(-window, window) :param numpy.ndarray x: spectrogram (time, freq) :param int max_time_warp: maximum time frames to warp :param bool inplace: overwrite x with the result :param str mode: “PIL” (default, fast, not differentiable) or “sparse_image_warp”

(slow, differentiable)

Returns numpy.ndarray:

time warped spectrogram (time, freq)

espnet.transform.transformation

Transformation module.

class espnet.transform.transformation.Transformation(conffile=None)[source]

Bases: object

Apply some functions to the mini-batch

Examples

>>> kwargs = {"process": [{"type": "fbank",
...                        "n_mels": 80,
...                        "fs": 16000},
...                       {"type": "cmvn",
...                        "stats": "data/train/cmvn.ark",
...                        "norm_vars": True},
...                       {"type": "delta", "window": 2, "order": 2}]}
>>> transform = Transformation(kwargs)
>>> bs = 10
>>> xs = [np.random.randn(100, 80).astype(np.float32)
...       for _ in range(bs)]
>>> xs = transform(xs)

espnet.transform.perturb

class espnet.transform.perturb.BandpassPerturbation(lower=0.0, upper=0.75, seed=None, axes=(-1, ))[source]

Bases: object

Randomly dropout along the frequency axis.

The original idea comes from the following:
“randomly-selected frequency band was cut off under the constraint of

leaving at least 1,000 Hz band within the range of less than 4,000Hz.”

(The Hitachi/JHU CHiME-5 system: Advances in speech recognition for

everyday home environments using multiple microphone arrays; http://spandh.dcs.shef.ac.uk/chime_workshop/papers/CHiME_2018_paper_kanda.pdf)

class espnet.transform.perturb.NoiseInjection(utt2noise=None, lower=-20, upper=-5, utt2ratio=None, filetype='list', dbunit=True, seed=None)[source]

Bases: object

Add isotropic noise

class espnet.transform.perturb.RIRConvolve(utt2rir, filetype='list')[source]

Bases: object

class espnet.transform.perturb.SpeedPerturbation(lower=0.9, upper=1.1, utt2ratio=None, keep_length=True, res_type='kaiser_best', seed=None)[source]

Bases: object

The speed perturbation in kaldi uses sox-speed instead of sox-tempo, and sox-speed just to resample the input, i.e pitch and tempo are changed both.

“Why use speed option instead of tempo -s in SoX for speed perturbation” https://groups.google.com/forum/#!topic/kaldi-help/8OOG7eE4sZ8

Warning

This function is very slow because of resampling. I recommmend to apply speed-perturb outside the training using sox.

class espnet.transform.perturb.VolumePerturbation(lower=-1.6, upper=1.6, utt2ratio=None, dbunit=True, seed=None)[source]

Bases: object

espnet.transform.add_deltas

class espnet.transform.add_deltas.AddDeltas(window=2, order=2)[source]

Bases: object

espnet.transform.add_deltas.add_deltas(x, window=2, order=2)[source]
espnet.transform.add_deltas.delta(feat, window)[source]

espnet.transform.wpe

class espnet.transform.wpe.WPE(taps=10, delay=3, iterations=3, psd_context=0, statistics_mode='full')[source]

Bases: object

espnet.transform.channel_selector

class espnet.transform.channel_selector.ChannelSelector(train_channel='random', eval_channel=0, axis=1)[source]

Bases: object

Select 1ch from multi-channel signal

espnet.transform.functional

class espnet.transform.functional.FuncTrans(**kwargs)[source]

Bases: espnet.transform.transform_interface.TransformInterface

Functional Transformation

Warning

Builtin or C/C++ functions may not work properly because this class heavily depends on the inspect module.

Usage:

>>> def foo_bar(x, a=1, b=2):
...     '''Foo bar
...     :param x: input
...     :param int a: default 1
...     :param int b: default 2
...     '''
...     return x + a - b
>>> class FooBar(FuncTrans):
...     _func = foo_bar
...     __doc__ = foo_bar.__doc__
classmethod add_arguments(parser)[source]
classmethod default_params()[source]
property func

espnet.transform.transform_interface

class espnet.transform.transform_interface.Identity[source]

Bases: espnet.transform.transform_interface.TransformInterface

Identity Function

class espnet.transform.transform_interface.TransformInterface[source]

Bases: object

Transform Interface

classmethod add_arguments(parser)[source]

espnet.transform.__init__

Initialize main package.

espnet.transform.cmvn

class espnet.transform.cmvn.CMVN(stats, norm_means=True, norm_vars=False, filetype='mat', utt2spk=None, spk2utt=None, reverse=False, std_floor=1e-20)[source]

Bases: object

class espnet.transform.cmvn.UtteranceCMVN(norm_means=True, norm_vars=False, std_floor=1e-20)[source]

Bases: object

espnet.transform.spectrogram

class espnet.transform.spectrogram.IStft(n_shift, win_length=None, window='hann', center=True)[source]

Bases: object

class espnet.transform.spectrogram.LogMelSpectrogram(fs, n_mels, n_fft, n_shift, win_length=None, window='hann', fmin=None, fmax=None, eps=1e-10)[source]

Bases: object

class espnet.transform.spectrogram.Spectrogram(n_fft, n_shift, win_length=None, window='hann')[source]

Bases: object

class espnet.transform.spectrogram.Stft(n_fft, n_shift, win_length=None, window='hann', center=True, pad_mode='reflect')[source]

Bases: object

class espnet.transform.spectrogram.Stft2LogMelSpectrogram(fs, n_mels, n_fft, fmin=None, fmax=None, eps=1e-10)[source]

Bases: object

espnet.transform.spectrogram.istft(x, n_shift, win_length=None, window='hann', center=True)[source]
espnet.transform.spectrogram.logmelspectrogram(x, fs, n_mels, n_fft, n_shift, win_length=None, window='hann', fmin=None, fmax=None, eps=1e-10, pad_mode='reflect')[source]
espnet.transform.spectrogram.spectrogram(x, n_fft, n_shift, win_length=None, window='hann')[source]
espnet.transform.spectrogram.stft(x, n_fft, n_shift, win_length=None, window='hann', center=True, pad_mode='reflect')[source]
espnet.transform.spectrogram.stft2logmelspectrogram(x_stft, fs, n_mels, n_fft, fmin=None, fmax=None, eps=1e-10)[source]