espnet2.enh package

espnet2.enh.espnet_model

Enhancement model module.

class espnet2.enh.espnet_model.ESPnetEnhancementModel(encoder: espnet2.enh.encoder.abs_encoder.AbsEncoder, separator: espnet2.enh.separator.abs_separator.AbsSeparator, decoder: espnet2.enh.decoder.abs_decoder.AbsDecoder, mask_module: Optional[espnet2.diar.layers.abs_mask.AbsMask], loss_wrappers: List[espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper], stft_consistency: bool = False, loss_type: str = 'mask_mse', mask_type: Optional[str] = None, extract_feats_in_collect_stats: bool = False)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

Speech enhancement or separation Frontend model

collect_feats(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]
forward(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters:
  • speech_mix – (Batch, samples) or (Batch, samples, channels)

  • speech_ref – (Batch, num_speaker, samples) or (Batch, num_speaker, samples, channels)

  • speech_mix_lengths – (Batch,), default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py

  • kwargs – “utt_id” is among the input.

forward_enhance(speech_mix: torch.Tensor, speech_lengths: torch.Tensor, additional: Optional[Dict] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]
forward_loss(speech_pre: torch.Tensor, speech_lengths: torch.Tensor, feature_mix: torch.Tensor, feature_pre: torch.Tensor, others: OrderedDict, speech_ref: torch.Tensor, noise_ref: torch.Tensor = None, dereverb_speech_ref: torch.Tensor = None) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]
static sort_by_perm(nn_output, perm)[source]

Sort the input list of tensors by the specified permutation.

Parameters:
  • nn_output – List[torch.Tensor(Batch, …)], len(nn_output) == num_spk

  • perm – (Batch, num_spk) or List[torch.Tensor(num_spk)]

Returns:

List[torch.Tensor(Batch, …)]

Return type:

nn_output_new

espnet2.enh.espnet_model_tse

Enhancement model module.

class espnet2.enh.espnet_model_tse.ESPnetExtractionModel(encoder: espnet2.enh.encoder.abs_encoder.AbsEncoder, extractor: espnet2.enh.extractor.abs_extractor.AbsExtractor, decoder: espnet2.enh.decoder.abs_decoder.AbsDecoder, loss_wrappers: List[espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper], num_spk: int = 1, share_encoder: bool = True, extract_feats_in_collect_stats: bool = False)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

Target Speaker Extraction Frontend model

collect_feats(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]
forward(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters:
  • speech_mix – (Batch, samples) or (Batch, samples, channels)

  • speech_ref1 – (Batch, samples) or (Batch, samples, channels)

  • speech_ref2 – (Batch, samples) or (Batch, samples, channels)

  • ..

  • speech_mix_lengths – (Batch,), default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py

  • enroll_ref1 – (Batch, samples_aux) enrollment (raw audio or embedding) for speaker 1

  • enroll_ref2 – (Batch, samples_aux) enrollment (raw audio or embedding) for speaker 2

  • ..

  • kwargs – “utt_id” is among the input.

forward_enhance(speech_mix: torch.Tensor, speech_lengths: torch.Tensor, enroll_ref: torch.Tensor, enroll_ref_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]
forward_loss(speech_pre: torch.Tensor, speech_lengths: torch.Tensor, feature_mix: torch.Tensor, feature_pre: torch.Tensor, others: OrderedDict, speech_ref: torch.Tensor) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

espnet2.enh.abs_enh

class espnet2.enh.abs_enh.AbsEnhancement[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract forward_rawwav(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]

espnet2.enh.espnet_enh_s2t_model

class espnet2.enh.espnet_enh_s2t_model.ESPnetEnhS2TModel(enh_model: espnet2.enh.espnet_model.ESPnetEnhancementModel, s2t_model: Union[espnet2.asr.espnet_model.ESPnetASRModel, espnet2.st.espnet_model.ESPnetSTModel, espnet2.diar.espnet_model.ESPnetDiarizationModel], calc_enh_loss: bool = True, bypass_enh_prob: float = 0)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

Joint model Enhancement and Speech to Text.

asr_pit_loss(speech, speech_lengths, text, text_lengths)[source]
batchify_nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)

Compute negative log likelihood(nll) from transformer-decoder

To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,

you may change this to avoid OOM or increase GPU memory usage

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]
encode(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder. Note that this method is used by asr_inference.py

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

encode_diar(speech: torch.Tensor, speech_lengths: torch.Tensor, num_spk: int) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder. Note that this method is used by diar_inference.py

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

  • num_spk – int

forward(speech: torch.Tensor, speech_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, ) default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py

  • Enh+ASR task (For) – text_spk1: (Batch, Length) text_spk2: (Batch, Length) … text_spk1_lengths: (Batch,) text_spk2_lengths: (Batch,) …

  • other tasks (For) –

    text: (Batch, Length) default None just to keep the argument order text_lengths: (Batch,)

    default None for the same reason as speech_lengths

inherite_attributes(inherite_enh_attrs: List[str] = [], inherite_s2t_attrs: List[str] = [])[source]
nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]

Compute negative log likelihood(nll) from transformer-decoder

Normally, this function is called in batchify_nll.

Parameters:
  • encoder_out – (Batch, Length, Dim)

  • encoder_out_lens – (Batch,)

  • ys_pad – (Batch, Length)

  • ys_pad_lens – (Batch,)

permutation_invariant_training(losses: torch.Tensor)[source]

Compute PIT loss.

Parameters:

losses (torch.Tensor) – (batch, nref, nhyp)

Returns:

list: (batch, n_spk) loss: torch.Tensor: (batch)

Return type:

perm

espnet2.enh.__init__

espnet2.enh.layers.complexnn

class espnet2.enh.layers.complexnn.ComplexBatchNorm(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, complex_axis=1)[source]

Bases: torch.nn.modules.module.Module

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reset_parameters()[source]
reset_running_stats()[source]
class espnet2.enh.layers.complexnn.ComplexConv2d(in_channels, out_channels, kernel_size=(1, 1), stride=(1, 1), padding=(0, 0), dilation=1, groups=1, causal=True, complex_axis=1)[source]

Bases: torch.nn.modules.module.Module

ComplexConv2d.

in_channels: real+imag out_channels: real+imag kernel_size : input [B,C,D,T] kernel size in [D,T] padding : input [B,C,D,T] padding in [D,T] causal: if causal, will padding time dimension’s left side,

otherwise both

forward(inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.complexnn.ComplexConvTranspose2d(in_channels, out_channels, kernel_size=(1, 1), stride=(1, 1), padding=(0, 0), output_padding=(0, 0), causal=False, complex_axis=1, groups=1)[source]

Bases: torch.nn.modules.module.Module

ComplexConvTranspose2d.

in_channels: real+imag out_channels: real+imag

forward(inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.complexnn.NavieComplexLSTM(input_size, hidden_size, projection_dim=None, bidirectional=False, batch_first=False)[source]

Bases: torch.nn.modules.module.Module

flatten_parameters()[source]
forward(inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.complexnn.complex_cat(inputs, axis)[source]

espnet2.enh.layers.complex_utils

Beamformer module.

espnet2.enh.layers.complex_utils.cat(seq: Sequence[Union[torch_complex.tensor.ComplexTensor, torch.Tensor]], *args, **kwargs)[source]
espnet2.enh.layers.complex_utils.complex_norm(c: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], dim=-1, keepdim=False) → torch.Tensor[source]
espnet2.enh.layers.complex_utils.einsum(equation, *operands)[source]
espnet2.enh.layers.complex_utils.inverse(c: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
espnet2.enh.layers.complex_utils.is_complex(c)[source]
espnet2.enh.layers.complex_utils.is_torch_complex_tensor(c)[source]
espnet2.enh.layers.complex_utils.matmul(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], b: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
espnet2.enh.layers.complex_utils.new_complex_like(ref: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], real_imag: Tuple[torch.Tensor, torch.Tensor])[source]
espnet2.enh.layers.complex_utils.reverse(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], dim=0)[source]
espnet2.enh.layers.complex_utils.solve(b: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor])[source]

Solve the linear equation ax = b.

espnet2.enh.layers.complex_utils.stack(seq: Sequence[Union[torch_complex.tensor.ComplexTensor, torch.Tensor]], *args, **kwargs)[source]
espnet2.enh.layers.complex_utils.to_double(c)[source]
espnet2.enh.layers.complex_utils.to_float(c)[source]
espnet2.enh.layers.complex_utils.trace(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor])[source]

espnet2.enh.layers.tcndenseunet

class espnet2.enh.layers.tcndenseunet.Conv2DActNorm(in_channels, out_channels, ksz=(3, 3), stride=(1, 2), padding=(1, 0), upsample=False, activation=<class 'torch.nn.modules.activation.ELU'>)[source]

Bases: torch.nn.modules.module.Module

Basic Conv2D + activation + instance norm building block.

forward(inp)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.tcndenseunet.DenseBlock(in_channels, out_channels, num_freqs, pre_blocks=2, freq_proc_blocks=1, post_blocks=2, ksz=(3, 3), activation=<class 'torch.nn.modules.activation.ELU'>, hid_chans=32)[source]

Bases: torch.nn.modules.module.Module

single DenseNet block as used in iNeuBe model.

Parameters:
  • in_channels – number of input channels (image axis).

  • out_channels – number of output channels (image axis).

  • num_freqs – number of complex frequencies in the input STFT complex image-like tensor. The input is batch, image_channels, frames, freqs.

  • pre_blocks – dense block before point-wise convolution block over frequency axis.

  • freq_proc_blocks – number of frequency axis processing blocks.

  • post_blocks – dense block after point-wise convolution block over frequency axis.

  • ksz – kernel size used in densenet Conv2D layers.

  • activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.

  • hid_chans – number of hidden channels in densenet Conv2D.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.tcndenseunet.FreqWiseBlock(in_channels, num_freqs, out_channels, activation=<class 'torch.nn.modules.activation.ELU'>)[source]

Bases: torch.nn.modules.module.Module

FreqWiseBlock, see iNeuBe paper.

Block that applies pointwise 2D convolution over STFT-like image tensor on frequency axis. The input is assumed to be [batch, image_channels, frames, freq].

forward(inp)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.tcndenseunet.TCNDenseUNet(n_spk=1, in_freqs=257, mic_channels=1, hid_chans=32, hid_chans_dense=32, ksz_dense=(3, 3), ksz_tcn=3, tcn_repeats=4, tcn_blocks=7, tcn_channels=384, activation=<class 'torch.nn.modules.activation.ELU'>)[source]

Bases: torch.nn.modules.module.Module

TCNDenseNet block from iNeuBe

Reference: Lu, Y. J., Cornell, S., Chang, X., Zhang, W., Li, C., Ni, Z., … & Watanabe, S. Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge. ICASSP 2022 p. 9201-9205.

Parameters:
  • n_spk – number of output sources/speakers.

  • in_freqs – number of complex STFT frequencies.

  • mic_channels – number of microphones channels (only fixed-array geometry supported).

  • hid_chans – number of channels in the subsampling/upsampling conv layers.

  • hid_chans_dense – number of channels in the densenet layers (reduce this to reduce VRAM requirements).

  • ksz_dense – kernel size in the densenet layers thorough iNeuBe.

  • ksz_tcn – kernel size in the TCN submodule.

  • tcn_repeats – number of repetitions of blocks in the TCN submodule.

  • tcn_blocks – number of blocks in the TCN submodule.

  • tcn_channels – number of channels in the TCN submodule.

  • activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.

forward(tf_rep)[source]

forward.

Parameters:

tf_rep (torch.Tensor) – 4D tensor (multi-channel complex STFT of mixture) of shape [B, T, C, F] batch, frames, microphones, frequencies.

Returns:

complex 3D tensor monaural STFT of the targets

shape is [B, T, F] batch, frames, frequencies.

Return type:

out (torch.Tensor)

class espnet2.enh.layers.tcndenseunet.TCNResBlock(in_chan, out_chan, ksz=3, stride=1, dilation=1, activation=<class 'torch.nn.modules.activation.ELU'>)[source]

Bases: torch.nn.modules.module.Module

single depth-wise separable TCN block as used in iNeuBe TCN.

Parameters:
  • in_chan – number of input feature channels.

  • out_chan – number of output feature channels.

  • ksz – kernel size.

  • stride – stride in depth-wise convolution.

  • dilation – dilation in depth-wise convolution.

  • activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.

forward(inp)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.dptnet

class espnet2.enh.layers.dptnet.DPTNet(rnn_type, input_size, hidden_size, output_size, att_heads=4, dropout=0, activation='relu', num_layers=1, bidirectional=True, norm_type='gLN')[source]

Bases: torch.nn.modules.module.Module

Dual-path transformer network.

Parameters:
  • rnn_type (str) – select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • input_size (int) – dimension of the input feature. Input size must be a multiple of att_heads.

  • hidden_size (int) – dimension of the hidden state.

  • output_size (int) – dimension of the output size.

  • att_heads (int) – number of attention heads.

  • dropout (float) – dropout ratio. Default is 0.

  • activation (str) – activation function applied at the output of RNN.

  • num_layers (int) – number of stacked RNN layers. Default is 1.

  • bidirectional (bool) – whether the RNN layers are bidirectional. Default is True.

  • norm_type (str) – type of normalization to use after each inter- or intra-chunk Transformer block.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

inter_chunk_process(x, layer_index)[source]
intra_chunk_process(x, layer_index)[source]
class espnet2.enh.layers.dptnet.ImprovedTransformerLayer(rnn_type, input_size, att_heads, hidden_size, dropout=0.0, activation='relu', bidirectional=True, norm='gLN')[source]

Bases: torch.nn.modules.module.Module

Container module of the (improved) Transformer proposed in [1].

Reference:

Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation; Chen et al, Interspeech 2020.

Parameters:
  • rnn_type (str) – select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • input_size (int) – Dimension of the input feature.

  • att_heads (int) – Number of attention heads.

  • hidden_size (int) – Dimension of the hidden state.

  • dropout (float) – Dropout ratio. Default is 0.

  • activation (str) – activation function applied at the output of RNN.

  • bidirectional (bool, optional) – True for bidirectional Inter-Chunk RNN (Intra-Chunk is always bidirectional).

  • norm (str, optional) – Type of normalization to use.

forward(x, attn_mask=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.dpmulcat

class espnet2.enh.layers.dpmulcat.DPMulCat(input_size: int, hidden_size: int, output_size: int, num_spk: int, dropout: float = 0.0, num_layers: int = 4, bidirectional: bool = True, input_normalize: bool = False)[source]

Bases: torch.nn.modules.module.Module

Dual-path RNN module with MulCat blocks.

Parameters:
  • input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).

  • hidden_size – int, dimension of the hidden state.

  • output_size – int, dimension of the output size.

  • num_spk – int, the number of speakers in the output.

  • dropout – float, the dropout rate in the LSTM layer. (Default: 0.0)

  • bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)

  • num_layers – int, number of stacked MulCat blocks. (Default: 4)

  • input_normalize – bool, whether to apply GroupNorm on the input Tensor. (Default: False)

forward(input)[source]

Compute output after DPMulCat module.

Parameters:

input (torch.Tensor) – The input feature. Tensor of shape (batch, N, dim1, dim2) Apply RNN on dim1 first and then dim2

Returns:

(list(torch.Tensor) or list(list(torch.Tensor))

In training mode, the module returns output of each DPMulCat block. In eval mode, the module only returns output in the last block.

class espnet2.enh.layers.dpmulcat.MulCatBlock(input_size: int, hidden_size: int, dropout: float = 0.0, bidirectional: bool = True)[source]

Bases: torch.nn.modules.module.Module

The MulCat block.

Parameters:
  • input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).

  • hidden_size – int, dimension of the hidden state.

  • dropout – float, the dropout rate in the LSTM layer. (Default: 0.0)

  • bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)

forward(input)[source]

Compute output after MulCatBlock.

Parameters:

input (torch.Tensor) – The input feature. Tensor of shape (batch, time, feature_dim)

Returns:

The output feature after MulCatBlock.

Tensor of shape (batch, time, feature_dim)

Return type:

(torch.Tensor)

espnet2.enh.layers.tcn

class espnet2.enh.layers.tcn.ChannelwiseLayerNorm(channel_size, shape='BDT')[source]

Bases: torch.nn.modules.module.Module

Channel-wise Layer Normalization (cLN).

forward(y)[source]

Forward.

Parameters:

y – [M, N, K], M is batch size, N is channel size, K is length

Returns:

[M, N, K]

Return type:

cLN_y

reset_parameters()[source]
class espnet2.enh.layers.tcn.Chomp1d(chomp_size)[source]

Bases: torch.nn.modules.module.Module

To ensure the output length is the same as the input.

forward(x)[source]

Forward.

Parameters:

x – [M, H, Kpad]

Returns:

[M, H, K]

class espnet2.enh.layers.tcn.DepthwiseSeparableConv(in_channels, out_channels, skip_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward.

Parameters:

x – [M, H, K]

Returns:

[M, B, K] skip_out: [M, Sc, K]

Return type:

res_out

class espnet2.enh.layers.tcn.GlobalLayerNorm(channel_size, shape='BDT')[source]

Bases: torch.nn.modules.module.Module

Global Layer Normalization (gLN).

forward(y)[source]

Forward.

Parameters:

y – [M, N, K], M is batch size, N is channel size, K is length

Returns:

[M, N, K]

Return type:

gLN_y

reset_parameters()[source]
class espnet2.enh.layers.tcn.TemporalBlock(in_channels, out_channels, skip_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward.

Parameters:

x – [M, B, K]

Returns:

[M, B, K]

class espnet2.enh.layers.tcn.TemporalConvNet(N, B, H, P, X, R, C, Sc=None, out_channel=None, norm_type='gLN', causal=False, pre_mask_nonlinear='linear', mask_nonlinear='relu')[source]

Bases: torch.nn.modules.module.Module

Basic Module of tasnet.

Parameters:
  • N – Number of filters in autoencoder

  • B – Number of channels in bottleneck 1 * 1-conv block

  • H – Number of channels in convolutional blocks

  • P – Kernel size in convolutional blocks

  • X – Number of convolutional blocks in each repeat

  • R – Number of repeats

  • C – Number of speakers

  • Sc – Number of channels in skip-connection paths’ 1x1-conv blocks

  • out_channel – Number of output channels if it is None, N will be used instead.

  • norm_type – BN, gLN, cLN

  • causal – causal or non-causal

  • pre_mask_nonlinear – the non-linear function before masknet

  • mask_nonlinear – use which non-linear function to generate mask

forward(mixture_w)[source]

Keep this API same with TasNet.

Parameters:

mixture_w – [M, N, K], M is batch size

Returns:

[M, C, N, K]

Return type:

est_mask

class espnet2.enh.layers.tcn.TemporalConvNetInformed(N, B, H, P, X, R, Sc=None, out_channel=None, norm_type='gLN', causal=False, pre_mask_nonlinear='prelu', mask_nonlinear='relu', i_adapt_layer: int = 7, adapt_layer_type: str = 'mul', adapt_enroll_dim: int = 128, **adapt_layer_kwargs)[source]

Bases: espnet2.enh.layers.tcn.TemporalConvNet

Basic Module of TasNet with adaptation layers.

Parameters:
  • N – Number of filters in autoencoder

  • B – Number of channels in bottleneck 1 * 1-conv block

  • H – Number of channels in convolutional blocks

  • P – Kernel size in convolutional blocks

  • X – Number of convolutional blocks in each repeat

  • R – Number of repeats

  • Sc – Number of channels in skip-connection paths’ 1x1-conv blocks

  • out_channel – Number of output channels if it is None, N will be used instead.

  • norm_type – BN, gLN, cLN

  • causal – causal or non-causal

  • pre_mask_nonlinear – the non-linear function before masknet

  • mask_nonlinear – use which non-linear function to generate mask

  • i_adapt_layer – int, index of the adaptation layer

  • adapt_layer_type – str, type of adaptation layer see espnet2.enh.layers.adapt_layers for options

  • adapt_enroll_dim – int, dimensionality of the speaker embedding

forward(mixture_w, enroll_emb)[source]

TasNet forward with adaptation layers.

Parameters:
  • mixture_w – [M, N, K], M is batch size

  • enroll_emb – [M, 2*adapt_enroll_dim] if self.skip_connection [M, adapt_enroll_dim] if not self.skip_connection

Returns:

[M, N, K]

Return type:

est_mask

espnet2.enh.layers.tcn.check_nonlinear(nolinear_type)[source]
espnet2.enh.layers.tcn.choose_norm(norm_type, channel_size, shape='BDT')[source]

The input of normalization will be (M, C, K), where M is batch size.

C is channel size and K is sequence length.

espnet2.enh.layers.wpe

espnet2.enh.layers.wpe.get_correlations(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], inverse_power: torch.Tensor, taps, delay) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Union[torch.Tensor, torch_complex.tensor.ComplexTensor]][source]

Calculates weighted correlations of a window of length taps

Parameters:
  • Y – Complex-valued STFT signal with shape (F, C, T)

  • inverse_power – Weighting factor with shape (F, T)

  • taps (int) – Lenghts of correlation window

  • delay (int) – Delay for the weighting factor

Returns:

Correlation matrix of shape (F, taps*C, taps*C) Correlation vector of shape (F, taps, C, C)

espnet2.enh.layers.wpe.get_filter_matrix_conj(correlation_matrix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], correlation_vector: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], eps: float = 1e-10) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Calculate (conjugate) filter matrix based on correlations for one freq.

Parameters:
  • correlation_matrix – Correlation matrix (F, taps * C, taps * C)

  • correlation_vector – Correlation vector (F, taps, C, C)

  • eps

Returns:

(F, taps, C, C)

Return type:

filter_matrix_conj (torch.complex/ComplexTensor)

espnet2.enh.layers.wpe.get_power(signal, dim=-2) → torch.Tensor[source]

Calculates power for signal

Parameters:
  • signal – Single frequency signal with shape (F, C, T).

  • axis – reduce_mean axis

Returns:

Power with shape (F, T)

espnet2.enh.layers.wpe.is_torch_1_9_plus = True

//github.com/fgnt/nara_wpe Many functions aren’t enough tested

Type:

WPE pytorch version

Type:

Ported from https

espnet2.enh.layers.wpe.perform_filter_operation(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], filter_matrix_conj: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], taps, delay) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
Parameters:
  • Y – Complex-valued STFT signal of shape (F, C, T)

  • Matrix (filter) –

espnet2.enh.layers.wpe.signal_framing(signal: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], frame_length: int, frame_step: int, pad_value=0) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Expands signal into frames of frame_length.

Parameters:

signal – (B * F, D, T)

Returns:

(B * F, D, T, W)

Return type:

torch.Tensor

espnet2.enh.layers.wpe.wpe(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], taps=10, delay=3, iterations=3) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

WPE

Parameters:
  • Y – Complex valued STFT signal with shape (F, C, T)

  • taps – Number of filter taps

  • delay – Delay as a guard interval, such that X does not become zero.

  • iterations

Returns:

(F, C, T)

Return type:

enhanced

espnet2.enh.layers.wpe.wpe_one_iteration(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], power: torch.Tensor, taps: int = 10, delay: int = 3, eps: float = 1e-10, inverse_power: bool = True) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

WPE for one iteration

Parameters:
  • Y – Complex valued STFT signal with shape (…, C, T)

  • power – : (…, T)

  • taps – Number of filter taps

  • delay – Delay as a guard interval, such that X does not become zero.

  • eps

  • inverse_power (bool) –

Returns:

(…, C, T)

Return type:

enhanced

espnet2.enh.layers.adapt_layers

class espnet2.enh.layers.adapt_layers.ConcatAdaptLayer(indim, enrolldim, ninputs=1)[source]

Bases: torch.nn.modules.module.Module

forward(main, enroll)[source]

ConcatAdaptLayer forward.

Parameters:
  • main

    tensor or tuple or list activations in the main neural network, which are adapted tuple/list may be useful when we want to apply the adaptation

    to both normal and skip connection at once

  • enroll

    tensor or tuple or list embedding extracted from enrollment tuple/list may be useful when we want to apply the adaptation

    to both normal and skip connection at once

class espnet2.enh.layers.adapt_layers.MulAddAdaptLayer(indim, enrolldim, ninputs=1, do_addition=True)[source]

Bases: torch.nn.modules.module.Module

forward(main, enroll)[source]

MulAddAdaptLayer Forward.

Parameters:
  • main

    tensor or tuple or list activations in the main neural network, which are adapted tuple/list may be useful when we want to apply the adaptation

    to both normal and skip connection at once

  • enroll

    tensor or tuple or list embedding extracted from enrollment tuple/list may be useful when we want to apply the adaptation

    to both normal and skip connection at once

espnet2.enh.layers.adapt_layers.into_orig_type(x, orig_type)[source]

Inverts into_tuple function.

espnet2.enh.layers.adapt_layers.into_tuple(x)[source]

Transforms tensor/list/tuple into tuple.

espnet2.enh.layers.adapt_layers.make_adapt_layer(type, indim, enrolldim, ninputs=1)[source]

espnet2.enh.layers.fasnet

class espnet2.enh.layers.fasnet.BF_module(input_dim, feature_dim, hidden_dim, output_dim, num_spk=2, layer=4, segment_size=100, bidirectional=True, dropout=0.0, fasnet_type='ifasnet')[source]

Bases: torch.nn.modules.module.Module

forward(input, num_mic)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.fasnet.FaSNet_TAC(*args, **kwargs)[source]

Bases: espnet2.enh.layers.fasnet.FaSNet_base

forward(input, num_mic)[source]

abstract forward function

input: shape (batch, max_num_ch, T) num_mic: shape (batch, ), the number of channels for each input.

Zero for fixed geometry configuration.

class espnet2.enh.layers.fasnet.FaSNet_base(enc_dim, feature_dim, hidden_dim, layer, segment_size=24, nspk=2, win_len=16, context_len=16, dropout=0.0, sr=16000)[source]

Bases: torch.nn.modules.module.Module

forward(input, num_mic)[source]

abstract forward function

input: shape (batch, max_num_ch, T) num_mic: shape (batch, ), the number of channels for each input.

Zero for fixed geometry configuration.

pad_input(input, window)[source]

Zero-padding input according to window/stride size.

seg_signal_context(x, window, context)[source]

Segmenting the signal into chunks with specific context.

input:

x: size (B, ch, T) window: int context: int

seq_cos_sim(ref, target)[source]

Cosine similarity between some reference mics and some target mics

ref: shape (nmic1, L, seg1) target: shape (nmic2, L, seg2)

signal_context(x, context)[source]

signal context function

Segmenting the signal into chunks with specific context. input:

x: size (B, dim, nframe) context: int

espnet2.enh.layers.fasnet.test_model(model)[source]

espnet2.enh.layers.beamformer

Beamformer module.

espnet2.enh.layers.beamformer.apply_beamforming_vector(beamform_vector: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], mix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
espnet2.enh.layers.beamformer.blind_analytic_normalization(ws, psd_noise, eps=1e-08)[source]

Blind analytic normalization (BAN) for post-filtering

Parameters:
  • ws (torch.complex64/ComplexTensor) – beamformer vector (…, F, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise PSD matrix (…, F, C, C)

  • eps (float) –

Returns:

normalized beamformer vector (…, F)

Return type:

ws_ban (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.generalized_eigenvalue_decomposition(a: torch.Tensor, b: torch.Tensor, eps=1e-06)[source]

Solves the generalized eigenvalue decomposition through Cholesky decomposition.

ported from https://github.com/asteroid-team/asteroid/blob/master/asteroid/dsp/beamforming.py#L464

a @ e_vec = e_val * b @ e_vec | | Cholesky decomposition on b: | b = L @ L^H, where L is a lower triangular matrix | | Let C = L^-1 @ a @ L^-H, it is Hermitian. | => C @ y = lambda * y => e_vec = L^-H @ y

Reference: https://www.netlib.org/lapack/lug/node54.html

Parameters:
  • a – A complex Hermitian or real symmetric matrix whose eigenvalues and eigenvectors will be computed. (…, C, C)

  • b – A complex Hermitian or real symmetric definite positive matrix. (…, C, C)

Returns:

generalized eigenvalues (ascending order) e_vec: generalized eigenvectors

Return type:

e_val

espnet2.enh.layers.beamformer.get_WPD_filter(Phi: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Rf: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Return the WPD vector.

WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:

h = (Rf^-1 @ Phi_{xx}) / tr[(Rf^-1) @ Phi_{xx}] @ u

Reference:

T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481

Parameters:
  • Phi (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the PSD of zero-padded speech [x^T(t,f) 0 … 0]^T.

  • Rf (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.

  • reference_vector (torch.Tensor) – (B, (btaps+1) * C) is the reference_vector.

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(B, F, (btaps + 1) * C)

Return type:

filter_matrix (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_WPD_filter_v2(Phi: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Rf: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Return the WPD vector (v2).

This implementation is more efficient than get_WPD_filter as

it skips unnecessary computation with zeros.

Parameters:
  • Phi (torch.complex64/ComplexTensor) – (B, F, C, C) is speech PSD.

  • Rf (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.

  • reference_vector (torch.Tensor) – (B, C) is the reference_vector.

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(B, F, (btaps+1) * C)

Return type:

filter_matrix (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_WPD_filter_with_rtf(psd_observed_bar: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-15) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Return the WPD vector calculated with RTF.

WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:

h = (Rf^-1 @ vbar) / (vbar^H @ R^-1 @ vbar)

Reference:

T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481

Parameters:
  • psd_observed_bar (torch.complex64/ComplexTensor) – stacked observation covariance matrix

  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • iterations (int) – number of iterations in power method

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64/ComplexTensor)r

espnet2.enh.layers.beamformer.get_covariances(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], inverse_power: torch.Tensor, bdelay: int, btaps: int, get_vector: bool = False) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
Calculates the power normalized spatio-temporal covariance

matrix of the framed signal.

Parameters:
  • Y – Complex STFT signal with shape (B, F, C, T)

  • inverse_power – Weighting factor with shape (B, F, T)

Returns:

(B, F, (btaps+1) * C, (btaps+1) * C) Correlation vector: (B, F, btaps + 1, C, C)

Return type:

Correlation matrix

espnet2.enh.layers.beamformer.get_gev_vector(psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Return the generalized eigenvalue (GEV) beamformer vector:

psd_speech @ h = lambda * psd_noise @ h

Reference:

Blind acoustic beamforming based on generalized eigenvalue decomposition; E. Warsitz and R. Haeb-Umbach, 2007.

Parameters:
  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition (only for torch builtin complex tensors)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • iterations (int) – number of iterations in power method

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_lcmv_vector_with_rtf(psd_n: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], rtf_mat: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
Return the LCMV (Linearly Constrained Minimum Variance) vector

calculated with RTF:

h = (Npsd^-1 @ rtf_mat) @ (rtf_mat^H @ Npsd^-1 @ rtf_mat)^-1 @ p

Reference:

H. L. Van Trees, “Optimum array processing: Part IV of detection, estimation, and modulation theory,” John Wiley & Sons, 2004. (Chapter 6.7)

Parameters:
  • psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)

  • rtf_mat (torch.complex64/ComplexTensor) – RTF matrix (…, F, C, num_spk)

  • reference_vector (torch.Tensor or int) – (…, num_spk) or scalar

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_mvdr_vector(psd_s, psd_n, reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the MVDR (Minimum Variance Distortionless Response) vector:

h = (Npsd^-1 @ Spsd) / (Tr(Npsd^-1 @ Spsd)) @ u

Reference:

On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420

Parameters:
  • psd_s (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor) – (…, C)

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_mvdr_vector_with_rtf(psd_n: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
Return the MVDR (Minimum Variance Distortionless Response) vector

calculated with RTF:

h = (Npsd^-1 @ rtf) / (rtf^H @ Npsd^-1 @ rtf)

Reference:

On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420

Parameters:
  • psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)

  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • iterations (int) – number of iterations in power method

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_mwf_vector(psd_s, psd_n, reference_vector: Union[torch.Tensor, int], diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the MWF (Minimum Multi-channel Wiener Filter) vector:

h = (Npsd^-1 @ Spsd) @ u

Parameters:
  • psd_s (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_n (torch.complex64/ComplexTensor) – power-normalized observation covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_power_spectral_density_matrix(xs, mask, normalization=True, reduction='mean', eps: float = 1e-15)[source]

Return cross-channel power spectral density (PSD) matrix

Parameters:
  • xs (torch.complex64/ComplexTensor) – (…, F, C, T)

  • reduction (str) – “mean” or “median”

  • mask (torch.Tensor) – (…, F, C, T)

  • normalization (bool) –

  • eps (float) –

Returns

psd (torch.complex64/ComplexTensor): (…, F, C, C)

espnet2.enh.layers.beamformer.get_rank1_mwf_vector(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the R1-MWF (Rank-1 Multi-channel Wiener Filter) vector

h = (Npsd^-1 @ Spsd) / (mu + Tr(Npsd^-1 @ Spsd)) @ u

Reference:

[1] Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal-01634449/document [2] Low-rank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918

Parameters:
  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • denoising_weight (float) – a trade-off parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. When denoising_weight = 0, it corresponds to MVDR beamformer.

  • approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its low-rank approximation as in [1]

  • iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_rtf(psd_speech, psd_noise, mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3)[source]

Calculate the relative transfer function (RTF)

Algorithm of power method:
  1. rtf = reference_vector

  2. for i in range(iterations):

    rtf = (psd_noise^-1 @ psd_speech) @ rtf rtf = rtf / ||rtf||_2 # this normalization can be skipped

  3. rtf = psd_noise @ rtf

  4. rtf = rtf / rtf[…, ref_channel, :]

Note: 4) Normalization at the reference channel is not performed here.

Parameters:
  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • iterations (int) – number of iterations in power method

Returns:

(…, F, C, 1)

Return type:

rtf (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_rtf_matrix(psd_speeches, psd_noises, diagonal_loading: bool = True, ref_channel: int = 0, rtf_iterations: int = 3, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Calculate the RTF matrix with each column the relative transfer function of the corresponding source.

espnet2.enh.layers.beamformer.get_sdw_mwf_vector(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the SDW-MWF (Speech Distortion Weighted Multi-channel Wiener Filter) vector

h = (Spsd + mu * Npsd)^-1 @ Spsd @ u

Reference:

[1] Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction; A. Spriet et al, 2004 https://dl.acm.org/doi/abs/10.1016/j.sigpro.2004.07.028 [2] Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal-01634449/document [3] Low-rank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918

Parameters:
  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • denoising_weight (float) – a trade-off parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. The plain MWF is obtained with denoising_weight = 1 (by default).

  • approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its low-rank approximation as in [2]

  • iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.gev_phase_correction(vector)[source]

Phase correction to reduce distortions due to phase inconsistencies.

ported from https://github.com/fgnt/nn-gev/blob/master/fgnt/beamforming.py#L169

Parameters:

vector – Beamforming vector with shape (…, F, C)

Returns:

Phase corrected beamforming vectors

Return type:

w

espnet2.enh.layers.beamformer.perform_WPD_filtering(filter_matrix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], bdelay: int, btaps: int) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Perform WPD filtering.

Parameters:
  • filter_matrix – Filter matrix (B, F, (btaps + 1) * C)

  • Y – Complex STFT signal with shape (B, F, C, T)

Returns:

(B, F, T)

Return type:

enhanced (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.prepare_beamformer_stats(signal, masks_speech, mask_noise, powers=None, beamformer_type='mvdr', bdelay=3, btaps=5, eps=1e-06)[source]

Prepare necessary statistics for constructing the specified beamformer.

Parameters:
  • signal (torch.complex64/ComplexTensor) – (…, F, C, T)

  • masks_speech (List[torch.Tensor]) – (…, F, C, T) masks for all speech sources

  • mask_noise (torch.Tensor) – (…, F, C, T) noise mask

  • powers (List[torch.Tensor]) – powers for all speech sources (…, F, T) used for wMPDR or WPD beamformers

  • beamformer_type (str) – one of the pre-defined beamformer types

  • bdelay (int) – delay factor, used for WPD beamformser

  • btaps (int) – number of filter taps, used for WPD beamformser

  • eps (torch.Tensor) – tiny constant

Returns:

a dictionary containing all necessary statistics

e.g. “psd_n”, “psd_speech”, “psd_distortion” Note: * When masks_speech is a tensor or a single-element list, all returned

statistics are tensors;

  • When masks_speech is a multi-element list, some returned statistics can be a list, e.g., “psd_n” for MVDR, “psd_speech” and “psd_distortion”.

Return type:

beamformer_stats (dict)

espnet2.enh.layers.beamformer.signal_framing(signal: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], frame_length: int, frame_step: int, bdelay: int, do_padding: bool = False, pad_value: int = 0, indices: List = None) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Expand signal into several frames, with each frame of length frame_length.

Parameters:
  • signal – (…, T)

  • frame_length – length of each segment

  • frame_step – step for selecting frames

  • bdelay – delay for WPD

  • do_padding – whether or not to pad the input signal at the beginning of the time dimension

  • pad_value – value to fill in the padding

Returns:

if do_padding: (…, T, frame_length) else: (…, T - bdelay - frame_length + 2, frame_length)

Return type:

torch.Tensor

espnet2.enh.layers.beamformer.tik_reg(mat, reg: float = 1e-08, eps: float = 1e-08)[source]

Perform Tikhonov regularization (only modifying real part).

Parameters:
  • mat (torch.complex64/ComplexTensor) – input matrix (…, C, C)

  • reg (float) – regularization factor

  • eps (float) –

Returns:

regularized matrix (…, C, C)

Return type:

ret (torch.complex64/ComplexTensor)

espnet2.enh.layers.dnn_wpe

class espnet2.enh.layers.dnn_wpe.DNN_WPE(wtype: str = 'blstmp', widim: int = 257, wlayers: int = 3, wunits: int = 300, wprojs: int = 320, dropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask: bool = True, nmask: int = 1, nonlinear: str = 'sigmoid', iterations: int = 1, normalization: bool = False, eps: float = 1e-06, diagonal_loading: bool = True, diag_eps: float = 1e-07, mask_flooring: bool = False, flooring_thres: float = 1e-06, use_torch_solver: bool = True)[source]

Bases: torch.nn.modules.module.Module

forward(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], torch.LongTensor, Union[torch.Tensor, torch_complex.tensor.ComplexTensor]][source]

DNN_WPE forward function.

Notation:

B: Batch C: Channel T: Time or Sequence length F: Freq or Some dimension of the feature vector

Parameters:
  • data – (B, T, C, F)

  • ilens – (B,)

Returns:

(B, T, C, F) ilens: (B,) masks (torch.Tensor or List[torch.Tensor]): (B, T, C, F) power (List[torch.Tensor]): (B, F, T)

Return type:

enhanced (torch.Tensor or List[torch.Tensor])

predict_mask(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]

Predict mask for WPE dereverberation.

Parameters:
  • data (torch.complex64/ComplexTensor) – (B, T, C, F), double precision

  • ilens (torch.Tensor) – (B,)

Returns:

(B, T, C, F) ilens (torch.Tensor): (B,)

Return type:

masks (torch.Tensor or List[torch.Tensor])

espnet2.enh.layers.dnsmos

class espnet2.enh.layers.dnsmos.DNSMOS_local(primary_model_path, p808_model_path, use_gpu=False)[source]

Bases: object

audio_melspec(audio, n_mels=120, frame_size=320, hop_length=160, sr=16000, to_db=True)[source]
get_polyfit_val(sig, bak, ovr, is_personalized_MOS)[source]
class espnet2.enh.layers.dnsmos.DNSMOS_web(auth_key)[source]

Bases: object

espnet2.enh.layers.mask_estimator

class espnet2.enh.layers.mask_estimator.MaskEstimator(type, idim, layers, units, projs, dropout, nmask=1, nonlinear='sigmoid')[source]

Bases: torch.nn.modules.module.Module

forward(xs: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]

Mask estimator forward function.

Parameters:
  • xs – (B, F, C, T)

  • ilens – (B,)

Returns:

The hidden vector (B, F, C, T) masks: A tuple of the masks. (B, F, C, T) ilens: (B,)

Return type:

hs (torch.Tensor)

espnet2.enh.layers.skim

class espnet2.enh.layers.skim.MemLSTM(hidden_size, dropout=0.0, bidirectional=False, mem_type='hc', norm_type='cLN')[source]

Bases: torch.nn.modules.module.Module

the Mem-LSTM of SkiM

Parameters:
  • hidden_size – int, dimension of the hidden state.

  • dropout – float, dropout ratio. Default is 0.

  • bidirectional – bool, whether the LSTM layers are bidirectional. Default is False.

  • mem_type – ‘hc’, ‘h’, ‘c’ or ‘id’. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned.

  • norm_type – gLN, cLN. cLN is for causal implementation.

extra_repr() → str[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(hc, S)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_one_step(hc, state)[source]
class espnet2.enh.layers.skim.SegLSTM(input_size, hidden_size, dropout=0.0, bidirectional=False, norm_type='cLN')[source]

Bases: torch.nn.modules.module.Module

the Seg-LSTM of SkiM

Parameters:
  • input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).

  • hidden_size – int, dimension of the hidden state.

  • dropout – float, dropout ratio. Default is 0.

  • bidirectional – bool, whether the LSTM layers are bidirectional. Default is False.

  • norm_type – gLN, cLN. cLN is for causal implementation.

forward(input, hc)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.skim.SkiM(input_size, hidden_size, output_size, dropout=0.0, num_blocks=2, segment_size=20, bidirectional=True, mem_type='hc', norm_type='gLN', seg_overlap=False)[source]

Bases: torch.nn.modules.module.Module

Skipping Memory Net

Parameters:
  • input_size – int, dimension of the input feature. Input shape shoud be (batch, length, input_size)

  • hidden_size – int, dimension of the hidden state.

  • output_size – int, dimension of the output size.

  • dropout – float, dropout ratio. Default is 0.

  • num_blocks – number of basic SkiM blocks

  • segment_size – segmentation size for splitting long features

  • bidirectional – bool, whether the RNN layers are bidirectional.

  • mem_type – ‘hc’, ‘h’, ‘c’, ‘id’ or None. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.

  • norm_type – gLN, cLN. cLN is for causal implementation.

  • seg_overlap – Bool, whether the segmentation will reserve 50% overlap for adjacent segments.Default is False.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_stream(input_frame, states)[source]

espnet2.enh.layers.dc_crn

class espnet2.enh.layers.dc_crn.DC_CRN(input_dim, input_channels: List = [2, 16, 32, 64, 128, 256], enc_hid_channels=8, enc_kernel_size=(1, 3), enc_padding=(0, 1), enc_last_kernel_size=(1, 4), enc_last_stride=(1, 2), enc_last_padding=(0, 1), enc_layers=5, skip_last_kernel_size=(1, 3), skip_last_stride=(1, 1), skip_last_padding=(0, 1), glstm_groups=2, glstm_layers=2, glstm_bidirectional=False, glstm_rearrange=False, output_channels=2)[source]

Bases: torch.nn.modules.module.Module

Densely-Connected Convolutional Recurrent Network (DC-CRN).

Reference: Fig. 3 and Section III-B in [1]

Parameters:
  • input_dim (int) – input feature dimension

  • input_channels (list) – number of input channels for the stacked DenselyConnectedBlock layers Its length should be (number of DenselyConnectedBlock layers). It is recommended to use even number of channels to avoid AssertError when glstm_bidirectional=True.

  • enc_hid_channels (int) – common number of intermediate channels for all DenselyConnectedBlock of the encoder

  • enc_kernel_size (tuple) – common kernel size for all DenselyConnectedBlock of the encoder

  • enc_padding (tuple) – common padding for all DenselyConnectedBlock of the encoder

  • enc_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the encoder

  • enc_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the encoder

  • enc_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the encoder

  • enc_layers (int) – common total number of Conv layers for all DenselyConnectedBlock layers of the encoder

  • skip_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the skip pathways

  • skip_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the skip pathways

  • skip_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the skip pathways

  • glstm_groups (int) – number of groups in each Grouped LSTM layer

  • glstm_layers (int) – number of Grouped LSTM layers

  • glstm_bidirectional (bool) – whether to use BLSTM or unidirectional LSTM in Grouped LSTM layers

  • glstm_rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer

  • output_channels (int) – number of output channels (must be an even number to recover both real and imaginary parts)

forward(x)[source]

DC-CRN forward.

Parameters:

x (torch.Tensor) – Concatenated real and imaginary spectrum features (B, input_channels[0], T, F)

Returns:

(B, 2, output_channels, T, F)

Return type:

out (torch.Tensor)

class espnet2.enh.layers.dc_crn.DenselyConnectedBlock(in_channels, out_channels, hid_channels=8, kernel_size=(1, 3), padding=(0, 1), last_kernel_size=(1, 4), last_stride=(1, 2), last_padding=(0, 1), last_output_padding=(0, 0), layers=5, transposed=False)[source]

Bases: torch.nn.modules.module.Module

Densely-Connected Convolutional Block.

Parameters:
  • in_channels (int) – number of input channels

  • out_channels (int) – number of output channels

  • hid_channels (int) – number of output channels in intermediate Conv layers

  • kernel_size (tuple) – kernel size for all but the last Conv layers

  • padding (tuple) – padding for all but the last Conv layers

  • last_kernel_size (tuple) – kernel size for the last GluConv layer

  • last_stride (tuple) – stride for the last GluConv layer

  • last_padding (tuple) – padding for the last GluConv layer

  • last_output_padding (tuple) – output padding for the last GluConvTranspose2d (only used when transposed=True)

  • layers (int) – total number of Conv layers

  • transposed (bool) – True to use GluConvTranspose2d in the last layer False to use GluConv2d in the last layer

forward(input)[source]

DenselyConnectedBlock forward.

Parameters:

input (torch.Tensor) – (B, C, T_in, F_in)

Returns:

(B, C, T_out, F_out)

Return type:

out (torch.Tensor)

class espnet2.enh.layers.dc_crn.GLSTM(hidden_size=1024, groups=2, layers=2, bidirectional=False, rearrange=False)[source]

Bases: torch.nn.modules.module.Module

Grouped LSTM.

Reference:

Efficient Sequence Learning with Group Recurrent Networks; Gao et al., 2018

Parameters:
  • hidden_size (int) – total hidden size of all LSTMs in each grouped LSTM layer i.e., hidden size of each LSTM is hidden_size // groups

  • groups (int) – number of LSTMs in each grouped LSTM layer

  • layers (int) – number of grouped LSTM layers

  • bidirectional (bool) – whether to use BLSTM or unidirectional LSTM

  • rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer

forward(x)[source]

Grouped LSTM forward.

Parameters:

x (torch.Tensor) – (B, C, T, D)

Returns:

(B, C, T, D)

Return type:

out (torch.Tensor)

class espnet2.enh.layers.dc_crn.GluConv2d(in_channels, out_channels, kernel_size, stride, padding=0)[source]

Bases: torch.nn.modules.module.Module

Conv2d with Gated Linear Units (GLU).

Input and output shapes are the same as regular Conv2d layers.

Reference: Section III-B in [1]

Parameters:
  • in_channels (int) – number of input channels

  • out_channels (int) – number of output channels

  • kernel_size (int/tuple) – kernel size in Conv2d

  • stride (int/tuple) – stride size in Conv2d

  • padding (int/tuple) – padding size in Conv2d

forward(x)[source]

ConvGLU forward.

Parameters:

x (torch.Tensor) – (B, C_in, H_in, W_in)

Returns:

(B, C_out, H_out, W_out)

Return type:

out (torch.Tensor)

class espnet2.enh.layers.dc_crn.GluConvTranspose2d(in_channels, out_channels, kernel_size, stride, padding=0, output_padding=(0, 0))[source]

Bases: torch.nn.modules.module.Module

ConvTranspose2d with Gated Linear Units (GLU).

Input and output shapes are the same as regular ConvTranspose2d layers.

Reference: Section III-B in [1]

Parameters:
  • in_channels (int) – number of input channels

  • out_channels (int) – number of output channels

  • kernel_size (int/tuple) – kernel size in ConvTranspose2d

  • stride (int/tuple) – stride size in ConvTranspose2d

  • padding (int/tuple) – padding size in ConvTranspose2d

  • output_padding (int/tuple) – Additional size added to one side of each dimension in the output shape

forward(x)[source]

DeconvGLU forward.

Parameters:

x (torch.Tensor) – (B, C_in, H_in, W_in)

Returns:

(B, C_out, H_out, W_out)

Return type:

out (torch.Tensor)

espnet2.enh.layers.__init__

espnet2.enh.layers.conv_utils

espnet2.enh.layers.conv_utils.conv2d_output_shape(h_w, kernel_size=1, stride=1, pad=0, dilation=1)[source]
espnet2.enh.layers.conv_utils.convtransp2d_output_shape(h_w, kernel_size=1, stride=1, pad=0, dilation=1, out_pad=0)[source]
espnet2.enh.layers.conv_utils.num2tuple(num)[source]

espnet2.enh.layers.dnn_beamformer

DNN beamformer module.

class espnet2.enh.layers.dnn_beamformer.AttentionReference(bidim, att_dim, eps=1e-06)[source]

Bases: torch.nn.modules.module.Module

forward(psd_in: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor, scaling: float = 2.0) → Tuple[torch.Tensor, torch.LongTensor][source]

Attention-based reference forward function.

Parameters:
  • psd_in (torch.complex64/ComplexTensor) – (B, F, C, C)

  • ilens (torch.Tensor) – (B,)

  • scaling (float) –

Returns:

(B, C) ilens (torch.Tensor): (B,)

Return type:

u (torch.Tensor)

class espnet2.enh.layers.dnn_beamformer.DNN_Beamformer(bidim, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, num_spk: int = 1, use_noise_mask: bool = True, nonlinear: str = 'sigmoid', dropout_rate: float = 0.0, badim: int = 320, ref_channel: int = -1, beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, mwf_mu: float = 1.0, eps: float = 1e-06, diagonal_loading: bool = True, diag_eps: float = 1e-07, mask_flooring: bool = False, flooring_thres: float = 1e-06, use_torch_solver: bool = True, use_torchaudio_api: bool = False, btaps: int = 5, bdelay: int = 3)[source]

Bases: torch.nn.modules.module.Module

DNN mask based Beamformer.

Citation:

Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017; http://proceedings.mlr.press/v70/ochiai17a/ochiai17a.pdf

apply_beamforming(data, ilens, psd_n, psd_speech, psd_distortion=None, rtf_mat=None, spk=0)[source]

Beamforming with the provided statistics.

Parameters:
  • data (torch.complex64/ComplexTensor) – (B, F, C, T)

  • ilens (torch.Tensor) – (B,)

  • psd_n (torch.complex64/ComplexTensor) – Noise covariance matrix for MVDR (B, F, C, C) Observation covariance matrix for MPDR/wMPDR (B, F, C, C) Stacked observation covariance for WPD (B,F,(btaps+1)*C,(btaps+1)*C)

  • psd_speech (torch.complex64/ComplexTensor) – Speech covariance matrix (B, F, C, C)

  • psd_distortion (torch.complex64/ComplexTensor) – Noise covariance matrix (B, F, C, C)

  • rtf_mat (torch.complex64/ComplexTensor) – RTF matrix (B, F, C, num_spk)

  • spk (int) – speaker index

Returns:

(B, F, T) ws (torch.complex64/ComplexTensor): (B, F) or (B, F, (btaps+1)*C)

Return type:

enhanced (torch.complex64/ComplexTensor)

forward(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor, powers: Optional[List[torch.Tensor]] = None, oracle_masks: Optional[List[torch.Tensor]] = None) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], torch.LongTensor, torch.Tensor][source]

DNN_Beamformer forward function.

Notation:

B: Batch C: Channel T: Time or Sequence length F: Freq

Parameters:
  • data (torch.complex64/ComplexTensor) – (B, T, C, F)

  • ilens (torch.Tensor) – (B,)

  • powers (List[torch.Tensor] or None) – used for wMPDR or WPD (B, F, T)

  • oracle_masks (List[torch.Tensor] or None) – oracle masks (B, F, C, T) if not None, oracle_masks will be used instead of self.mask

Returns:

(B, T, F) ilens (torch.Tensor): (B,) masks (torch.Tensor): (B, T, C, F)

Return type:

enhanced (torch.complex64/ComplexTensor)

predict_mask(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]

Predict masks for beamforming.

Parameters:
  • data (torch.complex64/ComplexTensor) – (B, T, C, F), double precision

  • ilens (torch.Tensor) – (B,)

Returns:

(B, T, C, F) ilens (torch.Tensor): (B,)

Return type:

masks (torch.Tensor)

espnet2.enh.layers.beamformer_th

Beamformer module.

espnet2.enh.layers.beamformer_th.apply_beamforming_vector(beamform_vector: torch.Tensor, mix: torch.Tensor) → torch.Tensor[source]
espnet2.enh.layers.beamformer_th.blind_analytic_normalization(ws, psd_noise, eps=1e-08)[source]

Blind analytic normalization (BAN) for post-filtering

Parameters:
  • ws (torch.complex64) – beamformer vector (…, F, C)

  • psd_noise (torch.complex64) – noise PSD matrix (…, F, C, C)

  • eps (float) –

Returns:

normalized beamformer vector (…, F)

Return type:

ws_ban (torch.complex64)

espnet2.enh.layers.beamformer_th.generalized_eigenvalue_decomposition(a: torch.Tensor, b: torch.Tensor, eps=1e-06)[source]

Solves the generalized eigenvalue decomposition through Cholesky decomposition.

ported from https://github.com/asteroid-team/asteroid/blob/master/asteroid/dsp/beamforming.py#L464

a @ e_vec = e_val * b @ e_vec | | Cholesky decomposition on b: | b = L @ L^H, where L is a lower triangular matrix | | Let C = L^-1 @ a @ L^-H, it is Hermitian. | => C @ y = lambda * y => e_vec = L^-H @ y

Reference: https://www.netlib.org/lapack/lug/node54.html

Parameters:
  • a – A complex Hermitian or real symmetric matrix whose eigenvalues and eigenvectors will be computed. (…, C, C)

  • b – A complex Hermitian or real symmetric definite positive matrix. (…, C, C)

Returns:

generalized eigenvalues (ascending order) e_vec: generalized eigenvectors

Return type:

e_val

espnet2.enh.layers.beamformer_th.get_WPD_filter(Phi: torch.Tensor, Rf: torch.Tensor, reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → torch.Tensor[source]

Return the WPD vector.

WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:

h = (Rf^-1 @ Phi_{xx}) / tr[(Rf^-1) @ Phi_{xx}] @ u

Reference:

T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481

Parameters:
  • Phi (torch.complex64) – (B, F, (btaps+1) * C, (btaps+1) * C) is the PSD of zero-padded speech [x^T(t,f) 0 … 0]^T.

  • Rf (torch.complex64) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.

  • reference_vector (torch.Tensor) – (B, (btaps+1) * C) is the reference_vector.

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(B, F, (btaps + 1) * C)

Return type:

filter_matrix (torch.complex64)

espnet2.enh.layers.beamformer_th.get_WPD_filter_v2(Phi: torch.Tensor, Rf: torch.Tensor, reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → torch.Tensor[source]

Return the WPD vector (v2).

This implementation is more efficient than get_WPD_filter as

it skips unnecessary computation with zeros.

Parameters:
  • Phi (torch.complex64) – (B, F, C, C) is speech PSD.

  • Rf (torch.complex64) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.

  • reference_vector (torch.Tensor) – (B, C) is the reference_vector.

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(B, F, (btaps+1) * C)

Return type:

filter_matrix (torch.complex64)

espnet2.enh.layers.beamformer_th.get_WPD_filter_with_rtf(psd_observed_bar: torch.Tensor, psd_speech: torch.Tensor, psd_noise: torch.Tensor, iterations: int = 3, reference_vector: Union[int, torch.Tensor] = 0, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-15) → torch.Tensor[source]

Return the WPD vector calculated with RTF.

WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:

h = (Rf^-1 @ vbar) / (vbar^H @ R^-1 @ vbar)

Reference:

T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481

Parameters:
  • psd_observed_bar (torch.complex64) – stacked observation covariance matrix

  • psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)

  • iterations (int) – number of iterations in power method

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.get_covariances(Y: torch.Tensor, inverse_power: torch.Tensor, bdelay: int, btaps: int, get_vector: bool = False) → torch.Tensor[source]
Calculates the power normalized spatio-temporal covariance

matrix of the framed signal.

Parameters:
  • Y – Complex STFT signal with shape (B, F, C, T)

  • inverse_power – Weighting factor with shape (B, F, T)

Returns:

(B, F, (btaps+1) * C, (btaps+1) * C) Correlation vector: (B, F, btaps + 1, C, C)

Return type:

Correlation matrix

espnet2.enh.layers.beamformer_th.get_gev_vector(psd_noise: torch.Tensor, psd_speech: torch.Tensor, mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → torch.Tensor[source]

Return the generalized eigenvalue (GEV) beamformer vector:

psd_speech @ h = lambda * psd_noise @ h

Reference:

Blind acoustic beamforming based on generalized eigenvalue decomposition; E. Warsitz and R. Haeb-Umbach, 2007.

Parameters:
  • psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)

  • psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)

  • mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • iterations (int) – number of iterations in power method

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.get_lcmv_vector_with_rtf(psd_n: torch.Tensor, rtf_mat: torch.Tensor, reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → torch.Tensor[source]
Return the LCMV (Linearly Constrained Minimum Variance) vector

calculated with RTF:

h = (Npsd^-1 @ rtf_mat) @ (rtf_mat^H @ Npsd^-1 @ rtf_mat)^-1 @ p

Reference:

H. L. Van Trees, “Optimum array processing: Part IV of detection, estimation, and modulation theory,” John Wiley & Sons, 2004. (Chapter 6.7)

Parameters:
  • psd_n (torch.complex64) – observation/noise covariance matrix (…, F, C, C)

  • rtf_mat (torch.complex64) – RTF matrix (…, F, C, num_spk)

  • reference_vector (torch.Tensor or int) – (…, num_spk) or scalar

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.get_mvdr_vector(psd_s, psd_n, reference_vector: Union[torch.Tensor, int], diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the MVDR (Minimum Variance Distortionless Response) vector:

h = (Npsd^-1 @ Spsd) / (Tr(Npsd^-1 @ Spsd)) @ u

Reference:

On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420

Parameters:
  • psd_s (torch.complex64) – speech covariance matrix (…, F, C, C)

  • psd_n (torch.complex64) – observation/noise covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor) – (…, C) or an integer

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.get_mvdr_vector_with_rtf(psd_n: torch.Tensor, psd_speech: torch.Tensor, psd_noise: torch.Tensor, iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → torch.Tensor[source]
Return the MVDR (Minimum Variance Distortionless Response) vector

calculated with RTF:

h = (Npsd^-1 @ rtf) / (rtf^H @ Npsd^-1 @ rtf)

Reference:

On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420

Parameters:
  • psd_n (torch.complex64) – observation/noise covariance matrix (…, F, C, C)

  • psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)

  • iterations (int) – number of iterations in power method

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.get_mwf_vector(psd_s, psd_n, reference_vector: Union[torch.Tensor, int], diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the MWF (Minimum Multi-channel Wiener Filter) vector:

h = (Npsd^-1 @ Spsd) @ u

Parameters:
  • psd_s (torch.complex64) – speech covariance matrix (…, F, C, C)

  • psd_n (torch.complex64) – power-normalized observation covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.get_rank1_mwf_vector(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the R1-MWF (Rank-1 Multi-channel Wiener Filter) vector

h = (Npsd^-1 @ Spsd) / (mu + Tr(Npsd^-1 @ Spsd)) @ u

Reference:

[1] Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal-01634449/document [2] Low-rank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918

Parameters:
  • psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • denoising_weight (float) – a trade-off parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. When denoising_weight = 0, it corresponds to MVDR beamformer.

  • approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its low-rank approximation as in [1]

  • iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.get_rtf(psd_speech, psd_noise, mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e-07)[source]

Calculate the relative transfer function (RTF).

Parameters:
  • psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)

  • mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • iterations (int) – number of iterations in power method

Returns:

(…, F, C)

Return type:

rtf (torch.complex64)

espnet2.enh.layers.beamformer_th.get_rtf_matrix(psd_speeches, psd_noises, diagonal_loading: bool = True, ref_channel: int = 0, rtf_iterations: int = 3, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Calculate the RTF matrix with each column the relative transfer function of the corresponding source.

espnet2.enh.layers.beamformer_th.get_sdw_mwf_vector(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the SDW-MWF (Speech Distortion Weighted Multi-channel Wiener Filter) vector

h = (Spsd + mu * Npsd)^-1 @ Spsd @ u

Reference:

[1] Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction; A. Spriet et al, 2004 https://dl.acm.org/doi/abs/10.1016/j.sigpro.2004.07.028 [2] Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal-01634449/document [3] Low-rank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918

Parameters:
  • psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • denoising_weight (float) – a trade-off parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. The plain MWF is obtained with denoising_weight = 1 (by default).

  • approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its low-rank approximation as in [2]

  • iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns:

(…, F, C)

Return type:

beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.gev_phase_correction(vector)[source]

Phase correction to reduce distortions due to phase inconsistencies.

ported from https://github.com/fgnt/nn-gev/blob/master/fgnt/beamforming.py#L169

Parameters:

vector – Beamforming vector with shape (…, F, C)

Returns:

Phase corrected beamforming vectors

Return type:

w

espnet2.enh.layers.beamformer_th.perform_WPD_filtering(filter_matrix: torch.Tensor, Y: torch.Tensor, bdelay: int, btaps: int) → torch.Tensor[source]

Perform WPD filtering.

Parameters:
  • filter_matrix – Filter matrix (B, F, (btaps + 1) * C)

  • Y – Complex STFT signal with shape (B, F, C, T)

Returns:

(B, F, T)

Return type:

enhanced (torch.complex64)

espnet2.enh.layers.beamformer_th.prepare_beamformer_stats(signal, masks_speech, mask_noise, powers=None, beamformer_type='mvdr', bdelay=3, btaps=5, eps=1e-06)[source]

Prepare necessary statistics for constructing the specified beamformer.

Parameters:
  • signal (torch.complex64) – (…, F, C, T)

  • masks_speech (List[torch.Tensor]) – (…, F, C, T) masks for all speech sources

  • mask_noise (torch.Tensor) – (…, F, C, T) noise mask

  • powers (List[torch.Tensor]) – powers for all speech sources (…, F, T) used for wMPDR or WPD beamformers

  • beamformer_type (str) – one of the pre-defined beamformer types

  • bdelay (int) – delay factor, used for WPD beamformser

  • btaps (int) – number of filter taps, used for WPD beamformser

  • eps (torch.Tensor) – tiny constant

Returns:

a dictionary containing all necessary statistics

e.g. “psd_n”, “psd_speech”, “psd_distortion” Note: * When masks_speech is a tensor or a single-element list, all returned

statistics are tensors;

  • When masks_speech is a multi-element list, some returned statistics can be a list, e.g., “psd_n” for MVDR, “psd_speech” and “psd_distortion”.

Return type:

beamformer_stats (dict)

espnet2.enh.layers.beamformer_th.signal_framing(signal: torch.Tensor, frame_length: int, frame_step: int, bdelay: int, do_padding: bool = False, pad_value: int = 0, indices: List = None) → torch.Tensor[source]

Expand signal into several frames, with each frame of length frame_length.

Parameters:
  • signal – (…, T)

  • frame_length – length of each segment

  • frame_step – step for selecting frames

  • bdelay – delay for WPD

  • do_padding – whether or not to pad the input signal at the beginning of the time dimension

  • pad_value – value to fill in the padding

Returns:

if do_padding: (…, T, frame_length) else: (…, T - bdelay - frame_length + 2, frame_length)

Return type:

torch.Tensor

espnet2.enh.layers.beamformer_th.tik_reg(mat, reg: float = 1e-08, eps: float = 1e-08)[source]

Perform Tikhonov regularization (only modifying real part).

Parameters:
  • mat (torch.complex64) – input matrix (…, C, C)

  • reg (float) – regularization factor

  • eps (float) –

Returns:

regularized matrix (…, C, C)

Return type:

ret (torch.complex64)

espnet2.enh.layers.ifasnet

class espnet2.enh.layers.ifasnet.iFaSNet(*args, **kwargs)[source]

Bases: espnet2.enh.layers.fasnet.FaSNet_base

forward(input, num_mic)[source]

abstract forward function

input: shape (batch, max_num_ch, T) num_mic: shape (batch, ), the number of channels for each input.

Zero for fixed geometry configuration.

espnet2.enh.layers.ifasnet.test_model(model)[source]

espnet2.enh.layers.dprnn

class espnet2.enh.layers.dprnn.DPRNN(rnn_type, input_size, hidden_size, output_size, dropout=0, num_layers=1, bidirectional=True)[source]

Bases: torch.nn.modules.module.Module

Deep dual-path RNN.

Parameters:
  • rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).

  • hidden_size – int, dimension of the hidden state.

  • output_size – int, dimension of the output size.

  • dropout – float, dropout ratio. Default is 0.

  • num_layers – int, number of stacked RNN layers. Default is 1.

  • bidirectional – bool, whether the RNN layers are bidirectional. Default is True.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.dprnn.DPRNN_TAC(rnn_type, input_size, hidden_size, output_size, dropout=0, num_layers=1, bidirectional=True)[source]

Bases: torch.nn.modules.module.Module

Deep duaL-path RNN with TAC applied to each layer/block.

Parameters:
  • rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).

  • hidden_size – int, dimension of the hidden state.

  • output_size – int, dimension of the output size.

  • dropout – float, dropout ratio. Default is 0.

  • num_layers – int, number of stacked RNN layers. Default is 1.

  • bidirectional – bool, whether the RNN layers are bidirectional. Default is False.

forward(input, num_mic)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.dprnn.SingleRNN(rnn_type, input_size, hidden_size, dropout=0, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

Container module for a single RNN layer.

Parameters:
  • rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).

  • hidden_size – int, dimension of the hidden state.

  • dropout – float, dropout ratio. Default is 0.

  • bidirectional – bool, whether the RNN layers are bidirectional. Default is False.

forward(input, state=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.dprnn.merge_feature(input, rest)[source]
espnet2.enh.layers.dprnn.split_feature(input, segment_size)[source]

espnet2.enh.encoder.abs_encoder

class espnet2.enh.encoder.abs_encoder.AbsEncoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_streaming(input: torch.Tensor)[source]
abstract property output_dim
streaming_frame(audio: torch.Tensor)[source]

streaming_frame. It splits the continuous audio into frame-level audio chunks in the streaming simulation. It is noted that this function takes the entire long audio as input for a streaming simulation. You may refer to this function to manage your streaming input buffer in a real streaming application.

Parameters:

audio – (B, T)

Returns:

List [(B, frame_size),]

Return type:

chunked

espnet2.enh.encoder.stft_encoder

class espnet2.enh.encoder.stft_encoder.STFTEncoder(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window='hann', center: bool = True, normalized: bool = False, onesided: bool = True, use_builtin_complex: bool = True)[source]

Bases: espnet2.enh.encoder.abs_encoder.AbsEncoder

STFT encoder for speech enhancement and separation

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters:
  • input (torch.Tensor) – mixed speech [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

forward_streaming(input: torch.Tensor)[source]

Forward. :param input: mixed speech [Batch, frame_length] :type input: torch.Tensor

Returns:

B, 1, F

property output_dim
streaming_frame(audio)[source]

streaming_frame. It splits the continuous audio into frame-level audio chunks in the streaming simulation. It is noted that this function takes the entire long audio as input for a streaming simulation. You may refer to this function to manage your streaming input buffer in a real streaming application.

Parameters:

audio – (B, T)

Returns:

List [(B, frame_size),]

Return type:

chunked

espnet2.enh.encoder.conv_encoder

class espnet2.enh.encoder.conv_encoder.ConvEncoder(channel: int, kernel_size: int, stride: int)[source]

Bases: espnet2.enh.encoder.abs_encoder.AbsEncoder

Convolutional encoder for speech enhancement and separation

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters:
  • input (torch.Tensor) – mixed speech [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns:

mixed feature after encoder [Batch, flens, channel]

Return type:

feature (torch.Tensor)

forward_streaming(input: torch.Tensor)[source]
property output_dim
streaming_frame(audio: torch.Tensor)[source]

streaming_frame. It splits the continuous audio into frame-level audio chunks in the streaming simulation. It is noted that this function takes the entire long audio as input for a streaming simulation. You may refer to this function to manage your streaming input buffer in a real streaming application.

Parameters:

audio – (B, T)

Returns:

List [(B, frame_size),]

Return type:

chunked

espnet2.enh.encoder.null_encoder

class espnet2.enh.encoder.null_encoder.NullEncoder[source]

Bases: espnet2.enh.encoder.abs_encoder.AbsEncoder

Null encoder.

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters:
  • input (torch.Tensor) – mixed speech [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

property output_dim

espnet2.enh.encoder.__init__

espnet2.enh.loss.__init__

espnet2.enh.loss.wrappers.mixit_solver

class espnet2.enh.loss.wrappers.mixit_solver.MixITSolver(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight: float = 1.0)[source]

Bases: espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

Mixture Invariant Training Solver.

Parameters:
  • criterion (AbsEnhLoss) – an instance of AbsEnhLoss

  • weight (float) – weight (between 0 and 1) of current loss for multi-task learning.

forward(ref: Union[List[torch.Tensor], List[torch_complex.tensor.ComplexTensor]], inf: Union[List[torch.Tensor], List[torch_complex.tensor.ComplexTensor]], others: Dict = {})[source]

MixIT solver.

Parameters:
  • ref (List[torch.Tensor]) – [(batch, …), …] x n_spk

  • inf (List[torch.Tensor]) – [(batch, …), …] x n_est

Returns:

(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned

Return type:

loss

property name

espnet2.enh.loss.wrappers.multilayer_pit_solver

class espnet2.enh.loss.wrappers.multilayer_pit_solver.MultiLayerPITSolver(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0, independent_perm=True, layer_weights=None)[source]

Bases: espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

Multi-Layer Permutation Invariant Training Solver.

Compute the PIT loss given inferences of multiple layers and a single reference. It also support single inference and single reference in evaluation stage.

Parameters:
  • criterion (AbsEnhLoss) – an instance of AbsEnhLoss

  • weight (float) – weight (between 0 and 1) of current loss for multi-task learning.

  • independent_perm (bool) – If True, PIT will be performed in forward to find the best permutation; If False, the permutation from the last LossWrapper output will be inherited. Note: You should be careful about the ordering of loss wrappers defined in the yaml config, if this argument is False.

  • layer_weights (Optional[List[float]]) – weights for each layer If not None, the loss of each layer will be weighted-summed using the specified weights.

forward(ref, infs, others={})[source]

Permutation invariant training solver.

Parameters:
  • ref (List[torch.Tensor]) – [(batch, …), …] x n_spk

  • infs (Union[List[torch.Tensor], List[List[torch.Tensor]]]) – [(batch, …), …]

Returns:

(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned

Return type:

loss

espnet2.enh.loss.wrappers.fixed_order

class espnet2.enh.loss.wrappers.fixed_order.FixedOrderSolver(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0)[source]

Bases: espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

forward(ref, inf, others={})[source]

An naive fixed-order solver

Parameters:
  • ref (List[torch.Tensor]) – [(batch, …), …] x n_spk

  • inf (List[torch.Tensor]) – [(batch, …), …]

Returns:

(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: reserved

Return type:

loss

espnet2.enh.loss.wrappers.abs_wrapper

class espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Base class for all Enhancement loss wrapper modules.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(ref: List, inf: List, others: Dict) → Tuple[torch.Tensor, Dict, Dict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

weight = 1.0

espnet2.enh.loss.wrappers.__init__

espnet2.enh.loss.wrappers.pit_solver

class espnet2.enh.loss.wrappers.pit_solver.PITSolver(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0, independent_perm=True, flexible_numspk=False)[source]

Bases: espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

Permutation Invariant Training Solver.

Parameters:
  • criterion (AbsEnhLoss) – an instance of AbsEnhLoss

  • weight (float) – weight (between 0 and 1) of current loss for multi-task learning.

  • independent_perm (bool) –

    If True, PIT will be performed in forward to find the best permutation; If False, the permutation from the last LossWrapper output will be inherited. NOTE (wangyou): You should be careful about the ordering of loss

    wrappers defined in the yaml config, if this argument is False.

  • flexible_numspk (bool) – If True, num_spk will be taken from inf to handle flexible numbers of speakers. This is because ref may include dummy data in this case.

forward(ref, inf, others={})[source]

PITSolver forward.

Parameters:
  • ref (List[torch.Tensor]) – [(batch, …), …] x n_spk

  • inf (List[torch.Tensor]) – [(batch, …), …]

Returns:

(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned

Return type:

loss

espnet2.enh.loss.wrappers.dpcl_solver

class espnet2.enh.loss.wrappers.dpcl_solver.DPCLSolver(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0)[source]

Bases: espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

forward(ref, inf, others={})[source]

A naive DPCL solver

Parameters:
  • ref (List[torch.Tensor]) – [(batch, …), …] x n_spk

  • inf (List[torch.Tensor]) – [(batch, …), …]

  • others (List) – other data included in this solver e.g. “tf_embedding” learned embedding of all T-F bins (B, T * F, D)

Returns:

(torch.Tensor): minimum loss with the best permutation stats: (dict), for collecting training status others: reserved

Return type:

loss

espnet2.enh.loss.criterions.time_domain

class espnet2.enh.loss.criterions.time_domain.CISDRLoss(filter_length=512, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

CI-SDR loss

Reference:

Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation; C. Boeddeker et al., 2021; https://arxiv.org/abs/2011.15003

Parameters:
  • ref – (Batch, samples)

  • inf – (Batch, samples)

  • filter_length (int) – a time-invariant filter that allows slight distortion via filtering

Returns:

(Batch,)

Return type:

loss

forward(ref: torch.Tensor, inf: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.loss.criterions.time_domain.MultiResL1SpecLoss(window_sz=[512], hop_sz=None, eps=1e-08, time_domain_weight=0.5, name=None, only_for_test=False)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

Multi-Resolution L1 time-domain + STFT mag loss

Reference: Lu, Y. J., Cornell, S., Chang, X., Zhang, W., Li, C., Ni, Z., … & Watanabe, S. Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge. ICASSP 2022 p. 9201-9205.

window_sz

(list) list of STFT window sizes.

hop_sz

(list, optional) list of hop_sizes, default is each window_sz // 2.

eps

(float) stability epsilon

time_domain_weight

(float) weight for time domain loss.

forward(target: torch.Tensor, estimate: torch.Tensor)[source]

forward.

Parameters:
  • target – (Batch, T)

  • estimate – (Batch, T)

Returns:

(Batch,)

Return type:

loss

get_magnitude(stft)[source]
property name
class espnet2.enh.loss.criterions.time_domain.SDRLoss(filter_length=512, use_cg_iter=None, clamp_db=None, zero_mean=True, load_diag=None, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

SDR loss.

filter_length: int

The length of the distortion filter allowed (default: 512)

use_cg_iter:

If provided, an iterative method is used to solve for the distortion filter coefficients instead of direct Gaussian elimination. This can speed up the computation of the metrics in case the filters are long. Using a value of 10 here has been shown to provide good accuracy in most cases and is sufficient when using this loss to train neural separation networks.

clamp_db: float

clamp the output value in [-clamp_db, clamp_db]

zero_mean: bool

When set to True, the mean of all signals is subtracted prior.

load_diag:

If provided, this small value is added to the diagonal coefficients of the system metrices when solving for the filter coefficients. This can help stabilize the metric in the case where some of the reference signals may sometimes be zero

forward(ref: torch.Tensor, est: torch.Tensor) → torch.Tensor[source]

SDR forward.

Parameters:
  • ref – Tensor, (…, n_samples) reference signal

  • est – Tensor (…, n_samples) estimated signal

Returns:

(…,)

the SDR loss (negative sdr)

Return type:

loss

class espnet2.enh.loss.criterions.time_domain.SISNRLoss(clamp_db=None, zero_mean=True, eps=None, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

SI-SNR (or named SI-SDR) loss

A more stable SI-SNR loss with clamp from fast_bss_eval.

clamp_db

float clamp the output value in [-clamp_db, clamp_db]

zero_mean

bool When set to True, the mean of all signals is subtracted prior.

eps

float Deprecated. Kept for compatibility.

forward(ref: torch.Tensor, est: torch.Tensor) → torch.Tensor[source]

SI-SNR forward.

Parameters:
  • ref – Tensor, (…, n_samples) reference signal

  • est – Tensor (…, n_samples) estimated signal

Returns:

(…,)

the SI-SDR loss (negative si-sdr)

Return type:

loss

class espnet2.enh.loss.criterions.time_domain.SNRLoss(eps=1.1920928955078125e-07, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

forward(ref: torch.Tensor, inf: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.loss.criterions.time_domain.TimeDomainL1(name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

forward(ref, inf) → torch.Tensor[source]

Time-domain L1 loss forward.

Parameters:
  • ref – (Batch, T) or (Batch, T, C)

  • inf – (Batch, T) or (Batch, T, C)

Returns:

(Batch,)

Return type:

loss

class espnet2.enh.loss.criterions.time_domain.TimeDomainLoss(name, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, abc.ABC

Base class for all time-domain Enhancement loss modules.

property is_dereverb_loss
property is_noise_loss
property name
property only_for_test
class espnet2.enh.loss.criterions.time_domain.TimeDomainMSE(name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

forward(ref, inf) → torch.Tensor[source]

Time-domain MSE loss forward.

Parameters:
  • ref – (Batch, T) or (Batch, T, C)

  • inf – (Batch, T) or (Batch, T, C)

Returns:

(Batch,)

Return type:

loss

espnet2.enh.loss.criterions.tf_domain

class espnet2.enh.loss.criterions.tf_domain.FrequencyDomainAbsCoherence(compute_on_mask=False, mask_type=None, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property compute_on_mask
forward(ref, inf) → torch.Tensor[source]

time-frequency absolute coherence loss.

Reference:

Independent Vector Analysis with Deep Neural Network Source Priors; Li et al 2020; https://arxiv.org/abs/2008.11273

Parameters:
  • ref – (Batch, T, F) or (Batch, T, C, F)

  • inf – (Batch, T, F) or (Batch, T, C, F)

Returns:

(Batch,)

Return type:

loss

property mask_type
class espnet2.enh.loss.criterions.tf_domain.FrequencyDomainCrossEntropy(compute_on_mask=False, mask_type=None, ignore_id=-100, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property compute_on_mask
forward(ref, inf) → torch.Tensor[source]

time-frequency cross-entropy loss.

Parameters:
  • ref – (Batch, T) or (Batch, T, C)

  • inf – (Batch, T, nclass) or (Batch, T, C, nclass)

Returns:

(Batch,)

Return type:

loss

property mask_type
class espnet2.enh.loss.criterions.tf_domain.FrequencyDomainDPCL(compute_on_mask=False, mask_type='IBM', loss_type='dpcl', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property compute_on_mask
forward(ref, inf) → torch.Tensor[source]

time-frequency Deep Clustering loss.

References

[1] Deep clustering: Discriminative embeddings for segmentation and

separation; John R. Hershey. et al., 2016; https://ieeexplore.ieee.org/document/7471631

[2] Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding

Vectors Based on Regular Simplex; Tanaka, K. et al., 2021; https://www.isca-speech.org/archive/interspeech_2021/tanaka21_interspeech.html

Parameters:
  • ref – List[(Batch, T, F) * spks]

  • inf – (Batch, T*F, D)

Returns:

(Batch,)

Return type:

loss

property mask_type
class espnet2.enh.loss.criterions.tf_domain.FrequencyDomainL1(compute_on_mask=False, mask_type='IBM', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property compute_on_mask
forward(ref, inf) → torch.Tensor[source]

time-frequency L1 loss.

Parameters:
  • ref – (Batch, T, F) or (Batch, T, C, F)

  • inf – (Batch, T, F) or (Batch, T, C, F)

Returns:

(Batch,)

Return type:

loss

property mask_type
class espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss(name, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, abc.ABC

Base class for all frequence-domain Enhancement loss modules.

abstract property compute_on_mask
create_mask_label(mix_spec, ref_spec, noise_spec=None)[source]
property is_dereverb_loss
property is_noise_loss
abstract property mask_type
property name
property only_for_test
class espnet2.enh.loss.criterions.tf_domain.FrequencyDomainMSE(compute_on_mask=False, mask_type='IBM', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]

Bases: espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property compute_on_mask
forward(ref, inf) → torch.Tensor[source]

time-frequency MSE loss.

Parameters:
  • ref – (Batch, T, F) or (Batch, T, C, F)

  • inf – (Batch, T, F) or (Batch, T, C, F)

Returns:

(Batch,)

Return type:

loss

property mask_type

espnet2.enh.loss.criterions.__init__

espnet2.enh.loss.criterions.abs_loss

class espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Base class for all Enhancement loss modules.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(ref, inf) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property name
property only_for_test

espnet2.enh.separator.ineube_separator

class espnet2.enh.separator.ineube_separator.iNeuBe(n_spk=1, n_fft=512, stride=128, window='hann', mic_channels=1, hid_chans=32, hid_chans_dense=32, ksz_dense=(3, 3), ksz_tcn=3, tcn_repeats=4, tcn_blocks=7, tcn_channels=384, activation='elu', output_from='dnn1', n_chunks=3, freeze_dnn1=False, tik_eps=1e-08)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

iNeuBe, iterative neural/beamforming enhancement

Reference: Lu, Y. J., Cornell, S., Chang, X., Zhang, W., Li, C., Ni, Z., … & Watanabe, S. Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge. ICASSP 2022 p. 9201-9205.

NOTES: As outlined in the Reference, this model works best when coupled with the MultiResL1SpecLoss defined in criterions/time_domain.py. The model is trained with variance normalized mixture input and target. e.g. with mixture of shape [batch, microphones, samples] you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signal. In the Reference, the variance normalization was performed offline (we normalized by the std computed on the entire training set and not for each input separately). However we found out that also normalizing each input and target separately works well.

Parameters:
  • n_spk – number of output sources/speakers.

  • n_fft – stft window size.

  • stride – stft stride.

  • window – stft window type choose between ‘hamming’, ‘hanning’ or None.

  • mic_channels – number of microphones channels (only fixed-array geometry supported).

  • hid_chans – number of channels in the subsampling/upsampling conv layers.

  • hid_chans_dense – number of channels in the densenet layers (reduce this to reduce VRAM requirements).

  • ksz_dense – kernel size in the densenet layers thorough iNeuBe.

  • ksz_tcn – kernel size in the TCN submodule.

  • tcn_repeats – number of repetitions of blocks in the TCN submodule.

  • tcn_blocks – number of blocks in the TCN submodule.

  • tcn_channels – number of channels in the TCN submodule.

  • activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.

  • output_from – output the estimate from ‘dnn1’, ‘mfmcwf’ or ‘dnn2’.

  • n_chunks – number of future and past frames to consider for mfMCWF computation.

  • freeze_dnn1 – whether or not freezing dnn1 parameters during training of dnn2.

  • tik_eps – diagonal loading in the mfMCWF computation.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor/ComplexTensor) – batched multi-channel audio tensor with C audio channels and T samples [B, T, C]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data, currently unused in this model.

Returns:

[(B, T), …] list of len n_spk

of mono audio tensors with T samples.

ilens (torch.Tensor): (B,) additional (Dict or None): other data, currently unused in this model,

we return it also in output.

Return type:

enhanced (List[Union[torch.Tensor, ComplexTensor]])

static mfmcwf(mixture, estimate, n_chunks, tik_eps)[source]

multi-frame multi-channel wiener filter.

Parameters:
  • mixture (torch.Tensor) – multi-channel STFT complex mixture tensor, of shape [B, T, C, F] batch, frames, microphones, frequencies.

  • estimate (torch.Tensor) – monaural STFT complex estimate of target source [B, T, F] batch, frames, frequencies.

  • n_chunks (int) – number of past and future mfMCWF frames. If 0 then standard MCWF.

  • tik_eps (float) – diagonal loading for matrix inversion in MCWF computation.

Returns:

monaural STFT complex estimate

of target source after MFMCWF [B, T, F] batch, frames, frequencies.

Return type:

beamformed (torch.Tensor)

property num_spk
static pad2(input_tensor, target_len)[source]
static unfold(tf_rep, chunk_size)[source]

unfolding STFT representation to add context in the mics channel.

Parameters:
  • mixture (torch.Tensor) – 3D tensor (monaural complex STFT) of shape [B, T, F] batch, frames, microphones, frequencies.

  • n_chunks (int) – number of past and future to consider.

Returns:

complex 3D tensor STFT with context channel.

shape now is [B, T, C, F] batch, frames, context, frequencies. Basically same shape as a multi-channel STFT with C microphones.

Return type:

est_unfolded (torch.Tensor)

espnet2.enh.separator.dan_separator

class espnet2.enh.separator.dan_separator.DANSeparator(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Deep Attractor Network Separator

Reference:

DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION; Zhuo Chen. et al., 2017; https://pubmed.ncbi.nlm.nih.gov/29430212/

Parameters:
  • input_dim – input feature dimension

  • rnn_type – string, select from ‘blstm’, ‘lstm’ etc.

  • bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.

  • num_spk – number of speakers

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of stacked RNN layers. Default is 3.

  • unit – int, dimension of the hidden state.

  • emb_D – int, dimension of the attribute vector for one tf-bin.

  • dropout – float, dropout ratio. Default is 0.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model e.g. “feature_ref”: list of reference spectra List[(B, T, F)]

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.asteroid_models

class espnet2.enh.separator.asteroid_models.AsteroidModel_Converter(encoder_output_dim: int, model_name: str, num_spk: int, pretrained_path: str = '', loss_type: str = 'si_snr', **model_related_kwargs)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

The class to convert the models from asteroid to AbsSeprator.

Parameters:
forward(input: torch.Tensor, ilens: torch.Tensor = None, additional: Optional[Dict] = None)[source]

Whole forward of asteroid models.

Parameters:
  • input (torch.Tensor) – Raw Waveforms [B, T]

  • ilens (torch.Tensor) – input lengths [B]

  • additional (Dict or None) – other data included in model

Returns:

[(B, T), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, T), ‘mask_spk2’: torch.Tensor(Batch, T), … ‘mask_spkn’: torch.Tensor(Batch, T),

]

Return type:

estimated Waveforms(List[Union(torch.Tensor])

forward_rawwav(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Output with waveforms.

property num_spk

espnet2.enh.separator.svoice_separator

class espnet2.enh.separator.svoice_separator.Decoder(kernel_size)[source]

Bases: torch.nn.modules.module.Module

forward(est_source)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.separator.svoice_separator.Encoder(enc_kernel_size: int, enc_feat_dim: int)[source]

Bases: torch.nn.modules.module.Module

forward(mixture)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.separator.svoice_separator.SVoiceSeparator(input_dim: int, enc_dim: int, kernel_size: int, hidden_size: int, num_spk: int = 2, num_layers: int = 4, segment_size: int = 20, bidirectional: bool = True, input_normalize: bool = False)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

SVoice model for speech separation.

Reference:

Voice Separation with an Unknown Number of Multiple Speakers; E. Nachmani et al., 2020; https://arxiv.org/abs/2003.01531

Parameters:
  • enc_dim – int, dimension of the encoder module’s output. (Default: 128)

  • kernel_size – int, the kernel size of Conv1D layer in both encoder and decoder modules. (Default: 8)

  • hidden_size – int, dimension of the hidden state in RNN layers. (Default: 128)

  • num_spk – int, the number of speakers in the output. (Default: 2)

  • num_layers – int, number of stacked MulCat blocks. (Default: 4)

  • segment_size – dual-path segment size. (Default: 20)

  • bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)

  • input_normalize – bool, whether to apply GroupNorm on the input Tensor. (Default: False)

forward(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk
espnet2.enh.separator.svoice_separator.overlap_and_add(signal, frame_step)[source]

Reconstructs a signal from a framed representation.

Adds potentially overlapping frames of a signal with shape […, frames, frame_length], offsetting subsequent frames by frame_step. The resulting tensor has shape […, output_size] where

output_size = (frames - 1) * frame_step + frame_length

Args:
signal: A […, frames, frame_length] Tensor. All dimensions may be unknown,

and rank must be at least 2.

frame_step: An integer denoting overlap offsets.

Must be less than or equal to frame_length.

Returns:
A Tensor with shape […, output_size] containing the

overlap-added frames of signal’s inner-most two dimensions.

output_size = (frames - 1) * frame_step + frame_length

Based on

https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/contrib/signal/python/ops/reconstruction_ops.py

espnet2.enh.separator.tfgridnet_separator

class espnet2.enh.separator.tfgridnet_separator.GridNetBlock(emb_dim, emb_ks, emb_hs, n_freqs, hidden_channels, n_head=4, approx_qk_dim=512, activation='prelu', eps=1e-05)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

GridNetBlock Forward.

Parameters:
  • x – [B, C, T, Q]

  • out – [B, C, T, Q]

class espnet2.enh.separator.tfgridnet_separator.LayerNormalization4D(input_dimension, eps=1e-05)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.separator.tfgridnet_separator.LayerNormalization4DCF(input_dimension, eps=1e-05)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.separator.tfgridnet_separator.TFGridNet(input_dim, n_srcs=2, n_fft=128, stride=64, window='hann', n_imics=1, n_layers=6, lstm_hidden_units=192, attn_n_head=4, attn_approx_qk_dim=512, emb_dim=48, emb_ks=4, emb_hs=1, activation='prelu', eps=1e-05, use_builtin_complex=False)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Offline TFGridNet

Reference: [1] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation”, in arXiv preprint arXiv:2211.12433, 2022. [2] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation”, in arXiv preprint arXiv:2209.03952, 2022.

NOTES: As outlined in the Reference, this model works best when trained with variance normalized mixture input and target, e.g., with mixture of shape [batch, samples, microphones], you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signals. It is encouraged to do so when not using scale-invariant loss functions such as SI-SDR.

Parameters:
  • input_dim – placeholder, not used

  • n_srcs – number of output sources/speakers.

  • n_fft – stft window size.

  • stride – stft stride.

  • window – stft window type choose between ‘hamming’, ‘hanning’ or None.

  • n_imics – number of microphones channels (only fixed-array geometry supported).

  • n_layers – number of TFGridNet blocks.

  • lstm_hidden_units – number of hidden units in LSTM.

  • attn_n_head – number of heads in self-attention

  • attn_approx_qk_dim – approximate dimention of frame-level key and value tensors

  • emb_dim – embedding dimension

  • emb_ks – kernel size for unfolding and deconv1D

  • emb_hs – hop size for unfolding and deconv1D

  • activation – activation function to use in the whole TFGridNet model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.

  • eps – small epsilon for normalization layers.

  • use_builtin_complex – whether to use builtin complex type or not.

forward(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor) – batched multi-channel audio tensor with M audio channels and N samples [B, N, M]

  • ilens (torch.Tensor) – input lengths [B]

  • additional (Dict or None) – other data, currently unused in this model.

Returns:

[(B, T), …] list of len n_srcs

of mono audio tensors with T samples.

ilens (torch.Tensor): (B,) additional (Dict or None): other data, currently unused in this model,

we return it also in output.

Return type:

enhanced (List[Union(torch.Tensor)])

property num_spk
static pad2(input_tensor, target_len)[source]

espnet2.enh.separator.abs_separator

class espnet2.enh.separator.abs_separator.AbsSeparator[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[Tuple[torch.Tensor], torch.Tensor, collections.OrderedDict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_streaming(input_frame: torch.Tensor, buffer=None)[source]
abstract property num_spk

espnet2.enh.separator.skim_separator

class espnet2.enh.separator.skim_separator.SkiMSeparator(input_dim: int, causal: bool = True, num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0, mem_type: str = 'hc', seg_overlap: bool = False)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Skipping Memory (SkiM) Separator

Parameters:
  • input_dim – input feature dimension

  • causal – bool, whether the system is causal.

  • num_spk – number of target speakers.

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of SkiM blocks. Default is 3.

  • unit – int, dimension of the hidden state.

  • segment_size – segmentation size for splitting long features

  • dropout – float, dropout ratio. Default is 0.

  • mem_type – ‘hc’, ‘h’, ‘c’, ‘id’ or None. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.

  • seg_overlap – Bool, whether the segmentation will reserve 50% overlap for adjacent segments. Default is False.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

forward_streaming(input_frame: torch.Tensor, states=None)[source]
property num_spk

espnet2.enh.separator.transformer_separator

class espnet2.enh.separator.transformer_separator.TransformerSeparator(input_dim: int, num_spk: int = 2, predict_noise: bool = False, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, use_scaled_pos_enc: bool = True, nonlinear: str = 'relu')[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Transformer separator.

Parameters:
  • input_dim – input feature dimension

  • num_spk – number of speakers

  • predict_noise – whether to output the estimated noise signal

  • adim (int) – Dimension of attention.

  • aheads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • layers (int) – The number of transformer blocks.

  • dropout_rate (float) – Dropout rate.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • use_scaled_pos_enc (bool) – use scaled positional encoding or not

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.dpcl_separator

class espnet2.enh.separator.dpcl_separator.DPCLSeparator(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Deep Clustering Separator.

References

[1] Deep clustering: Discriminative embeddings for segmentation and

separation; John R. Hershey. et al., 2016; https://ieeexplore.ieee.org/document/7471631

[2] Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding

Vectors Based on Regular Simplex; Tanaka, K. et al., 2021; https://www.isca-speech.org/archive/interspeech_2021/tanaka21_interspeech.html

Parameters:
  • input_dim – input feature dimension

  • rnn_type – string, select from ‘blstm’, ‘lstm’ etc.

  • bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.

  • num_spk – number of speakers

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of stacked RNN layers. Default is 3.

  • unit – int, dimension of the hidden state.

  • emb_D – int, dimension of the feature vector for a tf-bin.

  • dropout – float, dropout ratio. Default is 0.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. tf_embedding: OrderedDict[

’tf_embedding’: learned embedding of all T-F bins (B, T * F, D),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.dptnet_separator

class espnet2.enh.separator.dptnet_separator.DPTNetSeparator(input_dim: int, post_enc_relu: bool = True, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, unit: int = 256, att_heads: int = 4, dropout: float = 0.0, activation: str = 'relu', norm_type: str = 'gLN', layer: int = 6, segment_size: int = 20, nonlinear: str = 'relu')[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Dual-Path Transformer Network (DPTNet) Separator

Parameters:
  • input_dim – input feature dimension

  • rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.

  • num_spk – number of speakers

  • predict_noise – whether to output the estimated noise signal

  • unit – int, dimension of the hidden state.

  • att_heads – number of attention heads.

  • dropout – float, dropout ratio. Default is 0.

  • activation – activation function applied at the output of RNN.

  • norm_type – type of normalization to use after each inter- or intra-chunk Transformer block.

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of stacked RNN layers. Default is 3.

  • segment_size – dual-path segment size

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

merge_feature(x, length=None)[source]
property num_spk
split_feature(x)[source]

espnet2.enh.separator.conformer_separator

class espnet2.enh.separator.conformer_separator.ConformerSeparator(input_dim: int, num_spk: int = 2, predict_noise: bool = False, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, input_layer: str = 'linear', positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, nonlinear: str = 'relu', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, conformer_enc_kernel_size: int = 7, padding_idx: int = -1)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Conformer separator.

Parameters:
  • input_dim – input feature dimension

  • num_spk – number of speakers

  • predict_noise – whether to output the estimated noise signal

  • adim (int) – Dimension of attention.

  • aheads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • layers (int) – The number of transformer blocks.

  • dropout_rate (float) – Dropout rate.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • conformer_pos_enc_layer_type (str) – Encoder positional encoding layer type.

  • conformer_self_attn_layer_type (str) – Encoder attention layer type.

  • conformer_activation_type (str) – Encoder activation function type.

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • use_macaron_style_in_conformer (bool) – Whether to use macaron style for positionwise layer.

  • use_cnn_in_conformer (bool) – Whether to use convolution module.

  • conformer_enc_kernel_size (int) – Kernerl size of convolution module.

  • padding_idx (int) – Padding idx for input_layer=embed.

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.fasnet_separator

class espnet2.enh.separator.fasnet_separator.FaSNetSeparator(input_dim: int, enc_dim: int, feature_dim: int, hidden_dim: int, layer: int, segment_size: int, num_spk: int, win_len: int, context_len: int, fasnet_type: str, dropout: float = 0.0, sr: int = 16000, predict_noise: bool = False)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Filter-and-sum Network (FaSNet) Separator

Parameters:
  • input_dim – required by AbsSeparator. Not used in this model.

  • enc_dim – encoder dimension

  • feature_dim – feature dimension

  • hidden_dim – hidden dimension in DPRNN

  • layer – number of DPRNN blocks in iFaSNet

  • segment_size – dual-path segment size

  • num_spk – number of speakers

  • win_len – window length in millisecond

  • context_len – context length in millisecond

  • fasnet_type – ‘fasnet’ or ‘ifasnet’. Select from origin fasnet or Implicit fasnet

  • dropout – dropout rate. Default is 0.

  • sr – samplerate of input audio

  • predict_noise – whether to output the estimated noise signal

forward(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor) – (Batch, samples, channels)

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

separated (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.dprnn_separator

class espnet2.enh.separator.dprnn_separator.DPRNNSeparator(input_dim: int, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Dual-Path RNN (DPRNN) Separator

Parameters:
  • input_dim – input feature dimension

  • rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.

  • num_spk – number of speakers

  • predict_noise – whether to output the estimated noise signal

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of stacked RNN layers. Default is 3.

  • unit – int, dimension of the hidden state.

  • segment_size – dual-path segment size

  • dropout – float, dropout ratio. Default is 0.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.tcn_separator

class espnet2.enh.separator.tcn_separator.TCNSeparator(input_dim: int, num_spk: int = 2, predict_noise: bool = False, layer: int = 8, stack: int = 3, bottleneck_dim: int = 128, hidden_dim: int = 512, kernel: int = 3, causal: bool = False, norm_type: str = 'gLN', nonlinear: str = 'relu')[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Temporal Convolution Separator

Parameters:
  • input_dim – input feature dimension

  • num_spk – number of speakers

  • predict_noise – whether to output the estimated noise signal

  • layer – int, number of layers in each stack.

  • stack – int, number of stacks

  • bottleneck_dim – bottleneck dimension

  • hidden_dim – number of convolution channel

  • kernel – int, kernel size.

  • causal – bool, defalut False.

  • norm_type – str, choose from ‘BN’, ‘gLN’, ‘cLN’

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

forward_streaming(input_frame: torch.Tensor, buffer=None)[source]
property num_spk

espnet2.enh.separator.__init__

espnet2.enh.separator.rnn_separator

class espnet2.enh.separator.rnn_separator.RNNSeparator(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'sigmoid', layer: int = 3, unit: int = 512, dropout: float = 0.0)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

RNN Separator

Parameters:
  • input_dim – input feature dimension

  • rnn_type – string, select from ‘blstm’, ‘lstm’ etc.

  • bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.

  • num_spk – number of speakers

  • predict_noise – whether to output the estimated noise signal

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of stacked RNN layers. Default is 3.

  • unit – int, dimension of the hidden state.

  • dropout – float, dropout ratio. Default is 0.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

forward_streaming(input_frame: torch.Tensor, states=None)[source]
property num_spk

espnet2.enh.separator.dpcl_e2e_separator

class espnet2.enh.separator.dpcl_e2e_separator.DPCLE2ESeparator(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0, alpha: float = 5.0, max_iteration: int = 500, threshold: float = 1e-05)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Deep Clustering End-to-End Separator

References

Single-Channel Multi-Speaker Separation using Deep Clustering; Yusuf Isik. et al., 2016; https://www.isca-speech.org/archive/interspeech_2016/isik16_interspeech.html

Parameters:
  • input_dim – input feature dimension

  • rnn_type – string, select from ‘blstm’, ‘lstm’ etc.

  • bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.

  • num_spk – number of speakers

  • predict_noise – whether to output the estimated noise signal

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of stacked RNN layers. Default is 3.

  • unit – int, dimension of the hidden state.

  • emb_D – int, dimension of the feature vector for a tf-bin.

  • dropout – float, dropout ratio. Default is 0.

  • alpha – float, the clustering hardness parameter.

  • max_iteration – int, the max iterations of soft kmeans.

  • threshold – float, the threshold to end the soft k-means process.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. V: OrderedDict[

others predicted data, e.g. masks: OrderedDict[ ‘mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.dc_crn_separator

class espnet2.enh.separator.dc_crn_separator.DC_CRNSeparator(input_dim: int, num_spk: int = 2, predict_noise: bool = False, input_channels: List = [2, 16, 32, 64, 128, 256], enc_hid_channels: int = 8, enc_kernel_size: Tuple = (1, 3), enc_padding: Tuple = (0, 1), enc_last_kernel_size: Tuple = (1, 4), enc_last_stride: Tuple = (1, 2), enc_last_padding: Tuple = (0, 1), enc_layers: int = 5, skip_last_kernel_size: Tuple = (1, 3), skip_last_stride: Tuple = (1, 1), skip_last_padding: Tuple = (0, 1), glstm_groups: int = 2, glstm_layers: int = 2, glstm_bidirectional: bool = False, glstm_rearrange: bool = False, mode: str = 'masking', ref_channel: int = 0)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Densely-Connected Convolutional Recurrent Network (DC-CRN) Separator

Reference:

Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones; Tan et al., 2020 https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf

Parameters:
  • input_dim – input feature dimension

  • num_spk – number of speakers

  • predict_noise – whether to output the estimated noise signal

  • input_channels (list) – number of input channels for the stacked DenselyConnectedBlock layers Its length should be (number of DenselyConnectedBlock layers).

  • enc_hid_channels (int) – common number of intermediate channels for all DenselyConnectedBlock of the encoder

  • enc_kernel_size (tuple) – common kernel size for all DenselyConnectedBlock of the encoder

  • enc_padding (tuple) – common padding for all DenselyConnectedBlock of the encoder

  • enc_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the encoder

  • enc_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the encoder

  • enc_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the encoder

  • enc_layers (int) – common total number of Conv layers for all DenselyConnectedBlock layers of the encoder

  • skip_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the skip pathways

  • skip_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the skip pathways

  • skip_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the skip pathways

  • glstm_groups (int) – number of groups in each Grouped LSTM layer

  • glstm_layers (int) – number of Grouped LSTM layers

  • glstm_bidirectional (bool) – whether to use BLSTM or unidirectional LSTM in Grouped LSTM layers

  • glstm_rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer

  • output_channels (int) – number of output channels (even number)

  • mode (str) – one of (“mapping”, “masking”) “mapping”: complex spectral mapping “masking”: complex masking

  • ref_channel (int) – index of the reference microphone

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

DC-CRN Separator Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [Batch, T, F] or [Batch, T, C, F]

  • ilens (torch.Tensor) – input lengths [Batch,]

Returns:

[(Batch, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.dccrn_separator

class espnet2.enh.separator.dccrn_separator.DCCRNSeparator(input_dim: int, num_spk: int = 1, rnn_layer: int = 2, rnn_units: int = 256, masking_mode: str = 'E', use_clstm: bool = True, bidirectional: bool = False, use_cbn: bool = False, kernel_size: int = 5, kernel_num: List[int] = [32, 64, 128, 256, 256, 256], use_builtin_complex: bool = True, use_noise_mask: bool = False)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

DCCRN separator.

Parameters:
  • input_dim (int) – input dimension。

  • num_spk (int, optional) – number of speakers. Defaults to 1.

  • rnn_layer (int, optional) – number of lstm layers in the crn. Defaults to 2.

  • rnn_units (int, optional) – rnn units. Defaults to 128.

  • masking_mode (str, optional) – usage of the estimated mask. Defaults to “E”.

  • use_clstm (bool, optional) – whether use complex LSTM. Defaults to False.

  • bidirectional (bool, optional) – whether use BLSTM. Defaults to False.

  • use_cbn (bool, optional) – whether use complex BN. Defaults to False.

  • kernel_size (int, optional) – convolution kernel size. Defaults to 5.

  • kernel_num (list, optional) – output dimension of each layer of the encoder.

  • use_builtin_complex (bool, optional) – torch.complex if True, else ComplexTensor.

  • use_noise_mask (bool, optional) – whether to estimate the mask of noise.

apply_masks(masks: List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], real: torch.Tensor, imag: torch.Tensor)[source]

apply masks

Parameters:
  • masks – est_masks, [(B, T, F), …]

  • real (torch.Tensor) – real part of the noisy spectrum, (B, F, T)

  • imag (torch.Tensor) – imag part of the noisy spectrum, (B, F, T)

Returns:

[(B, T, F), …]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

create_masks(mask_tensor: torch.Tensor)[source]

create estimated mask for each speaker

Parameters:

mask_tensor (torch.Tensor) – output of decoder, shape(B, 2*num_spk, F-1, T)

flatten_parameters()[source]
forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

[(B, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.neural_beamformer

class espnet2.enh.separator.neural_beamformer.NeuralBeamformer(input_dim: int, num_spk: int = 1, loss_type: str = 'mask_mse', use_wpe: bool = False, wnet_type: str = 'blstmp', wlayers: int = 3, wunits: int = 300, wprojs: int = 320, wdropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask_for_wpe: bool = True, wnonlinear: str = 'crelu', multi_source_wpe: bool = True, wnormalization: bool = False, use_beamformer: bool = True, bnet_type: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, badim: int = 320, ref_channel: int = -1, use_noise_mask: bool = True, bnonlinear: str = 'sigmoid', beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, bdropout_rate: float = 0.0, shared_power: bool = True, use_torchaudio_api: bool = False, diagonal_loading: bool = True, diag_eps_wpe: float = 1e-07, diag_eps_bf: float = 1e-07, mask_flooring: bool = False, flooring_thres_wpe: float = 1e-06, flooring_thres_bf: float = 1e-06, use_torch_solver: bool = True)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters:
  • input (torch.complex64/ComplexTensor) – mixed speech [Batch, Frames, Channel, Freq]

  • ilens (torch.Tensor) – input lengths [Batch]

  • additional (Dict or None) – other data included in model NOTE: not used in this model

Returns:

List[torch.complex64/ComplexTensor] output lengths other predcited data: OrderedDict[

’dereverb1’: ComplexTensor(Batch, Frames, Channel, Freq), ‘mask_dereverb1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_noise1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Channel, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Channel, Freq),

]

Return type:

enhanced speech (single-channel)

property num_spk

espnet2.enh.decoder.stft_decoder

class espnet2.enh.decoder.stft_decoder.STFTDecoder(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window='hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]

Bases: espnet2.enh.decoder.abs_decoder.AbsDecoder

STFT decoder for speech enhancement and separation

forward(input: torch_complex.tensor.ComplexTensor, ilens: torch.Tensor)[source]

Forward.

Parameters:
  • input (ComplexTensor) – spectrum [Batch, T, (C,) F]

  • ilens (torch.Tensor) – input lengths [Batch]

forward_streaming(input_frame: torch.Tensor)[source]

Forward. :param input: spectrum [Batch, 1, F] :type input: ComplexTensor :param output: wavs [Batch, 1, self.win_length]

streaming_merge(chunks, ilens=None)[source]

streaming_merge. It merges the frame-level processed audio chunks in the streaming simulation. It is noted that, in real applications, the processed audio should be sent to the output channel frame by frame. You may refer to this function to manage your streaming output buffer.

Parameters:
  • chunks – List [(B, frame_size),]

  • ilens – [B]

Returns:

[B, T]

Return type:

merge_audio

espnet2.enh.decoder.null_decoder

class espnet2.enh.decoder.null_decoder.NullDecoder[source]

Bases: espnet2.enh.decoder.abs_decoder.AbsDecoder

Null decoder, return the same args.

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward. The input should be the waveform already.

Parameters:
  • input (torch.Tensor) – wav [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

espnet2.enh.decoder.abs_decoder

class espnet2.enh.decoder.abs_decoder.AbsDecoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_streaming(input_frame: torch.Tensor)[source]
streaming_merge(chunks: torch.Tensor, ilens: torch._VariableFunctionsClass.tensor = None)[source]

streaming_merge. It merges the frame-level processed audio chunks in the streaming simulation. It is noted that, in real applications, the processed audio should be sent to the output channel frame by frame. You may refer to this function to manage your streaming output buffer.

Parameters:
  • chunks – List [(B, frame_size),]

  • ilens – [B]

Returns:

[B, T]

Return type:

merge_audio

espnet2.enh.decoder.__init__

espnet2.enh.decoder.conv_decoder

class espnet2.enh.decoder.conv_decoder.ConvDecoder(channel: int, kernel_size: int, stride: int)[source]

Bases: espnet2.enh.decoder.abs_decoder.AbsDecoder

Transposed Convolutional decoder for speech enhancement and separation

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Args: input (torch.Tensor): spectrum [Batch, T, F] ilens (torch.Tensor): input lengths [Batch]

forward_streaming(input_frame: torch.Tensor)[source]
streaming_merge(chunks: torch.Tensor, ilens: torch._VariableFunctionsClass.tensor = None)[source]

streaming_merge. It merges the frame-level processed audio chunks in the streaming simulation. It is noted that, in real applications, the processed audio should be sent to the output channel frame by frame. You may refer to this function to manage your streaming output buffer.

Parameters:
  • chunks – List [(B, frame_size),]

  • ilens – [B]

Returns:

[B, T]

Return type:

merge_audio

espnet2.enh.extractor.abs_extractor

class espnet2.enh.extractor.abs_extractor.AbsExtractor[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor, input_aux: torch.Tensor, ilens_aux: torch.Tensor, suffix_tag: str = '') → Tuple[Tuple[torch.Tensor], torch.Tensor, collections.OrderedDict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.extractor.__init__

espnet2.enh.extractor.td_speakerbeam_extractor

class espnet2.enh.extractor.td_speakerbeam_extractor.TDSpeakerBeamExtractor(input_dim: int, layer: int = 8, stack: int = 3, bottleneck_dim: int = 128, hidden_dim: int = 512, skip_dim: int = 128, kernel: int = 3, causal: bool = False, norm_type: str = 'gLN', pre_nonlinear: str = 'prelu', nonlinear: str = 'relu', i_adapt_layer: int = 7, adapt_layer_type: str = 'mul', adapt_enroll_dim: int = 128, use_spk_emb: bool = False, spk_emb_dim: int = 256)[source]

Bases: espnet2.enh.extractor.abs_extractor.AbsExtractor

Time-Domain SpeakerBeam Extractor.

Parameters:
  • input_dim – input feature dimension

  • layer – int, number of layers in each stack

  • stack – int, number of stacks

  • bottleneck_dim – bottleneck dimension

  • hidden_dim – number of convolution channel

  • skip_dim – int, number of skip connection channels

  • kernel – int, kernel size.

  • causal – bool, defalut False.

  • norm_type – str, choose from ‘BN’, ‘gLN’, ‘cLN’

  • pre_nonlinear – the nonlinear function right before mask estimation select from ‘prelu’, ‘relu’, ‘tanh’, ‘sigmoid’, ‘linear’

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’, ‘linear’

  • i_adapt_layer – int, index of adaptation layer

  • adapt_layer_type – str, type of adaptation layer see espnet2.enh.layers.adapt_layers for options

  • adapt_enroll_dim – int, dimensionality of the speaker embedding

  • use_spk_emb – bool, whether to use speaker embeddings as enrollment

  • spk_emb_dim – int, dimension of input speaker embeddings only used when use_spk_emb is True

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, input_aux: torch.Tensor, ilens_aux: torch.Tensor, suffix_tag: str = '') → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

TD-SpeakerBeam Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

  • input_aux (torch.Tensor or ComplexTensor) – Encoded auxiliary feature for the target speaker [B, T, N] or [B, N]

  • ilens_aux (torch.Tensor) – input lengths of auxiliary input for the target speaker [Batch]

  • suffix_tag (str) – suffix to append to the keys in others

Returns:

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

f’mask{suffix_tag}’: torch.Tensor(Batch, Frames, Freq), f’enroll_emb{suffix_tag}’: torch.Tensor(Batch, adapt_enroll_dim/adapt_enroll_dim*2),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])