espnet2.diar package

espnet2.diar.abs_diar

class espnet2.diar.abs_diar.AbsDiarization[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract forward_rawwav(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]

espnet2.diar.espnet_model

class espnet2.diar.espnet_model.ESPnetDiarizationModel(frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], label_aggregator: torch.nn.modules.module.Module, encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, decoder: espnet2.diar.decoder.abs_decoder.AbsDecoder, attractor: Optional[espnet2.diar.attractor.abs_attractor.AbsAttractor], diar_weight: float = 1.0, attractor_weight: float = 1.0)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

Speaker Diarization model

If “attractor” is “None”, SA-EEND will be used. Else if “attractor” is not “None”, EEND-EDA will be used. For the details about SA-EEND and EEND-EDA, refer to the following papers: SA-EEND: https://arxiv.org/pdf/1909.06247.pdf EEND-EDA: https://arxiv.org/pdf/2005.09921.pdf, https://arxiv.org/pdf/2106.10654.pdf

attractor_loss(att_prob, label)[source]
static calc_diarization_error(pred, label, length)[source]
collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, spk_labels: torch.Tensor = None, spk_labels_lengths: torch.Tensor = None, **kwargs) → Dict[str, torch.Tensor][source]
create_length_mask(length, max_len, num_output)[source]
encode(speech: torch.Tensor, speech_lengths: torch.Tensor, bottleneck_feats: torch.Tensor, bottleneck_feats_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch,)

  • bottleneck_feats – (Batch, Length, …): used for enh + diar

forward(speech: torch.Tensor, speech_lengths: torch.Tensor = None, spk_labels: torch.Tensor = None, spk_labels_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters:
  • speech – (Batch, samples)

  • speech_lengths – (Batch,) default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py

  • spk_labels – (Batch, )

  • kwargs – “utt_id” is among the input.

pit_loss(pred, label, lengths)[source]
pit_loss_single_permute(pred, label, length)[source]

espnet2.diar.label_processor

class espnet2.diar.label_processor.LabelProcessor(win_length: int = 512, hop_length: int = 128, center: bool = True)[source]

Bases: torch.nn.modules.module.Module

Label aggregator for speaker diarization

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters:
  • input – (Batch, Nsamples, Label_dim)

  • ilens – (Batch)

Returns:

(Batch, Frames, Label_dim) olens: (Batch)

Return type:

output

espnet2.diar.__init__

espnet2.diar.attractor.abs_attractor

class espnet2.diar.attractor.abs_attractor.AbsAttractor[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(enc_input: torch.Tensor, ilens: torch.Tensor, dec_input: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.diar.attractor.rnn_attractor

class espnet2.diar.attractor.rnn_attractor.RnnAttractor(encoder_output_size: int, layer: int = 1, unit: int = 512, dropout: float = 0.1, attractor_grad: bool = True)[source]

Bases: espnet2.diar.attractor.abs_attractor.AbsAttractor

encoder decoder attractor for speaker diarization

forward(enc_input: torch.Tensor, ilens: torch.Tensor, dec_input: torch.Tensor)[source]

Forward.

Parameters:
  • enc_input (torch.Tensor) – hidden_space [Batch, T, F]

  • ilens (torch.Tensor) – input lengths [Batch]

  • dec_input (torch.Tensor) – decoder input (zeros) [Batch, num_spk + 1, F]

Returns:

[Batch, num_spk + 1, F] att_prob: [Batch, num_spk + 1, 1]

Return type:

attractor

espnet2.diar.attractor.__init__

espnet2.diar.layers.abs_mask

class espnet2.diar.layers.abs_mask.AbsMask[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input, ilens, bottleneck_feat, num_spk) → Tuple[Tuple[torch.Tensor], torch.Tensor, collections.OrderedDict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract property max_num_spk

espnet2.diar.layers.multi_mask

class espnet2.diar.layers.multi_mask.MultiMask(input_dim: int, bottleneck_dim: int = 128, max_num_spk: int = 3, mask_nonlinear='relu')[source]

Bases: espnet2.diar.layers.abs_mask.AbsMask

Multiple 1x1 convolution layer Module.

This module corresponds to the final 1x1 conv block and non-linear function in TCNSeparator. This module has multiple 1x1 conv blocks. One of them is selected according to the given num_spk to handle flexible num_spk.

Parameters:
  • input_dim – Number of filters in autoencoder

  • bottleneck_dim – Number of channels in bottleneck 1 * 1-conv block

  • max_num_spk – Number of mask_conv1x1 modules (>= Max number of speakers in the dataset)

  • mask_nonlinear – use which non-linear function to generate mask

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, bottleneck_feat: torch.Tensor, num_spk: int) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Keep this API same with TasNet.

Parameters:
  • input – [M, K, N], M is batch size

  • ilens (torch.Tensor) – (M,)

  • bottleneck_feat – [M, K, B]

  • num_spk – number of speakers

  • (Training – oracle,

  • Inference – estimated by other module (e.g, EEND-EDA))

Returns:

[(M, K, N), …] ilens (torch.Tensor): (M,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property max_num_spk

espnet2.diar.layers.tcn_nomask

class espnet2.diar.layers.tcn_nomask.ChannelwiseLayerNorm(channel_size)[source]

Bases: torch.nn.modules.module.Module

Channel-wise Layer Normalization (cLN).

forward(y)[source]

Forward.

Parameters:

y – [M, N, K], M is batch size, N is channel size, K is length

Returns:

[M, N, K]

Return type:

cLN_y

reset_parameters()[source]
class espnet2.diar.layers.tcn_nomask.Chomp1d(chomp_size)[source]

Bases: torch.nn.modules.module.Module

To ensure the output length is the same as the input.

forward(x)[source]

Forward.

Parameters:

x – [M, H, Kpad]

Returns:

[M, H, K]

class espnet2.diar.layers.tcn_nomask.DepthwiseSeparableConv(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward.

Parameters:

x – [M, H, K]

Returns:

[M, B, K]

Return type:

result

class espnet2.diar.layers.tcn_nomask.GlobalLayerNorm(channel_size)[source]

Bases: torch.nn.modules.module.Module

Global Layer Normalization (gLN).

forward(y)[source]

Forward.

Parameters:

y – [M, N, K], M is batch size, N is channel size, K is length

Returns:

[M, N, K]

Return type:

gLN_y

reset_parameters()[source]
class espnet2.diar.layers.tcn_nomask.TemporalBlock(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward.

Parameters:

x – [M, B, K]

Returns:

[M, B, K]

class espnet2.diar.layers.tcn_nomask.TemporalConvNet(N, B, H, P, X, R, norm_type='gLN', causal=False)[source]

Bases: torch.nn.modules.module.Module

Basic Module of tasnet.

Parameters:
  • N – Number of filters in autoencoder

  • B – Number of channels in bottleneck 1 * 1-conv block

  • H – Number of channels in convolutional blocks

  • P – Kernel size in convolutional blocks

  • X – Number of convolutional blocks in each repeat

  • R – Number of repeats

  • norm_type – BN, gLN, cLN

  • causal – causal or non-causal

forward(mixture_w)[source]

Keep this API same with TasNet.

Parameters:

mixture_w – [M, N, K], M is batch size

Returns:

[M, B, K]

Return type:

bottleneck_feature

espnet2.diar.layers.tcn_nomask.check_nonlinear(nolinear_type)[source]
espnet2.diar.layers.tcn_nomask.chose_norm(norm_type, channel_size)[source]

The input of normalization will be (M, C, K), where M is batch size.

C is channel size and K is sequence length.

espnet2.diar.layers.__init__

espnet2.diar.separator.__init__

espnet2.diar.separator.tcn_separator_nomask

class espnet2.diar.separator.tcn_separator_nomask.TCNSeparatorNomask(input_dim: int, layer: int = 8, stack: int = 3, bottleneck_dim: int = 128, hidden_dim: int = 512, kernel: int = 3, causal: bool = False, norm_type: str = 'gLN')[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Temporal Convolution Separator

Note that this separator is equivalent to TCNSeparator except for not having the mask estimation part. This separator outputs the intermediate bottleneck feats (which is used as the input to diarization branch in enh_diar task). This separator is followed by MultiMask module, which estimates the masks.

Parameters:
  • input_dim – input feature dimension

  • layer – int, number of layers in each stack.

  • stack – int, number of stacks

  • bottleneck_dim – bottleneck dimension

  • hidden_dim – number of convolution channel

  • kernel – int, kernel size.

  • causal – bool, defalut False.

  • norm_type – str, choose from ‘BN’, ‘gLN’, ‘cLN’

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward.

Parameters:
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns:

[B, T, bottleneck_dim] ilens (torch.Tensor): (B,)

Return type:

feats (torch.Tensor)

property num_spk
property output_dim

espnet2.diar.decoder.abs_decoder

class espnet2.diar.decoder.abs_decoder.AbsDecoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract property num_spk

espnet2.diar.decoder.__init__

espnet2.diar.decoder.linear_decoder

class espnet2.diar.decoder.linear_decoder.LinearDecoder(encoder_output_size: int, num_spk: int = 2)[source]

Bases: espnet2.diar.decoder.abs_decoder.AbsDecoder

Linear decoder for speaker diarization

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters:
  • input (torch.Tensor) – hidden_space [Batch, T, F]

  • ilens (torch.Tensor) – input lengths [Batch]

property num_spk