espnet2.asr package¶

espnet2.asr.espnet_model¶

class espnet2.asr.espnet_model.ESPnetASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], aux_ctc: dict = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', transducer_multi_blank_durations: List = [], transducer_multi_blank_sigma: float = 0.05, sym_sos: str = '<sos/eos>', sym_eos: str = '<sos/eos>', extract_feats_in_collect_stats: bool = True, lang_token_id: int = -1)[source]¶

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

CTC-attention hybrid Encoder-Decoder model

batchify_nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]¶

Compute negative log likelihood(nll) from transformer-decoder

To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,

you may change this to avoid OOM or increase GPU memory usage

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶

encode(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Frontend + Encoder. Note that this method is used by asr_inference.py

Parameters:

speech – (Batch, Length, …)
speech_lengths – (Batch, )

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Frontend + Encoder + Decoder + Calc loss

Parameters:

speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)
kwargs – “utt_id” is among the input.

nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶

Compute negative log likelihood(nll) from transformer-decoder

Normally, this function is called in batchify_nll.

Parameters:

encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)

espnet2.asr.discrete_asr_espnet_model¶

class espnet2.asr.discrete_asr_espnet_model.ESPnetDiscreteASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: Optional[espnet2.asr.ctc.CTC], ctc_weight: float = 0.5, interctc_weight: float = 0.0, src_vocab_size: int = 0, src_token_list: Union[Tuple[str, ...], List[str]] = [], ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_bleu: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True, share_decoder_input_output_embed: bool = False, share_encoder_decoder_input_embed: bool = False)[source]¶

Bases: espnet2.mt.espnet_model.ESPnetMTModel

Encoder-Decoder model

encode(src_text: torch.Tensor, src_text_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Frontend + Encoder. Note that this method is used by mt_inference.py

Parameters:

src_text – (Batch, Length, …)
src_text_lengths – (Batch, )

forward(text: torch.Tensor, text_lengths: torch.Tensor, src_text: torch.Tensor, src_text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Frontend + Encoder + Decoder + Calc loss

Parameters:

text – (Batch, Length)
text_lengths – (Batch,)
src_text – (Batch, length)
src_text_lengths – (Batch,)
kwargs – “utt_id” is among the input.

espnet2.asr.pit_espnet_model¶

class espnet2.asr.pit_espnet_model.ESPnetASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', sym_sos: str = '<sos/eos>', sym_eos: str = '<sos/eos>', extract_feats_in_collect_stats: bool = True, lang_token_id: int = -1, num_inf: int = 1, num_ref: int = 1)[source]¶

Bases: espnet2.asr.espnet_model.ESPnetASRModel

CTC-attention hybrid Encoder-Decoder model

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Frontend + Encoder + Decoder + Calc loss

Parameters:

speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)
kwargs – “utt_id” is among the input.

class espnet2.asr.pit_espnet_model.PITLossWrapper(criterion_fn: Callable, num_ref: int)[source]¶

Bases: espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

forward(inf: torch.Tensor, inf_lens: torch.Tensor, ref: torch.Tensor, ref_lens: torch.Tensor, others: Dict = None)[source]¶

PITLoss Wrapper function. Similar to espnet2/enh/loss/wrapper/pit_solver.py

Parameters:

inf – Iterable[torch.Tensor], (batch, num_inf, …)
inf_lens – Iterable[torch.Tensor], (batch, num_inf, …)
ref – Iterable[torch.Tensor], (batch, num_ref, …)
ref_lens – Iterable[torch.Tensor], (batch, num_ref, …)
permute_inf – If true, permute the inference and inference_lens according to the optimal permutation.

classmethod permutate(perm, *args)[source]¶

espnet2.asr.ctc¶

class espnet2.asr.ctc.CTC(odim: int, encoder_output_size: int, dropout_rate: float = 0.0, ctc_type: str = 'builtin', reduce: bool = True, ignore_nan_grad: bool = None, zero_infinity: bool = True)[source]¶

Bases: torch.nn.modules.module.Module

CTC module.

Parameters:

odim – dimension of outputs
encoder_output_size – number of encoder projection units
dropout_rate – dropout rate (0.0 ~ 1.0)
ctc_type – builtin or gtnctc
reduce – reduce the CTC loss into a scalar
ignore_nan_grad – Same as zero_infinity (keeping for backward compatiblity)
zero_infinity – Whether to zero infinite losses and the associated gradients.

argmax(hs_pad)[source]¶

argmax of frame activations

Parameters:: hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)
Returns:: argmax applied 2d tensor (B, Tmax)
Return type:: torch.Tensor

forward(hs_pad, hlens, ys_pad, ys_lens)[source]¶

Calculate CTC loss.

Parameters:

hs_pad – batch of padded hidden state sequences (B, Tmax, D)
hlens – batch of lengths of hidden state sequences (B)
ys_pad – batch of padded character id sequence tensor (B, Lmax)
ys_lens – batch of lengths of character sequence (B)

log_softmax(hs_pad)[source]¶

log_softmax of frame activations

Parameters:: hs_pad (Tensor) – 3d tensor (B, Tmax, eprojs)
Returns:: log softmax applied 3d tensor (B, Tmax, odim)
Return type:: torch.Tensor

loss_fn(th_pred, th_target, th_ilen, th_olen) → torch.Tensor[source]¶

softmax(hs_pad)[source]¶

softmax of frame activations

Parameters:: hs_pad (Tensor) – 3d tensor (B, Tmax, eprojs)
Returns:: softmax applied 3d tensor (B, Tmax, odim)
Return type:: torch.Tensor

espnet2.asr.maskctc_model¶

class espnet2.asr.maskctc_model.MaskCTCInference(asr_model: espnet2.asr.maskctc_model.MaskCTCModel, n_iterations: int, threshold_probability: float)[source]¶

Bases: torch.nn.modules.module.Module

Mask-CTC-based non-autoregressive inference

Initialize Mask-CTC inference

forward(enc_out: torch.Tensor) → List[espnet.nets.beam_search.Hypothesis][source]¶: Perform Mask-CTC inference

ids2text(ids: List[int])[source]¶

class espnet2.asr.maskctc_model.MaskCTCModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.mlm_decoder.MLMDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module] = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', sym_mask: str = '<mask>', extract_feats_in_collect_stats: bool = True)[source]¶

Bases: espnet2.asr.espnet_model.ESPnetASRModel

Hybrid CTC/Masked LM Encoder-Decoder model (Mask-CTC)

batchify_nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]¶

Compute negative log likelihood(nll) from transformer-decoder

To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,

you may change this to avoid OOM or increase GPU memory usage

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Frontend + Encoder + Decoder + Calc loss

Parameters:

speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)

nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶

Compute negative log likelihood(nll) from transformer-decoder

Normally, this function is called in batchify_nll.

Parameters:

encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)

espnet2.asr.init¶

espnet2.asr.layers.cgmlp¶

MLP with convolutional gating (cgMLP) definition.

References

https://openreview.net/forum?id=RA-zVvZLYIy https://arxiv.org/abs/2105.08050

class espnet2.asr.layers.cgmlp.ConvolutionalGatingMLP(size: int, linear_units: int, kernel_size: int, dropout_rate: float, use_linear_after_conv: bool, gate_activation: str)[source]¶

Bases: torch.nn.modules.module.Module

Convolutional Gating MLP (cgMLP).

forward(x, mask)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.layers.cgmlp.ConvolutionalSpatialGatingUnit(size: int, kernel_size: int, dropout_rate: float, use_linear_after_conv: bool, gate_activation: str)[source]¶

Bases: torch.nn.modules.module.Module

Convolutional Spatial Gating Unit (CSGU).

espnet_initialization_fn()[source]¶

forward(x, gate_add=None)[source]¶

Forward method

Parameters:

x (torch.Tensor) – (N, T, D)
gate_add (torch.Tensor) – (N, T, D/2)

Returns:

(N, T, D/2)

Return type:

out (torch.Tensor)

espnet2.asr.layers.init¶

espnet2.asr.layers.fastformer¶

Fastformer attention definition.

Reference:: Wu et al., “Fastformer: Additive Attention Can Be All You Need” https://arxiv.org/abs/2108.09084 https://github.com/wuch15/Fastformer

class espnet2.asr.layers.fastformer.FastSelfAttention(size, attention_heads, dropout_rate)[source]¶

Bases: torch.nn.modules.module.Module

Fast self-attention used in Fastformer.

espnet_initialization_fn()[source]¶

forward(xs_pad, mask)[source]¶

Forward method.

Parameters:

xs_pad – (batch, time, size = n_heads * attn_dim)
mask – (batch, 1, time), nonpadding is 1, padding is 0

Returns:

(batch, time, size)

Return type:

torch.Tensor

init_weights(module)[source]¶

transpose_for_scores(x)[source]¶

Reshape and transpose to compute scores.

Parameters:: x – (batch, time, size = n_heads * attn_dim)
Returns:: (batch, n_heads, time, attn_dim)

espnet2.asr.encoder.rnn_encoder¶

class espnet2.asr.encoder.rnn_encoder.RNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, subsample: Optional[Sequence[int]] = (2, 2, 1, 1))[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

RNNEncoder class.

Parameters:

input_size – The number of expected features in the input
output_size – The number of output features
hidden_size – The number of hidden features
bidirectional – If True becomes a bidirectional LSTM
use_projection – Use projection layer or not
num_layers – Number of recurrent layers
dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

espnet2.asr.encoder.transformer_encoder¶

Transformer encoder definition.

class espnet2.asr.encoder.transformer_encoder.TransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Transformer encoder module.

Parameters:

input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]¶

espnet2.asr.encoder.abs_encoder¶

class espnet2.asr.encoder.abs_encoder.AbsEncoder[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]¶

espnet2.asr.encoder.vgg_rnn_encoder¶

class espnet2.asr.encoder.vgg_rnn_encoder.VGGRNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, in_channel: int = 1)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

VGGRNNEncoder class.

Parameters:

input_size – The number of expected features in the input
bidirectional – If True becomes a bidirectional LSTM
use_projection – Use projection layer or not
num_layers – Number of recurrent layers
hidden_size – The number of hidden features
output_size – The number of output features
dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

espnet2.asr.encoder.hubert_encoder¶

Encoder definition.

class espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder(input_size: int, hubert_url: str = './', hubert_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0, dropout_rate: float = 0.0, activation_dropout: float = 0.1, attention_dropout: float = 0.0, mask_length: int = 10, mask_prob: float = 0.75, mask_selection: str = 'static', mask_other: int = 0, apply_mask: bool = True, mask_channel_length: int = 64, mask_channel_prob: float = 0.5, mask_channel_other: int = 0, mask_channel_selection: str = 'static', layerdrop: float = 0.1, feature_grad_mult: float = 0.0)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Hubert encoder module, used for loading pretrained weight and finetuning

Parameters:

input_size – input dim
hubert_url – url to Hubert pretrained model
hubert_dir_path – directory to download the Wav2Vec2.0 pretrained model.
output_size – dimension of attention
normalize_before – whether to use layer_norm before the first block
freeze_finetune_updates – steps that freeze all layers except output layer before tuning the whole model (nessasary to prevent overfit).
dropout_rate – dropout rate
activation_dropout – dropout rate in activation function
attention_dropout – dropout rate in attention

Hubert specific Args:: Please refer to: https://github.com/pytorch/fairseq/blob/master/fairseq/models/hubert/hubert.py

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward Hubert ASR Encoder.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]¶

reload_pretrained_parameters()[source]¶

class espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder(input_size: int = 1, output_size: int = 1024, linear_units: int = 1024, attention_heads: int = 12, num_blocks: int = 12, dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, activation_dropout_rate: float = 0.0, hubert_dict: str = './dict.txt', label_rate: int = 100, checkpoint_activations: bool = False, sample_rate: int = 16000, use_amp: bool = False, **kwargs)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Hubert pretrain encoder module, only used for pretraining stage

Parameters:

input_size – input dim
output_size – dimension of attention
linear_units – dimension of feedforward layers
attention_heads – the number of heads of multi head attention
num_blocks – the number of encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
hubert_dict – target dictionary for Hubert pretraining
label_rate – label frame rate. -1 for sequence label
sample_rate – target sample rate.
use_amp – whether to use automatic mixed precision
normalize_before – whether to use layer_norm before the first block

cast_mask_emb()[source]¶

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_length: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward Hubert Pretrain Encoder.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]¶

reload_pretrained_parameters()[source]¶

class espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder(input_size: int = None, extractor_mode: str = 'group_norm', extractor_conv_layer_config: Optional[List[Tuple[int, int, int]]] = [(512, 10, 5), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 2, 2), (512, 2, 2)], extractor_conv_bias: bool = False, encoder_embed_dim: int = 768, encoder_projection_dropout: float = 0.1, encoder_pos_conv_kernel: int = 128, encoder_pos_conv_groups: int = 16, encoder_num_layers: int = 12, encoder_num_heads: int = 12, encoder_attention_dropout: float = 0.1, encoder_ff_interm_features: int = 3072, encoder_ff_interm_dropout: float = 0.0, encoder_dropout: float = 0.1, encoder_layer_norm_first: bool = False, encoder_layer_drop: float = 0.05, mask_prob: float = 0.8, mask_selection: str = 'static', mask_other: float = 0.0, mask_length: int = 10, no_mask_overlap: bool = False, mask_min_space: int = 1, mask_channel_prob: float = 0.0, mask_channel_selection: str = 'static', mask_channel_other: float = 0.0, mask_channel_length: int = 10, no_mask_channel_overlap: bool = False, mask_channel_min_space: int = 1, skip_masked: bool = False, skip_nomask: bool = False, num_classes: int = 100, final_dim: int = 256, feature_grad_mult: Optional[float] = 0.1, finetuning: bool = False, freeze_encoder_updates: int = 0)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Torch Audio Hubert encoder module.

Parameters:

extractor_mode – Operation mode of feature extractor. Valid values are “group_norm” or “layer_norm”.
extractor_conv_layer_config – Configuration of convolution layers in feature extractor. List of convolution configuration, i.e. [(output_channel, kernel_size, stride), …]
extractor_conv_bias – Whether to include bias term to each convolution operation.
encoder_embed_dim – The dimension of embedding in encoder.
encoder_projection_dropout – The dropout probability applied after the input feature is projected to “encoder_embed_dim”.
encoder_pos_conv_kernel – Kernel size of convolutional positional embeddings.
encoder_pos_conv_groups – Number of groups of convolutional positional embeddings.
encoder_num_layers – Number of self attention layers in transformer block.
encoder_num_heads – Number of heads in self attention layers.
encoder_attention_dropout – Dropout probability applied after softmax in self-attention layer.
encoder_ff_interm_features – Dimension of hidden features in feed forward layer.
encoder_ff_interm_dropout – Dropout probability applied in feedforward layer.
encoder_dropout – Dropout probability applied at the end of feed forward layer.
encoder_layer_norm_first – Control the order of layer norm in transformer layer and each encoder layer. If True, in transformer layer, layer norm is applied before features are fed to encoder layers.
encoder_layer_drop – Probability to drop each encoder layer during training.
mask_prob – Probability for each token to be chosen as start of the span to be masked.
mask_selection – How to choose the mask length. Options: [static, uniform, normal, poisson].
mask_other – Secondary mask argument (used for more complex distributions).
mask_length – The lengths of the mask.
no_mask_overlap – Whether to allow masks to overlap.
mask_min_space – Minimum space between spans (if no overlap is enabled).
mask_channel_prob – (float): The probability of replacing a feature with 0.
mask_channel_selection – How to choose the mask length for channel masking. Options: [static, uniform, normal, poisson].
mask_channel_other – Secondary mask argument for channel masking(used for more complex distributions).
mask_channel_length – Minimum space between spans (if no overlap is enabled) for channel masking.
no_mask_channel_overlap – Whether to allow channel masks to overlap.
mask_channel_min_space – Minimum space between spans for channel masking(if no overlap is enabled).
skip_masked – If True, skip computing losses over masked frames.
skip_nomask – If True, skip computing losses over unmasked frames.
num_classes – The number of classes in the labels.
final_dim – Project final representations and targets to final_dim.
feature_grad_mult – The factor to scale the convolutional feature extraction layer gradients by. The scale factor will not affect the forward pass.
finetuning – Whether to finetuning the model with ASR or other tasks.
freeze_encoder_updates – The number of steps to freeze the encoder parameters in ASR finetuning.

Hubert specific Args:: Please refer to: https://pytorch.org/audio/stable/generated/torchaudio.models.hubert_pretrain_model.html#torchaudio.models.hubert_pretrain_model

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, ys_pad: torch.Tensor = None, ys_pad_length: torch.Tensor = None, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward Hubert Pretrain Encoder.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]¶

reload_pretrained_parameters()[source]¶

espnet2.asr.encoder.hubert_encoder.download_hubert(model_url, dir_path)[source]¶

espnet2.asr.encoder.wav2vec2_encoder¶

Encoder definition.

class espnet2.asr.encoder.wav2vec2_encoder.FairSeqWav2Vec2Encoder(input_size: int, w2v_url: str, w2v_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Wav2Vec2 encoder module.

Parameters:

input_size – input dim
output_size – dimension of attention
w2v_url – url to Wav2Vec2.0 pretrained model
w2v_dir_path – directory to download the Wav2Vec2.0 pretrained model.
normalize_before – whether to use layer_norm before the first block
finetune_last_n_layers – last n layers to be finetuned in Wav2Vec2.0 0 means to finetune every layer if freeze_w2v=False.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward FairSeqWav2Vec2 Encoder.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]¶

reload_pretrained_parameters()[source]¶

espnet2.asr.encoder.wav2vec2_encoder.download_w2v(model_url, dir_path)[source]¶

espnet2.asr.encoder.transformer_encoder_multispkr¶

Encoder definition.

class espnet2.asr.encoder.transformer_encoder_multispkr.TransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, num_blocks_sd: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, num_inf: int = 1)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Transformer encoder module.

Parameters:

input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of recognition encoder blocks
num_blocks_sd – the number of speaker dependent encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
num_inf – number of inference output

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]¶

espnet2.asr.encoder.contextual_block_conformer_encoder¶

Created on Sat Aug 21 17:27:16 2021.

@author: Keqi Deng (UCAS)

class espnet2.asr.encoder.contextual_block_conformer_encoder.ContextualBlockConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, cnn_module_kernel: int = 31, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Contextual Block Conformer encoder module.

Parameters:

input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
block_size – block size for contextual block processing
hop_Size – hop size for block processing
look_ahead – look-ahead size for block_processing
init_average – whether to use average as initial context (otherwise max values)
ctx_pos_enc – whether to use positional encoding to the context vectors

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).

Returns:

position embedded tensor and mask

forward_infer(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

forward_train(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]¶

espnet2.asr.encoder.whisper_encoder¶

class espnet2.asr.encoder.whisper_encoder.OpenAIWhisperEncoder(input_size: int = 1, dropout_rate: float = 0.0, whisper_model: str = 'small', download_dir: str = None, use_specaug: bool = False, specaug_conf: Optional[dict] = None, do_pad_trim: bool = False)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Transformer-based Speech Encoder from OpenAI’s Whisper Model:

URL: https://github.com/openai/whisper

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

log_mel_spectrogram(audio: torch.Tensor, ilens: torch.Tensor = None) → torch.Tensor[source]¶: Use log-mel spectrogram computation native to Whisper training

output_size() → int[source]¶

pad_or_trim(array: torch.Tensor, length: int, axis: int = -1) → torch.Tensor[source]¶

Pad or trim the audio array to N_SAMPLES.

Used in zero-shot inference cases.

whisper_encode(input: torch.Tensor, ilens: torch.Tensor = None) → torch.Tensor[source]¶

espnet2.asr.encoder.e_branchformer_encoder¶

E-Branchformer encoder definition. Reference:

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe, “E-Branchformer: Branchformer with Enhanced merging for speech recognition,” in SLT 2022.

class espnet2.asr.encoder.e_branchformer_encoder.EBranchformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', zero_triu: bool = False, padding_idx: int = -1, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000, use_ffn: bool = False, macaron_ffn: bool = False, ffn_activation_type: str = 'swish', linear_units: int = 2048, positionwise_layer_type: str = 'linear', merge_conv_kernel: int = 3, interctc_layer_idx=None, interctc_use_conditioning: bool = False)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

E-Branchformer encoder module.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None, max_layer: int = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Calculate forward propagation.

Parameters:

xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.
ctc (CTC) – Intermediate CTC module.
max_layer (int) – Layer depth below which InterCTC is applied.

Returns:

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type:

torch.Tensor

output_size() → int[source]¶

class espnet2.asr.encoder.e_branchformer_encoder.EBranchformerEncoderLayer(size: int, attn: torch.nn.modules.module.Module, cgmlp: torch.nn.modules.module.Module, feed_forward: Optional[torch.nn.modules.module.Module], feed_forward_macaron: Optional[torch.nn.modules.module.Module], dropout_rate: float, merge_conv_kernel: int = 3)[source]¶

Bases: torch.nn.modules.module.Module

E-Branchformer encoder layer module.

Parameters:

size (int) – model dimension
attn – standard self-attention or efficient attention
cgmlp – ConvolutionalGatingMLP
feed_forward – feed-forward module, optional
feed_forward – macaron-style feed-forward module, optional
dropout_rate (float) – dropout probability
merge_conv_kernel (int) – kernel size of the depth-wise conv in merge module

forward(x_input, mask, cache=None)[source]¶

Compute encoded features.

Parameters:

x_input (Union[Tuple, torch.Tensor]) – Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).

Returns:

Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time).

Return type:

torch.Tensor

espnet2.asr.encoder.init¶

espnet2.asr.encoder.longformer_encoder¶

Conformer encoder definition.

class espnet2.asr.encoder.longformer_encoder.LongformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'abs_pos', selfattention_layer_type: str = 'lf_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, attention_windows: list = [100, 100, 100, 100, 100, 100], attention_dilation: list = [1, 1, 1, 1, 1, 1], attention_mode: str = 'sliding_chunks')[source]¶

Bases: espnet2.asr.encoder.conformer_encoder.ConformerEncoder

Longformer SA Conformer encoder module.

Parameters:

input_size (int) – Input dimension.
output_size (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.
encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.
encoder_attn_layer_type (str) – Encoder attention layer type.
activation_type (str) – Encoder activation function type.
macaron_style (bool) – Whether to use macaron style for positionwise layer.
use_cnn_module (bool) – Whether to use convolution module.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.
attention_windows (list) – Layer-wise attention window sizes for longformer self-attn
attention_dilation (list) – Layer-wise attention dilation sizes for longformer self-attn
attention_mode (str) – Implementation for longformer self-attn. Default=”sliding_chunks” Choose ‘n2’, ‘tvm’ or ‘sliding_chunks’. More details in https://github.com/allenai/longformer

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Calculate forward propagation.

Parameters:

xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.

Returns:

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type:

torch.Tensor

output_size() → int[source]¶

espnet2.asr.encoder.branchformer_encoder¶

Branchformer encoder definition.

Reference:: Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe, “Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding,” in Proceedings of ICML, 2022.

class espnet2.asr.encoder.branchformer_encoder.BranchformerEncoder(input_size: int, output_size: int = 256, use_attn: bool = True, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', use_cgmlp: bool = True, cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', merge_method: str = 'concat', cgmlp_weight: Union[float, List[float]] = 0.5, attn_branch_drop_rate: Union[float, List[float]] = 0.0, num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', zero_triu: bool = False, padding_idx: int = -1, stochastic_depth_rate: Union[float, List[float]] = 0.0)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Branchformer encoder module.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Calculate forward propagation.

Parameters:

xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.

Returns:

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type:

torch.Tensor

output_size() → int[source]¶

class espnet2.asr.encoder.branchformer_encoder.BranchformerEncoderLayer(size: int, attn: Optional[torch.nn.modules.module.Module], cgmlp: Optional[torch.nn.modules.module.Module], dropout_rate: float, merge_method: str, cgmlp_weight: float = 0.5, attn_branch_drop_rate: float = 0.0, stochastic_depth_rate: float = 0.0)[source]¶

Bases: torch.nn.modules.module.Module

Branchformer encoder layer module.

Parameters:

size (int) – model dimension
attn – standard self-attention or efficient attention, optional
cgmlp – ConvolutionalGatingMLP, optional
dropout_rate (float) – dropout probability
merge_method (str) – concat, learned_ave, fixed_ave
cgmlp_weight (float) – weight of the cgmlp branch, between 0 and 1, used if merge_method is fixed_ave
attn_branch_drop_rate (float) – probability of dropping the attn branch, used if merge_method is learned_ave
stochastic_depth_rate (float) – stochastic depth probability

forward(x_input, mask, cache=None)[source]¶

Compute encoded features.

Parameters:

x_input (Union[Tuple, torch.Tensor]) – Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).

Returns:

Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time).

Return type:

torch.Tensor

espnet2.asr.encoder.conformer_encoder¶

Conformer encoder definition.

class espnet2.asr.encoder.conformer_encoder.ConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, stochastic_depth_rate: Union[float, List[float]] = 0.0, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Conformer encoder module.

Parameters:

input_size (int) – Input dimension.
output_size (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.
encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.
encoder_attn_layer_type (str) – Encoder attention layer type.
activation_type (str) – Encoder activation function type.
macaron_style (bool) – Whether to use macaron style for positionwise layer.
use_cnn_module (bool) – Whether to use convolution module.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Calculate forward propagation.

Parameters:

xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.

Returns:

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type:

torch.Tensor

output_size() → int[source]¶

espnet2.asr.encoder.contextual_block_transformer_encoder¶

Encoder definition.

class espnet2.asr.encoder.contextual_block_transformer_encoder.ContextualBlockTransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Contextual Block Transformer encoder module.

Details in Tsunoo et al. “Transformer ASR with contextual block processing” (https://arxiv.org/abs/1910.07204)

Parameters:

input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
block_size – block size for contextual block processing
hop_Size – hop size for block processing
look_ahead – look-ahead size for block_processing
init_average – whether to use average as initial context (otherwise max values)
ctx_pos_enc – whether to use positional encoding to the context vectors

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).

Returns:

position embedded tensor and mask

forward_infer(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

forward_train(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]¶

espnet2.asr.postencoder.abs_postencoder¶

class espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]¶

espnet2.asr.postencoder.hugging_face_transformers_postencoder¶

Hugging Face Transformers PostEncoder.

class espnet2.asr.postencoder.hugging_face_transformers_postencoder.HuggingFaceTransformersPostEncoder(input_size: int, model_name_or_path: str, length_adaptor_n_layers: int = 0, lang_token_id: int = -1)[source]¶

Bases: espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder

Hugging Face Transformers PostEncoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶: Forward.

output_size() → int[source]¶: Get the output size.

reload_pretrained_parameters()[source]¶

espnet2.asr.postencoder.init¶

espnet2.asr.preencoder.abs_preencoder¶

class espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]¶

espnet2.asr.preencoder.sinc¶

Sinc convolutions for raw audio input.

class espnet2.asr.preencoder.sinc.LightweightSincConvs(fs: Union[int, str, float] = 16000, in_channels: int = 1, out_channels: int = 256, activation_type: str = 'leakyrelu', dropout_type: str = 'dropout', windowing_type: str = 'hamming', scale_type: str = 'mel')[source]¶

Bases: espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder

Lightweight Sinc Convolutions.

Instead of using precomputed features, end-to-end speech recognition can also be done directly from raw audio using sinc convolutions, as described in “Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions” by Kürzinger et al. https://arxiv.org/abs/2010.07597

To use Sinc convolutions in your model instead of the default f-bank frontend, set this module as your pre-encoder with preencoder: sinc and use the input of the sliding window frontend with frontend: sliding_window in your yaml configuration file. So that the process flow is:

Frontend (SlidingWindow) -> SpecAug -> Normalization -> Pre-encoder (LightweightSincConvs) -> Encoder -> Decoder

Note that this method also performs data augmentation in time domain (vs. in spectral domain in the default frontend). Use plot_sinc_filters.py to visualize the learned Sinc filters.

Initialize the module.

Parameters:

fs – Sample rate.
in_channels – Number of input channels.
out_channels – Number of output channels (for each input channel).
activation_type – Choice of activation function.
dropout_type – Choice of dropout function.
windowing_type – Choice of windowing function.
scale_type – Choice of filter-bank initialization scale.

espnet_initialization_fn()[source]¶: Initialize sinc filters with filterbank values.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Apply Lightweight Sinc Convolutions.

The input shall be formatted as (B, T, C_in, D_in) with B as batch size, T as time dimension, C_in as channels, and D_in as feature dimension.

The output will then be (B, T, C_out*D_out) with C_out and D_out as output dimensions.

The current module structure only handles D_in=400, so that D_out=1. Remark for the multichannel case: C_out is the number of out_channels given at initialization multiplied with C_in.

gen_lsc_block(in_channels: int, out_channels: int, depthwise_kernel_size: int = 9, depthwise_stride: int = 1, depthwise_groups=None, pointwise_groups=0, dropout_probability: float = 0.15, avgpool=False)[source]¶

Generate a convolutional block for Lightweight Sinc convolutions.

Each block consists of either a depthwise or a depthwise-separable convolutions together with dropout, (batch-)normalization layer, and an optional average-pooling layer.

Parameters:

in_channels – Number of input channels.
out_channels – Number of output channels.
depthwise_kernel_size – Kernel size of the depthwise convolution.
depthwise_stride – Stride of the depthwise convolution.
depthwise_groups – Number of groups of the depthwise convolution.
pointwise_groups – Number of groups of the pointwise convolution.
dropout_probability – Dropout probability in the block.
avgpool – If True, an AvgPool layer is inserted.

Returns:

Neural network building block.

Return type:

torch.nn.Sequential

output_size() → int[source]¶: Get the output size.

class espnet2.asr.preencoder.sinc.SpatialDropout(dropout_probability: float = 0.15, shape: Union[tuple, list, None] = None)[source]¶

Bases: torch.nn.modules.module.Module

Spatial dropout module.

Apply dropout to full channels on tensors of input (B, C, D)

Initialize.

Parameters:

dropout_probability – Dropout probability.
shape (tuple, list) – Shape of input tensors.

forward(x: torch.Tensor) → torch.Tensor[source]¶: Forward of spatial dropout module.

espnet2.asr.preencoder.linear¶

Linear Projection.

class espnet2.asr.preencoder.linear.LinearProjection(input_size: int, output_size: int, dropout: float = 0.0)[source]¶

Bases: espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder

Linear Projection Preencoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶: Forward.

output_size() → int[source]¶: Get the output size.

espnet2.asr.preencoder.init¶

espnet2.asr.specaug.abs_specaug¶

class espnet2.asr.specaug.abs_specaug.AbsSpecAug[source]¶

Bases: torch.nn.modules.module.Module

Abstract class for the augmentation of spectrogram

The process-flow:

Frontend -> SpecAug -> Normalization -> Encoder -> Decoder

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor, x_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.specaug.init¶

espnet2.asr.specaug.specaug¶

SpecAugment module.

class espnet2.asr.specaug.specaug.SpecAug(apply_time_warp: bool = True, time_warp_window: int = 5, time_warp_mode: str = 'bicubic', apply_freq_mask: bool = True, freq_mask_width_range: Union[int, Sequence[int]] = (0, 20), num_freq_mask: int = 2, apply_time_mask: bool = True, time_mask_width_range: Union[int, Sequence[int], None] = None, time_mask_width_ratio_range: Union[float, Sequence[float], None] = None, num_time_mask: int = 2)[source]¶

Bases: espnet2.asr.specaug.abs_specaug.AbsSpecAug

Implementation of SpecAug.

Reference:: Daniel S. Park et al. “SpecAugment: A Simple Data

Augmentation Method for Automatic Speech Recognition”

Warning

When using cuda mode, time_warp doesn’t have reproducibility due to torch.nn.functional.interpolate.

forward(x, x_lengths=None)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.state_spaces.residual¶

Implementations of different types of residual functions.

class espnet2.asr.state_spaces.residual.Affine(*args, scalar=True, gamma=0.0, **kwargs)[source]¶

Bases: espnet2.asr.state_spaces.residual.Residual

Residual connection with learnable scalar multipliers on the main branch.

scalar: Single scalar multiplier, or one per dimension scale, power: Initialize to scale * layer_num**(-power)

forward(x, y, transposed)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.residual.DecayResidual(*args, power=0.5, l2=True)[source]¶

Bases: espnet2.asr.state_spaces.residual.Residual

Residual connection that can decay the linear combination depending on depth.

forward(x, y, transposed)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.residual.Feedforward(*args)[source]¶: Bases: espnet2.asr.state_spaces.residual.Residual

class espnet2.asr.state_spaces.residual.Highway(*args, scaling_correction=False, elemwise=False)[source]¶

Bases: espnet2.asr.state_spaces.residual.Residual

forward(x, y, transposed=False)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.residual.Residual(i_layer, d_input, d_model, alpha=1.0, beta=1.0)[source]¶

Bases: torch.nn.modules.module.Module

Residual connection with constant affine weights.

Can simulate standard residual, no residual, and “constant gates”.

property d_output¶

forward(x, y, transposed)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.state_spaces.cauchy¶

espnet2.asr.state_spaces.s4¶

Standalone version of Structured (Sequence) State Space (S4) model.

class espnet2.asr.state_spaces.s4.OptimModule[source]¶

Bases: torch.nn.modules.module.Module

Interface for Module that allows registering buffers/parameters with configurable optimizer hyperparameters. # noqa

Initializes internal Module state, shared by both nn.Module and ScriptModule.

register(name, tensor, lr=None)[source]¶: Register a tensor with a configurable learning rate and 0 weight decay.

class espnet2.asr.state_spaces.s4.S4(d_model, d_state=64, l_max=None, channels=1, bidirectional=False, activation='gelu', postact='glu', hyper_act=None, dropout=0.0, tie_dropout=False, bottleneck=None, gate=None, transposed=True, verbose=False, **kernel_args)[source]¶

Bases: torch.nn.modules.module.Module

Initialize S4 module.

d_state: the dimension of the state, also denoted by N l_max: the maximum kernel length, also denoted by L.

Set l_max=None to always use a global kernel

channels: can be interpreted as a number of “heads”;: the SSM is a map from a 1-dim to C-dim sequence. It’s not recommended to change this unless desperate for things to tune; instead, increase d_model for larger models

bidirectional: if True, convolution kernel will be two-sided

activation: activation in between SS and FF postact: activation after FF hyper_act: use a “hypernetwork” multiplication (experimental) dropout: standard dropout argument. tie_dropout=True ties the dropout

mask across the sequence length, emulating nn.Dropout1d

transposed: choose backbone axis ordering of: (B, L, H) (if False) or (B, H, L) (if True) [B=batch size, L=sequence length, H=hidden dimension]

gate: add gated activation (GSS) bottleneck: reduce SSM dimension (GSS)

See the class SSKernel for the kernel constructor which accepts kernel_args. Relevant options that are worth considering and tuning include “mode” + “measure”, “dt_min”, “dt_max”, “lr”

Other options are all experimental and should not need to be configured

property d_output¶

default_state(*batch_shape, device=None)[source]¶

forward(u, state=None, rate=1.0, lengths=None, **kwargs)[source]¶

Forward pass.

u: (B H L) if self.transposed else (B L H) state: (H N) never needed unless you know what you’re doing

Returns: same shape as u

setup_step(**kwargs)[source]¶

step(u, state, **kwargs)[source]¶

Step one time step as a recurrent model.

Intended to be used during validation.

u: (B H) state: (B H N) Returns: output (B H), state (B H N)

class espnet2.asr.state_spaces.s4.SSKernel(H, N=64, L=None, measure='legs', rank=1, channels=1, dt_min=0.001, dt_max=0.1, deterministic=False, lr=None, mode='nplr', n_ssm=None, verbose=False, measure_args={}, **kernel_args)[source]¶

Bases: torch.nn.modules.module.Module

Wrapper around SSKernel parameterizations.

The SSKernel is expected to support the interface forward() default_state() _setup_step() step()

State Space Kernel which computes the convolution kernel $\bar{K}$.

H: Number of independent SSM copies;: controls the size of the model. Also called d_model in the config.
N: State size (dimensionality of parameters A, B, C).: Also called d_state in the config. Generally shouldn’t need to be adjusted and doens’t affect speed much.
L: Maximum length of convolution kernel, if known.: Should work in the majority of cases even if not known.
measure: Options for initialization of (A, B).: For NPLR mode, recommendations are “legs”, “fout”, “hippo” (combination of both). For Diag mode, recommendations are “diag-inv”, “diag-lin”, “diag-legs”, and “diag” (combination of diag-inv and diag-lin)
rank: Rank of low-rank correction for NPLR mode.: Needs to be increased for measure “legt”
channels: C channels turns the SSM from a 1-dim to C-dim map;: can think of it having C separate “heads” per SSM. This was partly a feature to make it easier to implement bidirectionality; it is recommended to set channels=1 and adjust H to control parameters instead

dt_min, dt_max: min and max values for the step size dt (Delta) mode: Which kernel algorithm to use. ‘nplr’ is the full S4 model;

‘diag’ is the simpler S4D; ‘slow’ is a dense version for testing

n_ssm: Number of independent trainable (A, B) SSMs,: e.g. n_ssm=1 means all A/B parameters are tied across the H different instantiations of C. n_ssm=None means all H SSMs are completely independent. Generally, changing this option can save parameters but doesn’t affect performance or speed much. This parameter must divide H
lr: Passing in a number (e.g. 0.001) sets: attributes of SSM parameers (A, B, dt). A custom optimizer hook is needed to configure the optimizer to set the learning rates appropriately for these parameters.

default_state(*args, **kwargs)[source]¶

forward(state=None, L=None, rate=None)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_state(u, state)[source]¶

Forward the state through a sequence.

i.e. computes the state after passing chunk through SSM

state: (B, H, N) u: (B, H, L)

Returns: (B, H, N)

step(u, state, **kwargs)[source]¶

class espnet2.asr.state_spaces.s4.SSKernelDiag(A, B, C, log_dt, L=None, disc='bilinear', real_type='exp', lr=None, bandlimit=None)[source]¶

Bases: espnet2.asr.state_spaces.s4.OptimModule

Version using (complex) diagonal state matrix (S4D).

default_state(*batch_shape)[source]¶

forward(L, state=None, rate=1.0, u=None)[source]¶

Forward pass.

state: (B, H, N) initial state rate: sampling rate factor L: target length

returns: (C, H, L) convolution kernel (generally C=1) (B, H, L) output from initial state

forward_state(u, state)[source]¶

step(u, state)[source]¶

class espnet2.asr.state_spaces.s4.SSKernelNPLR(w, P, B, C, log_dt, L=None, lr=None, verbose=False, keops=False, real_type='exp', real_tolerance=0.001, bandlimit=None)[source]¶

Bases: espnet2.asr.state_spaces.s4.OptimModule

Stores a representation of and computes the SSKernel function.

K_L(A^dt, B^dt, C) corresponding to a discretized state space, where A is Normal + Low Rank (NPLR)

Initialize kernel.

L: Maximum length; this module computes an SSM kernel of length L A is represented by diag(w) - PP^* w: (S, N) diagonal part P: (R, S, N) low-rank part

B: (S, N) C: (C, H, N) dt: (H) timescale per feature lr: [dict | float | None] hook to set lr of special parameters (A, B, dt)

Dimensions: N (or d_state): state size H (or d_model): total SSM copies S (or n_ssm): number of trainable copies of (A, B, dt); must divide H R (or rank): rank of low-rank part C (or channels): system is 1-dim to C-dim

The forward pass of this Module returns a tensor of shape (C, H, L)

Note: tensor shape N here denotes half the true state size,: because of conjugate symmetry

default_state(*batch_shape)[source]¶

forward(state=None, rate=1.0, L=None)[source]¶

Forward pass.

state: (B, H, N) initial state rate: sampling rate factor L: target length

returns: (C, H, L) convolution kernel (generally C=1) (B, H, L) output from initial state

step(u, state)[source]¶

Step one time step as a recurrent model.

Must have called self._setup_step() and created state with self.default_state() before calling this

espnet2.asr.state_spaces.s4.cauchy_naive(v, z, w)[source]¶

Naive version.

v, w: (…, N) z: (…, L) returns: (…, L)

espnet2.asr.state_spaces.s4.combination(measures, N, R, S, **ssm_args)[source]¶

espnet2.asr.state_spaces.s4.dplr(scaling, N, rank=1, H=1, dtype=torch.float32, real_scale=1.0, imag_scale=1.0, random_real=False, random_imag=False, normalize=False, diagonal=True, random_B=False)[source]¶

espnet2.asr.state_spaces.s4.get_logger(name='espnet2.asr.state_spaces.s4', level=20) → logging.Logger[source]¶: Initialize multi-GPU-friendly python logger.

espnet2.asr.state_spaces.s4.log = <Logger espnet2.asr.state_spaces.s4 (INFO)>[source]¶: Cauchy and Vandermonde kernels

espnet2.asr.state_spaces.s4.log_vandermonde(v, x, L)[source]¶

Compute Vandermonde product.

v: (…, N) x: (…, N) returns: (…, L) sum v x^l

espnet2.asr.state_spaces.s4.log_vandermonde_transpose(u, v, x, L)[source]¶

espnet2.asr.state_spaces.s4.nplr(measure, N, rank=1, dtype=torch.float32, diagonalize_precision=True)[source]¶

Decompose as Normal Plus Low-Rank (NPLR).

Return w, p, q, V, B such that (w - p q^*, B) is unitarily equivalent to the original HiPPO A, B by the matrix V i.e. A = V[w - p q^*]V^*, B = V B

espnet2.asr.state_spaces.s4.power(L, A, v=None)[source]¶

Compute A^L and the scan sum_i A^i v_i.

A: (…, N, N) v: (…, N, L)

espnet2.asr.state_spaces.s4.rank_correction(measure, N, rank=1, dtype=torch.float32)[source]¶: Return low-rank matrix L such that A + L is normal.

espnet2.asr.state_spaces.s4.rank_zero_only(fn: Callable) → Callable[source]¶

Decorator function from PyTorch Lightning.

Function that can be used as a decorator to enable a function/method being called only on global rank 0.

espnet2.asr.state_spaces.s4.ssm(measure, N, R, H, **ssm_args)[source]¶

Dispatcher to create single SSM initialization.

N: state size R: rank (for DPLR parameterization) H: number of independent SSM copies

espnet2.asr.state_spaces.s4.transition(measure, N)[source]¶: A, B transition matrices for different measures.

espnet2.asr.state_spaces.ff¶

Implementation of FFN block in the style of Transformers.

class espnet2.asr.state_spaces.ff.FF(d_input, expand=2, d_output=None, transposed=False, activation='gelu', initializer=None, dropout=0.0, tie_dropout=False)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

forward(x, *args, **kwargs)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]¶

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.state_spaces.pool¶

Implements downsampling and upsampling on sequences.

class espnet2.asr.state_spaces.pool.DownAvgPool(d_input, stride=1, expand=1, transposed=True)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output¶

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

forward(x)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]¶

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

class espnet2.asr.state_spaces.pool.DownLinearPool(d_input, stride=1, expand=1, transposed=True)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output¶

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

forward(x)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]¶

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

class espnet2.asr.state_spaces.pool.DownPool(d_input, d_output=None, expand=None, stride=1, transposed=True, weight_norm=True, initializer=None, activation=None)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

default_state(*batch_shape, device=None)[source]¶: Create initial state for a batch of inputs.

forward(x)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]¶

Step one time step as a recurrent model.

x: (…, H)

class espnet2.asr.state_spaces.pool.DownPool2d(d_input, d_output, stride=1, transposed=True, weight_norm=True)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

forward(x)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

class espnet2.asr.state_spaces.pool.DownSample(d_input, stride=1, expand=1, transposed=True)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output¶

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

forward(x)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]¶

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

class espnet2.asr.state_spaces.pool.DownSpectralPool(d_input, stride=1, expand=1, transposed=True)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output¶

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

forward(x)[source]¶

Forward pass.

x: (B, L…, D)

step(x, state, **kwargs)[source]¶

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

class espnet2.asr.state_spaces.pool.UpPool(d_input, d_output, stride, transposed=True, weight_norm=True, initializer=None, activation=None)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output¶

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

default_state(*batch_shape, device=None)[source]¶: Create initial state for a batch of inputs.

forward(x, skip=None)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]¶

Step one time step as a recurrent model.

x: (…, H)

class espnet2.asr.state_spaces.pool.UpSample(d_input, stride=1, expand=1, transposed=True)[source]¶

Bases: torch.nn.modules.module.Module

property d_output¶

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

step(x, state, **kwargs)[source]¶

espnet2.asr.state_spaces.pool.downsample(x, stride=1, expand=1, transposed=False)[source]¶

espnet2.asr.state_spaces.pool.upsample(x, stride=1, expand=1, transposed=False)[source]¶

espnet2.asr.state_spaces.block¶

Implements a full residual block around a black box layer.

Configurable options include: normalization position: prenorm or postnorm normalization type: batchnorm, layernorm etc. subsampling/pooling residual options: feedforward, residual, affine scalars, depth-dependent scaling, etc.

class espnet2.asr.state_spaces.block.SequenceResidualBlock(d_input, i_layer=None, prenorm=True, dropout=0.0, tie_dropout=False, transposed=False, layer=None, residual=None, norm=None, pool=None, drop_path=0.0)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

Residual block wrapper for black box layer.

The SequenceResidualBlock class implements a generic (batch, length, d_input) -> (batch, length, d_input) transformation

Parameters:

d_input – Input feature dimension
i_layer – Layer index, only needs to be passed into certain residuals like Decay
dropout – Dropout for black box module
tie_dropout – Tie dropout mask across sequence like nn.Dropout1d/nn.Dropout2d
transposed – Transpose inputs so each layer receives (batch, dim, length)
layer – Config for black box module
residual – Config for residual function
norm – Config for normalization layer
pool – Config for pooling layer per stage
drop_path – Drop ratio for stochastic depth

property d_output¶

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

property d_state¶: Return dimension of output of self.state_to_tensor.

default_state(*args, **kwargs)[source]¶: Create initial state for a batch of inputs.

forward(x, state=None, **kwargs)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

property state_to_tensor¶

Return a function mapping a state to a single tensor.

This method should be implemented if one wants to use the hidden state insteadof the output sequence for final prediction. Currently only used with the StateDecoder.

step(x, state, **kwargs)[source]¶

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.state_spaces.base¶

class espnet2.asr.state_spaces.base.SequenceIdentity(*args, transposed=False, **kwargs)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceIdentity

Simple SequenceModule for testing purposes.

forward(x, state=None, **kwargs)[source]¶

class espnet2.asr.state_spaces.base.SequenceModule[source]¶

Bases: torch.nn.modules.module.Module

Abstract sequence model class.

All models must adhere to this interface

A SequenceModule is generally a model that transforms an input of shape (n_batch, l_sequence, d_model) to (n_batch, l_sequence, d_output)

REQUIRED methods and attributes forward, d_model, d_output: controls standard forward pass, a sequence-to-sequence transformation __init__ should also satisfy the following interface; see SequenceIdentity for an example

def __init__(self, d_model, transposed=False, **kwargs)

OPTIONAL methods default_state, step: allows stepping the model recurrently with a hidden state state_to_tensor, d_state: allows decoding from hidden state

Initializes internal Module state, shared by both nn.Module and ScriptModule.

property d_model¶

Model dimension (generally same as input dimension).

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, encoder) to track the internal shapes of the full model.

property d_output¶

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

property d_state¶: Return dimension of output of self.state_to_tensor.

default_state(*batch_shape, device=None)[source]¶: Create initial state for a batch of inputs.

forward(x, state=None, **kwargs)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

property state_to_tensor¶

Return a function mapping a state to a single tensor.

This method should be implemented if one wants to use the hidden state insteadof the output sequence for final prediction. Currently only used with the StateDecoder.

step(x, state=None, **kwargs)[source]¶

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.state_spaces.base.TransposedModule(module)[source]¶

Transpose module.

Wrap a SequenceModule class to accept transposed parameter, handle state, absorb kwargs

espnet2.asr.state_spaces.model¶

class espnet2.asr.state_spaces.model.SequenceModel(d_model, n_layers=1, transposed=False, dropout=0.0, tie_dropout=False, prenorm=True, n_repeat=1, layer=None, residual=None, norm=None, pool=None, track_norms=True, dropinp=0.0, drop_path=0.0)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

Isotropic deep sequence model backbone, in the style of ResNets / Transformers.

The SequenceModel class implements a generic (batch, length, d_input) -> (batch, length, d_output) transformation

Parameters:

d_model – Resize input (useful for deep models with residuals)
n_layers – Number of layers
transposed – Transpose inputs so each layer receives (batch, dim, length)
dropout – Dropout parameter applied on every residual and every layer
tie_dropout – Tie dropout mask across sequence like nn.Dropout1d/nn.Dropout2d
prenorm – Pre-norm vs. post-norm
n_repeat – Each layer is repeated n times per stage before applying pooling
layer – Layer config, must be specified
residual – Residual config
norm – Normalization config (e.g. layer vs batch)
pool – Config for pooling layer per stage
track_norms – Log norms of each layer output
dropinp – Input dropout
drop_path – Stochastic depth for each residual path

property d_state¶: Return dimension of output of self.state_to_tensor.

default_state(*batch_shape, device=None)[source]¶: Create initial state for a batch of inputs.

forward(inputs, *args, state=None, **kwargs)[source]¶

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

property state_to_tensor¶

Return a function mapping a state to a single tensor.

This method should be implemented if one wants to use the hidden state insteadof the output sequence for final prediction. Currently only used with the StateDecoder.

step(x, state, **kwargs)[source]¶

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.state_spaces.attention¶

Multi-Head Attention layer definition.

class espnet2.asr.state_spaces.attention.MultiHeadedAttention(n_feat, n_head, dropout=0.0, transposed=False, **kwargs)[source]¶

Bases: espnet2.asr.state_spaces.base.SequenceModule

Multi-Head Attention layer inheriting SequenceModule.

Comparing default MHA module in ESPnet, this module returns additional dummy state and has step function for autoregressive inference.

Parameters:

n_head (int) – The number of heads.
n_feat (int) – The number of features.
dropout_rate (float) – Dropout rate.

Construct an MultiHeadedAttention object.

forward(query, memory=None, mask=None, *args, **kwargs)[source]¶

Compute scaled dot product attention.

Parameters:

query (torch.Tensor) – Query tensor (#batch, time1, size).
key (torch.Tensor) – Key tensor (#batch, time2, size).
value (torch.Tensor) – Value tensor (#batch, time2, size).
mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).

Returns:

Output tensor (#batch, time1, d_model).

Return type:

torch.Tensor

forward_attention(value, scores, mask)[source]¶

Compute attention context vector.

Parameters:

value (torch.Tensor) – Transformed value (#batch, n_head, time2, d_k).
scores (torch.Tensor) – Attention score (#batch, n_head, time1, time2).
mask (torch.Tensor) – Mask (#batch, 1, time2) or (#batch, time1, time2).

Returns:

Transformed value (#batch, time1, d_model): weighted by the attention score (#batch, time1, time2).

Return type:

torch.Tensor

forward_qkv(query, key, value)[source]¶

Transform query, key and value.

Parameters:

query (torch.Tensor) – Query tensor (#batch, time1, size).
key (torch.Tensor) – Key tensor (#batch, time2, size).
value (torch.Tensor) – Value tensor (#batch, time2, size).

Returns:

Transformed query tensor (#batch, n_head, time1, d_k). torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k). torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).

Return type:

torch.Tensor

step(query, state, memory=None, mask=None, **kwargs)[source]¶

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.state_spaces.registry¶

espnet2.asr.state_spaces.utils¶

Utilities for dealing with collection objects (lists, dicts) and configs.

espnet2.asr.state_spaces.utils.extract_attrs_from_obj(obj, *attrs)[source]¶

espnet2.asr.state_spaces.utils.get_class(registry, _name_)[source]¶

espnet2.asr.state_spaces.utils.instantiate(registry, config, *args, partial=False, wrap=None, **kwargs)[source]¶

Instantiate registered module.

registry: Dictionary mapping names to functions or target paths: (e.g. {‘model’: ‘models.SequenceModel’})
config: Dictionary with a ‘_name_’ key indicating which element of the registry: to grab, and kwargs to be passed into the target constructor

wrap: wrap the target class (e.g. ema optimizer or tasks.wrap) *args, **kwargs: additional arguments

to override the config to pass into the target constructor

espnet2.asr.state_spaces.utils.is_dict(x)[source]¶

espnet2.asr.state_spaces.utils.is_list(x)[source]¶

espnet2.asr.state_spaces.utils.omegaconf_filter_keys(d, fn=None)[source]¶: Only keep keys where fn(key) is True. Support nested DictConfig.

espnet2.asr.state_spaces.utils.to_dict(x, recursive=True)[source]¶

Convert Sequence or Mapping object to dict.

lists get converted to {0: x[0], 1: x[1], …}

espnet2.asr.state_spaces.utils.to_list(x, recursive=False)[source]¶

Convert an object to list.

If Sequence (e.g. list, tuple, Listconfig): just return it

Special case: If non-recursive and not a list, wrap in list

espnet2.asr.state_spaces.components¶

espnet2.asr.state_spaces.components.Activation(activation=None, size=None, dim=-1)[source]¶

class espnet2.asr.state_spaces.components.DropoutNd(p: float = 0.5, tie=True, transposed=True)[source]¶

Bases: torch.nn.modules.module.Module

Initialize dropout module.

tie: tie dropout mask across sequence lengths (Dropout1d/2d/3d)

forward(X)[source]¶

Forward pass.

X: (batch, dim, lengths…)

espnet2.asr.state_spaces.components.LinearActivation(d_input, d_output, bias=True, zero_bias_init=False, transposed=False, initializer=None, activation=None, activate=False, weight_norm=False, **kwargs)[source]¶: Return a linear module, initialization, and activation.

class espnet2.asr.state_spaces.components.Normalization(d, transposed=False, _name_='layer', **kwargs)[source]¶

Bases: torch.nn.modules.module.Module

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

step(x, **kwargs)[source]¶

class espnet2.asr.state_spaces.components.ReversibleInstanceNorm1dInput(d, transposed=False)[source]¶

Bases: torch.nn.modules.module.Module

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.ReversibleInstanceNorm1dOutput(norm_input)[source]¶

Bases: torch.nn.modules.module.Module

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.SquaredReLU[source]¶

Bases: torch.nn.modules.module.Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.StochasticDepth(p: float, mode: str)[source]¶

Bases: torch.nn.modules.module.Module

Stochastic depth module.

See stochastic_depth().

forward(input)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.TSInverseNormalization(method, normalizer)[source]¶

Bases: torch.nn.modules.module.Module

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.TSNormalization(method, horizon)[source]¶

Bases: torch.nn.modules.module.Module

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.TransposedLN(d, scalar=True)[source]¶

Bases: torch.nn.modules.module.Module

Transposed LayerNorm module.

LayerNorm module over second dimension Assumes shape (B, D, L), where L can be 1 or more axis

This is slow and a dedicated CUDA/Triton implementation shuld provide substantial end-to-end speedup

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.TransposedLinear(d_input, d_output, bias=True)[source]¶

Bases: torch.nn.modules.module.Module

Transposed linear module.

Linear module on the second-to-last dimension Assumes shape (B, D, L), where L can be 1 or more axis

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.state_spaces.components.get_initializer(name, activation=None)[source]¶

espnet2.asr.state_spaces.components.stochastic_depth(input: torch._VariableFunctionsClass.tensor, p: float, mode: str, training: bool = True)[source]¶

Apply stochastic depth.

Implements the Stochastic Depth from “Deep Networks with Stochastic Depth” used for randomly dropping residual branches of residual architectures.

Parameters:

input (Tensor[N, ..]) – The input tensor or arbitrary dimensions with the first one being its batch i.e. a batch with N rows.
p (float) – probability of the input to be zeroed.
mode (str) – "batch" or "row". "batch" randomly zeroes the entire input, "row" zeroes randomly selected rows from the batch.
training – apply stochastic depth if is True. Default: True

Returns:

The randomly zeroed tensor.

Return type:

Tensor[N, ..]

espnet2.asr.state_spaces.init¶

Initialize sub package.

espnet2.asr.decoder.transformer_decoder¶

Decoder definition.

class espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True)[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

Base class of Transfomer decoder module.

Parameters:

vocab_size – output dim
encoder_output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
self_attention_dropout_rate – dropout rate for attention
input_layer – input layer type
use_output_layer – whether to use output layer
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶

Score new token batch.

Parameters:

ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns:

Tuple of: batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type:

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward decoder.

Parameters:

hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)

Returns:

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token): if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

forward_one_step(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶

Forward one step.

Parameters:

tgt – input token ids, int64 (batch, maxlen_out)
tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)
memory – encoded memory, float32 (batch, maxlen_in, feat)
cache – cached output list of (batch, max_time_out-1, size)

Returns:

NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)

Return type:

y, cache

score(ys, state, x)[source]¶: Score.

class espnet2.asr.decoder.transformer_decoder.DynamicConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.TransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

espnet2.asr.decoder.s4_decoder¶

Decoder definition.

class espnet2.asr.decoder.s4_decoder.S4Decoder(vocab_size: int, encoder_output_size: int, input_layer: str = 'embed', dropinp: float = 0.0, dropout: float = 0.25, prenorm: bool = True, n_layers: int = 16, transposed: bool = False, tie_dropout: bool = False, n_repeat=1, layer=None, residual=None, norm=None, pool=None, track_norms=True, drop_path: float = 0.0)[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

S4 decoder module.

Parameters:

vocab_size – output dim
encoder_output_size – dimension of hidden vector
input_layer – input layer type
dropinp – input dropout
dropout – dropout parameter applied on every residual and every layer
prenorm – pre-norm vs. post-norm
n_layers – number of layers
transposed – transpose inputs so each layer receives (batch, dim, length)
tie_dropout – tie dropout mask across sequence like nn.Dropout1d/nn.Dropout2d
n_repeat – each layer is repeated n times per stage before applying pooling
layer – layer config, must be specified
residual – residual config
norm – normalization config (e.g. layer vs batch)
pool – config for pooling layer per stage
track_norms – log norms of each layer output
drop_path – drop rate for stochastic depth

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶

Score new token batch.

Parameters:

ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns:

Tuple of: batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type:

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor, state=None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward decoder.

Parameters:

hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)

Returns:

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token): if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

init_state(x: torch.Tensor)[source]¶: Initialize state.

score(ys, state, x)[source]¶

Score new token (required).

Parameters:

y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.

Returns:

Tuple of: scores for next token that has a shape of (n_vocab) and next state for ys

Return type:

tuple[torch.Tensor, Any]

espnet2.asr.decoder.transducer_decoder¶

(RNN-)Transducer decoder definition.

class espnet2.asr.decoder.transducer_decoder.TransducerDecoder(vocab_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, dropout: float = 0.0, dropout_embed: float = 0.0, embed_pad: int = 0)[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

(RNN-)Transducer decoder module.

Parameters:

vocab_size – Output dimension.
layers_type – (RNN-)Decoder layers type.
num_layers – Number of decoder layers.
hidden_size – Number of decoder units per layer.
dropout – Dropout rate for decoder layers.
dropout_embed – Dropout rate for embedding layer.
embed_pad – Embed/Blank symbol ID.

batch_score(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]], dec_states: Tuple[torch.Tensor, Optional[torch.Tensor]], cache: Dict[str, Any], use_lm: bool) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], torch.Tensor][source]¶

One-step forward hypotheses.

Parameters:

hyps – Hypotheses.
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
cache – Pairs of (dec_out, dec_states) for each label sequences. (keys)
use_lm – Whether to compute label ID sequences for LM.

Returns:

Decoder output sequences. (B, D_dec) dec_states: Decoder hidden states. ((N, B, D_dec), (N, B, D_dec)) lm_labels: Label ID sequences for LM. (B,)

Return type:

dec_out

create_batch_states(states: Tuple[torch.Tensor, Optional[torch.Tensor]], new_states: List[Tuple[torch.Tensor, Optional[torch.Tensor]]], check_list: Optional[List] = None) → List[Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶

Create decoder hidden states.

Parameters:

states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
new_states – Decoder hidden states. [N x ((1, D_dec), (1, D_dec))]

Returns:

Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

Return type:

states

forward(labels: torch.Tensor) → torch.Tensor[source]¶

Encode source label sequences.

Parameters:: labels – Label ID sequences. (B, L)
Returns:: Decoder output sequences. (B, T, U, D_dec)
Return type:: dec_out

init_state(batch_size: int) → Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]][source]¶

Initialize decoder states.

Parameters:: batch_size – Batch size.
Returns:: Initial decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

rnn_forward(sequence: torch.Tensor, state: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶

Encode source label sequences.

Parameters:

sequence – RNN input sequences. (B, D_emb)
state – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

Returns:

RNN output sequences. (B, D_dec) (h_next, c_next): Decoder hidden states. (N, B, D_dec), (N, B, D_dec))

Return type:

sequence

score(hyp: espnet2.asr.transducer.beam_search_transducer.Hypothesis, cache: Dict[str, Any]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]], torch.Tensor][source]¶

One-step forward hypothesis.

Parameters:

hyp – Hypothesis.
cache – Pairs of (dec_out, state) for each label sequence. (key)

Returns:

Decoder output sequence. (1, D_dec) new_state: Decoder hidden states. ((N, 1, D_dec), (N, 1, D_dec)) label: Label ID for LM. (1,)

Return type:

dec_out

select_state(states: Tuple[torch.Tensor, Optional[torch.Tensor]], idx: int) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶

Get specified ID state from decoder hidden states.

Parameters:

states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
idx – State ID to extract.

Returns:

Decoder hidden state for given ID.: ((N, 1, D_dec), (N, 1, D_dec))

set_device(device: torch.device)[source]¶

Set GPU device to use.

Parameters:: device – Device ID.

espnet2.asr.decoder.mlm_decoder¶

Masked LM Decoder definition.

class espnet2.asr.decoder.mlm_decoder.MLMDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward decoder.

Parameters:

hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)

Returns:

tuple containing: x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

espnet2.asr.decoder.whisper_decoder¶

class espnet2.asr.decoder.whisper_decoder.OpenAIWhisperDecoder(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.0, whisper_model: str = 'small', download_dir: str = None)[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

Transformer-based Speech-to-Text Decoder from OpenAI’s Whisper Model:

URL: https://github.com/openai/whisper

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶

Score new token batch.

Parameters:

ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns:

Tuple of: batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type:

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward decoder.

Parameters:

hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)

Returns:

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token): if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

forward_one_step(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶

Forward one step.

Parameters:

tgt – input token ids, int64 (batch, maxlen_out)
tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)
memory – encoded memory, float32 (batch, maxlen_in, feat)
cache – cached output list of (batch, max_time_out-1, size)

Returns:

NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)

Return type:

y, cache

NOTE (Shih-Lun):: cache implementation is ignored for now for simplicity & correctness

score(ys, state, x)[source]¶: Score.

espnet2.asr.decoder.rnn_decoder¶

class espnet2.asr.decoder.rnn_decoder.RNNDecoder(vocab_size: int, encoder_output_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, sampling_probability: float = 0.0, dropout: float = 0.0, context_residual: bool = False, replace_sos: bool = False, num_encs: int = 1, att_conf: dict = {'aconv_chans': 10, 'aconv_filts': 100, 'adim': 320, 'aheads': 4, 'atype': 'location', 'awin': 5, 'han_conv_chans': -1, 'han_conv_filts': 100, 'han_dim': 320, 'han_heads': 4, 'han_mode': False, 'han_type': None, 'han_win': 5, 'num_att': 1, 'num_encs': 1})[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

forward(hs_pad, hlens, ys_in_pad, ys_in_lens, strm_idx=0)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_state(x)[source]¶

Get an initial state for decoding (optional).

Parameters:: x (torch.Tensor) – The encoded feature tensor

Returns: initial state

rnn_forward(ey, z_list, c_list, z_prev, c_prev)[source]¶

score(yseq, state, x)[source]¶

Score new token (required).

Parameters:

y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.

Returns:

Tuple of: scores for next token that has a shape of (n_vocab) and next state for ys

Return type:

tuple[torch.Tensor, Any]

zero_state(hs_pad)[source]¶

espnet2.asr.decoder.rnn_decoder.build_attention_list(eprojs: int, dunits: int, atype: str = 'location', num_att: int = 1, num_encs: int = 1, aheads: int = 4, adim: int = 320, awin: int = 5, aconv_chans: int = 10, aconv_filts: int = 100, han_mode: bool = False, han_type=None, han_heads: int = 4, han_dim: int = 320, han_conv_chans: int = -1, han_conv_filts: int = 100, han_win: int = 5)[source]¶

espnet2.asr.decoder.abs_decoder¶

class espnet2.asr.decoder.abs_decoder.AbsDecoder[source]¶

Bases: torch.nn.modules.module.Module, espnet.nets.scorer_interface.ScorerInterface, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.decoder.hugging_face_transformers_decoder¶

Hugging Face Transformers Decoder.

class espnet2.asr.decoder.hugging_face_transformers_decoder.HuggingFaceTransformersDecoder(vocab_size: int, encoder_output_size: int, model_name_or_path: str)[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

Hugging Face Transformers Decoder.

Parameters:

encoder_output_size – dimension of encoder attention
model_name_or_path – Hugging Face Transformers model name

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward decoder.

Parameters:

hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input tensor (batch, maxlen_out, #mels)
ys_in_lens – (batch)

Returns:

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token): if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

reload_pretrained_parameters()[source]¶

espnet2.asr.decoder.init¶

espnet2.asr.transducer.error_calculator¶

Error Calculator module for Transducer.

class espnet2.asr.transducer.error_calculator.ErrorCalculatorTransducer(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: torch.nn.modules.module.Module, token_list: List[int], sym_space: str, sym_blank: str, report_cer: bool = False, report_wer: bool = False)[source]¶

Bases: object

Calculate CER and WER for transducer models.

Parameters:

decoder – Decoder module.
token_list – List of tokens.
sym_space – Space symbol.
sym_blank – Blank symbol.
report_cer – Whether to compute CER.
report_wer – Whether to compute WER.

Construct an ErrorCalculatorTransducer.

calculate_cer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶

Calculate sentence-level CER score.

Parameters:

char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)

Returns:

Average sentence-level CER score.

calculate_wer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶

Calculate sentence-level WER score.

Parameters:

char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)

Returns:

Average sentence-level WER score

convert_to_char(pred: torch.Tensor, target: torch.Tensor) → Tuple[List, List][source]¶

Convert label ID sequences to character sequences.

Parameters:

pred – Prediction label ID sequences. (B, U)
target – Target label ID sequences. (B, L)

Returns:

Prediction character sequences. (B, ?) char_target: Target character sequences. (B, ?)

Return type:

char_pred

espnet2.asr.transducer.beam_search_transducer¶

Search algorithms for Transducer models.

class espnet2.asr.transducer.beam_search_transducer.BeamSearchTransducer(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, beam_size: int, lm: torch.nn.modules.module.Module = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, multi_blank_durations: List[int] = [], multi_blank_indices: List[int] = [], score_norm: bool = True, nbest: int = 1, token_list: Optional[List[str]] = None)[source]¶

Bases: object

Beam search implementation for Transducer.

Initialize Transducer search module.

Parameters:

decoder – Decoder module.
joint_network – Joint network module.
beam_size – Beam size.
lm – LM class.
lm_weight – LM weight for soft fusion.
search_type – Search algorithm to use during inference.
max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)
u_max – Maximum output sequence length. (ALSD)
nstep – Number of maximum expansion steps at each time step. (NSC/mAES)
prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)
expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)
expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)
multi_blank_durations – The duration of each blank token. (MBG)
multi_blank_indices – The index of each blank token in token_list. (MBG)
score_norm – Normalize final scores by length. (“default”)
nbest – Number of final hypothesis.

align_length_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶

Alignment-length synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters:: h – Encoder output sequences. (T, D)
Returns:: N-best hypothesis.
Return type:: nbest_hyps

default_beam_search(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶

Beam search implementation.

Modified from https://arxiv.org/pdf/1211.3711.pdf

Parameters:: enc_out – Encoder output sequence. (T, D)
Returns:: N-best hypothesis.
Return type:: nbest_hyps

greedy_search(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶

Greedy search implementation.

Parameters:: enc_out – Encoder output sequence. (T, D_enc)
Returns:: 1-best hypotheses.
Return type:: hyp

modified_adaptive_expansion_search(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶

It’s the modified Adaptive Expansion Search (mAES) implementation.

Based on/modified from https://ieeexplore.ieee.org/document/9250505 and NSC.

Parameters:: enc_out – Encoder output sequence. (T, D_enc)
Returns:: N-best hypothesis.
Return type:: nbest_hyps

multi_blank_greedy_search(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶

Greedy Search for Multi-Blank Transducer (Multi-Blank Greedy, MBG).

In this implementation, we assume: 1. the index of standard blank is the last entry of self.multi_blank_indices

rather than self.blank_id (to avoid too much change on original transducer)

other entries in self.multi_blank_indices are big blanks that account for multiple frames.

Based on https://arxiv.org/abs/2211.03541

Parameters:: enc_out – Encoder output sequence. (T, D_enc)
Returns:: 1-best hypothesis.
Return type:: hyp

nsc_beam_search(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶

N-step constrained beam search implementation.

Based on/Modified from https://arxiv.org/pdf/2002.03577.pdf. Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet until further modifications.

Parameters:: enc_out – Encoder output sequence. (T, D_enc)
Returns:: N-best hypothesis.
Return type:: nbest_hyps

prefix_search(hyps: List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis], enc_out_t: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶

Prefix search for NSC and mAES strategies.

Based on https://arxiv.org/pdf/1211.3711.pdf

sort_nbest(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]]) → Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]][source]¶

Sort hypotheses by score or score given sequence length.

Parameters:: hyps – Hypothesis.
Returns:: Sorted hypothesis.
Return type:: hyps

time_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶

Time synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters:: enc_out – Encoder output sequence. (T, D)
Returns:: N-best hypothesis.
Return type:: nbest_hyps

class espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None, dec_out: List[torch.Tensor] = None, lm_scores: torch.Tensor = None)[source]¶

Bases: espnet2.asr.transducer.beam_search_transducer.Hypothesis

Extended hypothesis definition for NSC beam search and mAES.

dec_out = None¶

lm_scores = None¶

class espnet2.asr.transducer.beam_search_transducer.Hypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None)[source]¶

Bases: object

Default hypothesis definition for Transducer search algorithms.

lm_state = None¶

espnet2.asr.transducer.init¶

espnet2.asr.transducer.rnnt_multi_blank.rnnt¶

espnet2.asr.transducer.rnnt_multi_blank.rnnt.multiblank_rnnt_loss_gpu(acts: torch.Tensor, labels: torch.Tensor, input_lengths: torch.Tensor, label_lengths: torch.Tensor, costs: torch.Tensor, grads: torch.Tensor, blank_label: int, big_blank_durations: list, fastemit_lambda: float, clamp: float, num_threads: int, sigma: float)[source]¶

Wrapper method for accessing GPU Multi-blank RNNT loss: (https://arxiv.org/pdf/2211.03541.pdf).
CUDA implementation ported from [HawkAaron/warp-transducer]: (https://github.com/HawkAaron/warp-transducer).

Parameters:

acts – Activation tensor of shape [B, T, U, V + num_big_blanks + 1].
labels – Ground truth labels of shape [B, U].
input_lengths – Lengths of the acoustic sequence as a vector of ints [B].
label_lengths – Lengths of the target sequence as a vector of ints [B].
costs – Zero vector of length [B] in which costs will be set.
grads – Zero tensor of shape [B, T, U, V + num_big_blanks + 1] where the gradient will be set.
blank_label – Index of the standard blank token in the vocabulary.
big_blank_durations – A list of supported durations for big blank symbols in the model, e.g. [2, 4, 8]. Note we only include durations for ``big blanks’’ here and it should not include 1 for the standard blank. Those big blanks have vocabulary indices after the standard blank index.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of threads for OpenMP.
sigma – logit-undernormalization weight used in the multi-blank model. Refer to the multi-blank paper https://arxiv.org/pdf/2211.03541 for detailed explanations.

espnet2.asr.transducer.rnnt_multi_blank.rnnt.rnnt_loss_cpu(acts: torch.Tensor, labels: torch.Tensor, input_lengths: torch.Tensor, label_lengths: torch.Tensor, costs: torch.Tensor, grads: torch.Tensor, blank_label: int, fastemit_lambda: float, clamp: float, num_threads: int)[source]¶

Wrapper method for accessing CPU RNNT loss.

CPU implementation ported from [HawkAaron/warp-transducer]: (https://github.com/HawkAaron/warp-transducer).

Parameters:

acts – Activation tensor of shape [B, T, U, V+1].
labels – Ground truth labels of shape [B, U].
input_lengths – Lengths of the acoustic sequence as a vector of ints [B].
label_lengths – Lengths of the target sequence as a vector of ints [B].
costs – Zero vector of length [B] in which costs will be set.
grads – Zero tensor of shape [B, T, U, V+1] where the gradient will be set.
blank_label – Index of the blank token in the vocabulary.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of threads for OpenMP.

espnet2.asr.transducer.rnnt_multi_blank.rnnt.rnnt_loss_gpu(acts: torch.Tensor, labels: torch.Tensor, input_lengths: torch.Tensor, label_lengths: torch.Tensor, costs: torch.Tensor, grads: torch.Tensor, blank_label: int, fastemit_lambda: float, clamp: float, num_threads: int)[source]¶

Wrapper method for accessing GPU RNNT loss.

CUDA implementation ported from [HawkAaron/warp-transducer]: (https://github.com/HawkAaron/warp-transducer).

Parameters:

acts – Activation tensor of shape [B, T, U, V+1].
labels – Ground truth labels of shape [B, U].
input_lengths – Lengths of the acoustic sequence as a vector of ints [B].
label_lengths – Lengths of the target sequence as a vector of ints [B].
costs – Zero vector of length [B] in which costs will be set.
grads – Zero tensor of shape [B, T, U, V+1] where the gradient will be set.
blank_label – Index of the blank token in the vocabulary.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of threads for OpenMP.

espnet2.asr.transducer.rnnt_multi_blank.init¶

espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank¶

espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank.rnnt_loss(acts, labels, act_lens, label_lens, blank=0, reduction='mean', fastemit_lambda: float = 0.0, clamp: float = 0.0)[source]¶

RNN Transducer Loss (functional form) :param acts: Tensor of (batch x seqLength x labelLength x outputDim)

containing output from network

Parameters:

labels – 2 dimensional Tensor containing all the targets of the batch with zero padded
act_lens – Tensor of size (batch) containing size of each output sequence from the network
label_lens – Tensor of (batch) containing label length of each example
blank (int, optional) – blank label. Default: 0.
reduction (string, optional) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: ‘mean’

class espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank.RNNTLossNumba(blank=0, reduction='mean', fastemit_lambda: float = 0.0, clamp: float = -1)[source]¶

Bases: torch.nn.modules.module.Module

Parameters:

blank (int, optional) – blank label. Default: 0.
reduction (string, optional) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: ‘mean’
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

forward(acts, labels, act_lens, label_lens)[source]¶

log_probs: Tensor of (batch x seqLength x labelLength x outputDim): containing output from network
labels: 2 dimensional Tensor containing all the targets of the: batch with zero padded
act_lens: Tensor of size (batch) containing size of each output: sequence from the network

label_lens: Tensor of (batch) containing label length of each example

class espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank.MultiblankRNNTLossNumba(blank, big_blank_durations, reduction='mean', fastemit_lambda: float = 0.0, clamp: float = -1, sigma: float = 0.0)[source]¶

Bases: torch.nn.modules.module.Module

Parameters:

blank (int) – standard blank label.
big_blank_durations – list of durations for multi-blank transducer, e.g. [2, 4, 8].
sigma – hyper-parameter for logit under-normalization method for training multi-blank transducers. Recommended value 0.05.
to https (Refer) – //arxiv.org/pdf/2211.03541 for detailed explanations for the above parameters;
reduction (string, optional) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: ‘mean’
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

forward(acts, labels, act_lens, label_lens)[source]¶

log_probs: Tensor of (batch x seqLength x labelLength x outputDim): containing output from network
labels: 2 dimensional Tensor containing all the targets of: the batch with zero padded
act_lens: Tensor of size (batch) containing size of each output: sequence from the network

label_lens: Tensor of (batch) containing label length of each example

espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants¶

class espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]¶

Bases: enum.Enum

An enumeration.

RNNT_STATUS_INVALID_VALUE = 1¶

RNNT_STATUS_SUCCESS = 0¶

espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.THRESHOLD = 0.1¶: Getters

espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.dtype()[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.threads_per_block()[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.warp_size()[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.init¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.add[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.compute_costs_data[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.copy_data_1d[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.div_up[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.exponential[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.flatten_tensor(x: torch.Tensor)[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.get_workspace_size(maxT: int, maxU: int, minibatch: int, gpu: bool) → Tuple[Optional[int], espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus][source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.identity[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.log_plus[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.log_sum_exp[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.maximum[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.negate[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_alphas_kernel[source]¶

Compute alpha (forward variable) probabilities over the transduction step.

Parameters:

acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
alphas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the forward variable probabilities.
llForward – Zero tensor of shape [B]. Represents the log-likelihood of the forward pass. Returned as the forward pass loss that is reduced by the optimizer.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.

Updates:: Kernel inplace updates the following inputs: - alphas: forward variable scores. - llForward: log-likelihood of forward variable.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_betas_kernel[source]¶

Compute beta (backward variable) probabilities over the transduction step.

Parameters:

acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
betas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the backward variable probabilities.
llBackward – Zero tensor of shape [B]. Represents the log-likelihood of the backward pass. Returned as the backward pass loss that is reduced by the optimizer.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.

Updates:: Kernel inplace updates the following inputs: - betas: backward variable scores. - llBackward: log-likelihood of backward variable.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_grad_kernel[source]¶

Compute gradients over the transduction step.

Parameters:

grads – Zero Tensor of shape [B, T, U, V+1]. Is updated by this kernel to contain the gradients of this batch of samples.
acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
alphas – Alpha variable, contains forward probabilities. A tensor of shape [B, T, U].
betas – Beta varoable, contains backward probabilities. A tensor of shape [B, T, U].
logll – Log-likelihood of the forward variable, represented as a vector of shape [B]. Represents the log-likelihood of the forward pass.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

Updates:: Kernel inplace updates the following inputs: - grads: Gradients with respect to the log likelihood (logll).

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_multiblank_alphas_kernel[source]¶

Compute alpha (forward variable) probabilities for multi-blank transducuer loss: (https://arxiv.org/pdf/2211.03541).

Parameters:

acts – Tensor of shape [B, T, U, V + 1 + num_big_blanks] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
sigma – Hyper-parameter for logit-undernormalization technique for training multi-blank transducers.
alphas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the forward variable probabilities.
llForward – Zero tensor of shape [B]. Represents the log-likelihood of the forward pass. Returned as the forward pass loss that is reduced by the optimizer.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT standard blank token in the vocabulary.
big_blank_durations – Vector of supported big blank durations of the model.
num_big_blanks – Number of big blanks of the model.

Updates:: Kernel inplace updates the following inputs: - alphas: forward variable scores. - llForward: log-likelihood of forward variable.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_multiblank_betas_kernel[source]¶

Compute beta (backward variable) probabilities for multi-blank transducer loss: (https://arxiv.org/pdf/2211.03541).

Parameters:

acts – Tensor of shape [B, T, U, V + 1 + num-big-blanks] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
sigma – Hyper-parameter for logit-undernormalization technique for training multi-blank transducers.
betas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the backward variable probabilities.
llBackward – Zero tensor of shape [B]. Represents the log-likelihood of the backward pass. Returned as the backward pass loss that is reduced by the optimizer.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT standard blank token in the vocabulary.
big_blank_durations – Vector of supported big blank durations of the model.
num_big_blanks – Number of big blanks of the model.

Updates:: Kernel inplace updates the following inputs: - betas: backward variable scores. - llBackward: log-likelihood of backward variable.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_multiblank_grad_kernel[source]¶

Compute gradients for multi-blank transducer loss: (https://arxiv.org/pdf/2211.03541).

Parameters:

grads – Zero Tensor of shape [B, T, U, V + 1 + num_big_blanks]. Is updated by this kernel to contain the gradients of this batch of samples.
acts – Tensor of shape [B, T, U, V + 1 + num_big_blanks] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
sigma – Hyper-parameter for logit-undernormalization technique for training multi-blank transducers.
alphas – Alpha variable, contains forward probabilities. A tensor of shape [B, T, U].
betas – Beta varoable, contains backward probabilities. A tensor of shape [B, T, U].
logll – Log-likelihood of the forward variable, represented as a vector of shape [B]. Represents the log-likelihood of the forward pass.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
big_blank_durations – Vector of supported big blank durations of the model.
num_big_blanks – Number of big blanks of the model.

Updates:: Kernel inplace updates the following inputs: - grads: Gradients with respect to the log likelihood (logll).

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.logp[source]¶

Compute the sum of log probability from the activation tensor and its denominator.

Parameters:

denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
mb – Batch indexer.
t – Acoustic sequence timestep indexer.
u – Target sequence timestep indexer.
v – Vocabulary token indexer.

Returns:

The sum of logprobs[mb, t, u, v] + denom[mb, t, u]

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce[source]¶

CUDA Warp reduction kernel.

It is a device kernel to be called by other kernels.

The data will be read from the right segement recursively, and reduced (ROP) onto the left half. Operation continues while warp size is larger than a given offset. Beyond this offset, warp reduction is performed via shfl_down_sync, which halves the reduction space and sums the two halves at each call.

Note

Efficient warp occurs at input shapes of 2 ^ K.

References

Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]

Parameters:

tid – CUDA thread index
x – activation. Single float.
storage – shared memory of size CTA_REDUCE_SIZE used for reduction in parallel threads.
count – equivalent to num_rows, which is equivalent to alphabet_size (V+1)
R_opid – Operator ID for reduction. See R_Op for more information.

class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.I_Op[source]¶

Bases: enum.Enum

Represents an operation that is performed on the input tensor

EXPONENTIAL = 0¶

IDENTITY = 1¶

class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.R_Op[source]¶

Bases: enum.Enum

Represents a reduction operation performed on the input tensor

ADD = 0¶

MAXIMUM = 1¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.ReduceHelper(I_opid: int, R_opid: int, acts: torch.Tensor, output: torch.Tensor, num_rows: int, num_cols: int, minus: bool, stream)[source]¶

CUDA Warp reduction kernel helper which reduces via the R_Op.Add and writes the result to output according to I_op id.

The result is stored in the blockIdx.

Note

Efficient warp occurs at input shapes of 2 ^ K.

References

Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]

Parameters:

I_opid – Operator ID for input. See I_Op for more information.
R_opid – Operator ID for reduction. See R_Op for more information.
acts – Flatened activation matrix of shape [B * T * U * (V+1)].
output – Flatened output matrix of shape [B * T * U * (V+1)]. Data will be overwritten.
num_rows – Vocabulary size (including blank token) - V+1. Represents the number of threads per block.
num_cols – Flattened shape of activation matrix, without vocabulary dimension (B * T * U). Represents number of blocks per grid.
minus – Bool flag whether to add or subtract as reduction. If minus is set; calls _reduce_minus, else calls _reduce_rows kernel.
stream – CUDA Stream.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.reduce_exp(acts: torch.Tensor, denom, rows: int, cols: int, minus: bool, stream)[source]¶

Helper method to call the Warp Reduction Kernel to perform exp reduction.

Note

Efficient warp occurs at input shapes of 2 ^ K.

References

Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]

Parameters:

acts – Flatened activation matrix of shape [B * T * U * (V+1)].
output – Flatened output matrix of shape [B * T * U * (V+1)]. Data will be overwritten.
rows – Vocabulary size (including blank token) - V+1. Represents the number of threads per block.
cols – Flattened shape of activation matrix, without vocabulary dimension (B * T * U). Represents number of blocks per grid.
minus – Bool flag whether to add or subtract as reduction. If minus is set; calls _reduce_minus, else calls _reduce_rows kernel.
stream – CUDA Stream.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.reduce_max(acts: torch.Tensor, denom, rows: int, cols: int, minus: bool, stream)[source]¶

Helper method to call the Warp Reduction Kernel to perform max reduction.

Note

Efficient warp occurs at input shapes of 2 ^ K.

References

Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]

Parameters:

acts – Flatened activation matrix of shape [B * T * U * (V+1)].
output – Flatened output matrix of shape [B * T * U * (V+1)]. Data will be overwritten.
rows – Vocabulary size (including blank token) - V+1. Represents the number of threads per block.
cols – Flattened shape of activation matrix, without vocabulary dimension (B * T * U). Represents number of blocks per grid.
minus – Bool flag whether to add or subtract as reduction. If minus is set; calls _reduce_minus, else calls _reduce_rows kernel.
stream – CUDA Stream.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt¶

class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.GPURNNT(minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, stream)[source]¶

Bases: object

Helper class to launch the CUDA Kernels to compute the Transducer Loss.

Parameters:

minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.
blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of OMP threads to launch.
stream – Numba Cuda Stream.

compute_cost_and_score(acts: torch.Tensor, grads: Optional[torch.Tensor], costs: torch.Tensor, labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor) → espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]¶

Compute both the loss and the gradients.

Parameters:

acts – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.
grad – A flattented zero tensor of same shape as acts.
costs – A zero vector of length B which will be updated inplace with the log probability costs.
flat_labels – A flattened matrix of labels of shape [B, U]
label_lengths – A vector of length B that contains the original lengths of the acoustic sequence.
input_lengths – A vector of length B that contains the original lengths of the target sequence.

Updates:: This will launch kernels that will update inline the following variables: - grads: Gradients of the activation matrix wrt the costs vector. - costs: Negative log likelihood of the forward variable.

Returns:: An enum that either represents a successful RNNT operation or failure.

cost_and_grad(acts: torch.Tensor, grads: torch.Tensor, costs: torch.Tensor, pad_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]¶

log_softmax(acts: torch.Tensor, denom: torch.Tensor)[source]¶

Computes the log softmax denominator of the input activation tensor and stores the result in denom.

Parameters:

acts – Activation tensor of shape [B, T, U, V+1]. The input must be represented as a flat tensor of shape [B * T * U * (V+1)] to allow pointer indexing.
denom – A zero tensor of same shape as acts.

Updates:: This kernel inplace updates the denom tensor

score_forward(acts: torch.Tensor, costs: torch.Tensor, pad_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]¶

class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.MultiblankGPURNNT(sigma: float, num_big_blanks: int, minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace, big_blank_workspace, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, stream)[source]¶

Bases: espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.GPURNNT

Helper class to launch the CUDA Kernels to compute Multi-blank Transducer Loss: (https://arxiv.org/pdf/2211.03541).

Parameters:

sigma – Hyper-parameter related to the logit-normalization method in training multi-blank transducers.
num_big_blanks – Number of big blank symbols the model has. This should not include the standard blank symbol.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V + 1 + num-big-blanks
workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.
big_blank_workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory specifically for the multi-blank related computations.
blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of OMP threads to launch.
stream – Numba Cuda Stream.

compute_cost_and_score(acts: torch.Tensor, grads: Optional[torch.Tensor], costs: torch.Tensor, labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor) → espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]¶

Compute both the loss and the gradients.

Parameters:

acts – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.
grad – A flattented zero tensor of same shape as acts.
costs – A zero vector of length B which will be updated inplace with the log probability costs.
flat_labels – A flattened matrix of labels of shape [B, U]
label_lengths – A vector of length B that contains the original lengths of the acoustic sequence.
input_lengths – A vector of length B that contains the original lengths of the target sequence.

Updates:: This will launch kernels that will update inline the following variables: - grads: Gradients of the activation matrix wrt the costs vector. - costs: Negative log likelihood of the forward variable.

Returns:: An enum that either represents a successful RNNT operation or failure.

cost_and_grad(acts: torch.Tensor, grads: torch.Tensor, costs: torch.Tensor, pad_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]¶

score_forward(acts: torch.Tensor, costs: torch.Tensor, pad_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.init¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.init¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt¶

class espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CPURNNT(minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace: torch.Tensor, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, batch_first: bool)[source]¶

Bases: object

Helper class to compute the Transducer Loss on CPU.

Parameters:

minibatch – Size of the minibatch b.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.
blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of OMP threads to launch.
batch_first – Bool that decides if batch dimension is first or third.

compute_alphas(log_probs: torch.Tensor, T: int, U: int, alphas: torch.Tensor)[source]¶

Compute the probability of the forward variable alpha.

Parameters:

log_probs – Flattened tensor [B, T, U, V+1]
T – Length of the acoustic sequence T (not padded).
U – Length of the target sequence U (not padded).
alphas – Working space memory for alpha of shape [B, T, U].

Returns:

Loglikelihood of the forward variable alpha.

compute_betas_and_grads(grad: torch.Tensor, log_probs: torch.Tensor, T: int, U: int, alphas: torch.Tensor, betas: torch.Tensor, labels: torch.Tensor, logll: torch.Tensor)[source]¶

Compute backward variable beta as well as gradients of the activation matrix wrt loglikelihood of forward variable.

Parameters:

grad – Working space memory of flattened shape [B, T, U, V+1]
log_probs – Activatio tensor of flattented shape [B, T, U, V+1]
T – Length of the acoustic sequence T (not padded).
U – Length of the target sequence U (not padded).
alphas – Working space memory for alpha of shape [B, T, U].
betas – Working space memory for alpha of shape [B, T, U].
labels – Ground truth label of shape [B, U]
logll – Loglikelihood of the forward variable.

Returns:

Loglikelihood of the forward variable and inplace updates the grad tensor.

cost_and_grad(log_probs: torch.Tensor, grads: torch.Tensor, costs: torch.Tensor, flat_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor) → espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]¶

cost_and_grad_kernel(log_probs: torch.Tensor, grad: torch.Tensor, labels: torch.Tensor, mb: int, T: int, U: int, bytes_used: int)[source]¶

score_forward(log_probs: torch.Tensor, costs: torch.Tensor, flat_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]¶

class espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CpuRNNT_index(U: int, maxU: int, minibatch: int, alphabet_size: int, batch_first: bool)[source]¶

Bases: object

A placeholder Index computation class that emits the resolved index in a flattened tensor, mimicing pointer indexing in CUDA kernels on the CPU.

Parameters:

U – Length of the current target sample (without padding).
maxU – Max Length of the padded target samples.
minibatch – Minibatch index
alphabet_size – Size of the vocabulary including RNNT blank - V+1.
batch_first – Bool flag determining if batch index is first or third.

class espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CpuRNNT_metadata(T: int, U: int, workspace: torch.Tensor, bytes_used: int, blank: int, labels: torch.Tensor, log_probs: torch.Tensor, idx: espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CpuRNNT_index)[source]¶

Bases: object

Metadata for CPU based RNNT loss calculation. Holds the working space memory.

Parameters:

T – Length of the acoustic sequence (without padding).
U – Length of the target sequence (without padding).
workspace – Working space memory for the CPU.
bytes_used – Number of bytes currently used for indexing the working space memory. Generally 0.
blank – Index of the blank token in the vocabulary.
labels – Ground truth padded labels matrix of shape [B, U]
log_probs – Log probs / activation matrix of flattented shape [B, T, U, V+1]
idx –

setup_probs(T: int, U: int, labels: torch.Tensor, blank: int, log_probs: torch.Tensor, idx: espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CpuRNNT_index)[source]¶

class espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.LogSoftmaxGradModification(*args, **kwargs)[source]¶

Bases: torch.autograd.function.Function

static backward(ctx, grad_output)[source]¶

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, acts, clamp)[source]¶

Performs the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).

The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with ctx.save_for_backward() if they are intended to be used in backward (equivalently, vjp) or ctx.save_for_forward() if they are intended to be used for in jvp.

espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.log_sum_exp(a: torch.Tensor, b: torch.Tensor)[source]¶: Logsumexp with safety checks for infs.

espnet2.asr.frontend.s3prl¶

class espnet2.asr.frontend.s3prl.S3prlFrontend(fs: Union[int, str] = 16000, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, download_dir: str = None, multilayer_feature: bool = False, layer: int = -1)[source]¶

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Speech Pretrained Representation frontend structure for ASR.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

reload_pretrained_parameters()[source]¶

espnet2.asr.frontend.default¶

class espnet2.asr.frontend.default.DefaultFrontend(fs: Union[int, str] = 16000, n_fft: int = 512, win_length: int = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: int = None, fmax: int = None, htk: bool = False, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, apply_stft: bool = True)[source]¶

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Conventional frontend structure for ASR.

Stft -> WPE -> MVDR-Beamformer -> Power-spec -> Mel-Fbank -> CMVN

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

espnet2.asr.frontend.fused¶

class espnet2.asr.frontend.fused.FusedFrontends(frontends=None, align_method='linear_projection', proj_dim=100, fs=16000)[source]¶

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

espnet2.asr.frontend.abs_frontend¶

class espnet2.asr.frontend.abs_frontend.AbsFrontend[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]¶

espnet2.asr.frontend.init¶

espnet2.asr.frontend.windowing¶

Sliding Window for raw audio input data.

class espnet2.asr.frontend.windowing.SlidingWindow(win_length: int = 400, hop_length: int = 160, channels: int = 1, padding: int = None, fs=None)[source]¶

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Sliding Window.

Provides a sliding window over a batched continuous raw audio tensor. Optionally, provides padding (Currently not implemented). Combine this module with a pre-encoder compatible with raw audio data, for example Sinc convolutions.

Known issues: Output length is calculated incorrectly if audio shorter than win_length. WARNING: trailing values are discarded - padding not implemented yet. There is currently no additional window function applied to input values.

Initialize.

Parameters:

win_length – Length of frame.
hop_length – Relative starting point of next frame.
channels – Number of input channels.
padding – Padding (placeholder, currently not implemented).
fs – Sampling rate (placeholder for compatibility, not used).

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Apply a sliding window on the input.

Parameters:

input – Input (B, T, C*D) or (B, T*C*D), with D=C=1.
input_lengths – Input lengths within batch.

Returns:

Output with dimensions (B, T, C, D), with D=win_length. Tensor: Output lengths within batch.

Return type:

Tensor

output_size() → int[source]¶: Return output length of feature dimension D, i.e. the window length.

espnet2.asr.frontend.whisper¶

class espnet2.asr.frontend.whisper.WhisperFrontend(whisper_model: str = 'small', freeze_weights: bool = True, download_dir: str = None)[source]¶

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Speech Representation Using Encoder Outputs from OpenAI’s Whisper Model:

URL: https://github.com/openai/whisper

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

log_mel_spectrogram(audio: torch.Tensor, ilens: torch.Tensor = None) → torch.Tensor[source]¶

output_size() → int[source]¶

whisper_encode(input: torch.Tensor, ilens: torch.Tensor = None) → torch.Tensor[source]¶

espnet2.asr package¶

espnet2.asr.espnet_model¶

espnet2.asr.discrete_asr_espnet_model¶

espnet2.asr.pit_espnet_model¶

espnet2.asr.ctc¶

espnet2.asr.maskctc_model¶

espnet2.asr.__init__¶

espnet2.asr.layers.cgmlp¶

espnet2.asr.layers.__init__¶

espnet2.asr.layers.fastformer¶

espnet2.asr.encoder.rnn_encoder¶

espnet2.asr.encoder.transformer_encoder¶

espnet2.asr.encoder.abs_encoder¶

espnet2.asr.encoder.vgg_rnn_encoder¶

espnet2.asr.encoder.hubert_encoder¶

espnet2.asr.encoder.wav2vec2_encoder¶

espnet2.asr.encoder.transformer_encoder_multispkr¶

espnet2.asr.encoder.contextual_block_conformer_encoder¶

espnet2.asr.encoder.whisper_encoder¶

espnet2.asr.encoder.e_branchformer_encoder¶

espnet2.asr.encoder.__init__¶

espnet2.asr.encoder.longformer_encoder¶

espnet2.asr.encoder.branchformer_encoder¶

espnet2.asr.encoder.conformer_encoder¶

espnet2.asr.encoder.contextual_block_transformer_encoder¶

espnet2.asr.postencoder.abs_postencoder¶

espnet2.asr.postencoder.hugging_face_transformers_postencoder¶

espnet2.asr.postencoder.__init__¶

espnet2.asr.preencoder.abs_preencoder¶

espnet2.asr.preencoder.sinc¶

espnet2.asr.preencoder.linear¶

espnet2.asr.preencoder.__init__¶

espnet2.asr.specaug.abs_specaug¶

espnet2.asr.specaug.__init__¶

espnet2.asr.specaug.specaug¶

espnet2.asr.state_spaces.residual¶

espnet2.asr.state_spaces.cauchy¶

espnet2.asr.state_spaces.s4¶

espnet2.asr.state_spaces.ff¶

espnet2.asr.state_spaces.pool¶

espnet2.asr.state_spaces.block¶

espnet2.asr.state_spaces.base¶

espnet2.asr.state_spaces.model¶

espnet2.asr.state_spaces.attention¶

espnet2.asr.state_spaces.registry¶

espnet2.asr.state_spaces.utils¶

espnet2.asr.state_spaces.components¶

espnet2.asr.state_spaces.__init__¶

espnet2.asr.decoder.transformer_decoder¶

espnet2.asr.decoder.s4_decoder¶

espnet2.asr.decoder.transducer_decoder¶

espnet2.asr.decoder.mlm_decoder¶

espnet2.asr.decoder.whisper_decoder¶

espnet2.asr.decoder.rnn_decoder¶

espnet2.asr.decoder.abs_decoder¶

espnet2.asr.decoder.hugging_face_transformers_decoder¶

espnet2.asr.decoder.__init__¶

espnet2.asr.transducer.error_calculator¶

espnet2.asr.transducer.beam_search_transducer¶

espnet2.asr.transducer.__init__¶

espnet2.asr.transducer.rnnt_multi_blank.rnnt¶

espnet2.asr.transducer.rnnt_multi_blank.__init__¶

espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank¶

espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants¶

espnet2.asr.transducer.rnnt_multi_blank.utils.__init__¶

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.__init__¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.__init__¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt¶

espnet2.asr.frontend.s3prl¶

espnet2.asr.frontend.default¶

espnet2.asr.frontend.fused¶

espnet2.asr.frontend.abs_frontend¶

espnet2.asr.frontend.__init__¶

espnet2.asr.frontend.windowing¶

espnet2.asr.frontend.whisper¶

espnet2.asr.init¶

espnet2.asr.layers.init¶

espnet2.asr.encoder.init¶

espnet2.asr.postencoder.init¶

espnet2.asr.preencoder.init¶

espnet2.asr.specaug.init¶

espnet2.asr.state_spaces.init¶

espnet2.asr.decoder.init¶

espnet2.asr.transducer.init¶

espnet2.asr.transducer.rnnt_multi_blank.init¶

espnet2.asr.transducer.rnnt_multi_blank.utils.init¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.init¶

espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.init¶

espnet2.asr.frontend.init¶