espnet2.asr package¶
espnet2.asr.espnet_model¶
-
class
espnet2.asr.espnet_model.
ESPnetASRModel
(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], aux_ctc: dict = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', transducer_multi_blank_durations: List = [], transducer_multi_blank_sigma: float = 0.05, sym_sos: str = '<sos/eos>', sym_eos: str = '<sos/eos>', extract_feats_in_collect_stats: bool = True, lang_token_id: int = -1)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
CTC-attention hybrid Encoder-Decoder model
-
batchify_nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]¶ Compute negative log likelihood(nll) from transformer-decoder
To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,
you may change this to avoid OOM or increase GPU memory usage
-
collect_feats
(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶
-
encode
(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Frontend + Encoder. Note that this method is used by asr_inference.py
- Parameters:
speech – (Batch, Length, …)
speech_lengths – (Batch, )
-
forward
(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
- Parameters:
speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)
kwargs – “utt_id” is among the input.
-
nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶ Compute negative log likelihood(nll) from transformer-decoder
Normally, this function is called in batchify_nll.
- Parameters:
encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)
-
espnet2.asr.discrete_asr_espnet_model¶
-
class
espnet2.asr.discrete_asr_espnet_model.
ESPnetDiscreteASRModel
(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: Optional[espnet2.asr.ctc.CTC], ctc_weight: float = 0.5, interctc_weight: float = 0.0, src_vocab_size: int = 0, src_token_list: Union[Tuple[str, ...], List[str]] = [], ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_bleu: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True, share_decoder_input_output_embed: bool = False, share_encoder_decoder_input_embed: bool = False)[source]¶ Bases:
espnet2.mt.espnet_model.ESPnetMTModel
Encoder-Decoder model
-
encode
(src_text: torch.Tensor, src_text_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Frontend + Encoder. Note that this method is used by mt_inference.py
- Parameters:
src_text – (Batch, Length, …)
src_text_lengths – (Batch, )
-
forward
(text: torch.Tensor, text_lengths: torch.Tensor, src_text: torch.Tensor, src_text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
- Parameters:
text – (Batch, Length)
text_lengths – (Batch,)
src_text – (Batch, length)
src_text_lengths – (Batch,)
kwargs – “utt_id” is among the input.
-
espnet2.asr.pit_espnet_model¶
-
class
espnet2.asr.pit_espnet_model.
ESPnetASRModel
(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', sym_sos: str = '<sos/eos>', sym_eos: str = '<sos/eos>', extract_feats_in_collect_stats: bool = True, lang_token_id: int = -1, num_inf: int = 1, num_ref: int = 1)[source]¶ Bases:
espnet2.asr.espnet_model.ESPnetASRModel
CTC-attention hybrid Encoder-Decoder model
-
forward
(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
- Parameters:
speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)
kwargs – “utt_id” is among the input.
-
-
class
espnet2.asr.pit_espnet_model.
PITLossWrapper
(criterion_fn: Callable, num_ref: int)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper
-
forward
(inf: torch.Tensor, inf_lens: torch.Tensor, ref: torch.Tensor, ref_lens: torch.Tensor, others: Dict = None)[source]¶ PITLoss Wrapper function. Similar to espnet2/enh/loss/wrapper/pit_solver.py
- Parameters:
inf – Iterable[torch.Tensor], (batch, num_inf, …)
inf_lens – Iterable[torch.Tensor], (batch, num_inf, …)
ref – Iterable[torch.Tensor], (batch, num_ref, …)
ref_lens – Iterable[torch.Tensor], (batch, num_ref, …)
permute_inf – If true, permute the inference and inference_lens according to the optimal permutation.
-
espnet2.asr.ctc¶
-
class
espnet2.asr.ctc.
CTC
(odim: int, encoder_output_size: int, dropout_rate: float = 0.0, ctc_type: str = 'builtin', reduce: bool = True, ignore_nan_grad: bool = None, zero_infinity: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
CTC module.
- Parameters:
odim – dimension of outputs
encoder_output_size – number of encoder projection units
dropout_rate – dropout rate (0.0 ~ 1.0)
ctc_type – builtin or gtnctc
reduce – reduce the CTC loss into a scalar
ignore_nan_grad – Same as zero_infinity (keeping for backward compatiblity)
zero_infinity – Whether to zero infinite losses and the associated gradients.
-
argmax
(hs_pad)[source]¶ argmax of frame activations
- Parameters:
hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)
- Returns:
argmax applied 2d tensor (B, Tmax)
- Return type:
torch.Tensor
-
forward
(hs_pad, hlens, ys_pad, ys_lens)[source]¶ Calculate CTC loss.
- Parameters:
hs_pad – batch of padded hidden state sequences (B, Tmax, D)
hlens – batch of lengths of hidden state sequences (B)
ys_pad – batch of padded character id sequence tensor (B, Lmax)
ys_lens – batch of lengths of character sequence (B)
espnet2.asr.maskctc_model¶
-
class
espnet2.asr.maskctc_model.
MaskCTCInference
(asr_model: espnet2.asr.maskctc_model.MaskCTCModel, n_iterations: int, threshold_probability: float)[source]¶ Bases:
torch.nn.modules.module.Module
Mask-CTC-based non-autoregressive inference
Initialize Mask-CTC inference
-
class
espnet2.asr.maskctc_model.
MaskCTCModel
(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.mlm_decoder.MLMDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module] = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', sym_mask: str = '<mask>', extract_feats_in_collect_stats: bool = True)[source]¶ Bases:
espnet2.asr.espnet_model.ESPnetASRModel
Hybrid CTC/Masked LM Encoder-Decoder model (Mask-CTC)
-
batchify_nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]¶ Compute negative log likelihood(nll) from transformer-decoder
To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,
you may change this to avoid OOM or increase GPU memory usage
-
forward
(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
- Parameters:
speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)
-
nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶ Compute negative log likelihood(nll) from transformer-decoder
Normally, this function is called in batchify_nll.
- Parameters:
encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)
-
espnet2.asr.__init__¶
espnet2.asr.layers.cgmlp¶
MLP with convolutional gating (cgMLP) definition.
References
https://openreview.net/forum?id=RA-zVvZLYIy https://arxiv.org/abs/2105.08050
-
class
espnet2.asr.layers.cgmlp.
ConvolutionalGatingMLP
(size: int, linear_units: int, kernel_size: int, dropout_rate: float, use_linear_after_conv: bool, gate_activation: str)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional Gating MLP (cgMLP).
-
forward
(x, mask)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.layers.cgmlp.
ConvolutionalSpatialGatingUnit
(size: int, kernel_size: int, dropout_rate: float, use_linear_after_conv: bool, gate_activation: str)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional Spatial Gating Unit (CSGU).
espnet2.asr.layers.__init__¶
espnet2.asr.layers.fastformer¶
Fastformer attention definition.
- Reference:
Wu et al., “Fastformer: Additive Attention Can Be All You Need” https://arxiv.org/abs/2108.09084 https://github.com/wuch15/Fastformer
-
class
espnet2.asr.layers.fastformer.
FastSelfAttention
(size, attention_heads, dropout_rate)[source]¶ Bases:
torch.nn.modules.module.Module
Fast self-attention used in Fastformer.
espnet2.asr.encoder.rnn_encoder¶
-
class
espnet2.asr.encoder.rnn_encoder.
RNNEncoder
(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, subsample: Optional[Sequence[int]] = (2, 2, 1, 1))[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
RNNEncoder class.
- Parameters:
input_size – The number of expected features in the input
output_size – The number of output features
hidden_size – The number of hidden features
bidirectional – If
True
becomes a bidirectional LSTMuse_projection – Use projection layer or not
num_layers – Number of recurrent layers
dropout – dropout probability
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.encoder.transformer_encoder¶
Transformer encoder definition.
-
class
espnet2.asr.encoder.transformer_encoder.
TransformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Transformer encoder module.
- Parameters:
input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
espnet2.asr.encoder.abs_encoder¶
-
class
espnet2.asr.encoder.abs_encoder.
AbsEncoder
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.encoder.vgg_rnn_encoder¶
-
class
espnet2.asr.encoder.vgg_rnn_encoder.
VGGRNNEncoder
(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, in_channel: int = 1)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
VGGRNNEncoder class.
- Parameters:
input_size – The number of expected features in the input
bidirectional – If
True
becomes a bidirectional LSTMuse_projection – Use projection layer or not
num_layers – Number of recurrent layers
hidden_size – The number of hidden features
output_size – The number of output features
dropout – dropout probability
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.encoder.hubert_encoder¶
Encoder definition.
-
class
espnet2.asr.encoder.hubert_encoder.
FairseqHubertEncoder
(input_size: int, hubert_url: str = './', hubert_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0, dropout_rate: float = 0.0, activation_dropout: float = 0.1, attention_dropout: float = 0.0, mask_length: int = 10, mask_prob: float = 0.75, mask_selection: str = 'static', mask_other: int = 0, apply_mask: bool = True, mask_channel_length: int = 64, mask_channel_prob: float = 0.5, mask_channel_other: int = 0, mask_channel_selection: str = 'static', layerdrop: float = 0.1, feature_grad_mult: float = 0.0)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
FairSeq Hubert encoder module, used for loading pretrained weight and finetuning
- Parameters:
input_size – input dim
hubert_url – url to Hubert pretrained model
hubert_dir_path – directory to download the Wav2Vec2.0 pretrained model.
output_size – dimension of attention
normalize_before – whether to use layer_norm before the first block
freeze_finetune_updates – steps that freeze all layers except output layer before tuning the whole model (nessasary to prevent overfit).
dropout_rate – dropout rate
activation_dropout – dropout rate in activation function
attention_dropout – dropout rate in attention
- Hubert specific Args:
Please refer to: https://github.com/pytorch/fairseq/blob/master/fairseq/models/hubert/hubert.py
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward Hubert ASR Encoder.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
-
class
espnet2.asr.encoder.hubert_encoder.
FairseqHubertPretrainEncoder
(input_size: int = 1, output_size: int = 1024, linear_units: int = 1024, attention_heads: int = 12, num_blocks: int = 12, dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, activation_dropout_rate: float = 0.0, hubert_dict: str = './dict.txt', label_rate: int = 100, checkpoint_activations: bool = False, sample_rate: int = 16000, use_amp: bool = False, **kwargs)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
FairSeq Hubert pretrain encoder module, only used for pretraining stage
- Parameters:
input_size – input dim
output_size – dimension of attention
linear_units – dimension of feedforward layers
attention_heads – the number of heads of multi head attention
num_blocks – the number of encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
hubert_dict – target dictionary for Hubert pretraining
label_rate – label frame rate. -1 for sequence label
sample_rate – target sample rate.
use_amp – whether to use automatic mixed precision
normalize_before – whether to use layer_norm before the first block
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_length: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward Hubert Pretrain Encoder.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
-
class
espnet2.asr.encoder.hubert_encoder.
TorchAudioHuBERTPretrainEncoder
(input_size: int = None, extractor_mode: str = 'group_norm', extractor_conv_layer_config: Optional[List[Tuple[int, int, int]]] = [(512, 10, 5), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 2, 2), (512, 2, 2)], extractor_conv_bias: bool = False, encoder_embed_dim: int = 768, encoder_projection_dropout: float = 0.1, encoder_pos_conv_kernel: int = 128, encoder_pos_conv_groups: int = 16, encoder_num_layers: int = 12, encoder_num_heads: int = 12, encoder_attention_dropout: float = 0.1, encoder_ff_interm_features: int = 3072, encoder_ff_interm_dropout: float = 0.0, encoder_dropout: float = 0.1, encoder_layer_norm_first: bool = False, encoder_layer_drop: float = 0.05, mask_prob: float = 0.8, mask_selection: str = 'static', mask_other: float = 0.0, mask_length: int = 10, no_mask_overlap: bool = False, mask_min_space: int = 1, mask_channel_prob: float = 0.0, mask_channel_selection: str = 'static', mask_channel_other: float = 0.0, mask_channel_length: int = 10, no_mask_channel_overlap: bool = False, mask_channel_min_space: int = 1, skip_masked: bool = False, skip_nomask: bool = False, num_classes: int = 100, final_dim: int = 256, feature_grad_mult: Optional[float] = 0.1, finetuning: bool = False, freeze_encoder_updates: int = 0)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Torch Audio Hubert encoder module.
- Parameters:
extractor_mode – Operation mode of feature extractor. Valid values are “group_norm” or “layer_norm”.
extractor_conv_layer_config – Configuration of convolution layers in feature extractor. List of convolution configuration, i.e. [(output_channel, kernel_size, stride), …]
extractor_conv_bias – Whether to include bias term to each convolution operation.
encoder_embed_dim – The dimension of embedding in encoder.
encoder_projection_dropout – The dropout probability applied after the input feature is projected to “encoder_embed_dim”.
encoder_pos_conv_kernel – Kernel size of convolutional positional embeddings.
encoder_pos_conv_groups – Number of groups of convolutional positional embeddings.
encoder_num_layers – Number of self attention layers in transformer block.
encoder_num_heads – Number of heads in self attention layers.
encoder_attention_dropout – Dropout probability applied after softmax in self-attention layer.
encoder_ff_interm_features – Dimension of hidden features in feed forward layer.
encoder_ff_interm_dropout – Dropout probability applied in feedforward layer.
encoder_dropout – Dropout probability applied at the end of feed forward layer.
encoder_layer_norm_first – Control the order of layer norm in transformer layer and each encoder layer. If True, in transformer layer, layer norm is applied before features are fed to encoder layers.
encoder_layer_drop – Probability to drop each encoder layer during training.
mask_prob – Probability for each token to be chosen as start of the span to be masked.
mask_selection – How to choose the mask length. Options: [static, uniform, normal, poisson].
mask_other – Secondary mask argument (used for more complex distributions).
mask_length – The lengths of the mask.
no_mask_overlap – Whether to allow masks to overlap.
mask_min_space – Minimum space between spans (if no overlap is enabled).
mask_channel_prob – (float): The probability of replacing a feature with 0.
mask_channel_selection – How to choose the mask length for channel masking. Options: [static, uniform, normal, poisson].
mask_channel_other – Secondary mask argument for channel masking(used for more complex distributions).
mask_channel_length – Minimum space between spans (if no overlap is enabled) for channel masking.
no_mask_channel_overlap – Whether to allow channel masks to overlap.
mask_channel_min_space – Minimum space between spans for channel masking(if no overlap is enabled).
skip_masked – If True, skip computing losses over masked frames.
skip_nomask – If True, skip computing losses over unmasked frames.
num_classes – The number of classes in the labels.
final_dim – Project final representations and targets to final_dim.
feature_grad_mult – The factor to scale the convolutional feature extraction layer gradients by. The scale factor will not affect the forward pass.
finetuning – Whether to finetuning the model with ASR or other tasks.
freeze_encoder_updates – The number of steps to freeze the encoder parameters in ASR finetuning.
- Hubert specific Args:
Please refer to: https://pytorch.org/audio/stable/generated/torchaudio.models.hubert_pretrain_model.html#torchaudio.models.hubert_pretrain_model
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, ys_pad: torch.Tensor = None, ys_pad_length: torch.Tensor = None, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward Hubert Pretrain Encoder.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
espnet2.asr.encoder.wav2vec2_encoder¶
Encoder definition.
-
class
espnet2.asr.encoder.wav2vec2_encoder.
FairSeqWav2Vec2Encoder
(input_size: int, w2v_url: str, w2v_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
FairSeq Wav2Vec2 encoder module.
- Parameters:
input_size – input dim
output_size – dimension of attention
w2v_url – url to Wav2Vec2.0 pretrained model
w2v_dir_path – directory to download the Wav2Vec2.0 pretrained model.
normalize_before – whether to use layer_norm before the first block
finetune_last_n_layers – last n layers to be finetuned in Wav2Vec2.0 0 means to finetune every layer if freeze_w2v=False.
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward FairSeqWav2Vec2 Encoder.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
espnet2.asr.encoder.transformer_encoder_multispkr¶
Encoder definition.
-
class
espnet2.asr.encoder.transformer_encoder_multispkr.
TransformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, num_blocks_sd: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, num_inf: int = 1)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Transformer encoder module.
- Parameters:
input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of recognition encoder blocks
num_blocks_sd – the number of speaker dependent encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
num_inf – number of inference output
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
espnet2.asr.encoder.contextual_block_conformer_encoder¶
Created on Sat Aug 21 17:27:16 2021.
@author: Keqi Deng (UCAS)
-
class
espnet2.asr.encoder.contextual_block_conformer_encoder.
ContextualBlockConformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, cnn_module_kernel: int = 31, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Contextual Block Conformer encoder module.
- Parameters:
input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
block_size – block size for contextual block processing
hop_Size – hop size for block processing
look_ahead – look-ahead size for block_processing
init_average – whether to use average as initial context (otherwise max values)
ctx_pos_enc – whether to use positional encoding to the context vectors
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).
- Returns:
position embedded tensor and mask
-
forward_infer
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
-
forward_train
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
espnet2.asr.encoder.whisper_encoder¶
-
class
espnet2.asr.encoder.whisper_encoder.
OpenAIWhisperEncoder
(input_size: int = 1, dropout_rate: float = 0.0, whisper_model: str = 'small', download_dir: str = None, use_specaug: bool = False, specaug_conf: Optional[dict] = None, do_pad_trim: bool = False)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Transformer-based Speech Encoder from OpenAI’s Whisper Model:
URL: https://github.com/openai/whisper
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
log_mel_spectrogram
(audio: torch.Tensor, ilens: torch.Tensor = None) → torch.Tensor[source]¶ Use log-mel spectrogram computation native to Whisper training
-
pad_or_trim
(array: torch.Tensor, length: int, axis: int = -1) → torch.Tensor[source]¶ Pad or trim the audio array to N_SAMPLES.
Used in zero-shot inference cases.
espnet2.asr.encoder.e_branchformer_encoder¶
E-Branchformer encoder definition. Reference:
Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe, “E-Branchformer: Branchformer with Enhanced merging for speech recognition,” in SLT 2022.
-
class
espnet2.asr.encoder.e_branchformer_encoder.
EBranchformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', zero_triu: bool = False, padding_idx: int = -1, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000, use_ffn: bool = False, macaron_ffn: bool = False, ffn_activation_type: str = 'swish', linear_units: int = 2048, positionwise_layer_type: str = 'linear', merge_conv_kernel: int = 3, interctc_layer_idx=None, interctc_use_conditioning: bool = False)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
E-Branchformer encoder module.
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None, max_layer: int = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Calculate forward propagation.
- Parameters:
xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.
ctc (CTC) – Intermediate CTC module.
max_layer (int) – Layer depth below which InterCTC is applied.
- Returns:
Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.
- Return type:
torch.Tensor
-
-
class
espnet2.asr.encoder.e_branchformer_encoder.
EBranchformerEncoderLayer
(size: int, attn: torch.nn.modules.module.Module, cgmlp: torch.nn.modules.module.Module, feed_forward: Optional[torch.nn.modules.module.Module], feed_forward_macaron: Optional[torch.nn.modules.module.Module], dropout_rate: float, merge_conv_kernel: int = 3)[source]¶ Bases:
torch.nn.modules.module.Module
E-Branchformer encoder layer module.
- Parameters:
size (int) – model dimension
attn – standard self-attention or efficient attention
cgmlp – ConvolutionalGatingMLP
feed_forward – feed-forward module, optional
feed_forward – macaron-style feed-forward module, optional
dropout_rate (float) – dropout probability
merge_conv_kernel (int) – kernel size of the depth-wise conv in merge module
-
forward
(x_input, mask, cache=None)[source]¶ Compute encoded features.
- Parameters:
x_input (Union[Tuple, torch.Tensor]) – Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).
- Returns:
Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time).
- Return type:
torch.Tensor
espnet2.asr.encoder.__init__¶
espnet2.asr.encoder.longformer_encoder¶
Conformer encoder definition.
-
class
espnet2.asr.encoder.longformer_encoder.
LongformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'abs_pos', selfattention_layer_type: str = 'lf_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, attention_windows: list = [100, 100, 100, 100, 100, 100], attention_dilation: list = [1, 1, 1, 1, 1, 1], attention_mode: str = 'sliding_chunks')[source]¶ Bases:
espnet2.asr.encoder.conformer_encoder.ConformerEncoder
Longformer SA Conformer encoder module.
- Parameters:
input_size (int) – Input dimension.
output_size (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.
encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.
encoder_attn_layer_type (str) – Encoder attention layer type.
activation_type (str) – Encoder activation function type.
macaron_style (bool) – Whether to use macaron style for positionwise layer.
use_cnn_module (bool) – Whether to use convolution module.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.
attention_windows (list) – Layer-wise attention window sizes for longformer self-attn
attention_dilation (list) – Layer-wise attention dilation sizes for longformer self-attn
attention_mode (str) – Implementation for longformer self-attn. Default=”sliding_chunks” Choose ‘n2’, ‘tvm’ or ‘sliding_chunks’. More details in https://github.com/allenai/longformer
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Calculate forward propagation.
- Parameters:
xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.
- Returns:
Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.
- Return type:
torch.Tensor
espnet2.asr.encoder.branchformer_encoder¶
Branchformer encoder definition.
- Reference:
Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe, “Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding,” in Proceedings of ICML, 2022.
-
class
espnet2.asr.encoder.branchformer_encoder.
BranchformerEncoder
(input_size: int, output_size: int = 256, use_attn: bool = True, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', use_cgmlp: bool = True, cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', merge_method: str = 'concat', cgmlp_weight: Union[float, List[float]] = 0.5, attn_branch_drop_rate: Union[float, List[float]] = 0.0, num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', zero_triu: bool = False, padding_idx: int = -1, stochastic_depth_rate: Union[float, List[float]] = 0.0)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Branchformer encoder module.
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Calculate forward propagation.
- Parameters:
xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.
- Returns:
Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.
- Return type:
torch.Tensor
-
-
class
espnet2.asr.encoder.branchformer_encoder.
BranchformerEncoderLayer
(size: int, attn: Optional[torch.nn.modules.module.Module], cgmlp: Optional[torch.nn.modules.module.Module], dropout_rate: float, merge_method: str, cgmlp_weight: float = 0.5, attn_branch_drop_rate: float = 0.0, stochastic_depth_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Branchformer encoder layer module.
- Parameters:
size (int) – model dimension
attn – standard self-attention or efficient attention, optional
cgmlp – ConvolutionalGatingMLP, optional
dropout_rate (float) – dropout probability
merge_method (str) – concat, learned_ave, fixed_ave
cgmlp_weight (float) – weight of the cgmlp branch, between 0 and 1, used if merge_method is fixed_ave
attn_branch_drop_rate (float) – probability of dropping the attn branch, used if merge_method is learned_ave
stochastic_depth_rate (float) – stochastic depth probability
-
forward
(x_input, mask, cache=None)[source]¶ Compute encoded features.
- Parameters:
x_input (Union[Tuple, torch.Tensor]) – Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).
- Returns:
Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time).
- Return type:
torch.Tensor
espnet2.asr.encoder.conformer_encoder¶
Conformer encoder definition.
-
class
espnet2.asr.encoder.conformer_encoder.
ConformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, stochastic_depth_rate: Union[float, List[float]] = 0.0, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Conformer encoder module.
- Parameters:
input_size (int) – Input dimension.
output_size (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.
encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.
encoder_attn_layer_type (str) – Encoder attention layer type.
activation_type (str) – Encoder activation function type.
macaron_style (bool) – Whether to use macaron style for positionwise layer.
use_cnn_module (bool) – Whether to use convolution module.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Calculate forward propagation.
- Parameters:
xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.
- Returns:
Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.
- Return type:
torch.Tensor
espnet2.asr.encoder.contextual_block_transformer_encoder¶
Encoder definition.
-
class
espnet2.asr.encoder.contextual_block_transformer_encoder.
ContextualBlockTransformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Contextual Block Transformer encoder module.
Details in Tsunoo et al. “Transformer ASR with contextual block processing” (https://arxiv.org/abs/1910.07204)
- Parameters:
input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
block_size – block size for contextual block processing
hop_Size – hop size for block processing
look_ahead – look-ahead size for block_processing
init_average – whether to use average as initial context (otherwise max values)
ctx_pos_enc – whether to use positional encoding to the context vectors
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).
- Returns:
position embedded tensor and mask
-
forward_infer
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
-
forward_train
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters:
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns:
position embedded tensor and mask
espnet2.asr.postencoder.abs_postencoder¶
-
class
espnet2.asr.postencoder.abs_postencoder.
AbsPostEncoder
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.postencoder.hugging_face_transformers_postencoder¶
Hugging Face Transformers PostEncoder.
-
class
espnet2.asr.postencoder.hugging_face_transformers_postencoder.
HuggingFaceTransformersPostEncoder
(input_size: int, model_name_or_path: str, length_adaptor_n_layers: int = 0, lang_token_id: int = -1)[source]¶ Bases:
espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder
Hugging Face Transformers PostEncoder.
Initialize the module.
espnet2.asr.postencoder.__init__¶
espnet2.asr.preencoder.abs_preencoder¶
-
class
espnet2.asr.preencoder.abs_preencoder.
AbsPreEncoder
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.preencoder.sinc¶
Sinc convolutions for raw audio input.
-
class
espnet2.asr.preencoder.sinc.
LightweightSincConvs
(fs: Union[int, str, float] = 16000, in_channels: int = 1, out_channels: int = 256, activation_type: str = 'leakyrelu', dropout_type: str = 'dropout', windowing_type: str = 'hamming', scale_type: str = 'mel')[source]¶ Bases:
espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder
Lightweight Sinc Convolutions.
Instead of using precomputed features, end-to-end speech recognition can also be done directly from raw audio using sinc convolutions, as described in “Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions” by Kürzinger et al. https://arxiv.org/abs/2010.07597
To use Sinc convolutions in your model instead of the default f-bank frontend, set this module as your pre-encoder with preencoder: sinc and use the input of the sliding window frontend with frontend: sliding_window in your yaml configuration file. So that the process flow is:
Frontend (SlidingWindow) -> SpecAug -> Normalization -> Pre-encoder (LightweightSincConvs) -> Encoder -> Decoder
Note that this method also performs data augmentation in time domain (vs. in spectral domain in the default frontend). Use plot_sinc_filters.py to visualize the learned Sinc filters.
Initialize the module.
- Parameters:
fs – Sample rate.
in_channels – Number of input channels.
out_channels – Number of output channels (for each input channel).
activation_type – Choice of activation function.
dropout_type – Choice of dropout function.
windowing_type – Choice of windowing function.
scale_type – Choice of filter-bank initialization scale.
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Apply Lightweight Sinc Convolutions.
The input shall be formatted as (B, T, C_in, D_in) with B as batch size, T as time dimension, C_in as channels, and D_in as feature dimension.
The output will then be (B, T, C_out*D_out) with C_out and D_out as output dimensions.
The current module structure only handles D_in=400, so that D_out=1. Remark for the multichannel case: C_out is the number of out_channels given at initialization multiplied with C_in.
-
gen_lsc_block
(in_channels: int, out_channels: int, depthwise_kernel_size: int = 9, depthwise_stride: int = 1, depthwise_groups=None, pointwise_groups=0, dropout_probability: float = 0.15, avgpool=False)[source]¶ Generate a convolutional block for Lightweight Sinc convolutions.
Each block consists of either a depthwise or a depthwise-separable convolutions together with dropout, (batch-)normalization layer, and an optional average-pooling layer.
- Parameters:
in_channels – Number of input channels.
out_channels – Number of output channels.
depthwise_kernel_size – Kernel size of the depthwise convolution.
depthwise_stride – Stride of the depthwise convolution.
depthwise_groups – Number of groups of the depthwise convolution.
pointwise_groups – Number of groups of the pointwise convolution.
dropout_probability – Dropout probability in the block.
avgpool – If True, an AvgPool layer is inserted.
- Returns:
Neural network building block.
- Return type:
torch.nn.Sequential
-
class
espnet2.asr.preencoder.sinc.
SpatialDropout
(dropout_probability: float = 0.15, shape: Union[tuple, list, None] = None)[source]¶ Bases:
torch.nn.modules.module.Module
Spatial dropout module.
Apply dropout to full channels on tensors of input (B, C, D)
Initialize.
- Parameters:
dropout_probability – Dropout probability.
shape (tuple, list) – Shape of input tensors.
espnet2.asr.preencoder.linear¶
Linear Projection.
-
class
espnet2.asr.preencoder.linear.
LinearProjection
(input_size: int, output_size: int, dropout: float = 0.0)[source]¶ Bases:
espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder
Linear Projection Preencoder.
Initialize the module.
espnet2.asr.preencoder.__init__¶
espnet2.asr.specaug.abs_specaug¶
-
class
espnet2.asr.specaug.abs_specaug.
AbsSpecAug
[source]¶ Bases:
torch.nn.modules.module.Module
Abstract class for the augmentation of spectrogram
The process-flow:
Frontend -> SpecAug -> Normalization -> Encoder -> Decoder
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
forward
(x: torch.Tensor, x_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.specaug.__init__¶
espnet2.asr.specaug.specaug¶
SpecAugment module.
-
class
espnet2.asr.specaug.specaug.
SpecAug
(apply_time_warp: bool = True, time_warp_window: int = 5, time_warp_mode: str = 'bicubic', apply_freq_mask: bool = True, freq_mask_width_range: Union[int, Sequence[int]] = (0, 20), num_freq_mask: int = 2, apply_time_mask: bool = True, time_mask_width_range: Union[int, Sequence[int], None] = None, time_mask_width_ratio_range: Union[float, Sequence[float], None] = None, num_time_mask: int = 2)[source]¶ Bases:
espnet2.asr.specaug.abs_specaug.AbsSpecAug
Implementation of SpecAug.
- Reference:
Daniel S. Park et al. “SpecAugment: A Simple Data
Augmentation Method for Automatic Speech Recognition”
Warning
When using cuda mode, time_warp doesn’t have reproducibility due to torch.nn.functional.interpolate.
-
forward
(x, x_lengths=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.state_spaces.residual¶
Implementations of different types of residual functions.
-
class
espnet2.asr.state_spaces.residual.
Affine
(*args, scalar=True, gamma=0.0, **kwargs)[source]¶ Bases:
espnet2.asr.state_spaces.residual.Residual
Residual connection with learnable scalar multipliers on the main branch.
scalar: Single scalar multiplier, or one per dimension scale, power: Initialize to scale * layer_num**(-power)
-
forward
(x, y, transposed)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.residual.
DecayResidual
(*args, power=0.5, l2=True)[source]¶ Bases:
espnet2.asr.state_spaces.residual.Residual
Residual connection that can decay the linear combination depending on depth.
-
forward
(x, y, transposed)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.residual.
Highway
(*args, scaling_correction=False, elemwise=False)[source]¶ Bases:
espnet2.asr.state_spaces.residual.Residual
-
forward
(x, y, transposed=False)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.residual.
Residual
(i_layer, d_input, d_model, alpha=1.0, beta=1.0)[source]¶ Bases:
torch.nn.modules.module.Module
Residual connection with constant affine weights.
Can simulate standard residual, no residual, and “constant gates”.
-
property
d_output
¶
-
forward
(x, y, transposed)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.state_spaces.cauchy¶
espnet2.asr.state_spaces.s4¶
Standalone version of Structured (Sequence) State Space (S4) model.
-
class
espnet2.asr.state_spaces.s4.
OptimModule
[source]¶ Bases:
torch.nn.modules.module.Module
Interface for Module that allows registering buffers/parameters with configurable optimizer hyperparameters. # noqa
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
class
espnet2.asr.state_spaces.s4.
S4
(d_model, d_state=64, l_max=None, channels=1, bidirectional=False, activation='gelu', postact='glu', hyper_act=None, dropout=0.0, tie_dropout=False, bottleneck=None, gate=None, transposed=True, verbose=False, **kernel_args)[source]¶ Bases:
torch.nn.modules.module.Module
Initialize S4 module.
d_state: the dimension of the state, also denoted by N l_max: the maximum kernel length, also denoted by L.
Set l_max=None to always use a global kernel
- channels: can be interpreted as a number of “heads”;
the SSM is a map from a 1-dim to C-dim sequence. It’s not recommended to change this unless desperate for things to tune; instead, increase d_model for larger models
bidirectional: if True, convolution kernel will be two-sided
activation: activation in between SS and FF postact: activation after FF hyper_act: use a “hypernetwork” multiplication (experimental) dropout: standard dropout argument. tie_dropout=True ties the dropout
mask across the sequence length, emulating nn.Dropout1d
- transposed: choose backbone axis ordering of
(B, L, H) (if False) or (B, H, L) (if True) [B=batch size, L=sequence length, H=hidden dimension]
gate: add gated activation (GSS) bottleneck: reduce SSM dimension (GSS)
See the class SSKernel for the kernel constructor which accepts kernel_args. Relevant options that are worth considering and tuning include “mode” + “measure”, “dt_min”, “dt_max”, “lr”
Other options are all experimental and should not need to be configured
-
property
d_output
¶
-
class
espnet2.asr.state_spaces.s4.
SSKernel
(H, N=64, L=None, measure='legs', rank=1, channels=1, dt_min=0.001, dt_max=0.1, deterministic=False, lr=None, mode='nplr', n_ssm=None, verbose=False, measure_args={}, **kernel_args)[source]¶ Bases:
torch.nn.modules.module.Module
Wrapper around SSKernel parameterizations.
The SSKernel is expected to support the interface forward() default_state() _setup_step() step()
State Space Kernel which computes the convolution kernel $\bar{K}$.
- H: Number of independent SSM copies;
controls the size of the model. Also called d_model in the config.
- N: State size (dimensionality of parameters A, B, C).
Also called d_state in the config. Generally shouldn’t need to be adjusted and doens’t affect speed much.
- L: Maximum length of convolution kernel, if known.
Should work in the majority of cases even if not known.
- measure: Options for initialization of (A, B).
For NPLR mode, recommendations are “legs”, “fout”, “hippo” (combination of both). For Diag mode, recommendations are “diag-inv”, “diag-lin”, “diag-legs”, and “diag” (combination of diag-inv and diag-lin)
- rank: Rank of low-rank correction for NPLR mode.
Needs to be increased for measure “legt”
- channels: C channels turns the SSM from a 1-dim to C-dim map;
can think of it having C separate “heads” per SSM. This was partly a feature to make it easier to implement bidirectionality; it is recommended to set channels=1 and adjust H to control parameters instead
dt_min, dt_max: min and max values for the step size dt (Delta) mode: Which kernel algorithm to use. ‘nplr’ is the full S4 model;
‘diag’ is the simpler S4D; ‘slow’ is a dense version for testing
- n_ssm: Number of independent trainable (A, B) SSMs,
e.g. n_ssm=1 means all A/B parameters are tied across the H different instantiations of C. n_ssm=None means all H SSMs are completely independent. Generally, changing this option can save parameters but doesn’t affect performance or speed much. This parameter must divide H
- lr: Passing in a number (e.g. 0.001) sets
attributes of SSM parameers (A, B, dt). A custom optimizer hook is needed to configure the optimizer to set the learning rates appropriately for these parameters.
-
forward
(state=None, L=None, rate=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
forward_state
(u, state)[source]¶ Forward the state through a sequence.
i.e. computes the state after passing chunk through SSM
state: (B, H, N) u: (B, H, L)
Returns: (B, H, N)
-
class
espnet2.asr.state_spaces.s4.
SSKernelDiag
(A, B, C, log_dt, L=None, disc='bilinear', real_type='exp', lr=None, bandlimit=None)[source]¶ Bases:
espnet2.asr.state_spaces.s4.OptimModule
Version using (complex) diagonal state matrix (S4D).
-
class
espnet2.asr.state_spaces.s4.
SSKernelNPLR
(w, P, B, C, log_dt, L=None, lr=None, verbose=False, keops=False, real_type='exp', real_tolerance=0.001, bandlimit=None)[source]¶ Bases:
espnet2.asr.state_spaces.s4.OptimModule
Stores a representation of and computes the SSKernel function.
K_L(A^dt, B^dt, C) corresponding to a discretized state space, where A is Normal + Low Rank (NPLR)
Initialize kernel.
L: Maximum length; this module computes an SSM kernel of length L A is represented by diag(w) - PP^* w: (S, N) diagonal part P: (R, S, N) low-rank part
B: (S, N) C: (C, H, N) dt: (H) timescale per feature lr: [dict | float | None] hook to set lr of special parameters (A, B, dt)
Dimensions: N (or d_state): state size H (or d_model): total SSM copies S (or n_ssm): number of trainable copies of (A, B, dt); must divide H R (or rank): rank of low-rank part C (or channels): system is 1-dim to C-dim
The forward pass of this Module returns a tensor of shape (C, H, L)
- Note: tensor shape N here denotes half the true state size,
because of conjugate symmetry
-
espnet2.asr.state_spaces.s4.
cauchy_naive
(v, z, w)[source]¶ Naive version.
v, w: (…, N) z: (…, L) returns: (…, L)
-
espnet2.asr.state_spaces.s4.
dplr
(scaling, N, rank=1, H=1, dtype=torch.float32, real_scale=1.0, imag_scale=1.0, random_real=False, random_imag=False, normalize=False, diagonal=True, random_B=False)[source]¶
-
espnet2.asr.state_spaces.s4.
get_logger
(name='espnet2.asr.state_spaces.s4', level=20) → logging.Logger[source]¶ Initialize multi-GPU-friendly python logger.
-
espnet2.asr.state_spaces.s4.
log
= <Logger espnet2.asr.state_spaces.s4 (INFO)>[source]¶ Cauchy and Vandermonde kernels
-
espnet2.asr.state_spaces.s4.
log_vandermonde
(v, x, L)[source]¶ Compute Vandermonde product.
v: (…, N) x: (…, N) returns: (…, L) sum v x^l
-
espnet2.asr.state_spaces.s4.
nplr
(measure, N, rank=1, dtype=torch.float32, diagonalize_precision=True)[source]¶ Decompose as Normal Plus Low-Rank (NPLR).
Return w, p, q, V, B such that (w - p q^*, B) is unitarily equivalent to the original HiPPO A, B by the matrix V i.e. A = V[w - p q^*]V^*, B = V B
-
espnet2.asr.state_spaces.s4.
power
(L, A, v=None)[source]¶ Compute A^L and the scan sum_i A^i v_i.
A: (…, N, N) v: (…, N, L)
-
espnet2.asr.state_spaces.s4.
rank_correction
(measure, N, rank=1, dtype=torch.float32)[source]¶ Return low-rank matrix L such that A + L is normal.
-
espnet2.asr.state_spaces.s4.
rank_zero_only
(fn: Callable) → Callable[source]¶ Decorator function from PyTorch Lightning.
Function that can be used as a decorator to enable a function/method being called only on global rank 0.
-
espnet2.asr.state_spaces.s4.
ssm
(measure, N, R, H, **ssm_args)[source]¶ Dispatcher to create single SSM initialization.
N: state size R: rank (for DPLR parameterization) H: number of independent SSM copies
-
espnet2.asr.state_spaces.s4.
transition
(measure, N)[source]¶ A, B transition matrices for different measures.
espnet2.asr.state_spaces.ff¶
Implementation of FFN block in the style of Transformers.
-
class
espnet2.asr.state_spaces.ff.
FF
(d_input, expand=2, d_output=None, transposed=False, activation='gelu', initializer=None, dropout=0.0, tie_dropout=False)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
-
forward
(x, *args, **kwargs)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
step
(x, state, **kwargs)[source]¶ Step the model recurrently for one step of the input sequence.
For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.
-
espnet2.asr.state_spaces.pool¶
Implements downsampling and upsampling on sequences.
-
class
espnet2.asr.state_spaces.pool.
DownAvgPool
(d_input, stride=1, expand=1, transposed=True)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
-
property
d_output
¶ Output dimension of model.
This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.
-
forward
(x)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
step
(x, state, **kwargs)[source]¶ Step the model recurrently for one step of the input sequence.
For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.
-
property
-
class
espnet2.asr.state_spaces.pool.
DownLinearPool
(d_input, stride=1, expand=1, transposed=True)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
-
property
d_output
¶ Output dimension of model.
This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.
-
forward
(x)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
step
(x, state, **kwargs)[source]¶ Step the model recurrently for one step of the input sequence.
For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.
-
property
-
class
espnet2.asr.state_spaces.pool.
DownPool
(d_input, d_output=None, expand=None, stride=1, transposed=True, weight_norm=True, initializer=None, activation=None)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
-
forward
(x)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
-
class
espnet2.asr.state_spaces.pool.
DownPool2d
(d_input, d_output, stride=1, transposed=True, weight_norm=True)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
-
forward
(x)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
-
class
espnet2.asr.state_spaces.pool.
DownSample
(d_input, stride=1, expand=1, transposed=True)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
-
property
d_output
¶ Output dimension of model.
This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.
-
forward
(x)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
step
(x, state, **kwargs)[source]¶ Step the model recurrently for one step of the input sequence.
For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.
-
property
-
class
espnet2.asr.state_spaces.pool.
DownSpectralPool
(d_input, stride=1, expand=1, transposed=True)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
-
property
d_output
¶ Output dimension of model.
This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.
-
step
(x, state, **kwargs)[source]¶ Step the model recurrently for one step of the input sequence.
For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.
-
property
-
class
espnet2.asr.state_spaces.pool.
UpPool
(d_input, d_output, stride, transposed=True, weight_norm=True, initializer=None, activation=None)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
-
property
d_output
¶ Output dimension of model.
This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.
-
forward
(x, skip=None)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
property
-
class
espnet2.asr.state_spaces.pool.
UpSample
(d_input, stride=1, expand=1, transposed=True)[source]¶ Bases:
torch.nn.modules.module.Module
-
property
d_output
¶
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.state_spaces.block¶
Implements a full residual block around a black box layer.
Configurable options include: normalization position: prenorm or postnorm normalization type: batchnorm, layernorm etc. subsampling/pooling residual options: feedforward, residual, affine scalars, depth-dependent scaling, etc.
-
class
espnet2.asr.state_spaces.block.
SequenceResidualBlock
(d_input, i_layer=None, prenorm=True, dropout=0.0, tie_dropout=False, transposed=False, layer=None, residual=None, norm=None, pool=None, drop_path=0.0)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
Residual block wrapper for black box layer.
The SequenceResidualBlock class implements a generic (batch, length, d_input) -> (batch, length, d_input) transformation
- Parameters:
d_input – Input feature dimension
i_layer – Layer index, only needs to be passed into certain residuals like Decay
dropout – Dropout for black box module
tie_dropout – Tie dropout mask across sequence like nn.Dropout1d/nn.Dropout2d
transposed – Transpose inputs so each layer receives (batch, dim, length)
layer – Config for black box module
residual – Config for residual function
norm – Config for normalization layer
pool – Config for pooling layer per stage
drop_path – Drop ratio for stochastic depth
-
property
d_output
¶ Output dimension of model.
This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.
-
property
d_state
¶ Return dimension of output of self.state_to_tensor.
-
forward
(x, state=None, **kwargs)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
property
state_to_tensor
¶ Return a function mapping a state to a single tensor.
This method should be implemented if one wants to use the hidden state insteadof the output sequence for final prediction. Currently only used with the StateDecoder.
-
step
(x, state, **kwargs)[source]¶ Step the model recurrently for one step of the input sequence.
For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.
espnet2.asr.state_spaces.base¶
-
class
espnet2.asr.state_spaces.base.
SequenceIdentity
(*args, transposed=False, **kwargs)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceIdentity
Simple SequenceModule for testing purposes.
-
class
espnet2.asr.state_spaces.base.
SequenceModule
[source]¶ Bases:
torch.nn.modules.module.Module
Abstract sequence model class.
All models must adhere to this interface
A SequenceModule is generally a model that transforms an input of shape (n_batch, l_sequence, d_model) to (n_batch, l_sequence, d_output)
REQUIRED methods and attributes forward, d_model, d_output: controls standard forward pass, a sequence-to-sequence transformation __init__ should also satisfy the following interface; see SequenceIdentity for an example
def __init__(self, d_model, transposed=False, **kwargs)
OPTIONAL methods default_state, step: allows stepping the model recurrently with a hidden state state_to_tensor, d_state: allows decoding from hidden state
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
property
d_model
¶ Model dimension (generally same as input dimension).
This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, encoder) to track the internal shapes of the full model.
-
property
d_output
¶ Output dimension of model.
This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.
-
property
d_state
¶ Return dimension of output of self.state_to_tensor.
-
forward
(x, state=None, **kwargs)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
property
state_to_tensor
¶ Return a function mapping a state to a single tensor.
This method should be implemented if one wants to use the hidden state insteadof the output sequence for final prediction. Currently only used with the StateDecoder.
-
step
(x, state=None, **kwargs)[source]¶ Step the model recurrently for one step of the input sequence.
For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.
-
property
espnet2.asr.state_spaces.model¶
-
class
espnet2.asr.state_spaces.model.
SequenceModel
(d_model, n_layers=1, transposed=False, dropout=0.0, tie_dropout=False, prenorm=True, n_repeat=1, layer=None, residual=None, norm=None, pool=None, track_norms=True, dropinp=0.0, drop_path=0.0)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
Isotropic deep sequence model backbone, in the style of ResNets / Transformers.
The SequenceModel class implements a generic (batch, length, d_input) -> (batch, length, d_output) transformation
- Parameters:
d_model – Resize input (useful for deep models with residuals)
n_layers – Number of layers
transposed – Transpose inputs so each layer receives (batch, dim, length)
dropout – Dropout parameter applied on every residual and every layer
tie_dropout – Tie dropout mask across sequence like nn.Dropout1d/nn.Dropout2d
prenorm – Pre-norm vs. post-norm
n_repeat – Each layer is repeated n times per stage before applying pooling
layer – Layer config, must be specified
residual – Residual config
norm – Normalization config (e.g. layer vs batch)
pool – Config for pooling layer per stage
track_norms – Log norms of each layer output
dropinp – Input dropout
drop_path – Stochastic depth for each residual path
-
property
d_state
¶ Return dimension of output of self.state_to_tensor.
-
forward
(inputs, *args, state=None, **kwargs)[source]¶ Forward pass.
A sequence-to-sequence transformation with an optional state.
Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)
Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well
-
property
state_to_tensor
¶ Return a function mapping a state to a single tensor.
This method should be implemented if one wants to use the hidden state insteadof the output sequence for final prediction. Currently only used with the StateDecoder.
-
step
(x, state, **kwargs)[source]¶ Step the model recurrently for one step of the input sequence.
For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.
espnet2.asr.state_spaces.attention¶
Multi-Head Attention layer definition.
-
class
espnet2.asr.state_spaces.attention.
MultiHeadedAttention
(n_feat, n_head, dropout=0.0, transposed=False, **kwargs)[source]¶ Bases:
espnet2.asr.state_spaces.base.SequenceModule
Multi-Head Attention layer inheriting SequenceModule.
Comparing default MHA module in ESPnet, this module returns additional dummy state and has step function for autoregressive inference.
- Parameters:
n_head (int) – The number of heads.
n_feat (int) – The number of features.
dropout_rate (float) – Dropout rate.
Construct an MultiHeadedAttention object.
-
forward
(query, memory=None, mask=None, *args, **kwargs)[source]¶ Compute scaled dot product attention.
- Parameters:
query (torch.Tensor) – Query tensor (#batch, time1, size).
key (torch.Tensor) – Key tensor (#batch, time2, size).
value (torch.Tensor) – Value tensor (#batch, time2, size).
mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).
- Returns:
Output tensor (#batch, time1, d_model).
- Return type:
torch.Tensor
-
forward_attention
(value, scores, mask)[source]¶ Compute attention context vector.
- Parameters:
value (torch.Tensor) – Transformed value (#batch, n_head, time2, d_k).
scores (torch.Tensor) – Attention score (#batch, n_head, time1, time2).
mask (torch.Tensor) – Mask (#batch, 1, time2) or (#batch, time1, time2).
- Returns:
- Transformed value (#batch, time1, d_model)
weighted by the attention score (#batch, time1, time2).
- Return type:
torch.Tensor
-
forward_qkv
(query, key, value)[source]¶ Transform query, key and value.
- Parameters:
query (torch.Tensor) – Query tensor (#batch, time1, size).
key (torch.Tensor) – Key tensor (#batch, time2, size).
value (torch.Tensor) – Value tensor (#batch, time2, size).
- Returns:
Transformed query tensor (#batch, n_head, time1, d_k). torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k). torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
- Return type:
torch.Tensor
-
step
(query, state, memory=None, mask=None, **kwargs)[source]¶ Step the model recurrently for one step of the input sequence.
For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.
espnet2.asr.state_spaces.registry¶
espnet2.asr.state_spaces.utils¶
Utilities for dealing with collection objects (lists, dicts) and configs.
-
espnet2.asr.state_spaces.utils.
instantiate
(registry, config, *args, partial=False, wrap=None, **kwargs)[source]¶ Instantiate registered module.
- registry: Dictionary mapping names to functions or target paths
(e.g. {‘model’: ‘models.SequenceModel’})
- config: Dictionary with a ‘_name_’ key indicating which element of the registry
to grab, and kwargs to be passed into the target constructor
wrap: wrap the target class (e.g. ema optimizer or tasks.wrap) *args, **kwargs: additional arguments
to override the config to pass into the target constructor
-
espnet2.asr.state_spaces.utils.
omegaconf_filter_keys
(d, fn=None)[source]¶ Only keep keys where fn(key) is True. Support nested DictConfig.
espnet2.asr.state_spaces.components¶
-
class
espnet2.asr.state_spaces.components.
DropoutNd
(p: float = 0.5, tie=True, transposed=True)[source]¶ Bases:
torch.nn.modules.module.Module
Initialize dropout module.
tie: tie dropout mask across sequence lengths (Dropout1d/2d/3d)
-
espnet2.asr.state_spaces.components.
LinearActivation
(d_input, d_output, bias=True, zero_bias_init=False, transposed=False, initializer=None, activation=None, activate=False, weight_norm=False, **kwargs)[source]¶ Return a linear module, initialization, and activation.
-
class
espnet2.asr.state_spaces.components.
Normalization
(d, transposed=False, _name_='layer', **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.components.
ReversibleInstanceNorm1dInput
(d, transposed=False)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.components.
ReversibleInstanceNorm1dOutput
(norm_input)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.components.
SquaredReLU
[source]¶ Bases:
torch.nn.modules.module.Module
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.components.
StochasticDepth
(p: float, mode: str)[source]¶ Bases:
torch.nn.modules.module.Module
Stochastic depth module.
See
stochastic_depth()
.-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.components.
TSInverseNormalization
(method, normalizer)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.components.
TSNormalization
(method, horizon)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.components.
TransposedLN
(d, scalar=True)[source]¶ Bases:
torch.nn.modules.module.Module
Transposed LayerNorm module.
LayerNorm module over second dimension Assumes shape (B, D, L), where L can be 1 or more axis
This is slow and a dedicated CUDA/Triton implementation shuld provide substantial end-to-end speedup
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.asr.state_spaces.components.
TransposedLinear
(d_input, d_output, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module
Transposed linear module.
Linear module on the second-to-last dimension Assumes shape (B, D, L), where L can be 1 or more axis
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
espnet2.asr.state_spaces.components.
stochastic_depth
(input: torch._VariableFunctionsClass.tensor, p: float, mode: str, training: bool = True)[source]¶ Apply stochastic depth.
Implements the Stochastic Depth from “Deep Networks with Stochastic Depth” used for randomly dropping residual branches of residual architectures.
- Parameters:
input (Tensor[N, ..]) – The input tensor or arbitrary dimensions with the first one being its batch i.e. a batch with
N
rows.p (float) – probability of the input to be zeroed.
mode (str) –
"batch"
or"row"
."batch"
randomly zeroes the entire input,"row"
zeroes randomly selected rows from the batch.training – apply stochastic depth if is
True
. Default:True
- Returns:
The randomly zeroed tensor.
- Return type:
Tensor[N, ..]
espnet2.asr.state_spaces.__init__¶
Initialize sub package.
espnet2.asr.decoder.transformer_decoder¶
Decoder definition.
-
class
espnet2.asr.decoder.transformer_decoder.
BaseTransformerDecoder
(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True)[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
,espnet.nets.scorer_interface.BatchScorerInterface
Base class of Transfomer decoder module.
- Parameters:
vocab_size – output dim
encoder_output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
self_attention_dropout_rate – dropout rate for attention
input_layer – input layer type
use_output_layer – whether to use output layer
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch.
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
forward
(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward decoder.
- Parameters:
hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)
- Returns:
tuple containing:
- x: decoded token score before softmax (batch, maxlen_out, token)
if use_output_layer is True,
olens: (batch, )
- Return type:
(tuple)
-
forward_one_step
(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶ Forward one step.
- Parameters:
tgt – input token ids, int64 (batch, maxlen_out)
tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)
memory – encoded memory, float32 (batch, maxlen_in, feat)
cache – cached output list of (batch, max_time_out-1, size)
- Returns:
NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)
- Return type:
y, cache
-
class
espnet2.asr.decoder.transformer_decoder.
DynamicConvolution2DTransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
-
class
espnet2.asr.decoder.transformer_decoder.
DynamicConvolutionTransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
-
class
espnet2.asr.decoder.transformer_decoder.
LightweightConvolution2DTransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
-
class
espnet2.asr.decoder.transformer_decoder.
LightweightConvolutionTransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
-
class
espnet2.asr.decoder.transformer_decoder.
TransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
espnet2.asr.decoder.s4_decoder¶
Decoder definition.
-
class
espnet2.asr.decoder.s4_decoder.
S4Decoder
(vocab_size: int, encoder_output_size: int, input_layer: str = 'embed', dropinp: float = 0.0, dropout: float = 0.25, prenorm: bool = True, n_layers: int = 16, transposed: bool = False, tie_dropout: bool = False, n_repeat=1, layer=None, residual=None, norm=None, pool=None, track_norms=True, drop_path: float = 0.0)[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
,espnet.nets.scorer_interface.BatchScorerInterface
S4 decoder module.
- Parameters:
vocab_size – output dim
encoder_output_size – dimension of hidden vector
input_layer – input layer type
dropinp – input dropout
dropout – dropout parameter applied on every residual and every layer
prenorm – pre-norm vs. post-norm
n_layers – number of layers
transposed – transpose inputs so each layer receives (batch, dim, length)
tie_dropout – tie dropout mask across sequence like nn.Dropout1d/nn.Dropout2d
n_repeat – each layer is repeated n times per stage before applying pooling
layer – layer config, must be specified
residual – residual config
norm – normalization config (e.g. layer vs batch)
pool – config for pooling layer per stage
track_norms – log norms of each layer output
drop_path – drop rate for stochastic depth
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch.
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
forward
(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor, state=None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward decoder.
- Parameters:
hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)
- Returns:
tuple containing:
- x: decoded token score before softmax (batch, maxlen_out, token)
if use_output_layer is True,
olens: (batch, )
- Return type:
(tuple)
-
score
(ys, state, x)[source]¶ Score new token (required).
- Parameters:
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.
- Returns:
- Tuple of
scores for next token that has a shape of (n_vocab) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
espnet2.asr.decoder.transducer_decoder¶
(RNN-)Transducer decoder definition.
-
class
espnet2.asr.decoder.transducer_decoder.
TransducerDecoder
(vocab_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, dropout: float = 0.0, dropout_embed: float = 0.0, embed_pad: int = 0)[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
(RNN-)Transducer decoder module.
- Parameters:
vocab_size – Output dimension.
layers_type – (RNN-)Decoder layers type.
num_layers – Number of decoder layers.
hidden_size – Number of decoder units per layer.
dropout – Dropout rate for decoder layers.
dropout_embed – Dropout rate for embedding layer.
embed_pad – Embed/Blank symbol ID.
-
batch_score
(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]], dec_states: Tuple[torch.Tensor, Optional[torch.Tensor]], cache: Dict[str, Any], use_lm: bool) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], torch.Tensor][source]¶ One-step forward hypotheses.
- Parameters:
hyps – Hypotheses.
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
cache – Pairs of (dec_out, dec_states) for each label sequences. (keys)
use_lm – Whether to compute label ID sequences for LM.
- Returns:
Decoder output sequences. (B, D_dec) dec_states: Decoder hidden states. ((N, B, D_dec), (N, B, D_dec)) lm_labels: Label ID sequences for LM. (B,)
- Return type:
dec_out
-
create_batch_states
(states: Tuple[torch.Tensor, Optional[torch.Tensor]], new_states: List[Tuple[torch.Tensor, Optional[torch.Tensor]]], check_list: Optional[List] = None) → List[Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ Create decoder hidden states.
- Parameters:
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
new_states – Decoder hidden states. [N x ((1, D_dec), (1, D_dec))]
- Returns:
Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
- Return type:
states
-
forward
(labels: torch.Tensor) → torch.Tensor[source]¶ Encode source label sequences.
- Parameters:
labels – Label ID sequences. (B, L)
- Returns:
Decoder output sequences. (B, T, U, D_dec)
- Return type:
dec_out
-
init_state
(batch_size: int) → Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]][source]¶ Initialize decoder states.
- Parameters:
batch_size – Batch size.
- Returns:
Initial decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
-
rnn_forward
(sequence: torch.Tensor, state: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ Encode source label sequences.
- Parameters:
sequence – RNN input sequences. (B, D_emb)
state – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
- Returns:
RNN output sequences. (B, D_dec) (h_next, c_next): Decoder hidden states. (N, B, D_dec), (N, B, D_dec))
- Return type:
sequence
-
score
(hyp: espnet2.asr.transducer.beam_search_transducer.Hypothesis, cache: Dict[str, Any]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]], torch.Tensor][source]¶ One-step forward hypothesis.
- Parameters:
hyp – Hypothesis.
cache – Pairs of (dec_out, state) for each label sequence. (key)
- Returns:
Decoder output sequence. (1, D_dec) new_state: Decoder hidden states. ((N, 1, D_dec), (N, 1, D_dec)) label: Label ID for LM. (1,)
- Return type:
dec_out
-
select_state
(states: Tuple[torch.Tensor, Optional[torch.Tensor]], idx: int) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Get specified ID state from decoder hidden states.
- Parameters:
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
idx – State ID to extract.
- Returns:
- Decoder hidden state for given ID.
((N, 1, D_dec), (N, 1, D_dec))
espnet2.asr.decoder.mlm_decoder¶
Masked LM Decoder definition.
-
class
espnet2.asr.decoder.mlm_decoder.
MLMDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
-
forward
(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward decoder.
- Parameters:
hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)
- Returns:
tuple containing: x: decoded token score before softmax (batch, maxlen_out, token)
if use_output_layer is True,
olens: (batch, )
- Return type:
(tuple)
-
espnet2.asr.decoder.whisper_decoder¶
-
class
espnet2.asr.decoder.whisper_decoder.
OpenAIWhisperDecoder
(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.0, whisper_model: str = 'small', download_dir: str = None)[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
,espnet.nets.scorer_interface.BatchScorerInterface
Transformer-based Speech-to-Text Decoder from OpenAI’s Whisper Model:
URL: https://github.com/openai/whisper
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch.
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
forward
(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward decoder.
- Parameters:
hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)
- Returns:
tuple containing:
- x: decoded token score before softmax (batch, maxlen_out, token)
if use_output_layer is True,
olens: (batch, )
- Return type:
(tuple)
-
forward_one_step
(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶ Forward one step.
- Parameters:
tgt – input token ids, int64 (batch, maxlen_out)
tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)
memory – encoded memory, float32 (batch, maxlen_in, feat)
cache – cached output list of (batch, max_time_out-1, size)
- Returns:
NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)
- Return type:
y, cache
- NOTE (Shih-Lun):
cache implementation is ignored for now for simplicity & correctness
-
espnet2.asr.decoder.rnn_decoder¶
-
class
espnet2.asr.decoder.rnn_decoder.
RNNDecoder
(vocab_size: int, encoder_output_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, sampling_probability: float = 0.0, dropout: float = 0.0, context_residual: bool = False, replace_sos: bool = False, num_encs: int = 1, att_conf: dict = {'aconv_chans': 10, 'aconv_filts': 100, 'adim': 320, 'aheads': 4, 'atype': 'location', 'awin': 5, 'han_conv_chans': -1, 'han_conv_filts': 100, 'han_dim': 320, 'han_heads': 4, 'han_mode': False, 'han_type': None, 'han_win': 5, 'num_att': 1, 'num_encs': 1})[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
-
forward
(hs_pad, hlens, ys_in_pad, ys_in_lens, strm_idx=0)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
init_state
(x)[source]¶ Get an initial state for decoding (optional).
- Parameters:
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
score
(yseq, state, x)[source]¶ Score new token (required).
- Parameters:
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.
- Returns:
- Tuple of
scores for next token that has a shape of (n_vocab) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
-
espnet2.asr.decoder.rnn_decoder.
build_attention_list
(eprojs: int, dunits: int, atype: str = 'location', num_att: int = 1, num_encs: int = 1, aheads: int = 4, adim: int = 320, awin: int = 5, aconv_chans: int = 10, aconv_filts: int = 100, han_mode: bool = False, han_type=None, han_heads: int = 4, han_dim: int = 320, han_conv_chans: int = -1, han_conv_filts: int = 100, han_win: int = 5)[source]¶
espnet2.asr.decoder.abs_decoder¶
-
class
espnet2.asr.decoder.abs_decoder.
AbsDecoder
[source]¶ Bases:
torch.nn.modules.module.Module
,espnet.nets.scorer_interface.ScorerInterface
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.decoder.hugging_face_transformers_decoder¶
Hugging Face Transformers Decoder.
-
class
espnet2.asr.decoder.hugging_face_transformers_decoder.
HuggingFaceTransformersDecoder
(vocab_size: int, encoder_output_size: int, model_name_or_path: str)[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
Hugging Face Transformers Decoder.
- Parameters:
encoder_output_size – dimension of encoder attention
model_name_or_path – Hugging Face Transformers model name
-
forward
(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward decoder.
- Parameters:
hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input tensor (batch, maxlen_out, #mels)
ys_in_lens – (batch)
- Returns:
tuple containing:
- x: decoded token score before softmax (batch, maxlen_out, token)
if use_output_layer is True,
olens: (batch, )
- Return type:
(tuple)
espnet2.asr.decoder.__init__¶
espnet2.asr.transducer.error_calculator¶
Error Calculator module for Transducer.
-
class
espnet2.asr.transducer.error_calculator.
ErrorCalculatorTransducer
(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: torch.nn.modules.module.Module, token_list: List[int], sym_space: str, sym_blank: str, report_cer: bool = False, report_wer: bool = False)[source]¶ Bases:
object
Calculate CER and WER for transducer models.
- Parameters:
decoder – Decoder module.
token_list – List of tokens.
sym_space – Space symbol.
sym_blank – Blank symbol.
report_cer – Whether to compute CER.
report_wer – Whether to compute WER.
Construct an ErrorCalculatorTransducer.
-
calculate_cer
(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶ Calculate sentence-level CER score.
- Parameters:
char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)
- Returns:
Average sentence-level CER score.
-
calculate_wer
(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶ Calculate sentence-level WER score.
- Parameters:
char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)
- Returns:
Average sentence-level WER score
-
convert_to_char
(pred: torch.Tensor, target: torch.Tensor) → Tuple[List, List][source]¶ Convert label ID sequences to character sequences.
- Parameters:
pred – Prediction label ID sequences. (B, U)
target – Target label ID sequences. (B, L)
- Returns:
Prediction character sequences. (B, ?) char_target: Target character sequences. (B, ?)
- Return type:
char_pred
espnet2.asr.transducer.beam_search_transducer¶
Search algorithms for Transducer models.
-
class
espnet2.asr.transducer.beam_search_transducer.
BeamSearchTransducer
(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, beam_size: int, lm: torch.nn.modules.module.Module = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, multi_blank_durations: List[int] = [], multi_blank_indices: List[int] = [], score_norm: bool = True, nbest: int = 1, token_list: Optional[List[str]] = None)[source]¶ Bases:
object
Beam search implementation for Transducer.
Initialize Transducer search module.
- Parameters:
decoder – Decoder module.
joint_network – Joint network module.
beam_size – Beam size.
lm – LM class.
lm_weight – LM weight for soft fusion.
search_type – Search algorithm to use during inference.
max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)
u_max – Maximum output sequence length. (ALSD)
nstep – Number of maximum expansion steps at each time step. (NSC/mAES)
prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)
expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)
expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)
multi_blank_durations – The duration of each blank token. (MBG)
multi_blank_indices – The index of each blank token in token_list. (MBG)
score_norm – Normalize final scores by length. (“default”)
nbest – Number of final hypothesis.
-
align_length_sync_decoding
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶ Alignment-length synchronous beam search implementation.
Based on https://ieeexplore.ieee.org/document/9053040
- Parameters:
h – Encoder output sequences. (T, D)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
default_beam_search
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶ Beam search implementation.
Modified from https://arxiv.org/pdf/1211.3711.pdf
- Parameters:
enc_out – Encoder output sequence. (T, D)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
greedy_search
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶ Greedy search implementation.
- Parameters:
enc_out – Encoder output sequence. (T, D_enc)
- Returns:
1-best hypotheses.
- Return type:
hyp
-
modified_adaptive_expansion_search
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶ It’s the modified Adaptive Expansion Search (mAES) implementation.
Based on/modified from https://ieeexplore.ieee.org/document/9250505 and NSC.
- Parameters:
enc_out – Encoder output sequence. (T, D_enc)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
multi_blank_greedy_search
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶ Greedy Search for Multi-Blank Transducer (Multi-Blank Greedy, MBG).
In this implementation, we assume: 1. the index of standard blank is the last entry of self.multi_blank_indices
rather than self.blank_id (to avoid too much change on original transducer)
other entries in self.multi_blank_indices are big blanks that account for multiple frames.
Based on https://arxiv.org/abs/2211.03541
- Parameters:
enc_out – Encoder output sequence. (T, D_enc)
- Returns:
1-best hypothesis.
- Return type:
hyp
-
nsc_beam_search
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶ N-step constrained beam search implementation.
Based on/Modified from https://arxiv.org/pdf/2002.03577.pdf. Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet until further modifications.
- Parameters:
enc_out – Encoder output sequence. (T, D_enc)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
prefix_search
(hyps: List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis], enc_out_t: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶ Prefix search for NSC and mAES strategies.
Based on https://arxiv.org/pdf/1211.3711.pdf
-
sort_nbest
(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]]) → Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]][source]¶ Sort hypotheses by score or score given sequence length.
- Parameters:
hyps – Hypothesis.
- Returns:
Sorted hypothesis.
- Return type:
hyps
-
time_sync_decoding
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶ Time synchronous beam search implementation.
Based on https://ieeexplore.ieee.org/document/9053040
- Parameters:
enc_out – Encoder output sequence. (T, D)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
class
espnet2.asr.transducer.beam_search_transducer.
ExtendedHypothesis
(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None, dec_out: List[torch.Tensor] = None, lm_scores: torch.Tensor = None)[source]¶ Bases:
espnet2.asr.transducer.beam_search_transducer.Hypothesis
Extended hypothesis definition for NSC beam search and mAES.
-
dec_out
= None¶
-
lm_scores
= None¶
-
-
class
espnet2.asr.transducer.beam_search_transducer.
Hypothesis
(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None)[source]¶ Bases:
object
Default hypothesis definition for Transducer search algorithms.
-
lm_state
= None¶
-
espnet2.asr.transducer.__init__¶
espnet2.asr.transducer.rnnt_multi_blank.rnnt¶
-
espnet2.asr.transducer.rnnt_multi_blank.rnnt.
multiblank_rnnt_loss_gpu
(acts: torch.Tensor, labels: torch.Tensor, input_lengths: torch.Tensor, label_lengths: torch.Tensor, costs: torch.Tensor, grads: torch.Tensor, blank_label: int, big_blank_durations: list, fastemit_lambda: float, clamp: float, num_threads: int, sigma: float)[source]¶ - Wrapper method for accessing GPU Multi-blank RNNT loss
- CUDA implementation ported from [HawkAaron/warp-transducer]
- Parameters:
acts – Activation tensor of shape [B, T, U, V + num_big_blanks + 1].
labels – Ground truth labels of shape [B, U].
input_lengths – Lengths of the acoustic sequence as a vector of ints [B].
label_lengths – Lengths of the target sequence as a vector of ints [B].
costs – Zero vector of length [B] in which costs will be set.
grads – Zero tensor of shape [B, T, U, V + num_big_blanks + 1] where the gradient will be set.
blank_label – Index of the standard blank token in the vocabulary.
big_blank_durations – A list of supported durations for big blank symbols in the model, e.g. [2, 4, 8]. Note we only include durations for ``big blanks’’ here and it should not include 1 for the standard blank. Those big blanks have vocabulary indices after the standard blank index.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of threads for OpenMP.
sigma – logit-undernormalization weight used in the multi-blank model. Refer to the multi-blank paper https://arxiv.org/pdf/2211.03541 for detailed explanations.
-
espnet2.asr.transducer.rnnt_multi_blank.rnnt.
rnnt_loss_cpu
(acts: torch.Tensor, labels: torch.Tensor, input_lengths: torch.Tensor, label_lengths: torch.Tensor, costs: torch.Tensor, grads: torch.Tensor, blank_label: int, fastemit_lambda: float, clamp: float, num_threads: int)[source]¶ Wrapper method for accessing CPU RNNT loss.
- CPU implementation ported from [HawkAaron/warp-transducer]
- Parameters:
acts – Activation tensor of shape [B, T, U, V+1].
labels – Ground truth labels of shape [B, U].
input_lengths – Lengths of the acoustic sequence as a vector of ints [B].
label_lengths – Lengths of the target sequence as a vector of ints [B].
costs – Zero vector of length [B] in which costs will be set.
grads – Zero tensor of shape [B, T, U, V+1] where the gradient will be set.
blank_label – Index of the blank token in the vocabulary.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of threads for OpenMP.
-
espnet2.asr.transducer.rnnt_multi_blank.rnnt.
rnnt_loss_gpu
(acts: torch.Tensor, labels: torch.Tensor, input_lengths: torch.Tensor, label_lengths: torch.Tensor, costs: torch.Tensor, grads: torch.Tensor, blank_label: int, fastemit_lambda: float, clamp: float, num_threads: int)[source]¶ Wrapper method for accessing GPU RNNT loss.
- CUDA implementation ported from [HawkAaron/warp-transducer]
- Parameters:
acts – Activation tensor of shape [B, T, U, V+1].
labels – Ground truth labels of shape [B, U].
input_lengths – Lengths of the acoustic sequence as a vector of ints [B].
label_lengths – Lengths of the target sequence as a vector of ints [B].
costs – Zero vector of length [B] in which costs will be set.
grads – Zero tensor of shape [B, T, U, V+1] where the gradient will be set.
blank_label – Index of the blank token in the vocabulary.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of threads for OpenMP.
espnet2.asr.transducer.rnnt_multi_blank.__init__¶
espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank¶
-
espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank.
rnnt_loss
(acts, labels, act_lens, label_lens, blank=0, reduction='mean', fastemit_lambda: float = 0.0, clamp: float = 0.0)[source]¶ RNN Transducer Loss (functional form) :param acts: Tensor of (batch x seqLength x labelLength x outputDim)
containing output from network
- Parameters:
labels – 2 dimensional Tensor containing all the targets of the batch with zero padded
act_lens – Tensor of size (batch) containing size of each output sequence from the network
label_lens – Tensor of (batch) containing label length of each example
blank (int, optional) – blank label. Default: 0.
reduction (string, optional) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: ‘mean’
-
class
espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank.
RNNTLossNumba
(blank=0, reduction='mean', fastemit_lambda: float = 0.0, clamp: float = -1)[source]¶ Bases:
torch.nn.modules.module.Module
- Parameters:
blank (int, optional) – blank label. Default: 0.
reduction (string, optional) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: ‘mean’
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
-
forward
(acts, labels, act_lens, label_lens)[source]¶ - log_probs: Tensor of (batch x seqLength x labelLength x outputDim)
containing output from network
- labels: 2 dimensional Tensor containing all the targets of the
batch with zero padded
- act_lens: Tensor of size (batch) containing size of each output
sequence from the network
label_lens: Tensor of (batch) containing label length of each example
-
class
espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank.
MultiblankRNNTLossNumba
(blank, big_blank_durations, reduction='mean', fastemit_lambda: float = 0.0, clamp: float = -1, sigma: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
- Parameters:
blank (int) – standard blank label.
big_blank_durations – list of durations for multi-blank transducer, e.g. [2, 4, 8].
sigma – hyper-parameter for logit under-normalization method for training multi-blank transducers. Recommended value 0.05.
to https (Refer) – //arxiv.org/pdf/2211.03541 for detailed explanations for the above parameters;
reduction (string, optional) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: ‘mean’
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
-
forward
(acts, labels, act_lens, label_lens)[source]¶ - log_probs: Tensor of (batch x seqLength x labelLength x outputDim)
containing output from network
- labels: 2 dimensional Tensor containing all the targets of
the batch with zero padded
- act_lens: Tensor of size (batch) containing size of each output
sequence from the network
label_lens: Tensor of (batch) containing label length of each example
espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants¶
-
class
espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.
RNNTStatus
[source]¶ Bases:
enum.Enum
An enumeration.
-
RNNT_STATUS_INVALID_VALUE
= 1¶
-
RNNT_STATUS_SUCCESS
= 0¶
-
-
espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.
THRESHOLD
= 0.1¶ Getters
espnet2.asr.transducer.rnnt_multi_blank.utils.__init__¶
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper¶
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel¶
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.
compute_alphas_kernel
[source]¶ Compute alpha (forward variable) probabilities over the transduction step.
- Parameters:
acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
alphas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the forward variable probabilities.
llForward – Zero tensor of shape [B]. Represents the log-likelihood of the forward pass. Returned as the forward pass loss that is reduced by the optimizer.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
- Updates:
Kernel inplace updates the following inputs: - alphas: forward variable scores. - llForward: log-likelihood of forward variable.
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.
compute_betas_kernel
[source]¶ Compute beta (backward variable) probabilities over the transduction step.
- Parameters:
acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
betas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the backward variable probabilities.
llBackward – Zero tensor of shape [B]. Represents the log-likelihood of the backward pass. Returned as the backward pass loss that is reduced by the optimizer.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
- Updates:
Kernel inplace updates the following inputs: - betas: backward variable scores. - llBackward: log-likelihood of backward variable.
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.
compute_grad_kernel
[source]¶ Compute gradients over the transduction step.
- Parameters:
grads – Zero Tensor of shape [B, T, U, V+1]. Is updated by this kernel to contain the gradients of this batch of samples.
acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
alphas – Alpha variable, contains forward probabilities. A tensor of shape [B, T, U].
betas – Beta varoable, contains backward probabilities. A tensor of shape [B, T, U].
logll – Log-likelihood of the forward variable, represented as a vector of shape [B]. Represents the log-likelihood of the forward pass.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
- Updates:
Kernel inplace updates the following inputs: - grads: Gradients with respect to the log likelihood (logll).
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.
compute_multiblank_alphas_kernel
[source]¶ - Compute alpha (forward variable) probabilities for multi-blank transducuer loss
- Parameters:
acts – Tensor of shape [B, T, U, V + 1 + num_big_blanks] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
sigma – Hyper-parameter for logit-undernormalization technique for training multi-blank transducers.
alphas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the forward variable probabilities.
llForward – Zero tensor of shape [B]. Represents the log-likelihood of the forward pass. Returned as the forward pass loss that is reduced by the optimizer.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT standard blank token in the vocabulary.
big_blank_durations – Vector of supported big blank durations of the model.
num_big_blanks – Number of big blanks of the model.
- Updates:
Kernel inplace updates the following inputs: - alphas: forward variable scores. - llForward: log-likelihood of forward variable.
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.
compute_multiblank_betas_kernel
[source]¶ - Compute beta (backward variable) probabilities for multi-blank transducer loss
- Parameters:
acts – Tensor of shape [B, T, U, V + 1 + num-big-blanks] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
sigma – Hyper-parameter for logit-undernormalization technique for training multi-blank transducers.
betas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the backward variable probabilities.
llBackward – Zero tensor of shape [B]. Represents the log-likelihood of the backward pass. Returned as the backward pass loss that is reduced by the optimizer.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT standard blank token in the vocabulary.
big_blank_durations – Vector of supported big blank durations of the model.
num_big_blanks – Number of big blanks of the model.
- Updates:
Kernel inplace updates the following inputs: - betas: backward variable scores. - llBackward: log-likelihood of backward variable.
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.
compute_multiblank_grad_kernel
[source]¶ - Compute gradients for multi-blank transducer loss
- Parameters:
grads – Zero Tensor of shape [B, T, U, V + 1 + num_big_blanks]. Is updated by this kernel to contain the gradients of this batch of samples.
acts – Tensor of shape [B, T, U, V + 1 + num_big_blanks] flattened. Represents the logprobs activation tensor.
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
sigma – Hyper-parameter for logit-undernormalization technique for training multi-blank transducers.
alphas – Alpha variable, contains forward probabilities. A tensor of shape [B, T, U].
betas – Beta varoable, contains backward probabilities. A tensor of shape [B, T, U].
logll – Log-likelihood of the forward variable, represented as a vector of shape [B]. Represents the log-likelihood of the forward pass.
xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.
ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.
mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
big_blank_durations – Vector of supported big blank durations of the model.
num_big_blanks – Number of big blanks of the model.
- Updates:
Kernel inplace updates the following inputs: - grads: Gradients with respect to the log likelihood (logll).
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.
logp
[source]¶ Compute the sum of log probability from the activation tensor and its denominator.
- Parameters:
denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.
acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
mb – Batch indexer.
t – Acoustic sequence timestep indexer.
u – Target sequence timestep indexer.
v – Vocabulary token indexer.
- Returns:
The sum of logprobs[mb, t, u, v] + denom[mb, t, u]
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce¶
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.
CTAReduce
[source]¶ CUDA Warp reduction kernel.
It is a device kernel to be called by other kernels.
The data will be read from the right segement recursively, and reduced (ROP) onto the left half. Operation continues while warp size is larger than a given offset. Beyond this offset, warp reduction is performed via shfl_down_sync, which halves the reduction space and sums the two halves at each call.
Note
Efficient warp occurs at input shapes of 2 ^ K.
References
Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]
- Parameters:
tid – CUDA thread index
x – activation. Single float.
storage – shared memory of size CTA_REDUCE_SIZE used for reduction in parallel threads.
count – equivalent to num_rows, which is equivalent to alphabet_size (V+1)
R_opid – Operator ID for reduction. See R_Op for more information.
-
class
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.
I_Op
[source]¶ Bases:
enum.Enum
Represents an operation that is performed on the input tensor
-
EXPONENTIAL
= 0¶
-
IDENTITY
= 1¶
-
-
class
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.
R_Op
[source]¶ Bases:
enum.Enum
Represents a reduction operation performed on the input tensor
-
ADD
= 0¶
-
MAXIMUM
= 1¶
-
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.
ReduceHelper
(I_opid: int, R_opid: int, acts: torch.Tensor, output: torch.Tensor, num_rows: int, num_cols: int, minus: bool, stream)[source]¶ CUDA Warp reduction kernel helper which reduces via the R_Op.Add and writes the result to output according to I_op id.
The result is stored in the blockIdx.
Note
Efficient warp occurs at input shapes of 2 ^ K.
References
Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]
- Parameters:
I_opid – Operator ID for input. See I_Op for more information.
R_opid – Operator ID for reduction. See R_Op for more information.
acts – Flatened activation matrix of shape [B * T * U * (V+1)].
output – Flatened output matrix of shape [B * T * U * (V+1)]. Data will be overwritten.
num_rows – Vocabulary size (including blank token) - V+1. Represents the number of threads per block.
num_cols – Flattened shape of activation matrix, without vocabulary dimension (B * T * U). Represents number of blocks per grid.
minus – Bool flag whether to add or subtract as reduction. If minus is set; calls _reduce_minus, else calls _reduce_rows kernel.
stream – CUDA Stream.
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.
reduce_exp
(acts: torch.Tensor, denom, rows: int, cols: int, minus: bool, stream)[source]¶ Helper method to call the Warp Reduction Kernel to perform exp reduction.
Note
Efficient warp occurs at input shapes of 2 ^ K.
References
Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]
- Parameters:
acts – Flatened activation matrix of shape [B * T * U * (V+1)].
output – Flatened output matrix of shape [B * T * U * (V+1)]. Data will be overwritten.
rows – Vocabulary size (including blank token) - V+1. Represents the number of threads per block.
cols – Flattened shape of activation matrix, without vocabulary dimension (B * T * U). Represents number of blocks per grid.
minus – Bool flag whether to add or subtract as reduction. If minus is set; calls _reduce_minus, else calls _reduce_rows kernel.
stream – CUDA Stream.
-
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.
reduce_max
(acts: torch.Tensor, denom, rows: int, cols: int, minus: bool, stream)[source]¶ Helper method to call the Warp Reduction Kernel to perform max reduction.
Note
Efficient warp occurs at input shapes of 2 ^ K.
References
Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]
- Parameters:
acts – Flatened activation matrix of shape [B * T * U * (V+1)].
output – Flatened output matrix of shape [B * T * U * (V+1)]. Data will be overwritten.
rows – Vocabulary size (including blank token) - V+1. Represents the number of threads per block.
cols – Flattened shape of activation matrix, without vocabulary dimension (B * T * U). Represents number of blocks per grid.
minus – Bool flag whether to add or subtract as reduction. If minus is set; calls _reduce_minus, else calls _reduce_rows kernel.
stream – CUDA Stream.
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt¶
-
class
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.
GPURNNT
(minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, stream)[source]¶ Bases:
object
Helper class to launch the CUDA Kernels to compute the Transducer Loss.
- Parameters:
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.
blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of OMP threads to launch.
stream – Numba Cuda Stream.
-
compute_cost_and_score
(acts: torch.Tensor, grads: Optional[torch.Tensor], costs: torch.Tensor, labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor) → espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]¶ Compute both the loss and the gradients.
- Parameters:
acts – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.
grad – A flattented zero tensor of same shape as acts.
costs – A zero vector of length B which will be updated inplace with the log probability costs.
flat_labels – A flattened matrix of labels of shape [B, U]
label_lengths – A vector of length B that contains the original lengths of the acoustic sequence.
input_lengths – A vector of length B that contains the original lengths of the target sequence.
- Updates:
This will launch kernels that will update inline the following variables: - grads: Gradients of the activation matrix wrt the costs vector. - costs: Negative log likelihood of the forward variable.
- Returns:
An enum that either represents a successful RNNT operation or failure.
-
cost_and_grad
(acts: torch.Tensor, grads: torch.Tensor, costs: torch.Tensor, pad_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]¶
-
log_softmax
(acts: torch.Tensor, denom: torch.Tensor)[source]¶ Computes the log softmax denominator of the input activation tensor and stores the result in denom.
- Parameters:
acts – Activation tensor of shape [B, T, U, V+1]. The input must be represented as a flat tensor of shape [B * T * U * (V+1)] to allow pointer indexing.
denom – A zero tensor of same shape as acts.
- Updates:
This kernel inplace updates the denom tensor
-
class
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.
MultiblankGPURNNT
(sigma: float, num_big_blanks: int, minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace, big_blank_workspace, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, stream)[source]¶ Bases:
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.GPURNNT
- Helper class to launch the CUDA Kernels to compute Multi-blank Transducer Loss
- Parameters:
sigma – Hyper-parameter related to the logit-normalization method in training multi-blank transducers.
num_big_blanks – Number of big blank symbols the model has. This should not include the standard blank symbol.
minibatch – Int representing the batch size.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V + 1 + num-big-blanks
workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.
big_blank_workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory specifically for the multi-blank related computations.
blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of OMP threads to launch.
stream – Numba Cuda Stream.
-
compute_cost_and_score
(acts: torch.Tensor, grads: Optional[torch.Tensor], costs: torch.Tensor, labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor) → espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]¶ Compute both the loss and the gradients.
- Parameters:
acts – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.
grad – A flattented zero tensor of same shape as acts.
costs – A zero vector of length B which will be updated inplace with the log probability costs.
flat_labels – A flattened matrix of labels of shape [B, U]
label_lengths – A vector of length B that contains the original lengths of the acoustic sequence.
input_lengths – A vector of length B that contains the original lengths of the target sequence.
- Updates:
This will launch kernels that will update inline the following variables: - grads: Gradients of the activation matrix wrt the costs vector. - costs: Negative log likelihood of the forward variable.
- Returns:
An enum that either represents a successful RNNT operation or failure.
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.__init__¶
espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.__init__¶
espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt¶
-
class
espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.
CPURNNT
(minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace: torch.Tensor, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, batch_first: bool)[source]¶ Bases:
object
Helper class to compute the Transducer Loss on CPU.
- Parameters:
minibatch – Size of the minibatch b.
maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.
blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
num_threads – Number of OMP threads to launch.
batch_first – Bool that decides if batch dimension is first or third.
-
compute_alphas
(log_probs: torch.Tensor, T: int, U: int, alphas: torch.Tensor)[source]¶ Compute the probability of the forward variable alpha.
- Parameters:
log_probs – Flattened tensor [B, T, U, V+1]
T – Length of the acoustic sequence T (not padded).
U – Length of the target sequence U (not padded).
alphas – Working space memory for alpha of shape [B, T, U].
- Returns:
Loglikelihood of the forward variable alpha.
-
compute_betas_and_grads
(grad: torch.Tensor, log_probs: torch.Tensor, T: int, U: int, alphas: torch.Tensor, betas: torch.Tensor, labels: torch.Tensor, logll: torch.Tensor)[source]¶ Compute backward variable beta as well as gradients of the activation matrix wrt loglikelihood of forward variable.
- Parameters:
grad – Working space memory of flattened shape [B, T, U, V+1]
log_probs – Activatio tensor of flattented shape [B, T, U, V+1]
T – Length of the acoustic sequence T (not padded).
U – Length of the target sequence U (not padded).
alphas – Working space memory for alpha of shape [B, T, U].
betas – Working space memory for alpha of shape [B, T, U].
labels – Ground truth label of shape [B, U]
logll – Loglikelihood of the forward variable.
- Returns:
Loglikelihood of the forward variable and inplace updates the grad tensor.
-
cost_and_grad
(log_probs: torch.Tensor, grads: torch.Tensor, costs: torch.Tensor, flat_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor) → espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]¶
-
class
espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.
CpuRNNT_index
(U: int, maxU: int, minibatch: int, alphabet_size: int, batch_first: bool)[source]¶ Bases:
object
A placeholder Index computation class that emits the resolved index in a flattened tensor, mimicing pointer indexing in CUDA kernels on the CPU.
- Parameters:
U – Length of the current target sample (without padding).
maxU – Max Length of the padded target samples.
minibatch – Minibatch index
alphabet_size – Size of the vocabulary including RNNT blank - V+1.
batch_first – Bool flag determining if batch index is first or third.
-
class
espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.
CpuRNNT_metadata
(T: int, U: int, workspace: torch.Tensor, bytes_used: int, blank: int, labels: torch.Tensor, log_probs: torch.Tensor, idx: espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CpuRNNT_index)[source]¶ Bases:
object
Metadata for CPU based RNNT loss calculation. Holds the working space memory.
- Parameters:
T – Length of the acoustic sequence (without padding).
U – Length of the target sequence (without padding).
workspace – Working space memory for the CPU.
bytes_used – Number of bytes currently used for indexing the working space memory. Generally 0.
blank – Index of the blank token in the vocabulary.
labels – Ground truth padded labels matrix of shape [B, U]
log_probs – Log probs / activation matrix of flattented shape [B, T, U, V+1]
idx –
-
class
espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.
LogSoftmaxGradModification
(*args, **kwargs)[source]¶ Bases:
torch.autograd.function.Function
-
static
backward
(ctx, grad_output)[source]¶ Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).
This function is to be overridden by all subclasses.
It must accept a context
ctx
as the first argument, followed by as many outputs as theforward()
returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs toforward()
. Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_grad
as a tuple of booleans representing whether each input needs gradient. E.g.,backward()
will havectx.needs_input_grad[0] = True
if the first input toforward()
needs gradient computated w.r.t. the output.
-
static
forward
(ctx, acts, clamp)[source]¶ Performs the operation.
This function is to be overridden by all subclasses.
It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).
The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with
ctx.save_for_backward()
if they are intended to be used inbackward
(equivalently,vjp
) orctx.save_for_forward()
if they are intended to be used for injvp
.
-
static
espnet2.asr.frontend.s3prl¶
-
class
espnet2.asr.frontend.s3prl.
S3prlFrontend
(fs: Union[int, str] = 16000, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, download_dir: str = None, multilayer_feature: bool = False, layer: int = -1)[source]¶ Bases:
espnet2.asr.frontend.abs_frontend.AbsFrontend
Speech Pretrained Representation frontend structure for ASR.
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.frontend.default¶
-
class
espnet2.asr.frontend.default.
DefaultFrontend
(fs: Union[int, str] = 16000, n_fft: int = 512, win_length: int = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: int = None, fmax: int = None, htk: bool = False, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, apply_stft: bool = True)[source]¶ Bases:
espnet2.asr.frontend.abs_frontend.AbsFrontend
Conventional frontend structure for ASR.
Stft -> WPE -> MVDR-Beamformer -> Power-spec -> Mel-Fbank -> CMVN
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.frontend.fused¶
-
class
espnet2.asr.frontend.fused.
FusedFrontends
(frontends=None, align_method='linear_projection', proj_dim=100, fs=16000)[source]¶ Bases:
espnet2.asr.frontend.abs_frontend.AbsFrontend
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.frontend.abs_frontend¶
-
class
espnet2.asr.frontend.abs_frontend.
AbsFrontend
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.asr.frontend.__init__¶
espnet2.asr.frontend.windowing¶
Sliding Window for raw audio input data.
-
class
espnet2.asr.frontend.windowing.
SlidingWindow
(win_length: int = 400, hop_length: int = 160, channels: int = 1, padding: int = None, fs=None)[source]¶ Bases:
espnet2.asr.frontend.abs_frontend.AbsFrontend
Sliding Window.
Provides a sliding window over a batched continuous raw audio tensor. Optionally, provides padding (Currently not implemented). Combine this module with a pre-encoder compatible with raw audio data, for example Sinc convolutions.
Known issues: Output length is calculated incorrectly if audio shorter than win_length. WARNING: trailing values are discarded - padding not implemented yet. There is currently no additional window function applied to input values.
Initialize.
- Parameters:
win_length – Length of frame.
hop_length – Relative starting point of next frame.
channels – Number of input channels.
padding – Padding (placeholder, currently not implemented).
fs – Sampling rate (placeholder for compatibility, not used).
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Apply a sliding window on the input.
- Parameters:
input – Input (B, T, C*D) or (B, T*C*D), with D=C=1.
input_lengths – Input lengths within batch.
- Returns:
Output with dimensions (B, T, C, D), with D=win_length. Tensor: Output lengths within batch.
- Return type:
Tensor
espnet2.asr.frontend.whisper¶
-
class
espnet2.asr.frontend.whisper.
WhisperFrontend
(whisper_model: str = 'small', freeze_weights: bool = True, download_dir: str = None)[source]¶ Bases:
espnet2.asr.frontend.abs_frontend.AbsFrontend
Speech Representation Using Encoder Outputs from OpenAI’s Whisper Model:
URL: https://github.com/openai/whisper
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
-
abstract
-
class
-
-
class
-
-
class
-
-
class
-
abstract
-
-
-
-
-
class
-
-
class
-
-
class
-
-
class
-
-
class
-
-
class
-
-
class
-
-
class
-
-
class
-
property
-
-
class
-
property
-
class
-
-
class
-
-
class
-
-
class
-
class
-
-
class
-
abstract
-
class
-
abstract
-
-
-
class
-
class
-
abstract
-
class
-
class
-