espnet2.asr_transducer package

espnet2.asr_transducer.activation

Activation functions for Transducer models.

class espnet2.asr_transducer.activation.FTSwish(threshold: float = -0.2, mean_shift: float = 0)[source]

Bases: torch.nn.modules.module.Module

Flatten-T Swish activation definition.

FTSwish(x) = x * sigmoid(x) + threshold

where FTSwish(x) < 0 = threshold

Reference: https://arxiv.org/abs/1812.06247

Parameters:
  • threshold – Threshold value for FTSwish activation formulation. (threshold < 0)

  • mean_shift – Mean shifting value for FTSwish activation formulation. (applied only if != 0, disabled by default)

forward(x: torch.Tensor) → torch.Tensor[source]

Forward computation.

class espnet2.asr_transducer.activation.Mish(softplus_beta: float = 1.0, softplus_threshold: int = 20, use_builtin: bool = False)[source]

Bases: torch.nn.modules.module.Module

Mish activation definition.

Mish(x) = x * tanh(softplus(x))

Reference: https://arxiv.org/abs/1908.08681.

Parameters:
  • softplus_beta – Beta value for softplus activation formulation. (Usually 0 > softplus_beta >= 2)

  • softplus_threshold – Values above this revert to a linear function. (Usually 10 > softplus_threshold >= 20)

  • use_builtin – Whether to use PyTorch activation function if available.

forward(x: torch.Tensor) → torch.Tensor[source]

Forward computation.

class espnet2.asr_transducer.activation.Smish(alpha: float = 1.0, beta: float = 1.0)[source]

Bases: torch.nn.modules.module.Module

Smish activation definition.

Smish(x) = (alpha * x) * tanh(log(1 + sigmoid(beta * x)))

where alpha > 0 and beta > 0

Reference: https://www.mdpi.com/2079-9292/11/4/540/htm.

Parameters:
  • alpha – Alpha value for Smish activation fomulation. (Usually, alpha = 1. If alpha <= 0, set value to 1).

  • beta – Beta value for Smish activation formulation. (Usually, beta = 1. If beta <= 0, set value to 1).

forward(x: torch.Tensor) → torch.Tensor[source]

Forward computation.

class espnet2.asr_transducer.activation.Swish(beta: float = 1.0, use_builtin: bool = False)[source]

Bases: torch.nn.modules.module.Module

Swish activation definition.

Swish(x) = (beta * x) * sigmoid(x)

where beta = 1 defines standard Swish activation.

References

https://arxiv.org/abs/2108.12943 / https://arxiv.org/abs/1710.05941v1. E-swish variant: https://arxiv.org/abs/1801.07145.

Parameters:
  • beta – Beta parameter for E-Swish. (beta >= 1. If beta < 1, use standard Swish).

  • use_builtin – Whether to use PyTorch function if available.

forward(x: torch.Tensor) → torch.Tensor[source]

Forward computation.

espnet2.asr_transducer.activation.get_activation(activation_type: str, ftswish_threshold: float = -0.2, ftswish_mean_shift: float = 0.0, hardtanh_min_val: int = -1.0, hardtanh_max_val: int = 1.0, leakyrelu_neg_slope: float = 0.01, smish_alpha: float = 1.0, smish_beta: float = 1.0, softplus_beta: float = 1.0, softplus_threshold: int = 20, swish_beta: float = 1.0) → torch.nn.modules.module.Module[source]

Return activation function.

Parameters:
  • activation_type – Activation function type.

  • ftswish_threshold – Threshold value for FTSwish activation formulation.

  • ftswish_mean_shift – Mean shifting value for FTSwish activation formulation.

  • hardtanh_min_val – Minimum value of the linear region range for HardTanh.

  • hardtanh_max_val – Maximum value of the linear region range for HardTanh.

  • leakyrelu_neg_slope – Negative slope value for LeakyReLU activation formulation.

  • smish_alpha – Alpha value for Smish activation fomulation.

  • smish_beta – Beta value for Smish activation formulation.

  • softplus_beta – Beta value for softplus activation formulation in Mish.

  • softplus_threshold – Values above this revert to a linear function in Mish.

  • swish_beta – Beta value for Swish variant formulation.

Returns:

Activation function.

espnet2.asr_transducer.error_calculator

Error Calculator module for Transducer.

class espnet2.asr_transducer.error_calculator.ErrorCalculator(decoder: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, token_list: List[int], sym_space: str, sym_blank: str, nstep: int = 2, report_cer: bool = False, report_wer: bool = False)[source]

Bases: object

Calculate CER and WER for transducer models.

Parameters:
  • decoder – Decoder module.

  • joint_network – Joint Network module.

  • token_list – List of token units.

  • sym_space – Space symbol.

  • sym_blank – Blank symbol.

  • nstep – Maximum number of symbol expansions at each time step w/ mAES.

  • report_cer – Whether to compute CER.

  • report_wer – Whether to compute WER.

Construct an ErrorCalculatorTransducer object.

calculate_cer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]

Calculate sentence-level CER score.

Parameters:
  • char_pred – Prediction character sequences. (B, ?)

  • char_target – Target character sequences. (B, ?)

Returns:

Average sentence-level CER score.

calculate_wer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]

Calculate sentence-level WER score.

Parameters:
  • char_pred – Prediction character sequences. (B, ?)

  • char_target – Target character sequences. (B, ?)

Returns:

Average sentence-level WER score

convert_to_char(pred: torch.Tensor, target: torch.Tensor) → Tuple[List, List][source]

Convert label ID sequences to character sequences.

Parameters:
  • pred – Prediction label ID sequences. (B, U)

  • target – Target label ID sequences. (B, L)

Returns:

Prediction character sequences. (B, ?) char_target: Target character sequences. (B, ?)

Return type:

char_pred

espnet2.asr_transducer.espnet_transducer_model

ESPnet2 ASR Transducer model.

class espnet2.asr_transducer.espnet_transducer_model.ESPnetASRTransducerModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], encoder: espnet2.asr_transducer.encoder.encoder.Encoder, decoder: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, transducer_weight: float = 1.0, use_k2_pruned_loss: bool = False, k2_pruned_loss_args: Dict = {}, warmup_steps: int = 25000, validation_nstep: int = 2, fastemit_lambda: float = 0.0, auxiliary_ctc_weight: float = 0.0, auxiliary_ctc_dropout_rate: float = 0.0, auxiliary_lm_loss_weight: float = 0.0, auxiliary_lm_loss_smoothing: float = 0.05, ignore_id: int = -1, sym_space: str = '<space>', sym_blank: str = '<blank>', report_cer: bool = False, report_wer: bool = False, extract_feats_in_collect_stats: bool = True)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

ESPnet2ASRTransducerModel module definition.

Parameters:
  • vocab_size – Size of complete vocabulary (w/ SOS/EOS and blank included).

  • token_list – List of tokens in vocabulary (minus reserved tokens).

  • frontend – Frontend module.

  • specaug – SpecAugment module.

  • normalize – Normalization module.

  • encoder – Encoder module.

  • decoder – Decoder module.

  • joint_network – Joint Network module.

  • transducer_weight – Weight of the Transducer loss.

  • use_k2_pruned_loss – Whether to use k2 pruned Transducer loss.

  • k2_pruned_loss_args – Arguments of the k2 loss pruned Transducer loss.

  • warmup_steps – Number of steps in warmup, used for pruned loss scaling.

  • validation_nstep – Maximum number of symbol expansions at each time step when reporting CER or/and WER using mAES.

  • fastemit_lambda – FastEmit lambda value.

  • auxiliary_ctc_weight – Weight of auxiliary CTC loss.

  • auxiliary_ctc_dropout_rate – Dropout rate for auxiliary CTC loss inputs.

  • auxiliary_lm_loss_weight – Weight of auxiliary LM loss.

  • auxiliary_lm_loss_smoothing – Smoothing rate for LM loss’ label smoothing.

  • ignore_id – Initial padding ID.

  • sym_space – Space symbol.

  • sym_blank – Blank Symbol.

  • report_cer – Whether to report Character Error Rate during validation.

  • report_wer – Whether to report Word Error Rate during validation.

  • extract_feats_in_collect_stats – Whether to use extract_feats stats collection.

Construct an ESPnetASRTransducerModel object.

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]

Collect features sequences and features lengths sequences.

Parameters:
  • speech – Speech sequences. (B, S)

  • speech_lengths – Speech sequences lengths. (B,)

  • text – Label ID sequences. (B, L)

  • text_lengths – Label ID sequences lengths. (B,)

  • kwargs – Contains “utts_id”.

Returns:

“feats”: Features sequences. (B, T, D_feats),

”feats_lengths”: Features sequences lengths. (B,)

Return type:

{}

encode(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Encoder speech sequences.

Parameters:
  • speech – Speech sequences. (B, S)

  • speech_lengths – Speech sequences lengths. (B,)

Returns:

Encoder outputs. (B, T, D_enc) encoder_out_lens: Encoder outputs lengths. (B,)

Return type:

encoder_out

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Forward architecture and compute loss(es).

Parameters:
  • speech – Speech sequences. (B, S)

  • speech_lengths – Speech sequences lengths. (B,)

  • text – Label ID sequences. (B, L)

  • text_lengths – Label ID sequences lengths. (B,)

  • kwargs – Contains “utts_id”.

Returns:

Main loss value. stats: Task statistics. weight: Task weights.

Return type:

loss

espnet2.asr_transducer.normalization

Normalization modules for Transducer.

class espnet2.asr_transducer.normalization.BasicNorm(normalized_shape: int, eps: float = 0.25)[source]

Bases: torch.nn.modules.module.Module

BasicNorm module definition.

Reference: https://github.com/k2-fsa/icefall/pull/288

Parameters:
  • normalized_shape – Expected size.

  • eps – Value added to the denominator for numerical stability.

Construct a BasicNorm object.

forward(x: torch.Tensor) → torch.Tensor[source]

Compute basic normalization.

Parameters:

x – Input sequences. (B, T, D_hidden)

Returns:

Output sequences. (B, T, D_hidden)

class espnet2.asr_transducer.normalization.RMSNorm(normalized_shape: int, eps: float = 1e-05, partial: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

RMSNorm module definition.

Reference: https://arxiv.org/pdf/1910.07467.pdf

Parameters:
  • normalized_shape – Expected size.

  • eps – Value added to the denominator for numerical stability.

  • partial – Value defining the part of the input used for RMS stats.

Construct a RMSNorm object.

forward(x: torch.Tensor) → torch.Tensor[source]

Compute RMS normalization.

Parameters:

x – Input sequences. (B, T, D_hidden)

Returns:

Output sequences. (B, T, D_hidden)

Return type:

x

class espnet2.asr_transducer.normalization.ScaleNorm(normalized_shape: int, eps: float = 1e-05)[source]

Bases: torch.nn.modules.module.Module

ScaleNorm module definition.

Reference: https://arxiv.org/pdf/1910.05895.pdf

Parameters:
  • normalized_shape – Expected size.

  • eps – Value added to the denominator for numerical stability.

Construct a ScaleNorm object.

forward(x: torch.Tensor) → torch.Tensor[source]

Compute scale normalization.

Parameters:

x – Input sequences. (B, T, D_hidden)

Returns:

Output sequences. (B, T, D_hidden)

espnet2.asr_transducer.normalization.get_normalization(normalization_type: str, eps: Optional[float] = None, partial: Optional[float] = None) → Tuple[torch.nn.modules.module.Module, Dict][source]

Get normalization module and arguments given parameters.

Parameters:
  • normalization_type – Normalization module type.

  • eps – Value added to the denominator.

  • partial – Value defining the part of the input used for RMS stats (RMSNorm).

Returns:

Normalization module class : Normalization module arguments

espnet2.asr_transducer.beam_search_transducer

Search algorithms for Transducer models.

class espnet2.asr_transducer.beam_search_transducer.BeamSearchTransducer(decoder: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, beam_size: int, lm: Optional[torch.nn.modules.module.Module] = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 3, u_max: int = 50, nstep: int = 2, expansion_gamma: float = 2.3, expansion_beta: int = 2, score_norm: bool = False, nbest: int = 1, streaming: bool = False)[source]

Bases: object

Beam search implementation for Transducer.

Parameters:
  • decoder – Decoder module.

  • joint_network – Joint network module.

  • beam_size – Size of the beam.

  • lm – LM module.

  • lm_weight – LM weight for soft fusion.

  • search_type – Search algorithm to use during inference.

  • max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)

  • u_max – Maximum expected target sequence length. (ALSD)

  • nstep – Number of maximum expansion steps at each time step. (mAES)

  • expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)

  • expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)

  • score_norm – Normalize final scores by length.

  • nbest – Number of final hypothesis.

  • streaming – Whether to perform chunk-by-chunk beam search.

Construct a BeamSearchTransducer object.

align_length_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]

Alignment-length synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters:

h – Encoder output sequences. (T, D)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

create_lm_batch_inputs(hyps_seq: List[List[int]]) → torch.Tensor[source]

Make batch of inputs with left padding for LM scoring.

Parameters:

hyps_seq – Hypothesis sequences.

Returns:

Padded batch of sequences.

Beam search implementation without prefix search.

Modified from https://arxiv.org/pdf/1211.3711.pdf

Parameters:

enc_out – Encoder output sequence. (T, D)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

Modified version of Adaptive Expansion Search (mAES).

Based on AES (https://ieeexplore.ieee.org/document/9250505) and

NSC (https://arxiv.org/abs/2201.05420).

Parameters:

enc_out – Encoder output sequence. (T, D_enc)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

recombine_hyps(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]

Recombine hypotheses with same label ID sequence.

Parameters:

hyps – Hypotheses.

Returns:

Recombined hypotheses.

Return type:

final

reset_cache() → None[source]

Reset cache for streaming decoding.

select_k_expansions(hyps: List[espnet2.asr_transducer.beam_search_transducer.ExtendedHypothesis], topk_idx: torch.Tensor, topk_logp: torch.Tensor) → List[espnet2.asr_transducer.beam_search_transducer.ExtendedHypothesis][source]

Return K hypotheses candidates for expansion from a list of hypothesis.

K candidates are selected according to the extended hypotheses probabilities and a prune-by-value method. Where K is equal to beam_size + beta.

Parameters:
  • hyps – Hypotheses.

  • topk_idx – Indices of candidates hypothesis.

  • topk_logp – Log-probabilities of candidates hypothesis.

Returns:

Best K expansion hypotheses candidates.

Return type:

k_expansions

sort_nbest(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]

Sort in-place hypotheses by score or score given sequence length.

Parameters:

hyps – Hypothesis.

Returns:

Sorted hypothesis.

Return type:

hyps

time_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]

Time synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters:

enc_out – Encoder output sequence. (T, D)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

class espnet2.asr_transducer.beam_search_transducer.ExtendedHypothesis(score: float, yseq: List[int], dec_state: Optional[Tuple[torch.Tensor, Optional[torch.Tensor]]] = None, lm_state: Union[Dict[str, Any], List[Any], None] = None, dec_out: torch.Tensor = None, lm_score: torch.Tensor = None)[source]

Bases: espnet2.asr_transducer.beam_search_transducer.Hypothesis

Extended hypothesis definition for NSC beam search and mAES.

:param : Hypothesis dataclass arguments. :param dec_out: Decoder output sequence. (B, D_dec) :param lm_score: Log-probabilities of the LM for given label. (vocab_size)

dec_out = None
lm_score = None
class espnet2.asr_transducer.beam_search_transducer.Hypothesis(score: float, yseq: List[int], dec_state: Optional[Tuple[torch.Tensor, Optional[torch.Tensor]]] = None, lm_state: Union[Dict[str, Any], List[Any], None] = None)[source]

Bases: object

Default hypothesis definition for Transducer search algorithms.

Parameters:
  • score – Total log-probability.

  • yseq – Label sequence as integer ID sequence.

  • dec_state – RNN/MEGA Decoder state (None if Stateless).

  • lm_state – RNNLM state. ((N, D_lm), (N, D_lm)) or None

dec_state = None
lm_state = None

espnet2.asr_transducer.utils

Utility functions for Transducer models.

exception espnet2.asr_transducer.utils.TooShortUttError(message: str, actual_size: int, limit: int)[source]

Bases: Exception

Raised when the utt is too short for subsampling.

Parameters:
  • message – Error message to display.

  • actual_size – The size that cannot pass the subsampling.

  • limit – The size limit for subsampling.

Construct a TooShortUttError module.

espnet2.asr_transducer.utils.check_short_utt(sub_factor: int, size: int) → Tuple[bool, int][source]

Check if the input is too short for subsampling.

Parameters:
  • sub_factor – Subsampling factor for Conv2DSubsampling.

  • size – Input size.

Returns:

Whether an error should be sent. : Size limit for specified subsampling factor.

espnet2.asr_transducer.utils.get_convinput_module_parameters(input_size: int, last_conv_size, subsampling_factor: int, is_vgg: bool = True) → Tuple[Union[Tuple[int, int], int], int][source]

Return the convolution module parameters.

Parameters:
  • input_size – Module input size.

  • last_conv_size – Last convolution size for module output size computation.

  • subsampling_factor – Total subsampling factor.

  • is_vgg – Whether the module type is VGG-like.

Returns:

First MaxPool2D kernel size or second Conv2d kernel size and stride. output_size: Convolution module output size.

espnet2.asr_transducer.utils.get_transducer_task_io(labels: torch.Tensor, encoder_out_lens: torch.Tensor, ignore_id: int = -1, blank_id: int = 0) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Get Transducer loss I/O.

Parameters:
  • labels – Label ID sequences. (B, L)

  • encoder_out_lens – Encoder output lengths. (B,)

  • ignore_id – Padding symbol ID.

  • blank_id – Blank symbol ID.

Returns:

Decoder inputs. (B, U) target: Target label ID sequences. (B, U) t_len: Time lengths. (B,) u_len: Label lengths. (B,)

Return type:

decoder_in

espnet2.asr_transducer.utils.make_chunk_mask(size: int, chunk_size: int, num_left_chunks: int = 0, device: torch.device = None) → torch.Tensor[source]

Create chunk mask for the subsequent steps (size, size).

Reference: https://github.com/k2-fsa/icefall/blob/master/icefall/utils.py

Parameters:
  • size – Size of the source mask.

  • chunk_size – Number of frames in chunk.

  • num_left_chunks – Number of left chunks the attention module can see. (null or negative value means full context)

  • device – Device for the mask tensor.

Returns:

Chunk mask. (size, size)

Return type:

mask

espnet2.asr_transducer.utils.make_source_mask(lengths: torch.Tensor) → torch.Tensor[source]

Create source mask for given lengths.

Reference: https://github.com/k2-fsa/icefall/blob/master/icefall/utils.py

Parameters:

lengths – Sequence lengths. (B,)

Returns:

Mask for the sequence lengths. (B, max_len)

espnet2.asr_transducer.__init__

espnet2.asr_transducer.joint_network

Transducer joint network implementation.

class espnet2.asr_transducer.joint_network.JointNetwork(output_size: int, encoder_size: int, decoder_size: int, joint_space_size: int = 256, joint_activation_type: str = 'tanh', **activation_parameters)[source]

Bases: torch.nn.modules.module.Module

Transducer joint network module.

Parameters:
  • output_size – Output size.

  • encoder_size – Encoder output size.

  • decoder_size – Decoder output size.

  • joint_space_size – Joint space size.

  • joint_act_type – Type of activation for joint network.

  • **activation_parameters – Parameters for the activation function.

Construct a JointNetwork object.

forward(enc_out: torch.Tensor, dec_out: torch.Tensor, no_projection: bool = False) → torch.Tensor[source]

Joint computation of encoder and decoder hidden state sequences.

Parameters:
  • enc_out – Expanded encoder output state sequences. (B, T, s_range, D_enc) or (B, T, 1, D_enc)

  • dec_out – Expanded decoder output state sequences. (B, T, s_range, D_dec) or (B, 1, U, D_dec)

Returns:

Joint output state sequences.

(B, T, U, D_out) or (B, T, s_range, D_out)

Return type:

joint_out

espnet2.asr_transducer.encoder.encoder

Encoder for Transducer model.

class espnet2.asr_transducer.encoder.encoder.Encoder(input_size: int, body_conf: List[Dict[str, Any]], input_conf: Dict[str, Any] = {}, main_conf: Dict[str, Any] = {})[source]

Bases: torch.nn.modules.module.Module

Encoder module definition.

Parameters:
  • input_size – Input size.

  • body_conf – Encoder body configuration.

  • input_conf – Encoder input configuration.

  • main_conf – Encoder main configuration.

Construct an Encoder object.

chunk_forward(x: torch.Tensor, x_len: torch.Tensor, processed_frames: torch._VariableFunctionsClass.tensor, left_context: int = 32) → torch.Tensor[source]

Encode input sequences as chunks.

Parameters:
  • x – Encoder input features. (1, T_in, F)

  • x_len – Encoder input features lengths. (1,)

  • processed_frames – Number of frames already seen.

  • left_context – Number of previous frames (AFTER subsampling) the attention module can see in current chunk.

Returns:

Encoder outputs. (B, T_out, D_enc)

Return type:

x

forward(x: torch.Tensor, x_len: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Encode input sequences.

Parameters:
  • x – Encoder input features. (B, T_in, F)

  • x_len – Encoder input features lengths. (B,)

Returns:

Encoder outputs. (B, T_out, D_enc) x_len: Encoder outputs lenghts. (B,)

Return type:

x

reset_cache(left_context: int, device: torch.device) → None[source]

Initialize/Reset encoder cache for streaming.

Parameters:
  • left_context – Number of previous frames (AFTER subsampling) the attention module can see in current chunk.

  • device – Device ID.

espnet2.asr_transducer.encoder.__init__

espnet2.asr_transducer.encoder.building

Set of methods to build Transducer encoder architecture.

espnet2.asr_transducer.encoder.building.build_body_blocks(configuration: List[Dict[str, Any]], main_params: Dict[str, Any], output_size: int) → espnet2.asr_transducer.encoder.modules.multi_blocks.MultiBlocks[source]

Build encoder body blocks.

Parameters:
  • configuration – Body blocks configuration.

  • main_params – Encoder main parameters.

  • output_size – Architecture output size.

Returns:

MultiBlocks function encapsulation all encoder blocks.

espnet2.asr_transducer.encoder.building.build_branchformer_block(configuration: List[Dict[str, Any]], main_params: Dict[str, Any]) → espnet2.asr_transducer.encoder.blocks.branchformer.Branchformer[source]

Build Branchformer block.

Parameters:
  • configuration – Branchformer block configuration.

  • main_params – Encoder main parameters.

Returns:

Branchformer block function.

espnet2.asr_transducer.encoder.building.build_conformer_block(configuration: List[Dict[str, Any]], main_params: Dict[str, Any]) → espnet2.asr_transducer.encoder.blocks.conformer.Conformer[source]

Build Conformer block.

Parameters:
  • configuration – Conformer block configuration.

  • main_params – Encoder main parameters.

Returns:

Conformer block function.

espnet2.asr_transducer.encoder.building.build_conv1d_block(configuration: List[Dict[str, Any]], causal: bool) → espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d[source]

Build Conv1d block.

Parameters:

configuration – Conv1d block configuration.

Returns:

Conv1d block function.

espnet2.asr_transducer.encoder.building.build_ebranchformer_block(configuration: List[Dict[str, Any]], main_params: Dict[str, Any]) → espnet2.asr_transducer.encoder.blocks.ebranchformer.EBranchformer[source]

Build E-Branchformer block.

Parameters:
  • configuration – E-Branchformer block configuration.

  • main_params – Encoder main parameters.

Returns:

E-Branchformer block function.

espnet2.asr_transducer.encoder.building.build_input_block(input_size: int, configuration: Dict[str, Union[str, int]]) → espnet2.asr_transducer.encoder.blocks.conv_input.ConvInput[source]

Build encoder input block.

Parameters:
  • input_size – Input size.

  • configuration – Input block configuration.

Returns:

ConvInput block function.

espnet2.asr_transducer.encoder.building.build_main_parameters(pos_wise_act_type: str = 'swish', conv_mod_act_type: str = 'swish', pos_enc_dropout_rate: float = 0.0, pos_enc_max_len: int = 5000, simplified_att_score: bool = False, norm_type: str = 'layer_norm', conv_mod_norm_type: str = 'layer_norm', after_norm_eps: Optional[float] = None, after_norm_partial: Optional[float] = None, blockdrop_rate: float = 0.0, dynamic_chunk_training: bool = False, short_chunk_threshold: float = 0.75, short_chunk_size: int = 25, num_left_chunks: int = 0, **activation_parameters) → Dict[str, Any][source]

Build encoder main parameters.

Parameters:
  • pos_wise_act_type – X-former position-wise feed-forward activation type.

  • conv_mod_act_type – X-former convolution module activation type.

  • pos_enc_dropout_rate – Positional encoding dropout rate.

  • pos_enc_max_len – Positional encoding maximum length.

  • simplified_att_score – Whether to use simplified attention score computation.

  • norm_type – X-former normalization module type.

  • conv_mod_norm_type – Conformer convolution module normalization type.

  • after_norm_eps – Epsilon value for the final normalization.

  • after_norm_partial – Value for the final normalization with RMSNorm.

  • blockdrop_rate – Probability threshold of dropping out each encoder block.

  • dynamic_chunk_training – Whether to use dynamic chunk training.

  • short_chunk_threshold – Threshold for dynamic chunk selection.

  • short_chunk_size – Minimum number of frames during dynamic chunk training.

  • num_left_chunks – Number of left chunks the attention module can see. (null or negative value means full context)

  • **activations_parameters – Parameters of the activation functions. (See espnet2/asr_transducer/activation.py)

Returns:

Main encoder parameters

espnet2.asr_transducer.encoder.building.build_positional_encoding(block_size: int, configuration: Dict[str, Any]) → espnet2.asr_transducer.encoder.modules.positional_encoding.RelPositionalEncoding[source]

Build positional encoding block.

Parameters:
  • block_size – Input/output size.

  • configuration – Positional encoding configuration.

Returns:

Positional encoding module.

espnet2.asr_transducer.encoder.validation

Set of methods to validate encoder architecture.

espnet2.asr_transducer.encoder.validation.validate_architecture(input_conf: Dict[str, Any], body_conf: List[Dict[str, Any]], input_size: int) → Tuple[int, int][source]

Validate specified architecture is valid.

Parameters:
  • input_conf – Encoder input block configuration.

  • body_conf – Encoder body blocks configuration.

  • input_size – Encoder input size.

Returns:

Encoder input block output size. : Encoder body block output size.

Return type:

input_block_osize

espnet2.asr_transducer.encoder.validation.validate_block_arguments(configuration: Dict[str, Any], block_id: int, previous_block_output: int) → Tuple[int, int][source]

Validate block arguments.

Parameters:
  • configuration – Architecture configuration.

  • block_id – Block ID.

  • previous_block_output – Previous block output size.

Returns:

Block input size. output_size: Block output size.

Return type:

input_size

espnet2.asr_transducer.encoder.validation.validate_input_block(configuration: Dict[str, Any], body_first_conf: Dict[str, Any], input_size: int) → int[source]

Validate input block.

Parameters:
  • configuration – Encoder input block configuration.

  • body_first_conf – Encoder first body block configuration.

  • input_size – Encoder input block input size.

Returns:

Encoder input block output size.

Return type:

output_size

espnet2.asr_transducer.encoder.blocks.conformer

Conformer block for Transducer encoder.

class espnet2.asr_transducer.encoder.blocks.conformer.Conformer(block_size: int, self_att: torch.nn.modules.module.Module, feed_forward: torch.nn.modules.module.Module, feed_forward_macaron: torch.nn.modules.module.Module, conv_mod: torch.nn.modules.module.Module, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Dict = {}, dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

Conformer module definition.

Parameters:
  • block_size – Input/output size.

  • self_att – Self-attention module instance.

  • feed_forward – Feed-forward module instance.

  • feed_forward_macaron – Feed-forward module instance for macaron network.

  • conv_mod – Convolution module instance.

  • norm_class – Normalization module class.

  • norm_args – Normalization module arguments.

  • dropout_rate – Dropout rate.

Construct a Conformer object.

chunk_forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → Tuple[torch.Tensor, torch.Tensor][source]

Encode chunk of input sequence.

Parameters:
  • x – Conformer input sequences. (B, T, D_block)

  • pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)

  • mask – Source mask. (B, T_2)

  • left_context – Number of previous frames the attention module can see in current chunk.

Returns:

Conformer output sequences. (B, T, D_block) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)

Return type:

x

forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Encode input sequences.

Parameters:
  • x – Conformer input sequences. (B, T, D_block)

  • pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)

  • mask – Source mask. (B, T)

  • chunk_mask – Chunk mask. (T_2, T_2)

Returns:

Conformer output sequences. (B, T, D_block) mask: Source mask. (B, T) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)

Return type:

x

reset_streaming_cache(left_context: int, device: torch.device) → None[source]

Initialize/Reset self-attention and convolution modules cache for streaming.

Parameters:
  • left_context – Number of previous frames the attention module can see in current chunk.

  • device – Device to use for cache tensor.

espnet2.asr_transducer.encoder.blocks.conv_input

ConvInput block for Transducer encoder.

class espnet2.asr_transducer.encoder.blocks.conv_input.ConvInput(input_size: int, conv_size: Union[int, Tuple], subsampling_factor: int = 4, vgg_like: bool = True, output_size: Optional[int] = None)[source]

Bases: torch.nn.modules.module.Module

ConvInput module definition.

Parameters:
  • input_size – Input size.

  • conv_size – Convolution size.

  • subsampling_factor – Subsampling factor.

  • vgg_like – Whether to use a VGG-like network.

  • output_size – Block output dimension.

Construct a ConvInput object.

forward(x: torch.Tensor, mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Encode input sequences.

Parameters:
  • x – ConvInput input sequences. (B, T, D_feats)

  • mask – Mask of input sequences. (B, 1, T)

Returns:

ConvInput output sequences. (B, sub(T), D_out) mask: Mask of output sequences. (B, 1, sub(T))

Return type:

x

espnet2.asr_transducer.encoder.blocks.ebranchformer

E-Branchformer block for Transducer encoder.

class espnet2.asr_transducer.encoder.blocks.ebranchformer.EBranchformer(block_size: int, linear_size: int, self_att: torch.nn.modules.module.Module, feed_forward: torch.nn.modules.module.Module, feed_forward_macaron: torch.nn.modules.module.Module, conv_mod: torch.nn.modules.module.Module, depthwise_conv_mod: torch.nn.modules.module.Module, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Dict = {}, dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

E-Branchformer module definition.

Reference: https://arxiv.org/pdf/2210.00077.pdf

Parameters:
  • block_size – Input/output size.

  • linear_size – Linear layers’ hidden size.

  • self_att – Self-attention module instance.

  • feed_forward – Feed-forward module instance.

  • feed_forward_macaron – Feed-forward module instance for macaron network.

  • conv_mod – ConvolutionalSpatialGatingUnit module instance.

  • depthwise_conv_mod – DepthwiseConvolution module instance.

  • norm_class – Normalization class.

  • norm_args – Normalization module arguments.

  • dropout_rate – Dropout rate.

Construct a E-Branchformer object.

chunk_forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → Tuple[torch.Tensor, torch.Tensor][source]

Encode chunk of input sequence.

Parameters:
  • x – E-Branchformer input sequences. (B, T, D_block)

  • pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)

  • mask – Source mask. (B, T_2)

  • left_context – Number of previous frames the attention module can see in current chunk.

Returns:

E-Branchformer output sequences. (B, T, D_block) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)

Return type:

x

forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Encode input sequences.

Parameters:
  • x – E-Branchformer input sequences. (B, T, D_block)

  • pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)

  • mask – Source mask. (B, T)

  • chunk_mask – Chunk mask. (T_2, T_2)

Returns:

E-Branchformer output sequences. (B, T, D_block) mask: Source mask. (B, T) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)

Return type:

x

reset_streaming_cache(left_context: int, device: torch.device) → None[source]

Initialize/Reset self-attention and convolution modules cache for streaming.

Parameters:
  • left_context – Number of previous frames the attention module can see in current chunk.

  • device – Device to use for cache tensor.

espnet2.asr_transducer.encoder.blocks.branchformer

Branchformer block for Transducer encoder.

class espnet2.asr_transducer.encoder.blocks.branchformer.Branchformer(block_size: int, linear_size: int, self_att: torch.nn.modules.module.Module, conv_mod: torch.nn.modules.module.Module, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Dict = {}, dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

Branchformer module definition.

Reference: https://arxiv.org/pdf/2207.02971.pdf

Parameters:
  • block_size – Input/output size.

  • linear_size – Linear layers’ hidden size.

  • self_att – Self-attention module instance.

  • conv_mod – Convolution module instance.

  • norm_class – Normalization class.

  • norm_args – Normalization module arguments.

  • dropout_rate – Dropout rate.

Construct a Branchformer object.

chunk_forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → Tuple[torch.Tensor, torch.Tensor][source]

Encode chunk of input sequence.

Parameters:
  • x – Branchformer input sequences. (B, T, D_block)

  • pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)

  • mask – Source mask. (B, T_2)

  • left_context – Number of previous frames the attention module can see in current chunk.

Returns:

Branchformer output sequences. (B, T, D_block) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)

Return type:

x

forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Encode input sequences.

Parameters:
  • x – Branchformer input sequences. (B, T, D_block)

  • pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)

  • mask – Source mask. (B, T)

  • chunk_mask – Chunk mask. (T_2, T_2)

Returns:

Branchformer output sequences. (B, T, D_block) mask: Source mask. (B, T) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)

Return type:

x

reset_streaming_cache(left_context: int, device: torch.device) → None[source]

Initialize/Reset self-attention and convolution modules cache for streaming.

Parameters:
  • left_context – Number of previous frames the attention module can see in current chunk.

  • device – Device to use for cache tensor.

espnet2.asr_transducer.encoder.blocks.conv1d

Conv1d block for Transducer encoder.

class espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d(input_size: int, output_size: int, kernel_size: Union[int, Tuple], stride: Union[int, Tuple] = 1, dilation: Union[int, Tuple] = 1, groups: Union[int, Tuple] = 1, bias: bool = True, batch_norm: bool = False, relu: bool = True, causal: bool = False, dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

Conv1d module definition.

Parameters:
  • input_size – Input dimension.

  • output_size – Output dimension.

  • kernel_size – Size of the convolving kernel.

  • stride – Stride of the convolution.

  • dilation – Spacing between the kernel points.

  • groups – Number of blocked connections from input channels to output channels.

  • bias – Whether to add a learnable bias to the output.

  • batch_norm – Whether to use batch normalization after convolution.

  • relu – Whether to use a ReLU activation after convolution.

  • causal – Whether to use causal convolution (set to True if streaming).

  • dropout_rate – Dropout rate.

Construct a Conv1d object.

chunk_forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → Tuple[torch.Tensor, torch.Tensor][source]

Encode chunk of input sequence.

Parameters:
  • x – Conv1d input sequences. (B, T, D_in)

  • pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_in)

  • mask – Source mask. (B, T)

  • left_context – Number of previous frames the attention module can see in current chunk (not used here).

Returns:

Conv1d output sequences. (B, T, D_out) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_out)

Return type:

x

create_new_mask(mask: torch.Tensor) → torch.Tensor[source]

Create new mask for output sequences.

Parameters:

mask – Mask of input sequences. (B, T)

Returns:

Mask of output sequences. (B, sub(T))

Return type:

mask

create_new_pos_enc(pos_enc: torch.Tensor) → torch.Tensor[source]

Create new positional embedding vector.

Parameters:

pos_enc – Input sequences positional embedding. (B, 2 * (T - 1), D_in)

Returns:

Output sequences positional embedding.

(B, 2 * (sub(T) - 1), D_in)

Return type:

pos_enc

forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: Optional[torch.Tensor] = None, chunk_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Encode input sequences.

Parameters:
  • x – Conv1d input sequences. (B, T, D_in)

  • pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_in)

  • mask – Source mask. (B, T)

  • chunk_mask – Chunk mask. (T_2, T_2)

Returns:

Conv1d output sequences. (B, sub(T), D_out) mask: Source mask. (B, T) or (B, sub(T)) pos_enc: Positional embedding sequences.

(B, 2 * (T - 1), D_att) or (B, 2 * (sub(T) - 1), D_out)

Return type:

x

reset_streaming_cache(left_context: int, device: torch.device) → None[source]

Initialize/Reset Conv1d cache for streaming.

Parameters:
  • left_context – Number of previous frames the attention module can see in current chunk (not used here).

  • device – Device to use for cache tensor.

espnet2.asr_transducer.encoder.blocks.__init__

espnet2.asr_transducer.encoder.modules.multi_blocks

MultiBlocks for encoder architecture.

class espnet2.asr_transducer.encoder.modules.multi_blocks.MultiBlocks(block_list: List[torch.nn.modules.module.Module], output_size: int, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Optional[Dict] = None, blockdrop_rate: int = 0.0)[source]

Bases: torch.nn.modules.module.Module

MultiBlocks definition.

Parameters:
  • block_list – Individual blocks of the encoder architecture.

  • output_size – Architecture output size.

  • norm_class – Normalization module class.

  • norm_args – Normalization module arguments.

  • blockdrop_rate – Probability threshold of dropping out each block.

Construct a MultiBlocks object.

chunk_forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → torch.Tensor[source]

Forward each block of the encoder architecture.

Parameters:
  • x – MultiBlocks input sequences. (B, T, D_block_1)

  • pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_att)

  • mask – Source mask. (B, T_2)

  • left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).

Returns:

MultiBlocks output sequences. (B, T, D_block_N)

Return type:

x

forward(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → torch.Tensor[source]

Forward each block of the encoder architecture.

Parameters:
  • x – MultiBlocks input sequences. (B, T, D_block_1)

  • pos_enc – Positional embedding sequences.

  • mask – Source mask. (B, T)

  • chunk_mask – Chunk mask. (T_2, T_2)

Returns:

Output sequences. (B, T, D_block_N)

Return type:

x

reset_streaming_cache(left_context: int, device: torch.device) → None[source]

Initialize/Reset encoder streaming cache.

Parameters:
  • left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).

  • device – Device to use for cache tensor.

espnet2.asr_transducer.encoder.modules.positional_encoding

Positional encoding modules.

class espnet2.asr_transducer.encoder.modules.positional_encoding.RelPositionalEncoding(size: int, dropout_rate: float = 0.0, max_len: int = 5000)[source]

Bases: torch.nn.modules.module.Module

Relative positional encoding.

Parameters:
  • size – Module size.

  • max_len – Maximum input length.

  • dropout_rate – Dropout rate.

Construct a RelativePositionalEncoding object.

extend_pe(x: torch.Tensor, left_context: int = 0) → None[source]

Reset positional encoding.

Parameters:
  • x – Input sequences. (B, T, ?)

  • left_context – Number of previous frames the attention module can see in current chunk.

forward(x: torch.Tensor, left_context: int = 0) → torch.Tensor[source]

Compute positional encoding.

Parameters:
  • x – Input sequences. (B, T, ?)

  • left_context – Number of previous frames the attention module can see in current chunk.

Returns:

Positional embedding sequences. (B, 2 * (T - 1), ?)

Return type:

pos_enc

espnet2.asr_transducer.encoder.modules.normalization

Normalization modules for X-former blocks.

class espnet2.asr_transducer.encoder.modules.normalization.BasicNorm(normalized_shape: int, eps: float = 0.25)[source]

Bases: torch.nn.modules.module.Module

BasicNorm module definition.

Reference: https://github.com/k2-fsa/icefall/pull/288

Parameters:
  • normalized_shape – Expected size.

  • eps – Value added to the denominator for numerical stability.

Construct a BasicNorm object.

forward(x: torch.Tensor) → torch.Tensor[source]

Compute basic normalization.

Parameters:

x – Input sequences. (B, T, D_hidden)

Returns:

Output sequences. (B, T, D_hidden)

class espnet2.asr_transducer.encoder.modules.normalization.RMSNorm(normalized_shape: int, eps: float = 1e-05, partial: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

RMSNorm module definition.

Reference: https://arxiv.org/pdf/1910.07467.pdf

Parameters:
  • normalized_shape – Expected size.

  • eps – Value added to the denominator for numerical stability.

  • partial – Value defining the part of the input used for RMS stats.

Construct a RMSNorm object.

forward(x: torch.Tensor) → torch.Tensor[source]

Compute RMS normalization.

Parameters:

x – Input sequences. (B, T, D_hidden)

Returns:

Output sequences. (B, T, D_hidden)

Return type:

x

class espnet2.asr_transducer.encoder.modules.normalization.ScaleNorm(normalized_shape: int, eps: float = 1e-05)[source]

Bases: torch.nn.modules.module.Module

ScaleNorm module definition.

Reference: https://arxiv.org/pdf/1910.05895.pdf

Parameters:
  • normalized_shape – Expected size.

  • eps – Value added to the denominator for numerical stability.

Construct a ScaleNorm object.

forward(x: torch.Tensor) → torch.Tensor[source]

Compute scale normalization.

Parameters:

x – Input sequences. (B, T, D_hidden)

Returns:

Output sequences. (B, T, D_hidden)

espnet2.asr_transducer.encoder.modules.normalization.get_normalization(normalization_type: str, eps: Optional[float] = None, partial: Optional[float] = None) → Tuple[torch.nn.modules.module.Module, Dict][source]

Get normalization module and arguments given parameters.

Parameters:
  • normalization_type – Normalization module type.

  • eps – Value added to the denominator.

  • partial – Value defining the part of the input used for RMS stats (RMSNorm).

Returns:

Normalization module class : Normalization module arguments

espnet2.asr_transducer.encoder.modules.convolution

Convolution modules for X-former blocks.

class espnet2.asr_transducer.encoder.modules.convolution.ConformerConvolution(channels: int, kernel_size: int, activation: torch.nn.modules.module.Module = ReLU(), norm_args: Dict = {}, causal: bool = False)[source]

Bases: torch.nn.modules.module.Module

ConformerConvolution module definition.

Parameters:
  • channels – The number of channels.

  • kernel_size – Size of the convolving kernel.

  • activation – Activation function.

  • norm_args – Normalization module arguments.

  • causal – Whether to use causal convolution (set to True if streaming).

Construct an ConformerConvolution object.

forward(x: torch.Tensor, mask: Optional[torch.Tensor] = None, cache: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Compute convolution module.

Parameters:
  • x – ConformerConvolution input sequences. (B, T, D_hidden)

  • mask – Source mask. (B, T_2)

  • cache – ConformerConvolution input cache. (1, D_hidden, conv_kernel)

Returns:

ConformerConvolution output sequences. (B, ?, D_hidden) cache: ConformerConvolution output cache. (1, D_hidden, conv_kernel)

Return type:

x

class espnet2.asr_transducer.encoder.modules.convolution.ConvolutionalSpatialGatingUnit(size: int, kernel_size: int, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Dict = {}, dropout_rate: float = 0.0, causal: bool = False)[source]

Bases: torch.nn.modules.module.Module

Convolutional Spatial Gating Unit module definition.

Parameters:
  • size – Initial size to determine the number of channels.

  • kernel_size – Size of the convolving kernel.

  • norm_class – Normalization module class.

  • norm_args – Normalization module arguments.

  • dropout_rate – Dropout rate.

  • causal – Whether to use causal convolution (set to True if streaming).

Construct a ConvolutionalSpatialGatingUnit object.

forward(x: torch.Tensor, mask: Optional[torch.Tensor] = None, cache: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Compute convolution module.

Parameters:
  • x – ConvolutionalSpatialGatingUnit input sequences. (B, T, D_hidden)

  • mask – Source mask. (B, T_2)

  • cache – ConvolutionalSpationGatingUnit input cache. (1, D_hidden, conv_kernel)

Returns:

ConvolutionalSpatialGatingUnit output sequences. (B, ?, D_hidden)

Return type:

x

class espnet2.asr_transducer.encoder.modules.convolution.DepthwiseConvolution(size: int, kernel_size: int, causal: bool = False)[source]

Bases: torch.nn.modules.module.Module

Depth-wise Convolution module definition.

Parameters:
  • size – Initial size to determine the number of channels.

  • kernel_size – Size of the convolving kernel.

  • causal – Whether to use causal convolution (set to True if streaming).

Construct a DepthwiseConvolution object.

forward(x: torch.Tensor, mask: Optional[torch.Tensor] = None, cache: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Compute convolution module.

Parameters:
  • x – DepthwiseConvolution input sequences. (B, T, D_hidden)

  • mask – Source mask. (B, T_2)

  • cache – DepthwiseConvolution input cache. (1, conv_kernel, D_hidden)

Returns:

DepthwiseConvolution output sequences. (B, ?, D_hidden)

Return type:

x

espnet2.asr_transducer.encoder.modules.attention

Multi-Head attention layers with relative positional encoding.

class espnet2.asr_transducer.encoder.modules.attention.RelPositionMultiHeadedAttention(num_heads: int, embed_size: int, dropout_rate: float = 0.0, simplified_attention_score: bool = False)[source]

Bases: torch.nn.modules.module.Module

RelPositionMultiHeadedAttention definition.

Parameters:
  • num_heads – Number of attention heads.

  • embed_size – Embedding size.

  • dropout_rate – Dropout rate.

Construct an MultiHeadedAttention object.

compute_attention_score(query: torch.Tensor, key: torch.Tensor, pos_enc: torch.Tensor, left_context: int = 0) → torch.Tensor[source]

Attention score computation.

Parameters:
  • query – Transformed query tensor. (B, H, T_1, d_k)

  • key – Transformed key tensor. (B, H, T_2, d_k)

  • pos_enc – Positional embedding tensor. (B, 2 * T_1 - 1, size)

  • left_context – Number of previous frames to use for current chunk attention computation.

Returns:

Attention score. (B, H, T_1, T_2)

compute_simplified_attention_score(query: torch.Tensor, key: torch.Tensor, pos_enc: torch.Tensor, left_context: int = 0) → torch.Tensor[source]

Simplified attention score computation.

Reference: https://github.com/k2-fsa/icefall/pull/458

Parameters:
  • query – Transformed query tensor. (B, H, T_1, d_k)

  • key – Transformed key tensor. (B, H, T_2, d_k)

  • pos_enc – Positional embedding tensor. (B, 2 * T_1 - 1, size)

  • left_context – Number of previous frames to use for current chunk attention computation.

Returns:

Attention score. (B, H, T_1, T_2)

forward(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None, left_context: int = 0) → torch.Tensor[source]

Compute scaled dot product attention with rel. positional encoding.

Parameters:
  • query – Query tensor. (B, T_1, size)

  • key – Key tensor. (B, T_2, size)

  • value – Value tensor. (B, T_2, size)

  • pos_enc – Positional embedding tensor. (B, 2 * T_1 - 1, size)

  • mask – Source mask. (B, T_2)

  • chunk_mask – Chunk mask. (T_1, T_1)

  • left_context – Number of previous frames to use for current chunk attention computation.

Returns:

Output tensor. (B, T_1, H * d_k)

forward_attention(value: torch.Tensor, scores: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → torch.Tensor[source]

Compute attention context vector.

Parameters:
  • value – Transformed value. (B, H, T_2, d_k)

  • scores – Attention score. (B, H, T_1, T_2)

  • mask – Source mask. (B, T_2)

  • chunk_mask – Chunk mask. (T_1, T_1)

Returns:

Transformed value weighted by attention score. (B, T_1, H * d_k)

Return type:

attn_output

forward_qkv(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Transform query, key and value.

Parameters:
  • query – Query tensor. (B, T_1, size)

  • key – Key tensor. (B, T_2, size)

  • v – Value tensor. (B, T_2, size)

Returns:

Transformed query tensor. (B, H, T_1, d_k) k: Transformed key tensor. (B, H, T_2, d_k) v: Transformed value tensor. (B, H, T_2, d_k)

Return type:

q

rel_shift(x: torch.Tensor, left_context: int = 0) → torch.Tensor[source]

Compute relative positional encoding.

Parameters:
  • x – Input sequence. (B, H, T_1, 2 * T_1 - 1)

  • left_context – Number of previous frames to use for current chunk attention computation.

Returns:

Output sequence. (B, H, T_1, T_2)

Return type:

x

espnet2.asr_transducer.encoder.modules.__init__

espnet2.asr_transducer.decoder.stateless_decoder

Stateless decoder definition for Transducer models.

class espnet2.asr_transducer.decoder.stateless_decoder.StatelessDecoder(vocab_size: int, embed_size: int = 256, embed_dropout_rate: float = 0.0, embed_pad: int = 0)[source]

Bases: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder

Stateless Transducer decoder module.

Parameters:
  • vocab_size – Output size.

  • embed_size – Embedding size.

  • embed_dropout_rate – Dropout rate for embedding layer.

  • embed_pad – Embed/Blank symbol ID.

Construct a StatelessDecoder object.

batch_score(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → Tuple[torch.Tensor, None][source]

One-step forward hypotheses.

Parameters:

hyps – Hypotheses.

Returns:

Decoder output sequences. (B, D_dec) states: Decoder hidden states. None

Return type:

out

create_batch_states(new_states: List[Optional[torch.Tensor]]) → None[source]

Create decoder hidden states.

Parameters:

new_states – Decoder hidden states. [N x None]

Returns:

Decoder hidden states. None

Return type:

states

forward(labels: torch.Tensor, states: Optional[Any] = None) → torch.Tensor[source]

Encode source label sequences.

Parameters:
  • labels – Label ID sequences. (B, L)

  • states – Decoder hidden states. None

Returns:

Decoder output sequences. (B, U, D_emb)

Return type:

embed

init_state(batch_size: int) → None[source]

Initialize decoder states.

Parameters:

batch_size – Batch size.

Returns:

Initial decoder hidden states. None

score(label_sequence: List[int], states: Optional[Any] = None) → Tuple[torch.Tensor, None][source]

One-step forward hypothesis.

Parameters:
  • label_sequence – Current label sequence.

  • states – Decoder hidden states. None

Returns:

Decoder output sequence. (1, D_emb) state: Decoder hidden states. None

select_state(states: Optional[torch.Tensor], idx: int) → None[source]

Get specified ID state from decoder hidden states.

Parameters:
  • states – Decoder hidden states. None

  • idx – State ID to extract.

Returns:

Decoder hidden state for given ID. None

set_device(device: torch.device) → None[source]

Set GPU device to use.

Parameters:

device – Device ID.

espnet2.asr_transducer.decoder.mega_decoder

MEGA decoder definition for Transducer models.

class espnet2.asr_transducer.decoder.mega_decoder.MEGADecoder(vocab_size: int, block_size: int = 512, linear_size: int = 1024, qk_size: int = 128, v_size: int = 1024, num_heads: int = 4, rel_pos_bias_type: str = 'simple', max_positions: int = 2048, truncation_length: Optional[int] = None, normalization_type: str = 'layer_norm', normalization_args: Dict = {}, activation_type: str = 'swish', activation_args: Dict = {}, chunk_size: int = -1, num_blocks: int = 4, dropout_rate: float = 0.0, embed_dropout_rate: float = 0.0, att_dropout_rate: float = 0.0, ema_dropout_rate: float = 0.0, ffn_dropout_rate: float = 0.0, embed_pad: int = 0)[source]

Bases: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder

MEGA decoder module.

Based on https://arxiv.org/pdf/2209.10655.pdf.

Parameters:
  • vocab_size – Vocabulary size.

  • block_size – Input/Output size.

  • linear_size – NormalizedPositionwiseFeedForward hidden size.

  • qk_size – Shared query and key size for attention module.

  • v_size – Value size for attention module.

  • num_heads – Number of EMA heads.

  • rel_pos_bias – Type of relative position bias in attention module.

  • max_positions – Maximum number of position for RelativePositionBias.

  • truncation_length – Maximum length for truncation in EMA module.

  • normalization_type – Normalization layer type.

  • normalization_args – Normalization layer arguments.

  • activation_type – Activation function type.

  • activation_args – Activation function arguments.

  • chunk_size – Chunk size for attention computation (-1 = full context).

  • num_blocks – Number of MEGA blocks.

  • dropout_rate – Dropout rate for MEGA internal modules.

  • embed_dropout_rate – Dropout rate for embedding layer.

  • att_dropout_rate – Dropout rate for the attention module.

  • ema_dropout_rate – Dropout rate for the EMA module.

  • ffn_dropout_rate – Dropout rate for the feed-forward module.

  • embed_pad – Embedding padding symbol ID.

Construct a MEGADecoder object.

batch_score(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → Tuple[torch.Tensor, List[Dict[str, torch.Tensor]]][source]

One-step forward hypotheses.

Parameters:

hyps – Hypotheses.

Returns:

states:

Return type:

out

create_batch_states(new_states: List[List[Dict[str, torch.Tensor]]]) → List[Dict[str, torch.Tensor]][source]

Create batch of decoder hidden states given a list of new states.

Parameters:

new_states – Decoder hidden states. [B x [N x Dict]]

Returns:

Decoder hidden states. [N x Dict]

forward(labels: torch.Tensor) → torch.Tensor[source]

Encode source label sequences.

Parameters:

labels – Decoder input sequences. (B, L)

Returns:

Decoder output sequences. (B, U, D_dec)

Return type:

out

inference(labels: torch.Tensor, states: List[Dict[str, torch.Tensor]]) → Tuple[torch.Tensor, List[Dict[str, torch.Tensor]]][source]

Encode source label sequences.

Parameters:
  • labels – Decoder input sequences. (B, L)

  • states – Decoder hidden states. [B x Dict]

Returns:

Decoder output sequences. (B, U, D_dec) new_states: Decoder hidden states. [B x Dict]

Return type:

out

init_state(batch_size: int = 0) → List[Dict[str, torch.Tensor]][source]

Initialize MEGADecoder states.

Parameters:

batch_size – Batch size.

Returns:

Decoder hidden states. [N x Dict]

Return type:

states

score(label_sequence: List[int], states: List[Dict[str, torch.Tensor]]) → Tuple[torch.Tensor, List[Dict[str, torch.Tensor]]][source]

One-step forward hypothesis.

Parameters:
  • label_sequence – Current label sequence.

  • states – Decoder hidden states. (??)

Returns:

Decoder output sequence. (D_dec) states: Decoder hidden states. (??)

select_state(states: List[Dict[str, torch.Tensor]], idx: int) → List[Dict[str, torch.Tensor]][source]

Select ID state from batch of decoder hidden states.

Parameters:

states – Decoder hidden states. [N x Dict]

Returns:

Decoder hidden states for given ID. [N x Dict]

set_device(device: torch.device) → None[source]

Set GPU device to use.

Parameters:

device – Device ID.

stack_qk_states(state_list: List[torch.Tensor], dim: int) → List[torch.Tensor][source]

Stack query or key states with different lengths.

Parameters:

state_list – List of query or key states.

Returns:

Query/Key state.

Return type:

new_state

espnet2.asr_transducer.decoder.rnn_decoder

RNN decoder definition for Transducer models.

class espnet2.asr_transducer.decoder.rnn_decoder.RNNDecoder(vocab_size: int, embed_size: int = 256, hidden_size: int = 256, rnn_type: str = 'lstm', num_layers: int = 1, dropout_rate: float = 0.0, embed_dropout_rate: float = 0.0, embed_pad: int = 0)[source]

Bases: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder

RNN decoder module.

Parameters:
  • vocab_size – Vocabulary size.

  • embed_size – Embedding size.

  • hidden_size – Hidden size..

  • rnn_type – Decoder layers type.

  • num_layers – Number of decoder layers.

  • dropout_rate – Dropout rate for decoder layers.

  • embed_dropout_rate – Dropout rate for embedding layer.

  • embed_pad – Embedding padding symbol ID.

Construct a RNNDecoder object.

batch_score(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]

One-step forward hypotheses.

Parameters:

hyps – Hypotheses.

Returns:

Decoder output sequences. (B, D_dec) states: Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)

Return type:

out

create_batch_states(new_states: List[Tuple[torch.Tensor, Optional[torch.Tensor]]]) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Create decoder hidden states.

Parameters:

new_states – Decoder hidden states. [B x ((N, 1, D_dec), (N, 1, D_dec) or None)]

Returns:

Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)

Return type:

states

forward(labels: torch.Tensor) → torch.Tensor[source]

Encode source label sequences.

Parameters:

labels – Label ID sequences. (B, L)

Returns:

Decoder output sequences. (B, U, D_dec)

Return type:

out

init_state(batch_size: int) → Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]][source]

Initialize decoder states.

Parameters:

batch_size – Batch size.

Returns:

Initial decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)

rnn_forward(x: torch.Tensor, state: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]

Encode source label sequences.

Parameters:
  • x – RNN input sequences. (B, D_emb)

  • state – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)

Returns:

RNN output sequences. (B, D_dec) (h_next, c_next): Decoder hidden states.

(N, B, D_dec), (N, B, D_dec) or None)

Return type:

x

score(label_sequence: List[int], states: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]

One-step forward hypothesis.

Parameters:
  • label_sequence – Current label sequence.

  • states – Decoder hidden states. ((N, 1, D_dec), (N, 1, D_dec) or None)

Returns:

Decoder output sequence. (1, D_dec) states: Decoder hidden states.

((N, 1, D_dec), (N, 1, D_dec) or None)

Return type:

out

select_state(states: Tuple[torch.Tensor, Optional[torch.Tensor]], idx: int) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Get specified ID state from decoder hidden states.

Parameters:
  • states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)

  • idx – State ID to extract.

Returns:

Decoder hidden state for given ID. ((N, 1, D_dec), (N, 1, D_dec) or None)

set_device(device: torch.device) → None[source]

Set GPU device to use.

Parameters:

device – Device ID.

espnet2.asr_transducer.decoder.rwkv_decoder

RWKV decoder definition for Transducer models.

class espnet2.asr_transducer.decoder.rwkv_decoder.RWKVDecoder(vocab_size: int, block_size: int = 512, context_size: int = 1024, linear_size: Optional[int] = None, attention_size: Optional[int] = None, normalization_type: str = 'layer_norm', normalization_args: Dict = {}, num_blocks: int = 4, rescale_every: int = 0, embed_dropout_rate: float = 0.0, att_dropout_rate: float = 0.0, ffn_dropout_rate: float = 0.0, embed_pad: int = 0)[source]

Bases: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder

RWKV decoder module.

Based on https://arxiv.org/pdf/2305.13048.pdf.

Parameters:
  • vocab_size – Vocabulary size.

  • block_size – Input/Output size.

  • context_size – Context size for WKV computation.

  • linear_size – FeedForward hidden size.

  • attention_size – SelfAttention hidden size.

  • normalization_type – Normalization layer type.

  • normalization_args – Normalization layer arguments.

  • num_blocks – Number of RWKV blocks.

  • rescale_every – Whether to rescale input every N blocks (inference only).

  • embed_dropout_rate – Dropout rate for embedding layer.

  • att_dropout_rate – Dropout rate for the attention module.

  • ffn_dropout_rate – Dropout rate for the feed-forward module.

  • embed_pad – Embedding padding symbol ID.

Construct a RWKVDecoder object.

batch_score(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → Tuple[torch.Tensor, List[torch.Tensor]][source]

One-step forward hypotheses.

Parameters:

hyps – Hypotheses.

Returns:

Decoder output sequence. (B, D_dec) states: Decoder hidden states. [5 x (B, 1, D_att/D_dec, N)]

Return type:

out

create_batch_states(new_states: List[List[Dict[str, torch.Tensor]]]) → List[torch.Tensor][source]

Create batch of decoder hidden states given a list of new states.

Parameters:

new_states – Decoder hidden states. [B x [5 x (1, 1, D_att/D_dec, N)]

Returns:

Decoder hidden states. [5 x (B, 1, D_att/D_dec, N)]

forward(labels: torch.Tensor) → torch.Tensor[source]

Encode source label sequences.

Parameters:

labels – Decoder input sequences. (B, L)

Returns:

Decoder output sequences. (B, U, D_dec)

Return type:

out

inference(labels: torch.Tensor, states: torch.Tensor) → Tuple[torch.Tensor, List[torch.Tensor]][source]

Encode source label sequences.

Parameters:
  • labels – Decoder input sequences. (B, L)

  • states – Decoder hidden states. [5 x (B, D_att/D_dec, N)]

Returns:

Decoder output sequences. (B, U, D_dec) states: Decoder hidden states. [5 x (B, D_att/D_dec, N)]

Return type:

out

init_state(batch_size: int = 1) → List[torch.Tensor][source]

Initialize RWKVDecoder states.

Parameters:

batch_size – Batch size.

Returns:

Decoder hidden states. [5 x (B, 1, D_att/D_dec, N)]

Return type:

states

score(label_sequence: List[int], states: List[torch.Tensor]) → Tuple[torch.Tensor, List[torch.Tensor]][source]

One-step forward hypothesis.

Parameters:
  • label_sequence – Current label sequence.

  • states – Decoder hidden states. [5 x (1, 1, D_att/D_dec, N)]

Returns:

Decoder output sequence. (D_dec) states: Decoder hidden states. [5 x (1, 1, D_att/D_dec, N)]

select_state(states: List[torch.Tensor], idx: int) → List[torch.Tensor][source]

Select ID state from batch of decoder hidden states.

Parameters:

states – Decoder hidden states. [5 x (B, 1, D_att/D_dec, N)]

Returns:

Decoder hidden states for given ID. [5 x (1, 1, D_att/D_dec, N)]

set_device(device: torch.device) → None[source]

Set GPU device to use.

Parameters:

device – Device ID.

espnet2.asr_transducer.decoder.abs_decoder

Abstract decoder definition for Transducer models.

class espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Abstract decoder module.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract batch_score(hyps: List[Any]) → Tuple[torch.Tensor, Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]]][source]

One-step forward hypotheses.

Parameters:

hyps – Hypotheses.

Returns:

Decoder output sequences. states: Decoder hidden states.

Return type:

out

abstract create_batch_states(new_states: List[Union[List[Dict[str, Optional[torch.Tensor]]], List[List[torch.Tensor]], Tuple[torch.Tensor, Optional[torch.Tensor]]]]) → Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]][source]

Create batch of decoder hidden states given a list of new states.

Parameters:

new_states – Decoder hidden states.

Returns:

Decoder hidden states.

abstract forward(labels: torch.Tensor) → torch.Tensor[source]

Encode source label sequences.

Parameters:

labels – Label ID sequences.

Returns:

Decoder output sequences.

abstract init_state(batch_size: int) → Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]]][source]

Initialize decoder states.

Parameters:

batch_size – Batch size.

Returns:

Decoder hidden states.

abstract score(label_sequence: List[int], states: Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]]) → Tuple[torch.Tensor, Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]]][source]

One-step forward hypothesis.

Parameters:
  • label_sequence – Current label sequence.

  • state – Decoder hidden states.

Returns:

Decoder output sequence. state: Decoder hidden states.

Return type:

out

abstract select_state(states: Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]], idx: int = 0) → Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]][source]

Get specified ID state from batch of states, if provided.

Parameters:
  • states – Decoder hidden states.

  • idx – State ID to extract.

Returns:

Decoder hidden state for given ID.

abstract set_device(device: torch.Tensor) → None[source]

Set GPU device to use.

Parameters:

device – Device ID.

espnet2.asr_transducer.decoder.__init__

espnet2.asr_transducer.decoder.blocks.mega

Moving Average Equipped Gated Attention (MEGA) block definition.

Based/modified from https://github.com/facebookresearch/mega/blob/main/fairseq/modules/moving_average_gated_attention.py

Most variables are renamed according to https://github.com/huggingface/transformers/blob/main/src/transformers/models/mega/modeling_mega.py.

class espnet2.asr_transducer.decoder.blocks.mega.MEGA(size: int = 512, num_heads: int = 4, qk_size: int = 128, v_size: int = 1024, activation: torch.nn.modules.module.Module = ReLU(), normalization: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, rel_pos_bias_type: str = 'simple', max_positions: int = 2048, truncation_length: Optional[int] = None, chunk_size: int = -1, dropout_rate: float = 0.0, att_dropout_rate: float = 0.0, ema_dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

MEGA module.

Parameters:
  • size – Input/Output size.

  • num_heads – Number of EMA heads.

  • qk_size – Shared query and key size for attention module.

  • v_size – Value size for attention module.

  • qk_v_size – (QK, V) sizes for attention module.

  • activation – Activation function type.

  • normalization – Normalization module.

  • rel_pos_bias_type – Type of relative position bias in attention module.

  • max_positions – Maximum number of position for RelativePositionBias.

  • truncation_length – Maximum length for truncation in EMA module.

  • chunk_size – Chunk size for attention computation (-1 = full context).

  • dropout_rate – Dropout rate for inner modules.

  • att_dropout_rate – Dropout rate for the attention module.

  • ema_dropout_rate – Dropout rate for the EMA module.

Construct a MEGA object.

forward(x: torch.Tensor, mask: Optional[torch.Tensor] = None, attn_mask: Optional[torch.Tensor] = None, state: Optional[Dict[str, Optional[torch.Tensor]]] = None) → Tuple[torch.Tensor, Optional[Dict[str, Optional[torch.Tensor]]]][source]

Compute moving average equiped gated attention.

Parameters:
  • x – MEGA input sequences. (L, B, size)

  • mask – MEGA input sequence masks. (B, 1, L)

  • attn_mask – MEGA attention mask. (1, L, L)

  • state – Decoder hidden states.

Returns:

MEGA output sequences. (B, L, size) state: Decoder hidden states.

Return type:

x

reset_parameters(val: int = 0.0, std: int = 0.02) → None[source]

Reset module parameters.

Parameters:
  • val – Initialization value.

  • std – Standard deviation.

softmax_attention(query: torch.Tensor, key: torch.Tensor, mask: Optional[torch.Tensor] = None, attn_mask: Optional[torch.Tensor] = None) → torch.Tensor[source]

Compute attention weights with softmax.

Parameters:
  • query – Query tensor. (B, 1, L, D)

  • key – Key tensor. (B, 1, L, D)

  • mask – Sequence mask. (B, 1, L)

  • attn_mask – Attention mask. (1, L, L)

Returns:

Attention weights. (B, 1, L, L)

Return type:

attn_weights

espnet2.asr_transducer.decoder.blocks.__init__

espnet2.asr_transducer.decoder.blocks.rwkv

Receptance Weighted Key Value (RWKV) block definition.

Based/modified from https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py

class espnet2.asr_transducer.decoder.blocks.rwkv.RWKV(size: int, linear_size: int, attention_size: int, context_size: int, block_id: int, num_blocks: int, normalization_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, normalization_args: Dict = {}, att_dropout_rate: float = 0.0, ffn_dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

RWKV module.

Parameters:
  • size – Input/Output size.

  • linear_size – Feed-forward hidden size.

  • attention_size – SelfAttention hidden size.

  • context_size – Context size for WKV computation.

  • block_id – Block index.

  • num_blocks – Number of blocks in the architecture.

  • normalization_class – Normalization layer class.

  • normalization_args – Normalization layer arguments.

  • att_dropout_rate – Dropout rate for the attention module.

  • ffn_dropout_rate – Dropout rate for the feed-forward module.

Construct a RWKV object.

forward(x: torch.Tensor, state: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Compute receptance weighted key value.

Parameters:
  • x – RWKV input sequences. (B, L, size)

  • state – Decoder hidden states. [5 x (B, D_att/size, N)]

Returns:

RWKV output sequences. (B, L, size) x: Decoder hidden states. [5 x (B, D_att/size, N)]

Return type:

x

espnet2.asr_transducer.decoder.modules.__init__

espnet2.asr_transducer.decoder.modules.rwkv.feed_forward

Feed-forward (channel mixing) module for RWKV block.

Based/Modified from https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py

Some variables are renamed according to https://github.com/huggingface/transformers/blob/main/src/transformers/models/rwkv/modeling_rwkv.py.

class espnet2.asr_transducer.decoder.modules.rwkv.feed_forward.FeedForward(size: int, hidden_size: int, block_id: int, num_blocks: int)[source]

Bases: torch.nn.modules.module.Module

FeedForward module definition.

Parameters:
  • size – Input/Output size.

  • hidden_size – Hidden size.

  • block_id – Block index.

  • num_blocks – Number of blocks in the architecture.

Construct a FeedForward object.

forward(x: torch.Tensor, state: Optional[List[torch.Tensor]] = None) → Tuple[torch.Tensor, Optional[List[torch.Tensor]]][source]

Compute channel mixing.

Parameters:
  • x – FeedForward input sequences. (B, U, size)

  • state – Decoder hidden state. [5 x (B, 1, size, N)]

Returns:

FeedForward output sequences. (B, U, size) state: Decoder hidden state. [5 x (B, 1, size, N)]

Return type:

x

reset_parameters(size: int, block_id: int, num_blocks: int) → None[source]

Reset module parameters.

Parameters:
  • size – Block size.

  • block_id – Block index.

  • num_blocks – Number of blocks in the architecture.

espnet2.asr_transducer.decoder.modules.rwkv.attention

Attention (time mixing) modules for RWKV block.

Based/Modified from https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py.

Some variables are renamed according to https://github.com/huggingface/transformers/blob/main/src/transformers/models/rwkv/modeling_rwkv.py.

class espnet2.asr_transducer.decoder.modules.rwkv.attention.SelfAttention(size: int, attention_size: int, context_size: int, block_id: int, num_blocks: int)[source]

Bases: torch.nn.modules.module.Module

SelfAttention module definition.

Parameters:
  • size – Input/Output size.

  • attention_size – Attention hidden size.

  • context_size – Context size for WKV kernel.

  • block_id – Block index.

  • num_blocks – Number of blocks in the architecture.

Construct a SelfAttention object.

forward(x: torch.Tensor, state: Optional[List[torch.Tensor]] = None) → Tuple[torch.Tensor, Optional[List[torch.Tensor]]][source]

Compute time mixing.

Parameters:
  • x – SelfAttention input sequences. (B, U, size)

  • state – Decoder hidden states. [5 x (B, 1, D_att, N)]

Returns:

SelfAttention output sequences. (B, U, size)

Return type:

x

reset_parameters(size: int, attention_size: int, block_id: int, num_blocks: int) → None[source]

Reset module parameters.

Parameters:
  • size – Block size.

  • attention_size – Attention hidden size.

  • block_id – Block index.

  • num_blocks – Number of blocks in the architecture.

wkv_linear_attention(time_decay: torch.Tensor, time_first: torch.Tensor, key: torch.Tensor, value: torch.Tensor, state: Tuple[torch.Tensor, torch.Tensor, torch.Tensor]) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor, torch.Tensor]][source]

Compute WKV with state (i.e.: for inference).

Parameters:
  • time_decay – Channel-wise time decay vector. (D_att)

  • time_first – Channel-wise time first vector. (D_att)

  • key – Key tensor. (B, 1, D_att)

  • value – Value tensor. (B, 1, D_att)

  • state – Decoder hidden states. [3 x (B, D_att)]

Returns:

Weighted Key-Value. (B, 1, D_att) state: Decoder hidden states. [3 x (B, 1, D_att)]

Return type:

output

class espnet2.asr_transducer.decoder.modules.rwkv.attention.WKVLinearAttention(*args, **kwargs)[source]

Bases: torch.autograd.function.Function

WKVLinearAttention function definition.

static backward(ctx, grad_output: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

WKVLinearAttention function backward pass.

Parameters:

grad_output – Output gradient. (B, U, D_att)

Returns:

Gradient for channel-wise time decay vector. (D_att) grad_time_first: Gradient for channel-wise time first vector. (D_att) grad_key: Gradient for key tensor. (B, U, D_att) grad_value: Gradient for value tensor. (B, U, D_att)

Return type:

grad_time_decay

static forward(ctx, time_decay: torch.Tensor, time_first: torch.Tensor, key: torch.Tensor, value: torch._VariableFunctionsClass.tensor) → torch.Tensor[source]

WKVLinearAttention function forward pass.

Parameters:
  • time_decay – Channel-wise time decay vector. (D_att)

  • time_first – Channel-wise time first vector. (D_att)

  • key – Key tensor. (B, U, D_att)

  • value – Value tensor. (B, U, D_att)

Returns:

Weighted Key-Value tensor. (B, U, D_att)

Return type:

out

espnet2.asr_transducer.decoder.modules.rwkv.attention.load_wkv_kernel(context_size: int) → None[source]

Load WKV CUDA kernel.

Parameters:

context_size – Context size.

espnet2.asr_transducer.decoder.modules.rwkv.__init__

espnet2.asr_transducer.decoder.modules.mega.feed_forward

Normalized position-wise feed-forward module for MEGA block.

class espnet2.asr_transducer.decoder.modules.mega.feed_forward.NormalizedPositionwiseFeedForward(size: int, hidden_size: int, normalization: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, activation: torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.ReLU'>, dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

NormalizedPositionFeedForward module definition.

Parameters:
  • size – Input/Output size.

  • hidden_size – Hidden size.

  • normalization – Normalization module.

  • activation – Activation function.

  • dropout_rate – Dropout rate.

Construct an NormalizedPositionwiseFeedForward object.

forward(x: torch.Tensor) → torch.Tensor[source]

Compute feed-forward module.

Parameters:

x – NormalizedPositionwiseFeedForward input sequences. (B, L, size)

Returns:

NormalizedPositionwiseFeedForward output sequences. (B, L, size)

Return type:

x

reset_parameters(val: float = 0.0, std: float = 0.02) → None[source]

Reset module parameters.

Parameters:
  • val – Initialization value.

  • std – Standard deviation.

espnet2.asr_transducer.decoder.modules.mega.positional_bias

Positional bias related modules.

Based/modified from https://github.com/facebookresearch/mega/blob/main/fairseq/modules/relative_positional_bias.py

class espnet2.asr_transducer.decoder.modules.mega.positional_bias.RelativePositionBias(max_positions: int)[source]

Bases: torch.nn.modules.module.Module

RelativePositionBias module definition.

Parameters:

max_positions – Maximum number of relative positions.

Construct a RelativePositionBias object.

forward(length: int) → torch.Tensor[source]

Compute relative position bias.

Parameters:

length – Sequence length.

Returns:

Relative position bias. (L, L)

Return type:

tile

reset_parameters(val: float = 0.0, std: float = 0.02) → None[source]

Reset module parameters.

Parameters:
  • val – Initialization value.

  • std – Standard deviation.

class espnet2.asr_transducer.decoder.modules.mega.positional_bias.RotaryRelativePositionBias(size: int, max_positions: int = 2048)[source]

Bases: torch.nn.modules.module.Module

RotaryRelativePositionBias module definition.

Parameters:
  • size – Module embedding size.

  • max_positions – Maximum number of relative positions.

Construct a RotaryRelativePositionBias object.

forward(length: int) → torch.Tensor[source]

Compute rotary relative position bias.

Parameters:

length – Sequence length.

Returns:

Rotary relative position bias. (L, L)

Return type:

bias

static get_sinusoid_embeddings(max_positions: int, size: int) → Tuple[torch.Tensor, torch.Tensor][source]

Compute sinusoidal positional embeddings.

Parameters:
  • max_positions – Maximum number of positions.

  • size – Input size.

Returns:

Sine elements. (max_positions, size // 2) : Cos elements. (max_positions, size // 2)

reset_parameters(val: float = 0.0, std: float = 0.02) → None[source]

Reset module parameters.

Parameters:
  • val – Initialization value.

  • std – Standard deviation.

rotary(x: torch.Tensor) → torch.Tensor[source]

Compute rotary positional embeddings.

Parameters:

x – Input sequence. (L, size)

Returns:

Rotary positional embeddings. (L, size)

Return type:

x

espnet2.asr_transducer.decoder.modules.mega.__init__

espnet2.asr_transducer.decoder.modules.mega.multi_head_damped_ema

Multi-head Damped Exponential Moving Average (EMA) module for MEGA block.

Based/modified from https://github.com/facebookresearch/mega/blob/main/fairseq/modules/moving_average_gated_attention.py

Most variables are renamed according to https://github.com/huggingface/transformers/blob/main/src/transformers/models/mega/modeling_mega.py.

class espnet2.asr_transducer.decoder.modules.mega.multi_head_damped_ema.MultiHeadDampedEMA(size: int, num_heads: int = 4, activation: torch.nn.modules.module.Module = ReLU(), truncation_length: Optional[int] = None)[source]

Bases: torch.nn.modules.module.Module

MultiHeadDampedEMA module definition.

Parameters:
  • size – Module size.

  • num_heads – Number of attention heads.

  • activation – Activation function type.

  • truncation_length – Maximum length for truncation.

Construct an MultiHeadDampedEMA object.

compute_ema_coefficients() → Tuple[torch.Tensor, torch.Tensor][source]

Compute EMA coefficients.

Parameters:

None

Returns:

Damping factor / P-th order coefficient.

(size, num_heads, 1)

prev_timestep_weight: Previous timestep weight / Q-th order coefficient.

(size, num_heads, 1)

Return type:

damping_factor

compute_ema_kernel(length: int) → torch.Tensor[source]

Compute EMA kernel / vandermonde product.

Parameters:

length – Sequence length.

Returns:

EMA kernel / Vandermonde product. (size, L)

ema_one_step(x: torch.Tensor, state: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Perform exponential moving average for a single step.

Parameters:
  • x – MultiHeadDampedEMA input sequences. (B, D, 1)

  • state – MultiHeadDampedEMA state. (B, D, num_heads)

Returns:

MultiHeadDamped output sequences. (B, 1, D) new_state: MultiHeadDampedEMA state. (B, D, num_heads)

Return type:

out

forward(x: torch.Tensor, mask: Optional[torch.Tensor] = None, state: Optional[Dict[str, torch.Tensor]] = None) → Optional[torch.Tensor][source]

Compute multi-dimensional damped EMA.

Parameters:
  • x – MultiHeadDampedEMA input sequence. (L, B, D)

  • mask – Sequence mask. (B, 1, L)

  • state – MultiHeadDampedEMA state. (B, D, num_heads)

Returns:

MultiHeadDampedEMA output sequence. (B, L, D) new_state: MultiHeadDampedEMA state. (B, D, num_heads)

Return type:

x

get_ema_coefficients() → Tuple[torch.Tensor, torch.Tensor][source]

Get EMA coefficients.

Parameters:

None

Returns:

Damping factor / P-th order coefficient. (size, num_heads, 1) : Previous timestep weight / Q-th order coefficient. (size, num_heads, 1)

reset_parameters(val: float = 0.0, std1: float = 0.2, std2: float = 1.0) → None[source]

Reset module parameters.

Parameters:
  • val – Initialization value.

  • std1 – Main standard deviation.

  • std2 – Secondary standard deviation.

espnet2.asr_transducer.frontend.online_audio_processor

Online processor for Transducer models chunk-by-chunk streaming decoding.

class espnet2.asr_transducer.frontend.online_audio_processor.OnlineAudioProcessor(feature_extractor: torch.nn.modules.module.Module, normalization_module: torch.nn.modules.module.Module, decoding_window: int, encoder_sub_factor: int, frontend_conf: Dict, device: torch.device, audio_sampling_rate: int = 16000)[source]

Bases: object

OnlineProcessor module definition.

Parameters:
  • feature_extractor – Feature extractor module.

  • normalization_module – Normalization module.

  • decoding_window – Size of the decoding window (in ms).

  • encoder_sub_factor – Encoder subsampling factor.

  • frontend_conf – Frontend configuration.

  • device – Device to pin module tensors on.

  • audio_sampling_rate – Input sampling rate.

Construct an OnlineAudioProcessor.

compute_features(samples: torch.Tensor, is_final: bool) → None[source]

Compute features from input samples.

Parameters:
  • samples – Speech data. (S)

  • is_final – Whether speech corresponds to the final chunk of data.

Returns:

Features sequence. (1, chunk_sz_bs, D_feats) feats_length: Features length sequence. (1,)

Return type:

feats

get_current_feats(feats: torch.Tensor, feats_length: torch.Tensor, is_final: bool) → Tuple[torch.Tensor, torch.Tensor][source]

Get features for current decoding window.

Parameters:
  • feats – Computed features sequence. (1, F, D_feats)

  • feats_length – Computed features sequence length. (1,)

  • is_final – Whether feats corresponds to the final chunk of data.

Returns:

Decoding window features sequence. (1, chunk_sz_bs, D_feats) feats_length: Decoding window features length sequence. (1,)

Return type:

feats

get_current_samples(samples: torch.Tensor, is_final: bool) → torch.Tensor[source]

Get samples for feature computation.

Parameters:
  • samples – Speech data. (S)

  • is_final – Whether speech corresponds to the final chunk of data.

Returns:

New speech data. (1, decoding_samples)

Return type:

samples

reset_cache() → None[source]

Reset cache parameters.

Parameters:

None

Returns:

None

espnet2.asr_transducer.frontend.__init__