espnet2.asr_transducer package¶
espnet2.asr_transducer.activation¶
Activation functions for Transducer models.
-
class
espnet2.asr_transducer.activation.
FTSwish
(threshold: float = -0.2, mean_shift: float = 0)[source]¶ Bases:
torch.nn.modules.module.Module
Flatten-T Swish activation definition.
- FTSwish(x) = x * sigmoid(x) + threshold
where FTSwish(x) < 0 = threshold
Reference: https://arxiv.org/abs/1812.06247
- Parameters:
threshold – Threshold value for FTSwish activation formulation. (threshold < 0)
mean_shift – Mean shifting value for FTSwish activation formulation. (applied only if != 0, disabled by default)
-
class
espnet2.asr_transducer.activation.
Mish
(softplus_beta: float = 1.0, softplus_threshold: int = 20, use_builtin: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
Mish activation definition.
Mish(x) = x * tanh(softplus(x))
Reference: https://arxiv.org/abs/1908.08681.
- Parameters:
softplus_beta – Beta value for softplus activation formulation. (Usually 0 > softplus_beta >= 2)
softplus_threshold – Values above this revert to a linear function. (Usually 10 > softplus_threshold >= 20)
use_builtin – Whether to use PyTorch activation function if available.
-
class
espnet2.asr_transducer.activation.
Smish
(alpha: float = 1.0, beta: float = 1.0)[source]¶ Bases:
torch.nn.modules.module.Module
Smish activation definition.
- Smish(x) = (alpha * x) * tanh(log(1 + sigmoid(beta * x)))
where alpha > 0 and beta > 0
Reference: https://www.mdpi.com/2079-9292/11/4/540/htm.
- Parameters:
alpha – Alpha value for Smish activation fomulation. (Usually, alpha = 1. If alpha <= 0, set value to 1).
beta – Beta value for Smish activation formulation. (Usually, beta = 1. If beta <= 0, set value to 1).
-
class
espnet2.asr_transducer.activation.
Swish
(beta: float = 1.0, use_builtin: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
Swish activation definition.
- Swish(x) = (beta * x) * sigmoid(x)
where beta = 1 defines standard Swish activation.
References
https://arxiv.org/abs/2108.12943 / https://arxiv.org/abs/1710.05941v1. E-swish variant: https://arxiv.org/abs/1801.07145.
- Parameters:
beta – Beta parameter for E-Swish. (beta >= 1. If beta < 1, use standard Swish).
use_builtin – Whether to use PyTorch function if available.
-
espnet2.asr_transducer.activation.
get_activation
(activation_type: str, ftswish_threshold: float = -0.2, ftswish_mean_shift: float = 0.0, hardtanh_min_val: int = -1.0, hardtanh_max_val: int = 1.0, leakyrelu_neg_slope: float = 0.01, smish_alpha: float = 1.0, smish_beta: float = 1.0, softplus_beta: float = 1.0, softplus_threshold: int = 20, swish_beta: float = 1.0) → torch.nn.modules.module.Module[source]¶ Return activation function.
- Parameters:
activation_type – Activation function type.
ftswish_threshold – Threshold value for FTSwish activation formulation.
ftswish_mean_shift – Mean shifting value for FTSwish activation formulation.
hardtanh_min_val – Minimum value of the linear region range for HardTanh.
hardtanh_max_val – Maximum value of the linear region range for HardTanh.
leakyrelu_neg_slope – Negative slope value for LeakyReLU activation formulation.
smish_alpha – Alpha value for Smish activation fomulation.
smish_beta – Beta value for Smish activation formulation.
softplus_beta – Beta value for softplus activation formulation in Mish.
softplus_threshold – Values above this revert to a linear function in Mish.
swish_beta – Beta value for Swish variant formulation.
- Returns:
Activation function.
espnet2.asr_transducer.error_calculator¶
Error Calculator module for Transducer.
-
class
espnet2.asr_transducer.error_calculator.
ErrorCalculator
(decoder: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, token_list: List[int], sym_space: str, sym_blank: str, nstep: int = 2, report_cer: bool = False, report_wer: bool = False)[source]¶ Bases:
object
Calculate CER and WER for transducer models.
- Parameters:
decoder – Decoder module.
joint_network – Joint Network module.
token_list – List of token units.
sym_space – Space symbol.
sym_blank – Blank symbol.
nstep – Maximum number of symbol expansions at each time step w/ mAES.
report_cer – Whether to compute CER.
report_wer – Whether to compute WER.
Construct an ErrorCalculatorTransducer object.
-
calculate_cer
(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶ Calculate sentence-level CER score.
- Parameters:
char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)
- Returns:
Average sentence-level CER score.
-
calculate_wer
(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶ Calculate sentence-level WER score.
- Parameters:
char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)
- Returns:
Average sentence-level WER score
-
convert_to_char
(pred: torch.Tensor, target: torch.Tensor) → Tuple[List, List][source]¶ Convert label ID sequences to character sequences.
- Parameters:
pred – Prediction label ID sequences. (B, U)
target – Target label ID sequences. (B, L)
- Returns:
Prediction character sequences. (B, ?) char_target: Target character sequences. (B, ?)
- Return type:
char_pred
espnet2.asr_transducer.espnet_transducer_model¶
ESPnet2 ASR Transducer model.
-
class
espnet2.asr_transducer.espnet_transducer_model.
ESPnetASRTransducerModel
(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], encoder: espnet2.asr_transducer.encoder.encoder.Encoder, decoder: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, transducer_weight: float = 1.0, use_k2_pruned_loss: bool = False, k2_pruned_loss_args: Dict = {}, warmup_steps: int = 25000, validation_nstep: int = 2, fastemit_lambda: float = 0.0, auxiliary_ctc_weight: float = 0.0, auxiliary_ctc_dropout_rate: float = 0.0, auxiliary_lm_loss_weight: float = 0.0, auxiliary_lm_loss_smoothing: float = 0.05, ignore_id: int = -1, sym_space: str = '<space>', sym_blank: str = '<blank>', report_cer: bool = False, report_wer: bool = False, extract_feats_in_collect_stats: bool = True)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
ESPnet2ASRTransducerModel module definition.
- Parameters:
vocab_size – Size of complete vocabulary (w/ SOS/EOS and blank included).
token_list – List of tokens in vocabulary (minus reserved tokens).
frontend – Frontend module.
specaug – SpecAugment module.
normalize – Normalization module.
encoder – Encoder module.
decoder – Decoder module.
joint_network – Joint Network module.
transducer_weight – Weight of the Transducer loss.
use_k2_pruned_loss – Whether to use k2 pruned Transducer loss.
k2_pruned_loss_args – Arguments of the k2 loss pruned Transducer loss.
warmup_steps – Number of steps in warmup, used for pruned loss scaling.
validation_nstep – Maximum number of symbol expansions at each time step when reporting CER or/and WER using mAES.
fastemit_lambda – FastEmit lambda value.
auxiliary_ctc_weight – Weight of auxiliary CTC loss.
auxiliary_ctc_dropout_rate – Dropout rate for auxiliary CTC loss inputs.
auxiliary_lm_loss_weight – Weight of auxiliary LM loss.
auxiliary_lm_loss_smoothing – Smoothing rate for LM loss’ label smoothing.
ignore_id – Initial padding ID.
sym_space – Space symbol.
sym_blank – Blank Symbol.
report_cer – Whether to report Character Error Rate during validation.
report_wer – Whether to report Word Error Rate during validation.
extract_feats_in_collect_stats – Whether to use extract_feats stats collection.
Construct an ESPnetASRTransducerModel object.
-
collect_feats
(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶ Collect features sequences and features lengths sequences.
- Parameters:
speech – Speech sequences. (B, S)
speech_lengths – Speech sequences lengths. (B,)
text – Label ID sequences. (B, L)
text_lengths – Label ID sequences lengths. (B,)
kwargs – Contains “utts_id”.
- Returns:
- “feats”: Features sequences. (B, T, D_feats),
”feats_lengths”: Features sequences lengths. (B,)
- Return type:
{}
-
encode
(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Encoder speech sequences.
- Parameters:
speech – Speech sequences. (B, S)
speech_lengths – Speech sequences lengths. (B,)
- Returns:
Encoder outputs. (B, T, D_enc) encoder_out_lens: Encoder outputs lengths. (B,)
- Return type:
encoder_out
-
forward
(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Forward architecture and compute loss(es).
- Parameters:
speech – Speech sequences. (B, S)
speech_lengths – Speech sequences lengths. (B,)
text – Label ID sequences. (B, L)
text_lengths – Label ID sequences lengths. (B,)
kwargs – Contains “utts_id”.
- Returns:
Main loss value. stats: Task statistics. weight: Task weights.
- Return type:
loss
espnet2.asr_transducer.normalization¶
Normalization modules for Transducer.
-
class
espnet2.asr_transducer.normalization.
BasicNorm
(normalized_shape: int, eps: float = 0.25)[source]¶ Bases:
torch.nn.modules.module.Module
BasicNorm module definition.
Reference: https://github.com/k2-fsa/icefall/pull/288
- Parameters:
normalized_shape – Expected size.
eps – Value added to the denominator for numerical stability.
Construct a BasicNorm object.
-
class
espnet2.asr_transducer.normalization.
RMSNorm
(normalized_shape: int, eps: float = 1e-05, partial: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
RMSNorm module definition.
Reference: https://arxiv.org/pdf/1910.07467.pdf
- Parameters:
normalized_shape – Expected size.
eps – Value added to the denominator for numerical stability.
partial – Value defining the part of the input used for RMS stats.
Construct a RMSNorm object.
-
class
espnet2.asr_transducer.normalization.
ScaleNorm
(normalized_shape: int, eps: float = 1e-05)[source]¶ Bases:
torch.nn.modules.module.Module
ScaleNorm module definition.
Reference: https://arxiv.org/pdf/1910.05895.pdf
- Parameters:
normalized_shape – Expected size.
eps – Value added to the denominator for numerical stability.
Construct a ScaleNorm object.
-
espnet2.asr_transducer.normalization.
get_normalization
(normalization_type: str, eps: Optional[float] = None, partial: Optional[float] = None) → Tuple[torch.nn.modules.module.Module, Dict][source]¶ Get normalization module and arguments given parameters.
- Parameters:
normalization_type – Normalization module type.
eps – Value added to the denominator.
partial – Value defining the part of the input used for RMS stats (RMSNorm).
- Returns:
Normalization module class : Normalization module arguments
espnet2.asr_transducer.beam_search_transducer¶
Search algorithms for Transducer models.
-
class
espnet2.asr_transducer.beam_search_transducer.
BeamSearchTransducer
(decoder: espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, beam_size: int, lm: Optional[torch.nn.modules.module.Module] = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 3, u_max: int = 50, nstep: int = 2, expansion_gamma: float = 2.3, expansion_beta: int = 2, score_norm: bool = False, nbest: int = 1, streaming: bool = False)[source]¶ Bases:
object
Beam search implementation for Transducer.
- Parameters:
decoder – Decoder module.
joint_network – Joint network module.
beam_size – Size of the beam.
lm – LM module.
lm_weight – LM weight for soft fusion.
search_type – Search algorithm to use during inference.
max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)
u_max – Maximum expected target sequence length. (ALSD)
nstep – Number of maximum expansion steps at each time step. (mAES)
expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)
expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)
score_norm – Normalize final scores by length.
nbest – Number of final hypothesis.
streaming – Whether to perform chunk-by-chunk beam search.
Construct a BeamSearchTransducer object.
-
align_length_sync_decoding
(enc_out: torch.Tensor) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]¶ Alignment-length synchronous beam search implementation.
Based on https://ieeexplore.ieee.org/document/9053040
- Parameters:
h – Encoder output sequences. (T, D)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
create_lm_batch_inputs
(hyps_seq: List[List[int]]) → torch.Tensor[source]¶ Make batch of inputs with left padding for LM scoring.
- Parameters:
hyps_seq – Hypothesis sequences.
- Returns:
Padded batch of sequences.
-
default_beam_search
(enc_out: torch.Tensor) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]¶ Beam search implementation without prefix search.
Modified from https://arxiv.org/pdf/1211.3711.pdf
- Parameters:
enc_out – Encoder output sequence. (T, D)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
modified_adaptive_expansion_search
(enc_out: torch.Tensor) → List[espnet2.asr_transducer.beam_search_transducer.ExtendedHypothesis][source]¶ Modified version of Adaptive Expansion Search (mAES).
- Based on AES (https://ieeexplore.ieee.org/document/9250505) and
- Parameters:
enc_out – Encoder output sequence. (T, D_enc)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
recombine_hyps
(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]¶ Recombine hypotheses with same label ID sequence.
- Parameters:
hyps – Hypotheses.
- Returns:
Recombined hypotheses.
- Return type:
final
-
select_k_expansions
(hyps: List[espnet2.asr_transducer.beam_search_transducer.ExtendedHypothesis], topk_idx: torch.Tensor, topk_logp: torch.Tensor) → List[espnet2.asr_transducer.beam_search_transducer.ExtendedHypothesis][source]¶ Return K hypotheses candidates for expansion from a list of hypothesis.
K candidates are selected according to the extended hypotheses probabilities and a prune-by-value method. Where K is equal to beam_size + beta.
- Parameters:
hyps – Hypotheses.
topk_idx – Indices of candidates hypothesis.
topk_logp – Log-probabilities of candidates hypothesis.
- Returns:
Best K expansion hypotheses candidates.
- Return type:
k_expansions
-
sort_nbest
(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]¶ Sort in-place hypotheses by score or score given sequence length.
- Parameters:
hyps – Hypothesis.
- Returns:
Sorted hypothesis.
- Return type:
hyps
-
time_sync_decoding
(enc_out: torch.Tensor) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]¶ Time synchronous beam search implementation.
Based on https://ieeexplore.ieee.org/document/9053040
- Parameters:
enc_out – Encoder output sequence. (T, D)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
class
espnet2.asr_transducer.beam_search_transducer.
ExtendedHypothesis
(score: float, yseq: List[int], dec_state: Optional[Tuple[torch.Tensor, Optional[torch.Tensor]]] = None, lm_state: Union[Dict[str, Any], List[Any], None] = None, dec_out: torch.Tensor = None, lm_score: torch.Tensor = None)[source]¶ Bases:
espnet2.asr_transducer.beam_search_transducer.Hypothesis
Extended hypothesis definition for NSC beam search and mAES.
:param : Hypothesis dataclass arguments. :param dec_out: Decoder output sequence. (B, D_dec) :param lm_score: Log-probabilities of the LM for given label. (vocab_size)
-
dec_out
= None¶
-
lm_score
= None¶
-
-
class
espnet2.asr_transducer.beam_search_transducer.
Hypothesis
(score: float, yseq: List[int], dec_state: Optional[Tuple[torch.Tensor, Optional[torch.Tensor]]] = None, lm_state: Union[Dict[str, Any], List[Any], None] = None)[source]¶ Bases:
object
Default hypothesis definition for Transducer search algorithms.
- Parameters:
score – Total log-probability.
yseq – Label sequence as integer ID sequence.
dec_state – RNN/MEGA Decoder state (None if Stateless).
lm_state – RNNLM state. ((N, D_lm), (N, D_lm)) or None
-
dec_state
= None¶
-
lm_state
= None¶
espnet2.asr_transducer.utils¶
Utility functions for Transducer models.
-
exception
espnet2.asr_transducer.utils.
TooShortUttError
(message: str, actual_size: int, limit: int)[source]¶ Bases:
Exception
Raised when the utt is too short for subsampling.
- Parameters:
message – Error message to display.
actual_size – The size that cannot pass the subsampling.
limit – The size limit for subsampling.
Construct a TooShortUttError module.
-
espnet2.asr_transducer.utils.
check_short_utt
(sub_factor: int, size: int) → Tuple[bool, int][source]¶ Check if the input is too short for subsampling.
- Parameters:
sub_factor – Subsampling factor for Conv2DSubsampling.
size – Input size.
- Returns:
Whether an error should be sent. : Size limit for specified subsampling factor.
-
espnet2.asr_transducer.utils.
get_convinput_module_parameters
(input_size: int, last_conv_size, subsampling_factor: int, is_vgg: bool = True) → Tuple[Union[Tuple[int, int], int], int][source]¶ Return the convolution module parameters.
- Parameters:
input_size – Module input size.
last_conv_size – Last convolution size for module output size computation.
subsampling_factor – Total subsampling factor.
is_vgg – Whether the module type is VGG-like.
- Returns:
First MaxPool2D kernel size or second Conv2d kernel size and stride. output_size: Convolution module output size.
-
espnet2.asr_transducer.utils.
get_transducer_task_io
(labels: torch.Tensor, encoder_out_lens: torch.Tensor, ignore_id: int = -1, blank_id: int = 0) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Get Transducer loss I/O.
- Parameters:
labels – Label ID sequences. (B, L)
encoder_out_lens – Encoder output lengths. (B,)
ignore_id – Padding symbol ID.
blank_id – Blank symbol ID.
- Returns:
Decoder inputs. (B, U) target: Target label ID sequences. (B, U) t_len: Time lengths. (B,) u_len: Label lengths. (B,)
- Return type:
decoder_in
-
espnet2.asr_transducer.utils.
make_chunk_mask
(size: int, chunk_size: int, num_left_chunks: int = 0, device: torch.device = None) → torch.Tensor[source]¶ Create chunk mask for the subsequent steps (size, size).
Reference: https://github.com/k2-fsa/icefall/blob/master/icefall/utils.py
- Parameters:
size – Size of the source mask.
chunk_size – Number of frames in chunk.
num_left_chunks – Number of left chunks the attention module can see. (null or negative value means full context)
device – Device for the mask tensor.
- Returns:
Chunk mask. (size, size)
- Return type:
mask
-
espnet2.asr_transducer.utils.
make_source_mask
(lengths: torch.Tensor) → torch.Tensor[source]¶ Create source mask for given lengths.
Reference: https://github.com/k2-fsa/icefall/blob/master/icefall/utils.py
- Parameters:
lengths – Sequence lengths. (B,)
- Returns:
Mask for the sequence lengths. (B, max_len)
espnet2.asr_transducer.__init__¶
espnet2.asr_transducer.joint_network¶
Transducer joint network implementation.
-
class
espnet2.asr_transducer.joint_network.
JointNetwork
(output_size: int, encoder_size: int, decoder_size: int, joint_space_size: int = 256, joint_activation_type: str = 'tanh', **activation_parameters)[source]¶ Bases:
torch.nn.modules.module.Module
Transducer joint network module.
- Parameters:
output_size – Output size.
encoder_size – Encoder output size.
decoder_size – Decoder output size.
joint_space_size – Joint space size.
joint_act_type – Type of activation for joint network.
**activation_parameters – Parameters for the activation function.
Construct a JointNetwork object.
-
forward
(enc_out: torch.Tensor, dec_out: torch.Tensor, no_projection: bool = False) → torch.Tensor[source]¶ Joint computation of encoder and decoder hidden state sequences.
- Parameters:
enc_out – Expanded encoder output state sequences. (B, T, s_range, D_enc) or (B, T, 1, D_enc)
dec_out – Expanded decoder output state sequences. (B, T, s_range, D_dec) or (B, 1, U, D_dec)
- Returns:
- Joint output state sequences.
(B, T, U, D_out) or (B, T, s_range, D_out)
- Return type:
joint_out
espnet2.asr_transducer.encoder.encoder¶
Encoder for Transducer model.
-
class
espnet2.asr_transducer.encoder.encoder.
Encoder
(input_size: int, body_conf: List[Dict[str, Any]], input_conf: Dict[str, Any] = {}, main_conf: Dict[str, Any] = {})[source]¶ Bases:
torch.nn.modules.module.Module
Encoder module definition.
- Parameters:
input_size – Input size.
body_conf – Encoder body configuration.
input_conf – Encoder input configuration.
main_conf – Encoder main configuration.
Construct an Encoder object.
-
chunk_forward
(x: torch.Tensor, x_len: torch.Tensor, processed_frames: torch._VariableFunctionsClass.tensor, left_context: int = 32) → torch.Tensor[source]¶ Encode input sequences as chunks.
- Parameters:
x – Encoder input features. (1, T_in, F)
x_len – Encoder input features lengths. (1,)
processed_frames – Number of frames already seen.
left_context – Number of previous frames (AFTER subsampling) the attention module can see in current chunk.
- Returns:
Encoder outputs. (B, T_out, D_enc)
- Return type:
x
-
forward
(x: torch.Tensor, x_len: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Encode input sequences.
- Parameters:
x – Encoder input features. (B, T_in, F)
x_len – Encoder input features lengths. (B,)
- Returns:
Encoder outputs. (B, T_out, D_enc) x_len: Encoder outputs lenghts. (B,)
- Return type:
x
espnet2.asr_transducer.encoder.__init__¶
espnet2.asr_transducer.encoder.building¶
Set of methods to build Transducer encoder architecture.
-
espnet2.asr_transducer.encoder.building.
build_body_blocks
(configuration: List[Dict[str, Any]], main_params: Dict[str, Any], output_size: int) → espnet2.asr_transducer.encoder.modules.multi_blocks.MultiBlocks[source]¶ Build encoder body blocks.
- Parameters:
configuration – Body blocks configuration.
main_params – Encoder main parameters.
output_size – Architecture output size.
- Returns:
MultiBlocks function encapsulation all encoder blocks.
-
espnet2.asr_transducer.encoder.building.
build_branchformer_block
(configuration: List[Dict[str, Any]], main_params: Dict[str, Any]) → espnet2.asr_transducer.encoder.blocks.branchformer.Branchformer[source]¶ Build Branchformer block.
- Parameters:
configuration – Branchformer block configuration.
main_params – Encoder main parameters.
- Returns:
Branchformer block function.
-
espnet2.asr_transducer.encoder.building.
build_conformer_block
(configuration: List[Dict[str, Any]], main_params: Dict[str, Any]) → espnet2.asr_transducer.encoder.blocks.conformer.Conformer[source]¶ Build Conformer block.
- Parameters:
configuration – Conformer block configuration.
main_params – Encoder main parameters.
- Returns:
Conformer block function.
-
espnet2.asr_transducer.encoder.building.
build_conv1d_block
(configuration: List[Dict[str, Any]], causal: bool) → espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d[source]¶ Build Conv1d block.
- Parameters:
configuration – Conv1d block configuration.
- Returns:
Conv1d block function.
-
espnet2.asr_transducer.encoder.building.
build_ebranchformer_block
(configuration: List[Dict[str, Any]], main_params: Dict[str, Any]) → espnet2.asr_transducer.encoder.blocks.ebranchformer.EBranchformer[source]¶ Build E-Branchformer block.
- Parameters:
configuration – E-Branchformer block configuration.
main_params – Encoder main parameters.
- Returns:
E-Branchformer block function.
-
espnet2.asr_transducer.encoder.building.
build_input_block
(input_size: int, configuration: Dict[str, Union[str, int]]) → espnet2.asr_transducer.encoder.blocks.conv_input.ConvInput[source]¶ Build encoder input block.
- Parameters:
input_size – Input size.
configuration – Input block configuration.
- Returns:
ConvInput block function.
-
espnet2.asr_transducer.encoder.building.
build_main_parameters
(pos_wise_act_type: str = 'swish', conv_mod_act_type: str = 'swish', pos_enc_dropout_rate: float = 0.0, pos_enc_max_len: int = 5000, simplified_att_score: bool = False, norm_type: str = 'layer_norm', conv_mod_norm_type: str = 'layer_norm', after_norm_eps: Optional[float] = None, after_norm_partial: Optional[float] = None, blockdrop_rate: float = 0.0, dynamic_chunk_training: bool = False, short_chunk_threshold: float = 0.75, short_chunk_size: int = 25, num_left_chunks: int = 0, **activation_parameters) → Dict[str, Any][source]¶ Build encoder main parameters.
- Parameters:
pos_wise_act_type – X-former position-wise feed-forward activation type.
conv_mod_act_type – X-former convolution module activation type.
pos_enc_dropout_rate – Positional encoding dropout rate.
pos_enc_max_len – Positional encoding maximum length.
simplified_att_score – Whether to use simplified attention score computation.
norm_type – X-former normalization module type.
conv_mod_norm_type – Conformer convolution module normalization type.
after_norm_eps – Epsilon value for the final normalization.
after_norm_partial – Value for the final normalization with RMSNorm.
blockdrop_rate – Probability threshold of dropping out each encoder block.
dynamic_chunk_training – Whether to use dynamic chunk training.
short_chunk_threshold – Threshold for dynamic chunk selection.
short_chunk_size – Minimum number of frames during dynamic chunk training.
num_left_chunks – Number of left chunks the attention module can see. (null or negative value means full context)
**activations_parameters – Parameters of the activation functions. (See espnet2/asr_transducer/activation.py)
- Returns:
Main encoder parameters
-
espnet2.asr_transducer.encoder.building.
build_positional_encoding
(block_size: int, configuration: Dict[str, Any]) → espnet2.asr_transducer.encoder.modules.positional_encoding.RelPositionalEncoding[source]¶ Build positional encoding block.
- Parameters:
block_size – Input/output size.
configuration – Positional encoding configuration.
- Returns:
Positional encoding module.
espnet2.asr_transducer.encoder.validation¶
Set of methods to validate encoder architecture.
-
espnet2.asr_transducer.encoder.validation.
validate_architecture
(input_conf: Dict[str, Any], body_conf: List[Dict[str, Any]], input_size: int) → Tuple[int, int][source]¶ Validate specified architecture is valid.
- Parameters:
input_conf – Encoder input block configuration.
body_conf – Encoder body blocks configuration.
input_size – Encoder input size.
- Returns:
Encoder input block output size. : Encoder body block output size.
- Return type:
input_block_osize
-
espnet2.asr_transducer.encoder.validation.
validate_block_arguments
(configuration: Dict[str, Any], block_id: int, previous_block_output: int) → Tuple[int, int][source]¶ Validate block arguments.
- Parameters:
configuration – Architecture configuration.
block_id – Block ID.
previous_block_output – Previous block output size.
- Returns:
Block input size. output_size: Block output size.
- Return type:
input_size
-
espnet2.asr_transducer.encoder.validation.
validate_input_block
(configuration: Dict[str, Any], body_first_conf: Dict[str, Any], input_size: int) → int[source]¶ Validate input block.
- Parameters:
configuration – Encoder input block configuration.
body_first_conf – Encoder first body block configuration.
input_size – Encoder input block input size.
- Returns:
Encoder input block output size.
- Return type:
output_size
espnet2.asr_transducer.encoder.blocks.conformer¶
Conformer block for Transducer encoder.
-
class
espnet2.asr_transducer.encoder.blocks.conformer.
Conformer
(block_size: int, self_att: torch.nn.modules.module.Module, feed_forward: torch.nn.modules.module.Module, feed_forward_macaron: torch.nn.modules.module.Module, conv_mod: torch.nn.modules.module.Module, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Dict = {}, dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Conformer module definition.
- Parameters:
block_size – Input/output size.
self_att – Self-attention module instance.
feed_forward – Feed-forward module instance.
feed_forward_macaron – Feed-forward module instance for macaron network.
conv_mod – Convolution module instance.
norm_class – Normalization module class.
norm_args – Normalization module arguments.
dropout_rate – Dropout rate.
Construct a Conformer object.
-
chunk_forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → Tuple[torch.Tensor, torch.Tensor][source]¶ Encode chunk of input sequence.
- Parameters:
x – Conformer input sequences. (B, T, D_block)
pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)
mask – Source mask. (B, T_2)
left_context – Number of previous frames the attention module can see in current chunk.
- Returns:
Conformer output sequences. (B, T, D_block) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)
- Return type:
x
-
forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Encode input sequences.
- Parameters:
x – Conformer input sequences. (B, T, D_block)
pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)
mask – Source mask. (B, T)
chunk_mask – Chunk mask. (T_2, T_2)
- Returns:
Conformer output sequences. (B, T, D_block) mask: Source mask. (B, T) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)
- Return type:
x
espnet2.asr_transducer.encoder.blocks.conv_input¶
ConvInput block for Transducer encoder.
-
class
espnet2.asr_transducer.encoder.blocks.conv_input.
ConvInput
(input_size: int, conv_size: Union[int, Tuple], subsampling_factor: int = 4, vgg_like: bool = True, output_size: Optional[int] = None)[source]¶ Bases:
torch.nn.modules.module.Module
ConvInput module definition.
- Parameters:
input_size – Input size.
conv_size – Convolution size.
subsampling_factor – Subsampling factor.
vgg_like – Whether to use a VGG-like network.
output_size – Block output dimension.
Construct a ConvInput object.
-
forward
(x: torch.Tensor, mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Encode input sequences.
- Parameters:
x – ConvInput input sequences. (B, T, D_feats)
mask – Mask of input sequences. (B, 1, T)
- Returns:
ConvInput output sequences. (B, sub(T), D_out) mask: Mask of output sequences. (B, 1, sub(T))
- Return type:
x
espnet2.asr_transducer.encoder.blocks.ebranchformer¶
E-Branchformer block for Transducer encoder.
-
class
espnet2.asr_transducer.encoder.blocks.ebranchformer.
EBranchformer
(block_size: int, linear_size: int, self_att: torch.nn.modules.module.Module, feed_forward: torch.nn.modules.module.Module, feed_forward_macaron: torch.nn.modules.module.Module, conv_mod: torch.nn.modules.module.Module, depthwise_conv_mod: torch.nn.modules.module.Module, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Dict = {}, dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
E-Branchformer module definition.
Reference: https://arxiv.org/pdf/2210.00077.pdf
- Parameters:
block_size – Input/output size.
linear_size – Linear layers’ hidden size.
self_att – Self-attention module instance.
feed_forward – Feed-forward module instance.
feed_forward_macaron – Feed-forward module instance for macaron network.
conv_mod – ConvolutionalSpatialGatingUnit module instance.
depthwise_conv_mod – DepthwiseConvolution module instance.
norm_class – Normalization class.
norm_args – Normalization module arguments.
dropout_rate – Dropout rate.
Construct a E-Branchformer object.
-
chunk_forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → Tuple[torch.Tensor, torch.Tensor][source]¶ Encode chunk of input sequence.
- Parameters:
x – E-Branchformer input sequences. (B, T, D_block)
pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)
mask – Source mask. (B, T_2)
left_context – Number of previous frames the attention module can see in current chunk.
- Returns:
E-Branchformer output sequences. (B, T, D_block) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)
- Return type:
x
-
forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Encode input sequences.
- Parameters:
x – E-Branchformer input sequences. (B, T, D_block)
pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)
mask – Source mask. (B, T)
chunk_mask – Chunk mask. (T_2, T_2)
- Returns:
E-Branchformer output sequences. (B, T, D_block) mask: Source mask. (B, T) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)
- Return type:
x
espnet2.asr_transducer.encoder.blocks.branchformer¶
Branchformer block for Transducer encoder.
-
class
espnet2.asr_transducer.encoder.blocks.branchformer.
Branchformer
(block_size: int, linear_size: int, self_att: torch.nn.modules.module.Module, conv_mod: torch.nn.modules.module.Module, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Dict = {}, dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Branchformer module definition.
Reference: https://arxiv.org/pdf/2207.02971.pdf
- Parameters:
block_size – Input/output size.
linear_size – Linear layers’ hidden size.
self_att – Self-attention module instance.
conv_mod – Convolution module instance.
norm_class – Normalization class.
norm_args – Normalization module arguments.
dropout_rate – Dropout rate.
Construct a Branchformer object.
-
chunk_forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → Tuple[torch.Tensor, torch.Tensor][source]¶ Encode chunk of input sequence.
- Parameters:
x – Branchformer input sequences. (B, T, D_block)
pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)
mask – Source mask. (B, T_2)
left_context – Number of previous frames the attention module can see in current chunk.
- Returns:
Branchformer output sequences. (B, T, D_block) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)
- Return type:
x
-
forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Encode input sequences.
- Parameters:
x – Branchformer input sequences. (B, T, D_block)
pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_block)
mask – Source mask. (B, T)
chunk_mask – Chunk mask. (T_2, T_2)
- Returns:
Branchformer output sequences. (B, T, D_block) mask: Source mask. (B, T) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_block)
- Return type:
x
espnet2.asr_transducer.encoder.blocks.conv1d¶
Conv1d block for Transducer encoder.
-
class
espnet2.asr_transducer.encoder.blocks.conv1d.
Conv1d
(input_size: int, output_size: int, kernel_size: Union[int, Tuple], stride: Union[int, Tuple] = 1, dilation: Union[int, Tuple] = 1, groups: Union[int, Tuple] = 1, bias: bool = True, batch_norm: bool = False, relu: bool = True, causal: bool = False, dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Conv1d module definition.
- Parameters:
input_size – Input dimension.
output_size – Output dimension.
kernel_size – Size of the convolving kernel.
stride – Stride of the convolution.
dilation – Spacing between the kernel points.
groups – Number of blocked connections from input channels to output channels.
bias – Whether to add a learnable bias to the output.
batch_norm – Whether to use batch normalization after convolution.
relu – Whether to use a ReLU activation after convolution.
causal – Whether to use causal convolution (set to True if streaming).
dropout_rate – Dropout rate.
Construct a Conv1d object.
-
chunk_forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → Tuple[torch.Tensor, torch.Tensor][source]¶ Encode chunk of input sequence.
- Parameters:
x – Conv1d input sequences. (B, T, D_in)
pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_in)
mask – Source mask. (B, T)
left_context – Number of previous frames the attention module can see in current chunk (not used here).
- Returns:
Conv1d output sequences. (B, T, D_out) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_out)
- Return type:
x
-
create_new_mask
(mask: torch.Tensor) → torch.Tensor[source]¶ Create new mask for output sequences.
- Parameters:
mask – Mask of input sequences. (B, T)
- Returns:
Mask of output sequences. (B, sub(T))
- Return type:
mask
-
create_new_pos_enc
(pos_enc: torch.Tensor) → torch.Tensor[source]¶ Create new positional embedding vector.
- Parameters:
pos_enc – Input sequences positional embedding. (B, 2 * (T - 1), D_in)
- Returns:
- Output sequences positional embedding.
(B, 2 * (sub(T) - 1), D_in)
- Return type:
pos_enc
-
forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: Optional[torch.Tensor] = None, chunk_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Encode input sequences.
- Parameters:
x – Conv1d input sequences. (B, T, D_in)
pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_in)
mask – Source mask. (B, T)
chunk_mask – Chunk mask. (T_2, T_2)
- Returns:
Conv1d output sequences. (B, sub(T), D_out) mask: Source mask. (B, T) or (B, sub(T)) pos_enc: Positional embedding sequences.
(B, 2 * (T - 1), D_att) or (B, 2 * (sub(T) - 1), D_out)
- Return type:
x
espnet2.asr_transducer.encoder.blocks.__init__¶
espnet2.asr_transducer.encoder.modules.multi_blocks¶
MultiBlocks for encoder architecture.
-
class
espnet2.asr_transducer.encoder.modules.multi_blocks.
MultiBlocks
(block_list: List[torch.nn.modules.module.Module], output_size: int, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Optional[Dict] = None, blockdrop_rate: int = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
MultiBlocks definition.
- Parameters:
block_list – Individual blocks of the encoder architecture.
output_size – Architecture output size.
norm_class – Normalization module class.
norm_args – Normalization module arguments.
blockdrop_rate – Probability threshold of dropping out each block.
Construct a MultiBlocks object.
-
chunk_forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) → torch.Tensor[source]¶ Forward each block of the encoder architecture.
- Parameters:
x – MultiBlocks input sequences. (B, T, D_block_1)
pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_att)
mask – Source mask. (B, T_2)
left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).
- Returns:
MultiBlocks output sequences. (B, T, D_block_N)
- Return type:
x
-
forward
(x: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶ Forward each block of the encoder architecture.
- Parameters:
x – MultiBlocks input sequences. (B, T, D_block_1)
pos_enc – Positional embedding sequences.
mask – Source mask. (B, T)
chunk_mask – Chunk mask. (T_2, T_2)
- Returns:
Output sequences. (B, T, D_block_N)
- Return type:
x
-
reset_streaming_cache
(left_context: int, device: torch.device) → None[source]¶ Initialize/Reset encoder streaming cache.
- Parameters:
left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).
device – Device to use for cache tensor.
espnet2.asr_transducer.encoder.modules.positional_encoding¶
Positional encoding modules.
-
class
espnet2.asr_transducer.encoder.modules.positional_encoding.
RelPositionalEncoding
(size: int, dropout_rate: float = 0.0, max_len: int = 5000)[source]¶ Bases:
torch.nn.modules.module.Module
Relative positional encoding.
- Parameters:
size – Module size.
max_len – Maximum input length.
dropout_rate – Dropout rate.
Construct a RelativePositionalEncoding object.
-
extend_pe
(x: torch.Tensor, left_context: int = 0) → None[source]¶ Reset positional encoding.
- Parameters:
x – Input sequences. (B, T, ?)
left_context – Number of previous frames the attention module can see in current chunk.
-
forward
(x: torch.Tensor, left_context: int = 0) → torch.Tensor[source]¶ Compute positional encoding.
- Parameters:
x – Input sequences. (B, T, ?)
left_context – Number of previous frames the attention module can see in current chunk.
- Returns:
Positional embedding sequences. (B, 2 * (T - 1), ?)
- Return type:
pos_enc
espnet2.asr_transducer.encoder.modules.normalization¶
Normalization modules for X-former blocks.
-
class
espnet2.asr_transducer.encoder.modules.normalization.
BasicNorm
(normalized_shape: int, eps: float = 0.25)[source]¶ Bases:
torch.nn.modules.module.Module
BasicNorm module definition.
Reference: https://github.com/k2-fsa/icefall/pull/288
- Parameters:
normalized_shape – Expected size.
eps – Value added to the denominator for numerical stability.
Construct a BasicNorm object.
-
class
espnet2.asr_transducer.encoder.modules.normalization.
RMSNorm
(normalized_shape: int, eps: float = 1e-05, partial: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
RMSNorm module definition.
Reference: https://arxiv.org/pdf/1910.07467.pdf
- Parameters:
normalized_shape – Expected size.
eps – Value added to the denominator for numerical stability.
partial – Value defining the part of the input used for RMS stats.
Construct a RMSNorm object.
-
class
espnet2.asr_transducer.encoder.modules.normalization.
ScaleNorm
(normalized_shape: int, eps: float = 1e-05)[source]¶ Bases:
torch.nn.modules.module.Module
ScaleNorm module definition.
Reference: https://arxiv.org/pdf/1910.05895.pdf
- Parameters:
normalized_shape – Expected size.
eps – Value added to the denominator for numerical stability.
Construct a ScaleNorm object.
-
espnet2.asr_transducer.encoder.modules.normalization.
get_normalization
(normalization_type: str, eps: Optional[float] = None, partial: Optional[float] = None) → Tuple[torch.nn.modules.module.Module, Dict][source]¶ Get normalization module and arguments given parameters.
- Parameters:
normalization_type – Normalization module type.
eps – Value added to the denominator.
partial – Value defining the part of the input used for RMS stats (RMSNorm).
- Returns:
Normalization module class : Normalization module arguments
espnet2.asr_transducer.encoder.modules.convolution¶
Convolution modules for X-former blocks.
-
class
espnet2.asr_transducer.encoder.modules.convolution.
ConformerConvolution
(channels: int, kernel_size: int, activation: torch.nn.modules.module.Module = ReLU(), norm_args: Dict = {}, causal: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
ConformerConvolution module definition.
- Parameters:
channels – The number of channels.
kernel_size – Size of the convolving kernel.
activation – Activation function.
norm_args – Normalization module arguments.
causal – Whether to use causal convolution (set to True if streaming).
Construct an ConformerConvolution object.
-
forward
(x: torch.Tensor, mask: Optional[torch.Tensor] = None, cache: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute convolution module.
- Parameters:
x – ConformerConvolution input sequences. (B, T, D_hidden)
mask – Source mask. (B, T_2)
cache – ConformerConvolution input cache. (1, D_hidden, conv_kernel)
- Returns:
ConformerConvolution output sequences. (B, ?, D_hidden) cache: ConformerConvolution output cache. (1, D_hidden, conv_kernel)
- Return type:
x
-
class
espnet2.asr_transducer.encoder.modules.convolution.
ConvolutionalSpatialGatingUnit
(size: int, kernel_size: int, norm_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: Dict = {}, dropout_rate: float = 0.0, causal: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional Spatial Gating Unit module definition.
- Parameters:
size – Initial size to determine the number of channels.
kernel_size – Size of the convolving kernel.
norm_class – Normalization module class.
norm_args – Normalization module arguments.
dropout_rate – Dropout rate.
causal – Whether to use causal convolution (set to True if streaming).
Construct a ConvolutionalSpatialGatingUnit object.
-
forward
(x: torch.Tensor, mask: Optional[torch.Tensor] = None, cache: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute convolution module.
- Parameters:
x – ConvolutionalSpatialGatingUnit input sequences. (B, T, D_hidden)
mask – Source mask. (B, T_2)
cache – ConvolutionalSpationGatingUnit input cache. (1, D_hidden, conv_kernel)
- Returns:
ConvolutionalSpatialGatingUnit output sequences. (B, ?, D_hidden)
- Return type:
x
-
class
espnet2.asr_transducer.encoder.modules.convolution.
DepthwiseConvolution
(size: int, kernel_size: int, causal: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
Depth-wise Convolution module definition.
- Parameters:
size – Initial size to determine the number of channels.
kernel_size – Size of the convolving kernel.
causal – Whether to use causal convolution (set to True if streaming).
Construct a DepthwiseConvolution object.
-
forward
(x: torch.Tensor, mask: Optional[torch.Tensor] = None, cache: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute convolution module.
- Parameters:
x – DepthwiseConvolution input sequences. (B, T, D_hidden)
mask – Source mask. (B, T_2)
cache – DepthwiseConvolution input cache. (1, conv_kernel, D_hidden)
- Returns:
DepthwiseConvolution output sequences. (B, ?, D_hidden)
- Return type:
x
espnet2.asr_transducer.encoder.modules.attention¶
Multi-Head attention layers with relative positional encoding.
-
class
espnet2.asr_transducer.encoder.modules.attention.
RelPositionMultiHeadedAttention
(num_heads: int, embed_size: int, dropout_rate: float = 0.0, simplified_attention_score: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
RelPositionMultiHeadedAttention definition.
- Parameters:
num_heads – Number of attention heads.
embed_size – Embedding size.
dropout_rate – Dropout rate.
Construct an MultiHeadedAttention object.
-
compute_attention_score
(query: torch.Tensor, key: torch.Tensor, pos_enc: torch.Tensor, left_context: int = 0) → torch.Tensor[source]¶ Attention score computation.
- Parameters:
query – Transformed query tensor. (B, H, T_1, d_k)
key – Transformed key tensor. (B, H, T_2, d_k)
pos_enc – Positional embedding tensor. (B, 2 * T_1 - 1, size)
left_context – Number of previous frames to use for current chunk attention computation.
- Returns:
Attention score. (B, H, T_1, T_2)
-
compute_simplified_attention_score
(query: torch.Tensor, key: torch.Tensor, pos_enc: torch.Tensor, left_context: int = 0) → torch.Tensor[source]¶ Simplified attention score computation.
Reference: https://github.com/k2-fsa/icefall/pull/458
- Parameters:
query – Transformed query tensor. (B, H, T_1, d_k)
key – Transformed key tensor. (B, H, T_2, d_k)
pos_enc – Positional embedding tensor. (B, 2 * T_1 - 1, size)
left_context – Number of previous frames to use for current chunk attention computation.
- Returns:
Attention score. (B, H, T_1, T_2)
-
forward
(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None, left_context: int = 0) → torch.Tensor[source]¶ Compute scaled dot product attention with rel. positional encoding.
- Parameters:
query – Query tensor. (B, T_1, size)
key – Key tensor. (B, T_2, size)
value – Value tensor. (B, T_2, size)
pos_enc – Positional embedding tensor. (B, 2 * T_1 - 1, size)
mask – Source mask. (B, T_2)
chunk_mask – Chunk mask. (T_1, T_1)
left_context – Number of previous frames to use for current chunk attention computation.
- Returns:
Output tensor. (B, T_1, H * d_k)
-
forward_attention
(value: torch.Tensor, scores: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶ Compute attention context vector.
- Parameters:
value – Transformed value. (B, H, T_2, d_k)
scores – Attention score. (B, H, T_1, T_2)
mask – Source mask. (B, T_2)
chunk_mask – Chunk mask. (T_1, T_1)
- Returns:
Transformed value weighted by attention score. (B, T_1, H * d_k)
- Return type:
attn_output
-
forward_qkv
(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Transform query, key and value.
- Parameters:
query – Query tensor. (B, T_1, size)
key – Key tensor. (B, T_2, size)
v – Value tensor. (B, T_2, size)
- Returns:
Transformed query tensor. (B, H, T_1, d_k) k: Transformed key tensor. (B, H, T_2, d_k) v: Transformed value tensor. (B, H, T_2, d_k)
- Return type:
q
-
rel_shift
(x: torch.Tensor, left_context: int = 0) → torch.Tensor[source]¶ Compute relative positional encoding.
- Parameters:
x – Input sequence. (B, H, T_1, 2 * T_1 - 1)
left_context – Number of previous frames to use for current chunk attention computation.
- Returns:
Output sequence. (B, H, T_1, T_2)
- Return type:
x
espnet2.asr_transducer.encoder.modules.__init__¶
espnet2.asr_transducer.decoder.stateless_decoder¶
Stateless decoder definition for Transducer models.
-
class
espnet2.asr_transducer.decoder.stateless_decoder.
StatelessDecoder
(vocab_size: int, embed_size: int = 256, embed_dropout_rate: float = 0.0, embed_pad: int = 0)[source]¶ Bases:
espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder
Stateless Transducer decoder module.
- Parameters:
vocab_size – Output size.
embed_size – Embedding size.
embed_dropout_rate – Dropout rate for embedding layer.
embed_pad – Embed/Blank symbol ID.
Construct a StatelessDecoder object.
-
batch_score
(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → Tuple[torch.Tensor, None][source]¶ One-step forward hypotheses.
- Parameters:
hyps – Hypotheses.
- Returns:
Decoder output sequences. (B, D_dec) states: Decoder hidden states. None
- Return type:
out
-
create_batch_states
(new_states: List[Optional[torch.Tensor]]) → None[source]¶ Create decoder hidden states.
- Parameters:
new_states – Decoder hidden states. [N x None]
- Returns:
Decoder hidden states. None
- Return type:
states
-
forward
(labels: torch.Tensor, states: Optional[Any] = None) → torch.Tensor[source]¶ Encode source label sequences.
- Parameters:
labels – Label ID sequences. (B, L)
states – Decoder hidden states. None
- Returns:
Decoder output sequences. (B, U, D_emb)
- Return type:
embed
-
init_state
(batch_size: int) → None[source]¶ Initialize decoder states.
- Parameters:
batch_size – Batch size.
- Returns:
Initial decoder hidden states. None
-
score
(label_sequence: List[int], states: Optional[Any] = None) → Tuple[torch.Tensor, None][source]¶ One-step forward hypothesis.
- Parameters:
label_sequence – Current label sequence.
states – Decoder hidden states. None
- Returns:
Decoder output sequence. (1, D_emb) state: Decoder hidden states. None
espnet2.asr_transducer.decoder.mega_decoder¶
MEGA decoder definition for Transducer models.
-
class
espnet2.asr_transducer.decoder.mega_decoder.
MEGADecoder
(vocab_size: int, block_size: int = 512, linear_size: int = 1024, qk_size: int = 128, v_size: int = 1024, num_heads: int = 4, rel_pos_bias_type: str = 'simple', max_positions: int = 2048, truncation_length: Optional[int] = None, normalization_type: str = 'layer_norm', normalization_args: Dict = {}, activation_type: str = 'swish', activation_args: Dict = {}, chunk_size: int = -1, num_blocks: int = 4, dropout_rate: float = 0.0, embed_dropout_rate: float = 0.0, att_dropout_rate: float = 0.0, ema_dropout_rate: float = 0.0, ffn_dropout_rate: float = 0.0, embed_pad: int = 0)[source]¶ Bases:
espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder
MEGA decoder module.
Based on https://arxiv.org/pdf/2209.10655.pdf.
- Parameters:
vocab_size – Vocabulary size.
block_size – Input/Output size.
linear_size – NormalizedPositionwiseFeedForward hidden size.
qk_size – Shared query and key size for attention module.
v_size – Value size for attention module.
num_heads – Number of EMA heads.
rel_pos_bias – Type of relative position bias in attention module.
max_positions – Maximum number of position for RelativePositionBias.
truncation_length – Maximum length for truncation in EMA module.
normalization_type – Normalization layer type.
normalization_args – Normalization layer arguments.
activation_type – Activation function type.
activation_args – Activation function arguments.
chunk_size – Chunk size for attention computation (-1 = full context).
num_blocks – Number of MEGA blocks.
dropout_rate – Dropout rate for MEGA internal modules.
embed_dropout_rate – Dropout rate for embedding layer.
att_dropout_rate – Dropout rate for the attention module.
ema_dropout_rate – Dropout rate for the EMA module.
ffn_dropout_rate – Dropout rate for the feed-forward module.
embed_pad – Embedding padding symbol ID.
Construct a MEGADecoder object.
-
batch_score
(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → Tuple[torch.Tensor, List[Dict[str, torch.Tensor]]][source]¶ One-step forward hypotheses.
- Parameters:
hyps – Hypotheses.
- Returns:
states:
- Return type:
out
-
create_batch_states
(new_states: List[List[Dict[str, torch.Tensor]]]) → List[Dict[str, torch.Tensor]][source]¶ Create batch of decoder hidden states given a list of new states.
- Parameters:
new_states – Decoder hidden states. [B x [N x Dict]]
- Returns:
Decoder hidden states. [N x Dict]
-
forward
(labels: torch.Tensor) → torch.Tensor[source]¶ Encode source label sequences.
- Parameters:
labels – Decoder input sequences. (B, L)
- Returns:
Decoder output sequences. (B, U, D_dec)
- Return type:
out
-
inference
(labels: torch.Tensor, states: List[Dict[str, torch.Tensor]]) → Tuple[torch.Tensor, List[Dict[str, torch.Tensor]]][source]¶ Encode source label sequences.
- Parameters:
labels – Decoder input sequences. (B, L)
states – Decoder hidden states. [B x Dict]
- Returns:
Decoder output sequences. (B, U, D_dec) new_states: Decoder hidden states. [B x Dict]
- Return type:
out
-
init_state
(batch_size: int = 0) → List[Dict[str, torch.Tensor]][source]¶ Initialize MEGADecoder states.
- Parameters:
batch_size – Batch size.
- Returns:
Decoder hidden states. [N x Dict]
- Return type:
states
-
score
(label_sequence: List[int], states: List[Dict[str, torch.Tensor]]) → Tuple[torch.Tensor, List[Dict[str, torch.Tensor]]][source]¶ One-step forward hypothesis.
- Parameters:
label_sequence – Current label sequence.
states – Decoder hidden states. (??)
- Returns:
Decoder output sequence. (D_dec) states: Decoder hidden states. (??)
-
select_state
(states: List[Dict[str, torch.Tensor]], idx: int) → List[Dict[str, torch.Tensor]][source]¶ Select ID state from batch of decoder hidden states.
- Parameters:
states – Decoder hidden states. [N x Dict]
- Returns:
Decoder hidden states for given ID. [N x Dict]
espnet2.asr_transducer.decoder.rnn_decoder¶
RNN decoder definition for Transducer models.
-
class
espnet2.asr_transducer.decoder.rnn_decoder.
RNNDecoder
(vocab_size: int, embed_size: int = 256, hidden_size: int = 256, rnn_type: str = 'lstm', num_layers: int = 1, dropout_rate: float = 0.0, embed_dropout_rate: float = 0.0, embed_pad: int = 0)[source]¶ Bases:
espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder
RNN decoder module.
- Parameters:
vocab_size – Vocabulary size.
embed_size – Embedding size.
hidden_size – Hidden size..
rnn_type – Decoder layers type.
num_layers – Number of decoder layers.
dropout_rate – Dropout rate for decoder layers.
embed_dropout_rate – Dropout rate for embedding layer.
embed_pad – Embedding padding symbol ID.
Construct a RNNDecoder object.
-
batch_score
(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ One-step forward hypotheses.
- Parameters:
hyps – Hypotheses.
- Returns:
Decoder output sequences. (B, D_dec) states: Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)
- Return type:
out
-
create_batch_states
(new_states: List[Tuple[torch.Tensor, Optional[torch.Tensor]]]) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Create decoder hidden states.
- Parameters:
new_states – Decoder hidden states. [B x ((N, 1, D_dec), (N, 1, D_dec) or None)]
- Returns:
Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)
- Return type:
states
-
forward
(labels: torch.Tensor) → torch.Tensor[source]¶ Encode source label sequences.
- Parameters:
labels – Label ID sequences. (B, L)
- Returns:
Decoder output sequences. (B, U, D_dec)
- Return type:
out
-
init_state
(batch_size: int) → Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]][source]¶ Initialize decoder states.
- Parameters:
batch_size – Batch size.
- Returns:
Initial decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)
-
rnn_forward
(x: torch.Tensor, state: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ Encode source label sequences.
- Parameters:
x – RNN input sequences. (B, D_emb)
state – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)
- Returns:
RNN output sequences. (B, D_dec) (h_next, c_next): Decoder hidden states.
(N, B, D_dec), (N, B, D_dec) or None)
- Return type:
x
-
score
(label_sequence: List[int], states: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ One-step forward hypothesis.
- Parameters:
label_sequence – Current label sequence.
states – Decoder hidden states. ((N, 1, D_dec), (N, 1, D_dec) or None)
- Returns:
Decoder output sequence. (1, D_dec) states: Decoder hidden states.
((N, 1, D_dec), (N, 1, D_dec) or None)
- Return type:
out
-
select_state
(states: Tuple[torch.Tensor, Optional[torch.Tensor]], idx: int) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Get specified ID state from decoder hidden states.
- Parameters:
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)
idx – State ID to extract.
- Returns:
Decoder hidden state for given ID. ((N, 1, D_dec), (N, 1, D_dec) or None)
espnet2.asr_transducer.decoder.rwkv_decoder¶
RWKV decoder definition for Transducer models.
-
class
espnet2.asr_transducer.decoder.rwkv_decoder.
RWKVDecoder
(vocab_size: int, block_size: int = 512, context_size: int = 1024, linear_size: Optional[int] = None, attention_size: Optional[int] = None, normalization_type: str = 'layer_norm', normalization_args: Dict = {}, num_blocks: int = 4, rescale_every: int = 0, embed_dropout_rate: float = 0.0, att_dropout_rate: float = 0.0, ffn_dropout_rate: float = 0.0, embed_pad: int = 0)[source]¶ Bases:
espnet2.asr_transducer.decoder.abs_decoder.AbsDecoder
RWKV decoder module.
Based on https://arxiv.org/pdf/2305.13048.pdf.
- Parameters:
vocab_size – Vocabulary size.
block_size – Input/Output size.
context_size – Context size for WKV computation.
linear_size – FeedForward hidden size.
attention_size – SelfAttention hidden size.
normalization_type – Normalization layer type.
normalization_args – Normalization layer arguments.
num_blocks – Number of RWKV blocks.
rescale_every – Whether to rescale input every N blocks (inference only).
embed_dropout_rate – Dropout rate for embedding layer.
att_dropout_rate – Dropout rate for the attention module.
ffn_dropout_rate – Dropout rate for the feed-forward module.
embed_pad – Embedding padding symbol ID.
Construct a RWKVDecoder object.
-
batch_score
(hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶ One-step forward hypotheses.
- Parameters:
hyps – Hypotheses.
- Returns:
Decoder output sequence. (B, D_dec) states: Decoder hidden states. [5 x (B, 1, D_att/D_dec, N)]
- Return type:
out
-
create_batch_states
(new_states: List[List[Dict[str, torch.Tensor]]]) → List[torch.Tensor][source]¶ Create batch of decoder hidden states given a list of new states.
- Parameters:
new_states – Decoder hidden states. [B x [5 x (1, 1, D_att/D_dec, N)]
- Returns:
Decoder hidden states. [5 x (B, 1, D_att/D_dec, N)]
-
forward
(labels: torch.Tensor) → torch.Tensor[source]¶ Encode source label sequences.
- Parameters:
labels – Decoder input sequences. (B, L)
- Returns:
Decoder output sequences. (B, U, D_dec)
- Return type:
out
-
inference
(labels: torch.Tensor, states: torch.Tensor) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶ Encode source label sequences.
- Parameters:
labels – Decoder input sequences. (B, L)
states – Decoder hidden states. [5 x (B, D_att/D_dec, N)]
- Returns:
Decoder output sequences. (B, U, D_dec) states: Decoder hidden states. [5 x (B, D_att/D_dec, N)]
- Return type:
out
-
init_state
(batch_size: int = 1) → List[torch.Tensor][source]¶ Initialize RWKVDecoder states.
- Parameters:
batch_size – Batch size.
- Returns:
Decoder hidden states. [5 x (B, 1, D_att/D_dec, N)]
- Return type:
states
-
score
(label_sequence: List[int], states: List[torch.Tensor]) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶ One-step forward hypothesis.
- Parameters:
label_sequence – Current label sequence.
states – Decoder hidden states. [5 x (1, 1, D_att/D_dec, N)]
- Returns:
Decoder output sequence. (D_dec) states: Decoder hidden states. [5 x (1, 1, D_att/D_dec, N)]
espnet2.asr_transducer.decoder.abs_decoder¶
Abstract decoder definition for Transducer models.
-
class
espnet2.asr_transducer.decoder.abs_decoder.
AbsDecoder
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Abstract decoder module.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
batch_score
(hyps: List[Any]) → Tuple[torch.Tensor, Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]]][source]¶ One-step forward hypotheses.
- Parameters:
hyps – Hypotheses.
- Returns:
Decoder output sequences. states: Decoder hidden states.
- Return type:
out
-
abstract
create_batch_states
(new_states: List[Union[List[Dict[str, Optional[torch.Tensor]]], List[List[torch.Tensor]], Tuple[torch.Tensor, Optional[torch.Tensor]]]]) → Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ Create batch of decoder hidden states given a list of new states.
- Parameters:
new_states – Decoder hidden states.
- Returns:
Decoder hidden states.
-
abstract
forward
(labels: torch.Tensor) → torch.Tensor[source]¶ Encode source label sequences.
- Parameters:
labels – Label ID sequences.
- Returns:
Decoder output sequences.
-
abstract
init_state
(batch_size: int) → Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]]][source]¶ Initialize decoder states.
- Parameters:
batch_size – Batch size.
- Returns:
Decoder hidden states.
-
abstract
score
(label_sequence: List[int], states: Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]]) → Tuple[torch.Tensor, Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]]][source]¶ One-step forward hypothesis.
- Parameters:
label_sequence – Current label sequence.
state – Decoder hidden states.
- Returns:
Decoder output sequence. state: Decoder hidden states.
- Return type:
out
-
abstract
select_state
(states: Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]], idx: int = 0) → Union[List[Dict[str, torch.Tensor]], List[torch.Tensor], Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ Get specified ID state from batch of states, if provided.
- Parameters:
states – Decoder hidden states.
idx – State ID to extract.
- Returns:
Decoder hidden state for given ID.
-
abstract
espnet2.asr_transducer.decoder.__init__¶
espnet2.asr_transducer.decoder.blocks.mega¶
Moving Average Equipped Gated Attention (MEGA) block definition.
Based/modified from https://github.com/facebookresearch/mega/blob/main/fairseq/modules/moving_average_gated_attention.py
Most variables are renamed according to https://github.com/huggingface/transformers/blob/main/src/transformers/models/mega/modeling_mega.py.
-
class
espnet2.asr_transducer.decoder.blocks.mega.
MEGA
(size: int = 512, num_heads: int = 4, qk_size: int = 128, v_size: int = 1024, activation: torch.nn.modules.module.Module = ReLU(), normalization: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, rel_pos_bias_type: str = 'simple', max_positions: int = 2048, truncation_length: Optional[int] = None, chunk_size: int = -1, dropout_rate: float = 0.0, att_dropout_rate: float = 0.0, ema_dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
MEGA module.
- Parameters:
size – Input/Output size.
num_heads – Number of EMA heads.
qk_size – Shared query and key size for attention module.
v_size – Value size for attention module.
qk_v_size – (QK, V) sizes for attention module.
activation – Activation function type.
normalization – Normalization module.
rel_pos_bias_type – Type of relative position bias in attention module.
max_positions – Maximum number of position for RelativePositionBias.
truncation_length – Maximum length for truncation in EMA module.
chunk_size – Chunk size for attention computation (-1 = full context).
dropout_rate – Dropout rate for inner modules.
att_dropout_rate – Dropout rate for the attention module.
ema_dropout_rate – Dropout rate for the EMA module.
Construct a MEGA object.
-
forward
(x: torch.Tensor, mask: Optional[torch.Tensor] = None, attn_mask: Optional[torch.Tensor] = None, state: Optional[Dict[str, Optional[torch.Tensor]]] = None) → Tuple[torch.Tensor, Optional[Dict[str, Optional[torch.Tensor]]]][source]¶ Compute moving average equiped gated attention.
- Parameters:
x – MEGA input sequences. (L, B, size)
mask – MEGA input sequence masks. (B, 1, L)
attn_mask – MEGA attention mask. (1, L, L)
state – Decoder hidden states.
- Returns:
MEGA output sequences. (B, L, size) state: Decoder hidden states.
- Return type:
x
-
reset_parameters
(val: int = 0.0, std: int = 0.02) → None[source]¶ Reset module parameters.
- Parameters:
val – Initialization value.
std – Standard deviation.
-
softmax_attention
(query: torch.Tensor, key: torch.Tensor, mask: Optional[torch.Tensor] = None, attn_mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶ Compute attention weights with softmax.
- Parameters:
query – Query tensor. (B, 1, L, D)
key – Key tensor. (B, 1, L, D)
mask – Sequence mask. (B, 1, L)
attn_mask – Attention mask. (1, L, L)
- Returns:
Attention weights. (B, 1, L, L)
- Return type:
attn_weights
espnet2.asr_transducer.decoder.blocks.__init__¶
espnet2.asr_transducer.decoder.blocks.rwkv¶
Receptance Weighted Key Value (RWKV) block definition.
Based/modified from https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py
-
class
espnet2.asr_transducer.decoder.blocks.rwkv.
RWKV
(size: int, linear_size: int, attention_size: int, context_size: int, block_id: int, num_blocks: int, normalization_class: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, normalization_args: Dict = {}, att_dropout_rate: float = 0.0, ffn_dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
RWKV module.
- Parameters:
size – Input/Output size.
linear_size – Feed-forward hidden size.
attention_size – SelfAttention hidden size.
context_size – Context size for WKV computation.
block_id – Block index.
num_blocks – Number of blocks in the architecture.
normalization_class – Normalization layer class.
normalization_args – Normalization layer arguments.
att_dropout_rate – Dropout rate for the attention module.
ffn_dropout_rate – Dropout rate for the feed-forward module.
Construct a RWKV object.
-
forward
(x: torch.Tensor, state: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Compute receptance weighted key value.
- Parameters:
x – RWKV input sequences. (B, L, size)
state – Decoder hidden states. [5 x (B, D_att/size, N)]
- Returns:
RWKV output sequences. (B, L, size) x: Decoder hidden states. [5 x (B, D_att/size, N)]
- Return type:
x
espnet2.asr_transducer.decoder.modules.__init__¶
espnet2.asr_transducer.decoder.modules.rwkv.feed_forward¶
Feed-forward (channel mixing) module for RWKV block.
Based/Modified from https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py
Some variables are renamed according to https://github.com/huggingface/transformers/blob/main/src/transformers/models/rwkv/modeling_rwkv.py.
-
class
espnet2.asr_transducer.decoder.modules.rwkv.feed_forward.
FeedForward
(size: int, hidden_size: int, block_id: int, num_blocks: int)[source]¶ Bases:
torch.nn.modules.module.Module
FeedForward module definition.
- Parameters:
size – Input/Output size.
hidden_size – Hidden size.
block_id – Block index.
num_blocks – Number of blocks in the architecture.
Construct a FeedForward object.
-
forward
(x: torch.Tensor, state: Optional[List[torch.Tensor]] = None) → Tuple[torch.Tensor, Optional[List[torch.Tensor]]][source]¶ Compute channel mixing.
- Parameters:
x – FeedForward input sequences. (B, U, size)
state – Decoder hidden state. [5 x (B, 1, size, N)]
- Returns:
FeedForward output sequences. (B, U, size) state: Decoder hidden state. [5 x (B, 1, size, N)]
- Return type:
x
espnet2.asr_transducer.decoder.modules.rwkv.attention¶
Attention (time mixing) modules for RWKV block.
Based/Modified from https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py.
Some variables are renamed according to https://github.com/huggingface/transformers/blob/main/src/transformers/models/rwkv/modeling_rwkv.py.
-
class
espnet2.asr_transducer.decoder.modules.rwkv.attention.
SelfAttention
(size: int, attention_size: int, context_size: int, block_id: int, num_blocks: int)[source]¶ Bases:
torch.nn.modules.module.Module
SelfAttention module definition.
- Parameters:
size – Input/Output size.
attention_size – Attention hidden size.
context_size – Context size for WKV kernel.
block_id – Block index.
num_blocks – Number of blocks in the architecture.
Construct a SelfAttention object.
-
forward
(x: torch.Tensor, state: Optional[List[torch.Tensor]] = None) → Tuple[torch.Tensor, Optional[List[torch.Tensor]]][source]¶ Compute time mixing.
- Parameters:
x – SelfAttention input sequences. (B, U, size)
state – Decoder hidden states. [5 x (B, 1, D_att, N)]
- Returns:
SelfAttention output sequences. (B, U, size)
- Return type:
x
-
reset_parameters
(size: int, attention_size: int, block_id: int, num_blocks: int) → None[source]¶ Reset module parameters.
- Parameters:
size – Block size.
attention_size – Attention hidden size.
block_id – Block index.
num_blocks – Number of blocks in the architecture.
-
wkv_linear_attention
(time_decay: torch.Tensor, time_first: torch.Tensor, key: torch.Tensor, value: torch.Tensor, state: Tuple[torch.Tensor, torch.Tensor, torch.Tensor]) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor, torch.Tensor]][source]¶ Compute WKV with state (i.e.: for inference).
- Parameters:
time_decay – Channel-wise time decay vector. (D_att)
time_first – Channel-wise time first vector. (D_att)
key – Key tensor. (B, 1, D_att)
value – Value tensor. (B, 1, D_att)
state – Decoder hidden states. [3 x (B, D_att)]
- Returns:
Weighted Key-Value. (B, 1, D_att) state: Decoder hidden states. [3 x (B, 1, D_att)]
- Return type:
output
-
class
espnet2.asr_transducer.decoder.modules.rwkv.attention.
WKVLinearAttention
(*args, **kwargs)[source]¶ Bases:
torch.autograd.function.Function
WKVLinearAttention function definition.
-
static
backward
(ctx, grad_output: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ WKVLinearAttention function backward pass.
- Parameters:
grad_output – Output gradient. (B, U, D_att)
- Returns:
Gradient for channel-wise time decay vector. (D_att) grad_time_first: Gradient for channel-wise time first vector. (D_att) grad_key: Gradient for key tensor. (B, U, D_att) grad_value: Gradient for value tensor. (B, U, D_att)
- Return type:
grad_time_decay
-
static
forward
(ctx, time_decay: torch.Tensor, time_first: torch.Tensor, key: torch.Tensor, value: torch._VariableFunctionsClass.tensor) → torch.Tensor[source]¶ WKVLinearAttention function forward pass.
- Parameters:
time_decay – Channel-wise time decay vector. (D_att)
time_first – Channel-wise time first vector. (D_att)
key – Key tensor. (B, U, D_att)
value – Value tensor. (B, U, D_att)
- Returns:
Weighted Key-Value tensor. (B, U, D_att)
- Return type:
out
-
static
espnet2.asr_transducer.decoder.modules.rwkv.__init__¶
espnet2.asr_transducer.decoder.modules.mega.feed_forward¶
Normalized position-wise feed-forward module for MEGA block.
-
class
espnet2.asr_transducer.decoder.modules.mega.feed_forward.
NormalizedPositionwiseFeedForward
(size: int, hidden_size: int, normalization: torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, activation: torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.ReLU'>, dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
NormalizedPositionFeedForward module definition.
- Parameters:
size – Input/Output size.
hidden_size – Hidden size.
normalization – Normalization module.
activation – Activation function.
dropout_rate – Dropout rate.
Construct an NormalizedPositionwiseFeedForward object.
espnet2.asr_transducer.decoder.modules.mega.positional_bias¶
Positional bias related modules.
Based/modified from https://github.com/facebookresearch/mega/blob/main/fairseq/modules/relative_positional_bias.py
-
class
espnet2.asr_transducer.decoder.modules.mega.positional_bias.
RelativePositionBias
(max_positions: int)[source]¶ Bases:
torch.nn.modules.module.Module
RelativePositionBias module definition.
- Parameters:
max_positions – Maximum number of relative positions.
Construct a RelativePositionBias object.
-
class
espnet2.asr_transducer.decoder.modules.mega.positional_bias.
RotaryRelativePositionBias
(size: int, max_positions: int = 2048)[source]¶ Bases:
torch.nn.modules.module.Module
RotaryRelativePositionBias module definition.
- Parameters:
size – Module embedding size.
max_positions – Maximum number of relative positions.
Construct a RotaryRelativePositionBias object.
-
forward
(length: int) → torch.Tensor[source]¶ Compute rotary relative position bias.
- Parameters:
length – Sequence length.
- Returns:
Rotary relative position bias. (L, L)
- Return type:
bias
-
static
get_sinusoid_embeddings
(max_positions: int, size: int) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute sinusoidal positional embeddings.
- Parameters:
max_positions – Maximum number of positions.
size – Input size.
- Returns:
Sine elements. (max_positions, size // 2) : Cos elements. (max_positions, size // 2)
espnet2.asr_transducer.decoder.modules.mega.__init__¶
espnet2.asr_transducer.decoder.modules.mega.multi_head_damped_ema¶
Multi-head Damped Exponential Moving Average (EMA) module for MEGA block.
Based/modified from https://github.com/facebookresearch/mega/blob/main/fairseq/modules/moving_average_gated_attention.py
Most variables are renamed according to https://github.com/huggingface/transformers/blob/main/src/transformers/models/mega/modeling_mega.py.
-
class
espnet2.asr_transducer.decoder.modules.mega.multi_head_damped_ema.
MultiHeadDampedEMA
(size: int, num_heads: int = 4, activation: torch.nn.modules.module.Module = ReLU(), truncation_length: Optional[int] = None)[source]¶ Bases:
torch.nn.modules.module.Module
MultiHeadDampedEMA module definition.
- Parameters:
size – Module size.
num_heads – Number of attention heads.
activation – Activation function type.
truncation_length – Maximum length for truncation.
Construct an MultiHeadDampedEMA object.
-
compute_ema_coefficients
() → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute EMA coefficients.
- Parameters:
None –
- Returns:
- Damping factor / P-th order coefficient.
(size, num_heads, 1)
- prev_timestep_weight: Previous timestep weight / Q-th order coefficient.
(size, num_heads, 1)
- Return type:
damping_factor
-
compute_ema_kernel
(length: int) → torch.Tensor[source]¶ Compute EMA kernel / vandermonde product.
- Parameters:
length – Sequence length.
- Returns:
EMA kernel / Vandermonde product. (size, L)
-
ema_one_step
(x: torch.Tensor, state: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Perform exponential moving average for a single step.
- Parameters:
x – MultiHeadDampedEMA input sequences. (B, D, 1)
state – MultiHeadDampedEMA state. (B, D, num_heads)
- Returns:
MultiHeadDamped output sequences. (B, 1, D) new_state: MultiHeadDampedEMA state. (B, D, num_heads)
- Return type:
out
-
forward
(x: torch.Tensor, mask: Optional[torch.Tensor] = None, state: Optional[Dict[str, torch.Tensor]] = None) → Optional[torch.Tensor][source]¶ Compute multi-dimensional damped EMA.
- Parameters:
x – MultiHeadDampedEMA input sequence. (L, B, D)
mask – Sequence mask. (B, 1, L)
state – MultiHeadDampedEMA state. (B, D, num_heads)
- Returns:
MultiHeadDampedEMA output sequence. (B, L, D) new_state: MultiHeadDampedEMA state. (B, D, num_heads)
- Return type:
x
espnet2.asr_transducer.frontend.online_audio_processor¶
Online processor for Transducer models chunk-by-chunk streaming decoding.
-
class
espnet2.asr_transducer.frontend.online_audio_processor.
OnlineAudioProcessor
(feature_extractor: torch.nn.modules.module.Module, normalization_module: torch.nn.modules.module.Module, decoding_window: int, encoder_sub_factor: int, frontend_conf: Dict, device: torch.device, audio_sampling_rate: int = 16000)[source]¶ Bases:
object
OnlineProcessor module definition.
- Parameters:
feature_extractor – Feature extractor module.
normalization_module – Normalization module.
decoding_window – Size of the decoding window (in ms).
encoder_sub_factor – Encoder subsampling factor.
frontend_conf – Frontend configuration.
device – Device to pin module tensors on.
audio_sampling_rate – Input sampling rate.
Construct an OnlineAudioProcessor.
-
compute_features
(samples: torch.Tensor, is_final: bool) → None[source]¶ Compute features from input samples.
- Parameters:
samples – Speech data. (S)
is_final – Whether speech corresponds to the final chunk of data.
- Returns:
Features sequence. (1, chunk_sz_bs, D_feats) feats_length: Features length sequence. (1,)
- Return type:
feats
-
get_current_feats
(feats: torch.Tensor, feats_length: torch.Tensor, is_final: bool) → Tuple[torch.Tensor, torch.Tensor][source]¶ Get features for current decoding window.
- Parameters:
feats – Computed features sequence. (1, F, D_feats)
feats_length – Computed features sequence length. (1,)
is_final – Whether feats corresponds to the final chunk of data.
- Returns:
Decoding window features sequence. (1, chunk_sz_bs, D_feats) feats_length: Decoding window features length sequence. (1,)
- Return type:
feats