espnet.nets package¶
Initialize sub package.
espnet.nets.mt_interface¶
MT Interface module.
-
class
espnet.nets.mt_interface.
MTInterface
[source]¶ Bases:
object
MT Interface for ESPnet model implementation.
-
property
attention_plot_class
¶ Get attention plot class.
-
classmethod
build
(idim: int, odim: int, **kwargs)[source]¶ Initialize this class with python-level args.
- Parameters:
idim (int) – The number of an input feature dim.
odim (int) – The number of output vocab.
- Returns:
A new instance of ASRInterface.
- Return type:
ASRinterface
-
calculate_all_attentions
(xs, ilens, ys)[source]¶ Calculate attention.
- Parameters:
xs (list) – list of padded input sequences [(T1, idim), (T2, idim), …]
ilens (ndarray) – batch of lengths of input sequences (B)
ys (list) – list of character id sequence tensor [(L1), (L2), (L3), …]
- Returns:
attention weights (B, Lmax, Tmax)
- Return type:
float ndarray
-
forward
(xs, ilens, ys)[source]¶ Compute loss for training.
- Parameters:
xs – For pytorch, batch of padded source sequences torch.Tensor (B, Tmax, idim) For chainer, list of source sequences chainer.Variable
ilens – batch of lengths of source sequences (B) For pytorch, torch.Tensor For chainer, list of int
ys – For pytorch, batch of padded source sequences torch.Tensor (B, Lmax) For chainer, list of source sequences chainer.Variable
- Returns:
loss value
- Return type:
torch.Tensor for pytorch, chainer.Variable for chainer
-
translate
(x, trans_args, char_list=None, rnnlm=None)[source]¶ Translate x for evaluation.
- Parameters:
x (ndarray) – input acouctic feature (B, T, D) or (T, D)
trans_args (namespace) – argment namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
translate_batch
(x, trans_args, char_list=None, rnnlm=None)[source]¶ Beam search implementation for batch.
- Parameters:
x (torch.Tensor) – encoder hidden state sequences (B, Tmax, Henc)
trans_args (namespace) – argument namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
property
espnet.nets.beam_search_timesync¶
Time Synchronous One-Pass Beam Search.
Implements joint CTC/attention decoding where hypotheses are expanded along the time (input) axis, as described in https://arxiv.org/abs/2210.05200. Supports CPU and GPU inference. References: https://arxiv.org/abs/1408.2873 for CTC beam search Author: Brian Yan
-
class
espnet.nets.beam_search_timesync.
BeamSearchTimeSync
(sos: int, beam_size: int, scorers: Dict[str, espnet.nets.scorer_interface.ScorerInterface], weights: Dict[str, float], token_list=<class 'dict'>, pre_beam_ratio: float = 1.5, blank: int = 0, force_lid: bool = False, temp: float = 1.0)[source]¶ Bases:
torch.nn.modules.module.Module
Time synchronous beam search algorithm.
Initialize beam search.
- Parameters:
beam_size – num hyps
sos – sos index
ctc – CTC module
pre_beam_ratio – pre_beam_ratio * beam_size = pre_beam pre_beam is used to select candidates from vocab to extend hypotheses
decoder – decoder ScorerInterface
ctc_weight – ctc_weight
blank – blank index
-
cached_score
(h: Tuple[int], cache: dict, scorer: espnet.nets.scorer_interface.ScorerInterface) → Any[source]¶ Retrieve decoder/LM scores which may be cached.
espnet.nets.e2e_mt_common¶
Common functions for ST and MT.
-
class
espnet.nets.e2e_mt_common.
ErrorCalculator
(char_list, sym_space, sym_pad, report_bleu=False)[source]¶ Bases:
object
Calculate BLEU for ST and MT models during training.
- Parameters:
y_hats – numpy array with predicted text
y_pads – numpy array with true (target) text
char_list – vocabulary list
sym_space – space symbol
sym_pad – pad symbol
report_bleu – report BLUE score if True
Construct an ErrorCalculator object.
espnet.nets.scorer_interface¶
Scorer interface module.
-
class
espnet.nets.scorer_interface.
BatchPartialScorerInterface
[source]¶ Bases:
espnet.nets.scorer_interface.BatchScorerInterface
,espnet.nets.scorer_interface.PartialScorerInterface
Batch partial scorer interface for beam search.
-
batch_score_partial
(ys: torch.Tensor, next_tokens: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, Any][source]¶ Score new token (required).
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
next_tokens (torch.Tensor) – torch.int64 tokens to score (n_batch, n_token).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
Tuple of a score tensor for ys that has a shape (n_batch, n_vocab) and next states for ys
- Return type:
tuple[torch.Tensor, Any]
-
-
class
espnet.nets.scorer_interface.
BatchScorerInterface
[source]¶ Bases:
espnet.nets.scorer_interface.ScorerInterface
Batch scorer interface.
-
batch_init_state
(x: torch.Tensor) → Any[source]¶ Get an initial state for decoding (optional).
- Parameters:
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch (required).
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
-
class
espnet.nets.scorer_interface.
PartialScorerInterface
[source]¶ Bases:
espnet.nets.scorer_interface.ScorerInterface
Partial scorer interface for beam search.
The partial scorer performs scoring when non-partial scorer finished scoring, and receives pre-pruned next tokens to score because it is too heavy to score all the tokens.
Examples
- Prefix search for connectionist-temporal-classification models
-
score_partial
(y: torch.Tensor, next_tokens: torch.Tensor, state: Any, x: torch.Tensor) → Tuple[torch.Tensor, Any][source]¶ Score new token (required).
- Parameters:
y (torch.Tensor) – 1D prefix token
next_tokens (torch.Tensor) – torch.int64 next token to score
state – decoder state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys
- Returns:
Tuple of a score tensor for y that has a shape (len(next_tokens),) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
-
class
espnet.nets.scorer_interface.
ScorerInterface
[source]¶ Bases:
object
Scorer interface for beam search.
The scorer performs scoring of the all tokens in vocabulary.
Examples
- Search heuristics
- Decoder networks of the sequence-to-sequence models
espnet.nets.pytorch_backend.nets.transformer.decoder.Decoder
espnet.nets.pytorch_backend.nets.rnn.decoders.Decoder
-
final_score
(state: Any) → float[source]¶ Score eos (optional).
- Parameters:
state – Scorer state for prefix tokens
- Returns:
final score
- Return type:
float
-
init_state
(x: torch.Tensor) → Any[source]¶ Get an initial state for decoding (optional).
- Parameters:
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
score
(y: torch.Tensor, state: Any, x: torch.Tensor) → Tuple[torch.Tensor, Any][source]¶ Score new token (required).
- Parameters:
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.
- Returns:
- Tuple of
scores for next token that has a shape of (n_vocab) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
-
select_state
(state: Any, i: int, new_id: int = None) → Any[source]¶ Select state with relative ids in the main beam search.
- Parameters:
state – Decoder state for prefix tokens
i (int) – Index to select a state in the main beam search
new_id (int) – New label index to select a state if necessary
- Returns:
pruned state
- Return type:
state
espnet.nets.lm_interface¶
Language model interface.
-
class
espnet.nets.lm_interface.
LMInterface
[source]¶ Bases:
espnet.nets.scorer_interface.ScorerInterface
LM Interface for ESPnet model implementation.
-
classmethod
build
(n_vocab: int, **kwargs)[source]¶ Initialize this class with python-level args.
- Parameters:
idim (int) – The number of vocabulary.
- Returns:
A new instance of LMInterface.
- Return type:
LMinterface
-
forward
(x, t)[source]¶ Compute LM loss value from buffer sequences.
- Parameters:
x (torch.Tensor) – Input ids. (batch, len)
t (torch.Tensor) – Target ids. (batch, len)
- Returns:
- Tuple of
loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)
- Return type:
tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Notes
The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)
-
classmethod
espnet.nets.tts_interface¶
TTS Interface realted modules.
-
class
espnet.nets.tts_interface.
Reporter
(**links)[source]¶ Bases:
chainer.link.Chain
Reporter module.
-
class
espnet.nets.tts_interface.
TTSInterface
[source]¶ Bases:
object
TTS Interface for ESPnet model implementation.
Initilize TTS module.
-
property
attention_plot_class
¶ Plot attention weights.
-
property
base_plot_keys
¶ Return base key names to plot during training.
The keys should match what chainer.reporter reports. if you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns:
Base keys to plot during training.
- Return type:
list[str]
-
calculate_all_attentions
(*args, **kwargs)[source]¶ Calculate TTS attention weights.
- Parameters:
Tensor – Batch of attention weights (B, Lmax, Tmax).
-
forward
(*args, **kwargs)[source]¶ Calculate TTS forward propagation.
- Returns:
Loss value.
- Return type:
Tensor
-
property
espnet.nets.asr_interface¶
ASR Interface module.
-
class
espnet.nets.asr_interface.
ASRInterface
[source]¶ Bases:
object
ASR Interface for ESPnet model implementation.
-
property
attention_plot_class
¶ Get attention plot class.
-
classmethod
build
(idim: int, odim: int, **kwargs)[source]¶ Initialize this class with python-level args.
- Parameters:
idim (int) – The number of an input feature dim.
odim (int) – The number of output vocab.
- Returns:
A new instance of ASRInterface.
- Return type:
ASRinterface
-
calculate_all_attentions
(xs, ilens, ys)[source]¶ Calculate attention.
- Parameters:
xs (list) – list of padded input sequences [(T1, idim), (T2, idim), …]
ilens (ndarray) – batch of lengths of input sequences (B)
ys (list) – list of character id sequence tensor [(L1), (L2), (L3), …]
- Returns:
attention weights (B, Lmax, Tmax)
- Return type:
float ndarray
-
calculate_all_ctc_probs
(xs, ilens, ys)[source]¶ Calculate CTC probability.
- Parameters:
xs_pad (list) – list of padded input sequences [(T1, idim), (T2, idim), …]
ilens (ndarray) – batch of lengths of input sequences (B)
ys (list) – list of character id sequence tensor [(L1), (L2), (L3), …]
- Returns:
CTC probabilities (B, Tmax, vocab)
- Return type:
float ndarray
-
property
ctc_plot_class
¶ Get CTC plot class.
-
encode
(feat)[source]¶ Encode feature in beam_search (optional).
- Parameters:
x (numpy.ndarray) – input feature (T, D)
- Returns:
encoded feature (T, D)
- Return type:
torch.Tensor for pytorch, chainer.Variable for chainer
-
forward
(xs, ilens, ys)[source]¶ Compute loss for training.
- Parameters:
xs – For pytorch, batch of padded source sequences torch.Tensor (B, Tmax, idim) For chainer, list of source sequences chainer.Variable
ilens – batch of lengths of source sequences (B) For pytorch, torch.Tensor For chainer, list of int
ys – For pytorch, batch of padded source sequences torch.Tensor (B, Lmax) For chainer, list of source sequences chainer.Variable
- Returns:
loss value
- Return type:
torch.Tensor for pytorch, chainer.Variable for chainer
-
recognize
(x, recog_args, char_list=None, rnnlm=None)[source]¶ Recognize x for evaluation.
- Parameters:
x (ndarray) – input acouctic feature (B, T, D) or (T, D)
recog_args (namespace) – argment namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
recognize_batch
(x, recog_args, char_list=None, rnnlm=None)[source]¶ Beam search implementation for batch.
- Parameters:
x (torch.Tensor) – encoder hidden state sequences (B, Tmax, Henc)
recog_args (namespace) – argument namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
scorers
()[source]¶ Get scorers for beam_search (optional).
- Returns:
dict of ScorerInterface objects
- Return type:
dict[str, ScorerInterface]
-
property
espnet.nets.e2e_asr_common¶
Common functions for ASR.
-
class
espnet.nets.e2e_asr_common.
ErrorCalculator
(char_list, sym_space, sym_blank, report_cer=False, report_wer=False)[source]¶ Bases:
object
Calculate CER and WER for E2E_ASR and CTC models during training.
- Parameters:
y_hats – numpy array with predicted text
y_pads – numpy array with true (target) text
char_list –
sym_space –
sym_blank –
- Returns:
Construct an ErrorCalculator object.
-
calculate_cer
(seqs_hat, seqs_true)[source]¶ Calculate sentence-level CER score.
- Parameters:
seqs_hat (list) – prediction
seqs_true (list) – reference
- Returns:
average sentence-level CER score
:rtype float
-
calculate_cer_ctc
(ys_hat, ys_pad)[source]¶ Calculate sentence-level CER score for CTC.
- Parameters:
ys_hat (torch.Tensor) – prediction (batch, seqlen)
ys_pad (torch.Tensor) – reference (batch, seqlen)
- Returns:
average sentence-level CER score
:rtype float
-
espnet.nets.e2e_asr_common.
end_detect
(ended_hyps, i, M=3, D_end=-10.0)[source]¶ End detection.
described in Eq. (50) of S. Watanabe et al “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition”
- Parameters:
ended_hyps –
i –
M –
D_end –
- Returns:
espnet.nets.transducer_decoder_interface¶
Transducer decoder interface module.
-
class
espnet.nets.transducer_decoder_interface.
ExtendedHypothesis
(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None, dec_out: List[torch.Tensor] = None, lm_scores: torch.Tensor = None)[source]¶ Bases:
espnet.nets.transducer_decoder_interface.Hypothesis
Extended hypothesis definition for NSC beam search and mAES.
-
dec_out
= None¶
-
lm_scores
= None¶
-
-
class
espnet.nets.transducer_decoder_interface.
Hypothesis
(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None)[source]¶ Bases:
object
Default hypothesis definition for Transducer search algorithms.
-
lm_state
= None¶
-
-
class
espnet.nets.transducer_decoder_interface.
TransducerDecoderInterface
[source]¶ Bases:
object
Decoder interface for Transducer models.
-
batch_score
(hyps: Union[List[espnet.nets.transducer_decoder_interface.Hypothesis], List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis]], dec_states: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]], cache: Dict[str, Any], use_lm: bool) → Tuple[torch.Tensor, Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]], torch.Tensor][source]¶ One-step forward hypotheses.
- Parameters:
hyps – Hypotheses.
dec_states – Decoder hidden states.
cache – Pairs of (dec_out, dec_states) for each label sequence. (key)
use_lm – Whether to compute label ID sequences for LM.
- Returns:
Decoder output sequences. dec_states: Decoder hidden states. lm_labels: Label ID sequences for LM.
- Return type:
dec_out
-
create_batch_states
(states: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]], new_states: List[Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]]], l_tokens: List[List[int]]) → Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]][source]¶ Create decoder hidden states.
- Parameters:
batch_states – Batch of decoder states
l_states – List of decoder states
l_tokens – List of token sequences for input batch
- Returns:
Batch of decoder states
- Return type:
batch_states
-
init_state
(batch_size: int) → Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]][source]¶ Initialize decoder states.
- Parameters:
batch_size – Batch size.
- Returns:
Initial decoder hidden states.
- Return type:
state
-
score
(hyp: espnet.nets.transducer_decoder_interface.Hypothesis, cache: Dict[str, Any]) → Tuple[torch.Tensor, Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]], torch.Tensor][source]¶ One-step forward hypothesis.
- Parameters:
hyp – Hypothesis.
cache – Pairs of (dec_out, dec_state) for each token sequence. (key)
- Returns:
Decoder output sequence. new_state: Decoder hidden states. lm_tokens: Label ID for LM.
- Return type:
dec_out
-
select_state
(batch_states: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[torch.Tensor]], idx: int) → Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]][source]¶ Get specified ID state from decoder hidden states.
- Parameters:
batch_states – Decoder hidden states.
idx – State ID to extract.
- Returns:
Decoder hidden state for given ID.
- Return type:
state_idx
-
espnet.nets.beam_search¶
Beam search module.
-
class
espnet.nets.beam_search.
BeamSearch
(scorers: Dict[str, espnet.nets.scorer_interface.ScorerInterface], weights: Dict[str, float], beam_size: int, vocab_size: int, sos: int, eos: int, token_list: List[str] = None, pre_beam_ratio: float = 1.5, pre_beam_score_key: str = None, hyp_primer: List[int] = None)[source]¶ Bases:
torch.nn.modules.module.Module
Beam search implementation.
Initialize beam search.
- Parameters:
scorers (dict[str, ScorerInterface]) – Dict of decoder modules e.g., Decoder, CTCPrefixScorer, LM The scorer will be ignored if it is None
weights (dict[str, float]) – Dict of weights for each scorers The scorer will be ignored if its weight is 0
beam_size (int) – The number of hypotheses kept during search
vocab_size (int) – The number of vocabulary
sos (int) – Start of sequence id
eos (int) – End of sequence id
token_list (list[str]) – List of tokens for debug log
pre_beam_score_key (str) – key of scores to perform pre-beam search
pre_beam_ratio (float) – beam size in the pre-beam search will be int(pre_beam_ratio * beam_size)
-
static
append_token
(xs: torch.Tensor, x: int) → torch.Tensor[source]¶ Append new token to prefix tokens.
- Parameters:
xs (torch.Tensor) – The prefix token
x (int) – The new token to append
- Returns:
New tensor contains: xs + [x] with xs.dtype and xs.device
- Return type:
torch.Tensor
-
beam
(weighted_scores: torch.Tensor, ids: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute topk full token ids and partial token ids.
- Parameters:
weighted_scores (torch.Tensor) – The weighted sum scores for each tokens.
shape is ` (Its) –
ids (torch.Tensor) – The partial token ids to compute topk
- Returns:
The topk full token ids and partial token ids. Their shapes are (self.beam_size,)
- Return type:
Tuple[torch.Tensor, torch.Tensor]
-
forward
(x: torch.Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0) → List[espnet.nets.beam_search.Hypothesis][source]¶ Perform beam search.
- Parameters:
x (torch.Tensor) – Encoded speech feature (T, D)
maxlenratio (float) – Input length ratio to obtain max output length. If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths If maxlenratio<0.0, its absolute value is interpreted as a constant max output length.
minlenratio (float) – Input length ratio to obtain min output length. If minlenratio<0.0, its absolute value is interpreted as a constant min output length.
- Returns:
N-best decoding results
- Return type:
list[Hypothesis]
-
init_hyp
(x: torch.Tensor) → List[espnet.nets.beam_search.Hypothesis][source]¶ Get an initial hypothesis data.
- Parameters:
x (torch.Tensor) – The encoder output feature
- Returns:
The initial hypothesis.
- Return type:
-
static
merge_scores
(prev_scores: Dict[str, float], next_full_scores: Dict[str, torch.Tensor], full_idx: int, next_part_scores: Dict[str, torch.Tensor], part_idx: int) → Dict[str, torch.Tensor][source]¶ Merge scores for new hypothesis.
- Parameters:
prev_scores (Dict[str, float]) – The previous hypothesis scores by self.scorers
next_full_scores (Dict[str, torch.Tensor]) – scores by self.full_scorers
full_idx (int) – The next token id for next_full_scores
next_part_scores (Dict[str, torch.Tensor]) – scores of partial tokens by self.part_scorers
part_idx (int) – The new token id for next_part_scores
- Returns:
- The new score dict.
Its keys are names of self.full_scorers and self.part_scorers. Its values are scalar tensors by the scorers.
- Return type:
Dict[str, torch.Tensor]
-
merge_states
(states: Any, part_states: Any, part_idx: int) → Any[source]¶ Merge states for new hypothesis.
- Parameters:
states – states of self.full_scorers
part_states – states of self.part_scorers
part_idx (int) – The new token id for part_scores
- Returns:
- The new score dict.
Its keys are names of self.full_scorers and self.part_scorers. Its values are states of the scorers.
- Return type:
Dict[str, torch.Tensor]
-
post_process
(i: int, maxlen: int, maxlenratio: float, running_hyps: List[espnet.nets.beam_search.Hypothesis], ended_hyps: List[espnet.nets.beam_search.Hypothesis]) → List[espnet.nets.beam_search.Hypothesis][source]¶ Perform post-processing of beam search iterations.
- Parameters:
i (int) – The length of hypothesis tokens.
maxlen (int) – The maximum length of tokens in beam search.
maxlenratio (int) – The maximum length ratio in beam search.
running_hyps (List[Hypothesis]) – The running hypotheses in beam search.
ended_hyps (List[Hypothesis]) – The ended hypotheses in beam search.
- Returns:
The new running hypotheses.
- Return type:
List[Hypothesis]
-
score_full
(hyp: espnet.nets.beam_search.Hypothesis, x: torch.Tensor) → Tuple[Dict[str, torch.Tensor], Dict[str, Any]][source]¶ Score new hypothesis by self.full_scorers.
- Parameters:
hyp (Hypothesis) – Hypothesis with prefix tokens to score
x (torch.Tensor) – Corresponding input feature
- Returns:
- Tuple of
score dict of hyp that has string keys of self.full_scorers and tensor score values of shape: (self.n_vocab,), and state dict that has string keys and state values of self.full_scorers
- Return type:
Tuple[Dict[str, torch.Tensor], Dict[str, Any]]
-
score_partial
(hyp: espnet.nets.beam_search.Hypothesis, ids: torch.Tensor, x: torch.Tensor) → Tuple[Dict[str, torch.Tensor], Dict[str, Any]][source]¶ Score new hypothesis by self.part_scorers.
- Parameters:
hyp (Hypothesis) – Hypothesis with prefix tokens to score
ids (torch.Tensor) – 1D tensor of new partial tokens to score
x (torch.Tensor) – Corresponding input feature
- Returns:
- Tuple of
score dict of hyp that has string keys of self.part_scorers and tensor score values of shape: (len(ids),), and state dict that has string keys and state values of self.part_scorers
- Return type:
Tuple[Dict[str, torch.Tensor], Dict[str, Any]]
-
search
(running_hyps: List[espnet.nets.beam_search.Hypothesis], x: torch.Tensor) → List[espnet.nets.beam_search.Hypothesis][source]¶ Search new tokens for running hypotheses and encoded speech x.
- Parameters:
running_hyps (List[Hypothesis]) – Running hypotheses on beam
x (torch.Tensor) – Encoded speech feature (T, D)
- Returns:
Best sorted hypotheses
- Return type:
List[Hypotheses]
-
class
espnet.nets.beam_search.
Hypothesis
[source]¶ Bases:
tuple
Hypothesis data type.
Create new instance of Hypothesis(yseq, score, scores, states)
-
score
¶ Alias for field number 1
-
scores
¶ Alias for field number 2
-
states
¶ Alias for field number 3
-
yseq
¶ Alias for field number 0
-
-
espnet.nets.beam_search.
beam_search
(x: torch.Tensor, sos: int, eos: int, beam_size: int, vocab_size: int, scorers: Dict[str, espnet.nets.scorer_interface.ScorerInterface], weights: Dict[str, float], token_list: List[str] = None, maxlenratio: float = 0.0, minlenratio: float = 0.0, pre_beam_ratio: float = 1.5, pre_beam_score_key: str = 'full') → list[source]¶ Perform beam search with scorers.
- Parameters:
x (torch.Tensor) – Encoded speech feature (T, D)
sos (int) – Start of sequence id
eos (int) – End of sequence id
beam_size (int) – The number of hypotheses kept during search
vocab_size (int) – The number of vocabulary
scorers (dict[str, ScorerInterface]) – Dict of decoder modules e.g., Decoder, CTCPrefixScorer, LM The scorer will be ignored if it is None
weights (dict[str, float]) – Dict of weights for each scorers The scorer will be ignored if its weight is 0
token_list (list[str]) – List of tokens for debug log
maxlenratio (float) – Input length ratio to obtain max output length. If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
minlenratio (float) – Input length ratio to obtain min output length.
pre_beam_score_key (str) – key of scores to perform pre-beam search
pre_beam_ratio (float) – beam size in the pre-beam search will be int(pre_beam_ratio * beam_size)
- Returns:
N-best decoding results
- Return type:
list
espnet.nets.beam_search_transducer¶
Search algorithms for Transducer models.
-
class
espnet.nets.beam_search_transducer.
BeamSearchTransducer
(decoder: Union[espnet.nets.pytorch_backend.transducer.rnn_decoder.RNNDecoder, espnet.nets.pytorch_backend.transducer.custom_decoder.CustomDecoder], joint_network: espnet.nets.pytorch_backend.transducer.joint_network.JointNetwork, beam_size: int, lm: torch.nn.modules.module.Module = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, score_norm: bool = True, softmax_temperature: float = 1.0, nbest: int = 1, quantization: bool = False)[source]¶ Bases:
object
Beam search implementation for Transducer.
Initialize Transducer search module.
- Parameters:
decoder – Decoder module.
joint_network – Joint network module.
beam_size – Beam size.
lm – LM class.
lm_weight – LM weight for soft fusion.
search_type – Search algorithm to use during inference.
max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)
u_max – Maximum output sequence length. (ALSD)
nstep – Number of maximum expansion steps at each time step. (NSC/mAES)
prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)
expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)
expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)
score_norm – Normalize final scores by length. (“default”)
softmax_temperature – Penalization term for softmax function.
nbest – Number of final hypothesis.
quantization – Whether dynamic quantization is used.
-
align_length_sync_decoding
(enc_out: torch.Tensor) → List[espnet.nets.transducer_decoder_interface.Hypothesis][source]¶ Alignment-length synchronous beam search implementation.
Based on https://ieeexplore.ieee.org/document/9053040
- Parameters:
h – Encoder output sequences. (T, D)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
default_beam_search
(enc_out: torch.Tensor) → List[espnet.nets.transducer_decoder_interface.Hypothesis][source]¶ Beam search implementation.
Modified from https://arxiv.org/pdf/1211.3711.pdf
- Parameters:
enc_out – Encoder output sequence. (T, D)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
greedy_search
(enc_out: torch.Tensor) → List[espnet.nets.transducer_decoder_interface.Hypothesis][source]¶ Greedy search implementation.
- Parameters:
enc_out – Encoder output sequence. (T, D_enc)
- Returns:
1-best hypotheses.
- Return type:
hyp
-
modified_adaptive_expansion_search
(enc_out: torch.Tensor) → List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis][source]¶ It’s the modified Adaptive Expansion Search (mAES) implementation.
Based on/modified from https://ieeexplore.ieee.org/document/9250505 and NSC.
- Parameters:
enc_out – Encoder output sequence. (T, D_enc)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
nsc_beam_search
(enc_out: torch.Tensor) → List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis][source]¶ N-step constrained beam search implementation.
Based on/Modified from https://arxiv.org/pdf/2002.03577.pdf. Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet until further modifications.
- Parameters:
enc_out – Encoder output sequence. (T, D_enc)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
-
prefix_search
(hyps: List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis], enc_out_t: torch.Tensor) → List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis][source]¶ Prefix search for NSC and mAES strategies.
Based on https://arxiv.org/pdf/1211.3711.pdf
-
sort_nbest
(hyps: Union[List[espnet.nets.transducer_decoder_interface.Hypothesis], List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis]]) → Union[List[espnet.nets.transducer_decoder_interface.Hypothesis], List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis]][source]¶ Sort hypotheses by score or score given sequence length.
- Parameters:
hyps – Hypothesis.
- Returns:
Sorted hypothesis.
- Return type:
hyps
-
time_sync_decoding
(enc_out: torch.Tensor) → List[espnet.nets.transducer_decoder_interface.Hypothesis][source]¶ Time synchronous beam search implementation.
Based on https://ieeexplore.ieee.org/document/9053040
- Parameters:
enc_out – Encoder output sequence. (T, D)
- Returns:
N-best hypothesis.
- Return type:
nbest_hyps
espnet.nets.st_interface¶
ST Interface module.
-
class
espnet.nets.st_interface.
STInterface
[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
ST Interface for ESPnet model implementation.
NOTE: This class is inherited from ASRInterface to enable joint translation and recognition when performing multi-task learning with the ASR task.
-
translate
(x, trans_args, char_list=None, rnnlm=None, ensemble_models=[])[source]¶ Recognize x for evaluation.
- Parameters:
x (ndarray) – input acouctic feature (B, T, D) or (T, D)
trans_args (namespace) – argment namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
translate_batch
(x, trans_args, char_list=None, rnnlm=None)[source]¶ Beam search implementation for batch.
- Parameters:
x (torch.Tensor) – encoder hidden state sequences (B, Tmax, Henc)
trans_args (namespace) – argument namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
espnet.nets.__init__¶
Initialize sub package.
espnet.nets.batch_beam_search_online_sim¶
Parallel beam search module for online simulation.
-
class
espnet.nets.batch_beam_search_online_sim.
BatchBeamSearchOnlineSim
(scorers: Dict[str, espnet.nets.scorer_interface.ScorerInterface], weights: Dict[str, float], beam_size: int, vocab_size: int, sos: int, eos: int, token_list: List[str] = None, pre_beam_ratio: float = 1.5, pre_beam_score_key: str = None, hyp_primer: List[int] = None)[source]¶ Bases:
espnet.nets.batch_beam_search.BatchBeamSearch
Online beam search implementation.
This simulates streaming decoding. It requires encoded features of entire utterance and extracts block by block from it as it shoud be done in streaming processing. This is based on Tsunoo et al, “STREAMING TRANSFORMER ASR WITH BLOCKWISE SYNCHRONOUS BEAM SEARCH” (https://arxiv.org/abs/2006.14941).
Initialize beam search.
- Parameters:
scorers (dict[str, ScorerInterface]) – Dict of decoder modules e.g., Decoder, CTCPrefixScorer, LM The scorer will be ignored if it is None
weights (dict[str, float]) – Dict of weights for each scorers The scorer will be ignored if its weight is 0
beam_size (int) – The number of hypotheses kept during search
vocab_size (int) – The number of vocabulary
sos (int) – Start of sequence id
eos (int) – End of sequence id
token_list (list[str]) – List of tokens for debug log
pre_beam_score_key (str) – key of scores to perform pre-beam search
pre_beam_ratio (float) – beam size in the pre-beam search will be int(pre_beam_ratio * beam_size)
-
extend
(x: torch.Tensor, hyps: espnet.nets.beam_search.Hypothesis) → List[espnet.nets.beam_search.Hypothesis][source]¶ Extend probabilities and states with more encoded chunks.
- Parameters:
x (torch.Tensor) – The extended encoder output feature
hyps (Hypothesis) – Current list of hypothesis
- Returns:
The extended hypothesis
- Return type:
-
forward
(x: torch.Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0) → List[espnet.nets.beam_search.Hypothesis][source]¶ Perform beam search.
- Parameters:
x (torch.Tensor) – Encoded speech feature (T, D)
maxlenratio (float) – Input length ratio to obtain max output length. If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
minlenratio (float) – Input length ratio to obtain min output length.
- Returns:
N-best decoding results
- Return type:
list[Hypothesis]
-
set_block_size
(block_size: int)[source]¶ Set block size for streaming decoding.
- Parameters:
block_size (int) – The block size of encoder
-
set_hop_size
(hop_size: int)[source]¶ Set hop size for streaming decoding.
- Parameters:
hop_size (int) – The hop size of encoder
espnet.nets.batch_beam_search_online¶
Parallel beam search module for online simulation.
-
class
espnet.nets.batch_beam_search_online.
BatchBeamSearchOnline
(*args, block_size=40, hop_size=16, look_ahead=16, disable_repetition_detection=False, encoded_feat_length_limit=0, decoder_text_length_limit=0, **kwargs)[source]¶ Bases:
espnet.nets.batch_beam_search.BatchBeamSearch
Online beam search implementation.
This simulates streaming decoding. It requires encoded features of entire utterance and extracts block by block from it as it shoud be done in streaming processing. This is based on Tsunoo et al, “STREAMING TRANSFORMER ASR WITH BLOCKWISE SYNCHRONOUS BEAM SEARCH” (https://arxiv.org/abs/2006.14941).
Initialize beam search.
-
extend
(x: torch.Tensor, hyps: espnet.nets.beam_search.Hypothesis) → List[espnet.nets.beam_search.Hypothesis][source]¶ Extend probabilities and states with more encoded chunks.
- Parameters:
x (torch.Tensor) – The extended encoder output feature
hyps (Hypothesis) – Current list of hypothesis
- Returns:
The extended hypothesis
- Return type:
-
forward
(x: torch.Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0, is_final: bool = True) → List[espnet.nets.beam_search.Hypothesis][source]¶ Perform beam search.
- Parameters:
x (torch.Tensor) – Encoded speech feature (T, D)
maxlenratio (float) – Input length ratio to obtain max output length. If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
minlenratio (float) – Input length ratio to obtain min output length.
- Returns:
N-best decoding results
- Return type:
list[Hypothesis]
-
score_full
(hyp: espnet.nets.batch_beam_search.BatchHypothesis, x: torch.Tensor) → Tuple[Dict[str, torch.Tensor], Dict[str, Any]][source]¶ Score new hypothesis by self.full_scorers.
- Parameters:
hyp (Hypothesis) – Hypothesis with prefix tokens to score
x (torch.Tensor) – Corresponding input feature
- Returns:
- Tuple of
score dict of hyp that has string keys of self.full_scorers and tensor score values of shape: (self.n_vocab,), and state dict that has string keys and state values of self.full_scorers
- Return type:
Tuple[Dict[str, torch.Tensor], Dict[str, Any]]
-
espnet.nets.ctc_prefix_score¶
-
class
espnet.nets.ctc_prefix_score.
CTCPrefixScore
(x, blank, eos, xp)[source]¶ Bases:
object
Compute CTC label sequence scores
which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the probablities of multiple labels simultaneously
-
class
espnet.nets.ctc_prefix_score.
CTCPrefixScoreTH
(x, xlens, blank, eos, margin=0)[source]¶ Bases:
object
Batch processing of CTCPrefixScore
which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the label probablities for multiple hypotheses simultaneously See also Seki et al. “Vectorized Beam Search for CTC-Attention-Based Speech Recognition,” In INTERSPEECH (pp. 3825-3829), 2019.
Construct CTC prefix scorer
- Parameters:
x (torch.Tensor) – input label posterior sequences (B, T, O)
xlens (torch.Tensor) – input lengths (B,)
blank (int) – blank label id
eos (int) – end-of-sequence id
margin (int) – margin parameter for windowing (0 means no windowing)
espnet.nets.batch_beam_search¶
Parallel beam search module.
-
class
espnet.nets.batch_beam_search.
BatchBeamSearch
(scorers: Dict[str, espnet.nets.scorer_interface.ScorerInterface], weights: Dict[str, float], beam_size: int, vocab_size: int, sos: int, eos: int, token_list: List[str] = None, pre_beam_ratio: float = 1.5, pre_beam_score_key: str = None, hyp_primer: List[int] = None)[source]¶ Bases:
espnet.nets.beam_search.BeamSearch
Batch beam search implementation.
Initialize beam search.
- Parameters:
scorers (dict[str, ScorerInterface]) – Dict of decoder modules e.g., Decoder, CTCPrefixScorer, LM The scorer will be ignored if it is None
weights (dict[str, float]) – Dict of weights for each scorers The scorer will be ignored if its weight is 0
beam_size (int) – The number of hypotheses kept during search
vocab_size (int) – The number of vocabulary
sos (int) – Start of sequence id
eos (int) – End of sequence id
token_list (list[str]) – List of tokens for debug log
pre_beam_score_key (str) – key of scores to perform pre-beam search
pre_beam_ratio (float) – beam size in the pre-beam search will be int(pre_beam_ratio * beam_size)
-
batch_beam
(weighted_scores: torch.Tensor, ids: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Batch-compute topk full token ids and partial token ids.
- Parameters:
weighted_scores (torch.Tensor) – The weighted sum scores for each tokens. Its shape is (n_beam, self.vocab_size).
ids (torch.Tensor) – The partial token ids to compute topk. Its shape is (n_beam, self.pre_beam_size).
- Returns:
The topk full (prev_hyp, new_token) ids and partial (prev_hyp, new_token) ids. Their shapes are all (self.beam_size,)
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
-
batchfy
(hyps: List[espnet.nets.beam_search.Hypothesis]) → espnet.nets.batch_beam_search.BatchHypothesis[source]¶ Convert list to batch.
-
init_hyp
(x: torch.Tensor) → espnet.nets.batch_beam_search.BatchHypothesis[source]¶ Get an initial hypothesis data.
- Parameters:
x (torch.Tensor) – The encoder output feature
- Returns:
The initial hypothesis.
- Return type:
-
merge_states
(states: Any, part_states: Any, part_idx: int) → Any[source]¶ Merge states for new hypothesis.
- Parameters:
states – states of self.full_scorers
part_states – states of self.part_scorers
part_idx (int) – The new token id for part_scores
- Returns:
- The new score dict.
Its keys are names of self.full_scorers and self.part_scorers. Its values are states of the scorers.
- Return type:
Dict[str, torch.Tensor]
-
post_process
(i: int, maxlen: int, maxlenratio: float, running_hyps: espnet.nets.batch_beam_search.BatchHypothesis, ended_hyps: List[espnet.nets.beam_search.Hypothesis]) → espnet.nets.batch_beam_search.BatchHypothesis[source]¶ Perform post-processing of beam search iterations.
- Parameters:
i (int) – The length of hypothesis tokens.
maxlen (int) – The maximum length of tokens in beam search.
maxlenratio (int) – The maximum length ratio in beam search.
running_hyps (BatchHypothesis) – The running hypotheses in beam search.
ended_hyps (List[Hypothesis]) – The ended hypotheses in beam search.
- Returns:
The new running hypotheses.
- Return type:
-
score_full
(hyp: espnet.nets.batch_beam_search.BatchHypothesis, x: torch.Tensor) → Tuple[Dict[str, torch.Tensor], Dict[str, Any]][source]¶ Score new hypothesis by self.full_scorers.
- Parameters:
hyp (Hypothesis) – Hypothesis with prefix tokens to score
x (torch.Tensor) – Corresponding input feature
- Returns:
- Tuple of
score dict of hyp that has string keys of self.full_scorers and tensor score values of shape: (self.n_vocab,), and state dict that has string keys and state values of self.full_scorers
- Return type:
Tuple[Dict[str, torch.Tensor], Dict[str, Any]]
-
score_partial
(hyp: espnet.nets.batch_beam_search.BatchHypothesis, ids: torch.Tensor, x: torch.Tensor) → Tuple[Dict[str, torch.Tensor], Dict[str, Any]][source]¶ Score new hypothesis by self.full_scorers.
- Parameters:
hyp (Hypothesis) – Hypothesis with prefix tokens to score
ids (torch.Tensor) – 2D tensor of new partial tokens to score
x (torch.Tensor) – Corresponding input feature
- Returns:
- Tuple of
score dict of hyp that has string keys of self.full_scorers and tensor score values of shape: (self.n_vocab,), and state dict that has string keys and state values of self.full_scorers
- Return type:
Tuple[Dict[str, torch.Tensor], Dict[str, Any]]
-
search
(running_hyps: espnet.nets.batch_beam_search.BatchHypothesis, x: torch.Tensor) → espnet.nets.batch_beam_search.BatchHypothesis[source]¶ Search new tokens for running hypotheses and encoded speech x.
- Parameters:
running_hyps (BatchHypothesis) – Running hypotheses on beam
x (torch.Tensor) – Encoded speech feature (T, D)
- Returns:
Best sorted hypotheses
- Return type:
-
class
espnet.nets.batch_beam_search.
BatchHypothesis
[source]¶ Bases:
tuple
Batchfied/Vectorized hypothesis data type.
Create new instance of BatchHypothesis(yseq, score, length, scores, states)
-
length
¶ Alias for field number 2
-
score
¶ Alias for field number 1
-
scores
¶ Alias for field number 3
-
states
¶ Alias for field number 4
-
yseq
¶ Alias for field number 0
-
espnet.nets.chainer_backend.e2e_asr_transformer¶
Transformer-based model for End-to-end ASR.
-
class
espnet.nets.chainer_backend.e2e_asr_transformer.
E2E
(idim, odim, args, ignore_id=-1, flag_return=True)[source]¶ Bases:
espnet.nets.chainer_backend.asr_interface.ChainerASRInterface
E2E module.
- Parameters:
idim (int) – Input dimmensions.
odim (int) – Output dimmensions.
args (Namespace) – Training config.
ignore_id (int, optional) – Id for ignoring a character.
flag_return (bool, optional) – If true, return a list with (loss,
loss_att, acc) in forward. Otherwise, return loss. (loss_ctc,) –
Initialize the transformer.
-
static
add_arguments
(parser)[source]¶ Customize flags for transformer setup.
- Parameters:
parser (Namespace) – Training config.
-
property
attention_plot_class
¶ Attention plot function.
Redirects to PlotAttentionReport
- Returns:
PlotAttentionReport
-
calculate_all_attentions
(xs, ilens, ys)[source]¶ E2E attention calculation.
- Parameters:
xs (List[tuple()]) – List of padded input sequences. [(T1, idim), (T2, idim), …]
ilens (ndarray) – Batch of lengths of input sequences. (B)
ys (List) – List of character id sequence tensor. [(L1), (L2), (L3), …]
- Returns:
Attention weights. (B, Lmax, Tmax)
- Return type:
float ndarray
-
static
custom_parallel_updater
(iters, optimizer, converter, devices, accum_grad=1)[source]¶ Get custom_parallel_updater of the model.
-
static
custom_updater
(iters, optimizer, converter, device=-1, accum_grad=1)[source]¶ Get custom_updater of the model.
-
forward
(xs, ilens, ys_pad, calculate_attentions=False)[source]¶ E2E forward propagation.
- Parameters:
xs (chainer.Variable) – Batch of padded character ids. (B, Tmax)
ilens (chainer.Variable) – Batch of length of each input batch. (B,)
ys (chainer.Variable) – Batch of padded target features. (B, Lmax, odim)
calculate_attentions (bool) – If true, return value is the output of encoder.
- Returns:
Training loss. float (optional): Training loss for ctc. float (optional): Training loss for attention. float (optional): Accuracy. chainer.Variable (Optional): Output of the encoder.
- Return type:
float
-
recognize
(x_block, recog_args, char_list=None, rnnlm=None)[source]¶ E2E recognition function.
- Parameters:
x (ndarray) – Input acouctic feature (B, T, D) or (T, D).
recog_args (Namespace) – Argment namespace contraining options.
char_list (List[str]) – List of characters.
rnnlm (chainer.Chain) – Language model module defined at
espnet.lm.chainer_backend.lm. –
- Returns:
N-best decoding results.
- Return type:
List
-
recognize_beam
(h, lpz, recog_args, char_list=None, rnnlm=None)[source]¶ E2E beam search.
- Parameters:
h (ndarray) – Encoder output features (B, T, D) or (T, D).
lpz (ndarray) – Log probabilities from CTC.
recog_args (Namespace) – Argment namespace contraining options.
char_list (List[str]) – List of characters.
rnnlm (chainer.Chain) – Language model module defined at
espnet.lm.chainer_backend.lm. –
- Returns:
N-best decoding results.
- Return type:
List
espnet.nets.chainer_backend.ctc¶
espnet.nets.chainer_backend.asr_interface¶
ASR Interface module.
-
class
espnet.nets.chainer_backend.asr_interface.
ChainerASRInterface
(**links)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,chainer.link.Chain
ASR Interface for ESPnet model implementation.
espnet.nets.chainer_backend.deterministic_embed_id¶
-
class
espnet.nets.chainer_backend.deterministic_embed_id.
EmbedID
(in_size, out_size, initialW=None, ignore_label=None)[source]¶ Bases:
chainer.link.Link
Efficient linear layer for one-hot input.
This is a link that wraps the
embed_id()
function. This link holds the ID (word) embedding matrixW
as a parameter.- Parameters:
in_size (int) – Number of different identifiers (a.k.a. vocabulary size).
out_size (int) – Output dimension.
initialW (Initializer) – Initializer to initialize the weight.
ignore_label (int) – If ignore_label is an int value, i-th column of return value is filled with 0.
embed_id()
-
W
¶ Embedding parameter matrix.
- Type:
Variable
Examples
>>> W = np.array([[0, 0, 0], ... [1, 1, 1], ... [2, 2, 2]]).astype('f') >>> W array([[ 0., 0., 0.], [ 1., 1., 1.], [ 2., 2., 2.]], dtype=float32) >>> l = L.EmbedID(W.shape[0], W.shape[1], initialW=W) >>> x = np.array([2, 1]).astype('i') >>> x array([2, 1], dtype=int32) >>> y = l(x) >>> y.data array([[ 2., 2., 2.], [ 1., 1., 1.]], dtype=float32)
-
ignore_label
= None¶
-
class
espnet.nets.chainer_backend.deterministic_embed_id.
EmbedIDFunction
(ignore_label=None)[source]¶ Bases:
chainer.function_node.FunctionNode
-
backward
(indexes, grad_outputs)[source]¶ Computes gradients w.r.t. specified inputs given output gradients.
This method is used to compute one step of the backpropagation corresponding to the forward computation of this function node. Given the gradients w.r.t. output variables, this method computes the gradients w.r.t. specified input variables. Note that this method does not need to compute any input gradients not specified by
target_input_indices
.Unlike
Function.backward()
, gradients are given asVariable
objects and this method itself has to return input gradients asVariable
objects. It enables the function node to return the input gradients with the full computational history, in which case it supports differentiable backpropagation or higher-order differentiation.The default implementation returns
None
s, which means the function is not differentiable.- Parameters:
target_input_indexes (tuple of int) – Sorted indices of the input variables w.r.t. which the gradients are required. It is guaranteed that this tuple contains at least one element.
grad_outputs (tuple of
Variable
s) – Gradients w.r.t. the output variables. If the gradient w.r.t. an output variable is not given, the corresponding element isNone
.
- Returns:
Tuple of variables that represent the gradients w.r.t. specified input variables. The length of the tuple can be same as either
len(target_input_indexes)
or the number of inputs. In the latter case, the elements not specified bytarget_input_indexes
will be discarded.
See also
backward_accumulate()
provides an alternative interface that allows you to implement the backward computation fused with the gradient accumulation.-
check_type_forward
(in_types)[source]¶ Checks types of input data before forward propagation.
This method is called before
forward()
and validates the types of input variables using the type checking utilities.- Parameters:
in_types (TypeInfoTuple) – The type information of input variables for
forward()
.
-
forward
(inputs)[source]¶ Computes the output arrays from the input arrays.
It delegates the procedure to
forward_cpu()
orforward_gpu()
by default. Which of them this method selects is determined by the type of input arrays. Implementations ofFunctionNode
must implement either CPU/GPU methods or this method.- Parameters:
inputs – Tuple of input array(s).
- Returns:
Tuple of output array(s).
Warning
Implementations of
FunctionNode
must take care that the return value must be a tuple even if it returns only one array.-
class
espnet.nets.chainer_backend.deterministic_embed_id.
EmbedIDGrad
(w_shape, ignore_label=None)[source]¶ Bases:
chainer.function_node.FunctionNode
-
backward
(indexes, grads)[source]¶ Computes gradients w.r.t. specified inputs given output gradients.
This method is used to compute one step of the backpropagation corresponding to the forward computation of this function node. Given the gradients w.r.t. output variables, this method computes the gradients w.r.t. specified input variables. Note that this method does not need to compute any input gradients not specified by
target_input_indices
.Unlike
Function.backward()
, gradients are given asVariable
objects and this method itself has to return input gradients asVariable
objects. It enables the function node to return the input gradients with the full computational history, in which case it supports differentiable backpropagation or higher-order differentiation.The default implementation returns
None
s, which means the function is not differentiable.- Parameters:
target_input_indexes (tuple of int) – Sorted indices of the input variables w.r.t. which the gradients are required. It is guaranteed that this tuple contains at least one element.
grad_outputs (tuple of
Variable
s) – Gradients w.r.t. the output variables. If the gradient w.r.t. an output variable is not given, the corresponding element isNone
.
- Returns:
Tuple of variables that represent the gradients w.r.t. specified input variables. The length of the tuple can be same as either
len(target_input_indexes)
or the number of inputs. In the latter case, the elements not specified bytarget_input_indexes
will be discarded.
See also
backward_accumulate()
provides an alternative interface that allows you to implement the backward computation fused with the gradient accumulation.-
forward
(inputs)[source]¶ Computes the output arrays from the input arrays.
It delegates the procedure to
forward_cpu()
orforward_gpu()
by default. Which of them this method selects is determined by the type of input arrays. Implementations ofFunctionNode
must implement either CPU/GPU methods or this method.- Parameters:
inputs – Tuple of input array(s).
- Returns:
Tuple of output array(s).
Warning
Implementations of
FunctionNode
must take care that the return value must be a tuple even if it returns only one array.-
espnet.nets.chainer_backend.deterministic_embed_id.
embed_id
(x, W, ignore_label=None)[source]¶ Efficient linear function for one-hot input.
This function implements so called word embeddings. It takes two arguments: a set of IDs (words)
x
in \(B\) dimensional integer vector, and a set of all ID (word) embeddingsW
in \(V \\times d\) float32 matrix. It outputs \(B \\times d\) matrix whosei
-th column is thex[i]
-th column ofW
. This function is only differentiable on the inputW
.- Parameters:
x (chainer.Variable | np.ndarray) – Batch vectors of IDs. Each element must be signed integer.
W (chainer.Variable | np.ndarray) – Distributed representation of each ID (a.k.a. word embeddings).
ignore_label (int) – If ignore_label is an int value, i-th column of return value is filled with 0.
- Returns:
Embedded variable.
- Return type:
chainer.Variable
EmbedID
Examples
>>> x = np.array([2, 1]).astype('i') >>> x array([2, 1], dtype=int32) >>> W = np.array([[0, 0, 0], ... [1, 1, 1], ... [2, 2, 2]]).astype('f') >>> W array([[ 0., 0., 0.], [ 1., 1., 1.], [ 2., 2., 2.]], dtype=float32) >>> F.embed_id(x, W).data array([[ 2., 2., 2.], [ 1., 1., 1.]], dtype=float32) >>> F.embed_id(x, W, ignore_label=1).data array([[ 2., 2., 2.], [ 0., 0., 0.]], dtype=float32)
espnet.nets.chainer_backend.e2e_asr¶
RNN sequence-to-sequence speech recognition model (chainer).
-
class
espnet.nets.chainer_backend.e2e_asr.
E2E
(idim, odim, args, flag_return=True)[source]¶ Bases:
espnet.nets.chainer_backend.asr_interface.ChainerASRInterface
E2E module for chainer backend.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (parser.args) – Training config.
flag_return (bool) – If True, train() would return additional metrics in addition to the training loss.
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
calculate_all_attentions
(xs, ilens, ys)[source]¶ E2E attention calculation.
- Parameters:
xs (List) – List of padded input sequences. [(T1, idim), (T2, idim), …]
ilens (np.ndarray) – Batch of lengths of input sequences. (B)
ys (List) – List of character id sequence tensor. [(L1), (L2), (L3), …]
- Returns:
Attention weights. (B, Lmax, Tmax)
- Return type:
float np.ndarray
-
static
custom_parallel_updater
(iters, optimizer, converter, devices, accum_grad=1)[source]¶ Get custom_parallel_updater of the model.
-
static
custom_updater
(iters, optimizer, converter, device=-1, accum_grad=1)[source]¶ Get custom_updater of the model.
-
forward
(xs, ilens, ys)[source]¶ E2E forward propagation.
- Parameters:
xs (chainer.Variable) – Batch of padded character ids. (B, Tmax)
ilens (chainer.Variable) – Batch of length of each input batch. (B,)
ys (chainer.Variable) – Batch of padded target features. (B, Lmax, odim)
- Returns:
Loss that calculated by attention and ctc loss. float (optional): Ctc loss. float (optional): Attention loss. float (optional): Accuracy.
- Return type:
float
-
recognize
(x, recog_args, char_list, rnnlm=None)[source]¶ E2E greedy/beam search.
- Parameters:
x (chainer.Variable) – Input tensor for recognition.
recog_args (parser.args) – Arguments of config file.
char_list (List[str]) – List of Characters.
rnnlm (Module) – RNNLM module defined at espnet.lm.chainer_backend.lm.
- Returns:
Result of recognition.
- Return type:
List[Dict[str, Any]]
espnet.nets.chainer_backend.nets_utils¶
espnet.nets.chainer_backend.__init__¶
Initialize sub package.
espnet.nets.chainer_backend.transformer.encoder_layer¶
Class Declaration of Transformer’s Encoder Block.
-
class
espnet.nets.chainer_backend.transformer.encoder_layer.
EncoderLayer
(n_units, d_units=0, h=8, dropout=0.1, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Single encoder layer module.
- Parameters:
n_units (int) – Number of input/output dimension of a FeedForward layer.
d_units (int) – Number of units of hidden layer in a FeedForward layer.
h (int) – Number of attention heads.
dropout (float) – Dropout rate
Initialize EncoderLayer.
espnet.nets.chainer_backend.transformer.encoder¶
Class Declaration of Transformer’s Encoder.
-
class
espnet.nets.chainer_backend.transformer.encoder.
Encoder
(idim, attention_dim=256, attention_heads=4, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, attention_dropout_rate=0.0, input_layer='conv2d', pos_enc_class=<class 'espnet.nets.chainer_backend.transformer.embedding.PositionalEncoding'>, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Encoder.
- Parameters:
input_type (str) – Sampling type. input_type must be conv2d or ‘linear’ currently.
idim (int) – Dimension of inputs.
n_layers (int) – Number of encoder layers.
n_units (int) – Number of input/output dimension of a FeedForward layer.
d_units (int) – Number of units of hidden layer in a FeedForward layer.
h (int) – Number of attention heads.
dropout (float) – Dropout rate
Initialize Encoder.
- Parameters:
idim (int) – Input dimension.
args (Namespace) – Training config.
initialW (int, optional) – Initializer to initialize the weight.
initial_bias (bool, optional) – Initializer to initialize the bias.
-
forward
(e, ilens)[source]¶ Compute Encoder layer.
- Parameters:
e (chainer.Variable) – Batch of padded character. (B, Tmax)
ilens (chainer.Variable) – Batch of length of each input batch. (B,)
- Returns:
Computed variable of encoder. numpy.array: Mask. chainer.Variable: Batch of lengths of each encoder outputs.
- Return type:
chainer.Variable
espnet.nets.chainer_backend.transformer.layer_norm¶
Class Declaration of Transformer’s Label Smootion loss.
espnet.nets.chainer_backend.transformer.ctc¶
Class Declaration of Transformer’s CTC.
-
class
espnet.nets.chainer_backend.transformer.ctc.
CTC
(odim, eprojs, dropout_rate)[source]¶ Bases:
chainer.link.Chain
Chainer implementation of ctc layer.
- Parameters:
odim (int) – The output dimension.
eprojs (int | None) – Dimension of input vectors from encoder.
dropout_rate (float) – Dropout rate.
Initialize CTC.
espnet.nets.chainer_backend.transformer.decoder_layer¶
Class Declaration of Transformer’s Decoder Block.
-
class
espnet.nets.chainer_backend.transformer.decoder_layer.
DecoderLayer
(n_units, d_units=0, h=8, dropout=0.1, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Single decoder layer module.
- Parameters:
n_units (int) – Number of input/output dimension of a FeedForward layer.
d_units (int) – Number of units of hidden layer in a FeedForward layer.
h (int) – Number of attention heads.
dropout (float) – Dropout rate
Initialize DecoderLayer.
espnet.nets.chainer_backend.transformer.embedding¶
Class Declaration of Transformer’s Positional Encoding.
-
class
espnet.nets.chainer_backend.transformer.embedding.
PositionalEncoding
(n_units, dropout=0.1, length=5000)[source]¶ Bases:
chainer.link.Chain
Positional encoding module.
- Parameters:
n_units (int) – embedding dim
dropout (float) – dropout rate
length (int) – maximum input length
Initialize Positional Encoding.
espnet.nets.chainer_backend.transformer.positionwise_feed_forward¶
Class Declaration of Transformer’s Positionwise Feedforward.
-
class
espnet.nets.chainer_backend.transformer.positionwise_feed_forward.
PositionwiseFeedForward
(n_units, d_units=0, dropout=0.1, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Positionwise feed forward.
:param : param int idim: input dimenstion :param : param int hidden_units: number of hidden units :param : param float dropout_rate: dropout rate
Initialize PositionwiseFeedForward.
- Parameters:
n_units (int) – Input dimension.
d_units (int, optional) – Output dimension of hidden layer.
dropout (float, optional) – Dropout ratio.
initialW (int, optional) – Initializer to initialize the weight.
initial_bias (bool, optional) – Initializer to initialize the bias.
espnet.nets.chainer_backend.transformer.label_smoothing_loss¶
Class Declaration of Transformer’s Label Smootion loss.
-
class
espnet.nets.chainer_backend.transformer.label_smoothing_loss.
LabelSmoothingLoss
(smoothing, n_target_vocab, normalize_length=False, ignore_id=-1)[source]¶ Bases:
chainer.link.Chain
Label Smoothing Loss.
- Parameters:
smoothing (float) – smoothing rate (0.0 means the conventional CE).
n_target_vocab (int) – number of classes.
normalize_length (bool) – normalize loss by sequence length if True.
Initialize Loss.
espnet.nets.chainer_backend.transformer.attention¶
Class Declaration of Transformer’s Attention.
-
class
espnet.nets.chainer_backend.transformer.attention.
MultiHeadAttention
(n_units, h=8, dropout=0.1, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Multi Head Attention Layer.
- Parameters:
n_units (int) – Number of input units.
h (int) – Number of attention heads.
dropout (float) – Dropout rate.
initialW – Initializer to initialize the weight.
initial_bias – Initializer to initialize the bias.
h – the number of heads
n_units – the number of features
dropout_rate (float) – dropout rate
Initialize MultiHeadAttention.
-
forward
(e_var, s_var=None, mask=None, batch=1)[source]¶ Core function of the Multi-head attention layer.
- Parameters:
e_var (chainer.Variable) – Variable of input array.
s_var (chainer.Variable) – Variable of source array from encoder.
mask (chainer.Variable) – Attention mask.
batch (int) – Batch size.
- Returns:
Outout of multi-head attention layer.
- Return type:
chainer.Variable
espnet.nets.chainer_backend.transformer.decoder¶
Class Declaration of Transformer’s Decoder.
-
class
espnet.nets.chainer_backend.transformer.decoder.
Decoder
(odim, args, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Decoder layer.
- Parameters:
odim (int) – The output dimension.
n_layers (int) – Number of ecoder layers.
n_units (int) – Number of attention units.
d_units (int) – Dimension of input vector of decoder.
h (int) – Number of attention heads.
dropout (float) – Dropout rate.
initialW (Initializer) – Initializer to initialize the weight.
initial_bias (Initializer) – Initializer to initialize the bias.
Initialize Decoder.
-
forward
(ys_pad, source, x_mask)[source]¶ Forward decoder.
- Parameters:
e (xp.array) – input token ids, int64 (batch, maxlen_out)
yy_mask (xp.array) – input token mask, uint8 (batch, maxlen_out)
source (xp.array) – encoded memory, float32 (batch, maxlen_in, feat)
xy_mask (xp.array) – encoded memory mask, uint8 (batch, maxlen_in)
- Return e:
decoded token score before softmax (batch, maxlen_out, token)
- Return type:
chainer.Variable
espnet.nets.chainer_backend.transformer.__init__¶
Initialize sub package.
espnet.nets.chainer_backend.transformer.training¶
Class Declaration of Transformer’s Training Subprocess.
-
class
espnet.nets.chainer_backend.transformer.training.
CustomConverter
[source]¶ Bases:
object
Custom Converter.
- Parameters:
subsampling_factor (int) – The subsampling factor.
Initialize subsampling.
-
class
espnet.nets.chainer_backend.transformer.training.
CustomParallelUpdater
(train_iters, optimizer, converter, devices, accum_grad=1)[source]¶ Bases:
chainer.training.updaters.multiprocess_parallel_updater.MultiprocessParallelUpdater
Custom Parallel Updater for chainer.
Defines the main update routine.
- Parameters:
train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name
'main'
.optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name
'main'
.converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the
device
option are passed to this function.chainer.dataset.concat_examples()
is used by default.device (torch.device) – Device to which the training data is sent. Negative value indicates the host memory (CPU).
accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.
Initialize custom parallel updater.
-
class
espnet.nets.chainer_backend.transformer.training.
CustomUpdater
(train_iter, optimizer, converter, device, accum_grad=1)[source]¶ Bases:
chainer.training.updaters.standard_updater.StandardUpdater
Custom updater for chainer.
- Parameters:
train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name
'main'
.optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name
'main'
.converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the
device
option are passed to this function.chainer.dataset.concat_examples()
is used by default.device (int or dict) – The destination device info to send variables. In the case of cpu or single gpu, device=-1 or 0, respectively. In the case of multi-gpu, device={“main”:0, “sub_1”: 1, …}.
accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.
Initialize Custom Updater.
-
class
espnet.nets.chainer_backend.transformer.training.
VaswaniRule
(attr, d, warmup_steps=4000, init=None, target=None, optimizer=None, scale=1.0)[source]¶ Bases:
chainer.training.extension.Extension
Trainer extension to shift an optimizer attribute magically by Vaswani.
- Parameters:
attr (str) – Name of the attribute to shift.
rate (float) – Rate of the exponential shift. This value is multiplied to the attribute at each call.
init (float) – Initial value of the attribute. If it is
None
, the extension extracts the attribute at the first call and uses it as the initial value.target (float) – Target value of the attribute. If the attribute reaches this value, the shift stops.
optimizer (Optimizer) – Target optimizer to adjust the attribute. If it is
None
, the main optimizer of the updater is used.
Initialize Vaswani rule extension.
espnet.nets.chainer_backend.transformer.subsampling¶
Class Declaration of Transformer’s Input layers.
-
class
espnet.nets.chainer_backend.transformer.subsampling.
Conv2dSubsampling
(channels, idim, dims, dropout=0.1, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Convolutional 2D subsampling (to 1/4 length).
- Parameters:
idim (int) – input dim
odim (int) – output dim
dropout_rate (flaot) – dropout rate
Initialize Conv2dSubsampling.
-
class
espnet.nets.chainer_backend.transformer.subsampling.
LinearSampling
(idim, dims, dropout=0.1, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Linear 1D subsampling.
- Parameters:
idim (int) – input dim
odim (int) – output dim
dropout_rate (flaot) – dropout rate
Initialize LinearSampling.
espnet.nets.chainer_backend.transformer.mask¶
Create mask for subsequent steps.
espnet.nets.chainer_backend.rnn.attentions¶
-
class
espnet.nets.chainer_backend.rnn.attentions.
AttDot
(eprojs, dunits, att_dim)[source]¶ Bases:
chainer.link.Chain
Compute attention based on dot product.
- Parameters:
eprojs (int | None) – Dimension of input vectors from encoder.
dunits (int | None) – Dimension of input vectors for decoder.
att_dim (int) – Dimension of input vectors for attention.
-
class
espnet.nets.chainer_backend.rnn.attentions.
AttLoc
(eprojs, dunits, att_dim, aconv_chans, aconv_filts)[source]¶ Bases:
chainer.link.Chain
Compute location-based attention.
- Parameters:
eprojs (int | None) – Dimension of input vectors from encoder.
dunits (int | None) – Dimension of input vectors for decoder.
att_dim (int) – Dimension of input vectors for attention.
aconv_chans (int) – Number of channels of output arrays from convolutional layer.
aconv_filts (int) – Size of filters of convolutional layer.
espnet.nets.chainer_backend.rnn.decoders¶
-
class
espnet.nets.chainer_backend.rnn.decoders.
Decoder
(eprojs, odim, dtype, dlayers, dunits, sos, eos, att, verbose=0, char_list=None, labeldist=None, lsm_weight=0.0, sampling_probability=0.0)[source]¶ Bases:
chainer.link.Chain
Decoder layer.
- Parameters:
eprojs (int) – Dimension of input variables from encoder.
odim (int) – The output dimension.
dtype (str) – Decoder type.
dlayers (int) – Number of layers for decoder.
dunits (int) – Dimension of input vector of decoder.
sos (int) – Number to indicate the start of sequences.
eos (int) – Number to indicate the end of sequences.
att (Module) – Attention module defined at espnet.espnet.nets.chainer_backend.attentions.
verbose (int) – Verbosity level.
char_list (List[str]) – List of all characters.
labeldist (numpy.array) – Distributed array of counted transcript length.
lsm_weight (float) – Weight to use when calculating the training loss.
sampling_probability (float) – Threshold for scheduled sampling.
-
calculate_all_attentions
(hs, ys)[source]¶ Calculate all of attentions.
- Parameters:
hs (list of chainer.Variable | N-dimensional array) – Input variable from encoder.
ys (list of chainer.Variable | N-dimensional array) – Input variable of decoder.
- Returns:
List of attention weights.
- Return type:
chainer.Variable
-
recognize_beam
(h, lpz, recog_args, char_list, rnnlm=None)[source]¶ Beam search implementation.
- Parameters:
h (chainer.Variable) – One of the output from the encoder.
lpz (chainer.Variable | None) – Result of net propagation.
recog_args (Namespace) – The argument.
char_list (List[str]) – List of all characters.
rnnlm (Module) – RNNLM module. Defined at espnet.lm.chainer_backend.lm
- Returns:
Result of recognition.
- Return type:
List[Dict[str,Any]]
-
espnet.nets.chainer_backend.rnn.decoders.
decoder_for
(args, odim, sos, eos, att, labeldist)[source]¶ Return the decoding layer corresponding to the args.
- Parameters:
args (Namespace) – The program arguments.
odim (int) – The output dimension.
sos (int) – Number to indicate the start of sequences.
eos (int) –
att (Module) – Attention module defined at espnet.nets.chainer_backend.attentions.
labeldist (numpy.array) – Distributed array of length od transcript.
- Returns:
The decoder module.
- Return type:
chainer.Chain
espnet.nets.chainer_backend.rnn.encoders¶
-
class
espnet.nets.chainer_backend.rnn.encoders.
Encoder
(etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1)[source]¶ Bases:
chainer.link.Chain
Encoder network class.
- Parameters:
etype (str) – Type of encoder network.
idim (int) – Number of dimensions of encoder network.
elayers (int) – Number of layers of encoder network.
eunits (int) – Number of lstm units of encoder network.
eprojs (int) – Number of projection units of encoder network.
subsample (np.array) – Subsampling number. e.g. 1_2_2_2_1
dropout (float) – Dropout rate.
-
class
espnet.nets.chainer_backend.rnn.encoders.
RNN
(idim, elayers, cdim, hdim, dropout, typ='lstm')[source]¶ Bases:
chainer.link.Chain
RNN Module.
- Parameters:
idim (int) – Dimension of the imput.
elayers (int) – Number of encoder layers.
cdim (int) – Number of rnn units.
hdim (int) – Number of projection units.
dropout (float) – Dropout rate.
typ (str) – Rnn type.
-
class
espnet.nets.chainer_backend.rnn.encoders.
RNNP
(idim, elayers, cdim, hdim, subsample, dropout, typ='blstm')[source]¶ Bases:
chainer.link.Chain
RNN with projection layer module.
- Parameters:
idim (int) – Dimension of inputs.
elayers (int) – Number of encoder layers.
cdim (int) – Number of rnn units. (resulted in cdim * 2 if bidirectional)
hdim (int) – Number of projection units.
subsample (np.ndarray) – List to use sabsample the input array.
dropout (float) – Dropout rate.
typ (str) – The RNN type.
espnet.nets.chainer_backend.rnn.__init__¶
Initialize sub package.
espnet.nets.chainer_backend.rnn.training¶
-
class
espnet.nets.chainer_backend.rnn.training.
CustomConverter
(subsampling_factor=1)[source]¶ Bases:
object
Custom Converter.
- Parameters:
subsampling_factor (int) – The subsampling factor.
-
class
espnet.nets.chainer_backend.rnn.training.
CustomParallelUpdater
(train_iters, optimizer, converter, devices, accum_grad=1)[source]¶ Bases:
chainer.training.updaters.multiprocess_parallel_updater.MultiprocessParallelUpdater
Custom Parallel Updater for chainer.
Defines the main update routine.
- Parameters:
train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name
'main'
.optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name
'main'
.converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the
device
option are passed to this function.chainer.dataset.concat_examples()
is used by default.device (torch.device) – Device to which the training data is sent. Negative value indicates the host memory (CPU).
accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.
-
class
espnet.nets.chainer_backend.rnn.training.
CustomUpdater
(train_iter, optimizer, converter, device, accum_grad=1)[source]¶ Bases:
chainer.training.updaters.standard_updater.StandardUpdater
Custom updater for chainer.
- Parameters:
train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name
'main'
.optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name
'main'
.converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the
device
option are passed to this function.chainer.dataset.concat_examples()
is used by default.device (int or dict) – The destination device info to send variables. In the case of cpu or single gpu, device=-1 or 0, respectively. In the case of multi-gpu, device={“main”:0, “sub_1”: 1, …}.
accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.
espnet.nets.scorers.ngram¶
Ngram lm implement.
-
class
espnet.nets.scorers.ngram.
NgramFullScorer
(ngram_model, token_list)[source]¶ Bases:
espnet.nets.scorers.ngram.Ngrambase
,espnet.nets.scorer_interface.BatchScorerInterface
Fullscorer for ngram.
Initialize Ngrambase.
- Parameters:
ngram_model – ngram model path
token_list – token list from dict or model.json
-
score
(y, state, x)[source]¶ Score interface for both full and partial scorer.
- Parameters:
y – previous char
state – previous state
x – encoded feature
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
class
espnet.nets.scorers.ngram.
NgramPartScorer
(ngram_model, token_list)[source]¶ Bases:
espnet.nets.scorers.ngram.Ngrambase
,espnet.nets.scorer_interface.PartialScorerInterface
Partialscorer for ngram.
Initialize Ngrambase.
- Parameters:
ngram_model – ngram model path
token_list – token list from dict or model.json
-
score_partial
(y, next_token, state, x)[source]¶ Score interface for both full and partial scorer.
- Parameters:
y – previous char
next_token – next token need to be score
state – previous state
x – encoded feature
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
class
espnet.nets.scorers.ngram.
Ngrambase
(ngram_model, token_list)[source]¶ Bases:
abc.ABC
Ngram base implemented through ScorerInterface.
Initialize Ngrambase.
- Parameters:
ngram_model – ngram model path
token_list – token list from dict or model.json
-
score_partial_
(y, next_token, state, x)[source]¶ Score interface for both full and partial scorer.
- Parameters:
y – previous char
next_token – next token need to be score
state – previous state
x – encoded feature
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
espnet.nets.scorers.length_bonus¶
Length bonus module.
-
class
espnet.nets.scorers.length_bonus.
LengthBonus
(n_vocab: int)[source]¶ Bases:
espnet.nets.scorer_interface.BatchScorerInterface
Length bonus in beam search.
Initialize class.
- Parameters:
n_vocab (int) – The number of tokens in vocabulary for beam search
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch.
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
score
(y, state, x)[source]¶ Score new token.
- Parameters:
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – 2D encoder feature that generates ys.
- Returns:
- Tuple of
torch.float32 scores for next token (n_vocab) and None
- Return type:
tuple[torch.Tensor, Any]
espnet.nets.scorers.ctc¶
ScorerInterface implementation for CTC.
-
class
espnet.nets.scorers.ctc.
CTCPrefixScorer
(ctc: torch.nn.modules.module.Module, eos: int)[source]¶ Bases:
espnet.nets.scorer_interface.BatchPartialScorerInterface
Decoder interface wrapper for CTCPrefixScore.
Initialize class.
- Parameters:
ctc (torch.nn.Module) – The CTC implementation. For example,
espnet.nets.pytorch_backend.ctc.CTC
eos (int) – The end-of-sequence id.
-
batch_init_state
(x: torch.Tensor)[source]¶ Get an initial state for decoding.
- Parameters:
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
batch_score_partial
(y, ids, state, x)[source]¶ Score new token.
- Parameters:
y (torch.Tensor) – 1D prefix token
ids (torch.Tensor) – torch.int64 next token to score
state – decoder state for prefix tokens
x (torch.Tensor) – 2D encoder feature that generates ys
- Returns:
Tuple of a score tensor for y that has a shape (len(next_tokens),) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
-
extend_prob
(x: torch.Tensor)[source]¶ Extend probs for decoding.
This extension is for streaming decoding as in Eq (14) in https://arxiv.org/abs/2006.14941
- Parameters:
x (torch.Tensor) – The encoded feature tensor
-
extend_state
(state)[source]¶ Extend state for decoding.
This extension is for streaming decoding as in Eq (14) in https://arxiv.org/abs/2006.14941
- Parameters:
state – The states of hyps
Returns: exteded state
-
init_state
(x: torch.Tensor)[source]¶ Get an initial state for decoding.
- Parameters:
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
score_partial
(y, ids, state, x)[source]¶ Score new token.
- Parameters:
y (torch.Tensor) – 1D prefix token
next_tokens (torch.Tensor) – torch.int64 next token to score
state – decoder state for prefix tokens
x (torch.Tensor) – 2D encoder feature that generates ys
- Returns:
Tuple of a score tensor for y that has a shape (len(next_tokens),) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
-
select_state
(state, i, new_id=None)[source]¶ Select state with relative ids in the main beam search.
- Parameters:
state – Decoder state for prefix tokens
i (int) – Index to select a state in the main beam search
new_id (int) – New label id to select a state if necessary
- Returns:
pruned state
- Return type:
state
espnet.nets.scorers.uasr¶
ScorerInterface implementation for UASR.
-
class
espnet.nets.scorers.uasr.
UASRPrefixScorer
(eos: int)[source]¶ Bases:
espnet.nets.scorers.ctc.CTCPrefixScorer
Decoder interface wrapper for CTCPrefixScore.
Initialize class.
espnet.nets.scorers.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.e2e_asr_mix¶
This script is used to construct End-to-End models of multi-speaker ASR.
- Copyright 2017 Johns Hopkins University (Shinji Watanabe)
Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
-
class
espnet.nets.pytorch_backend.e2e_asr_mix.
E2E
(idim, odim, args)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Initialize multi-speaker E2E module.
-
calculate_all_attentions
(xs_pad, ilens, ys_pad)[source]¶ E2E attention calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, num_spkrs, Lmax)
- Returns:
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).
- Return type:
float ndarray
-
enhance
(xs)[source]¶ Forward only the frontend stage.
- Parameters:
xs (ndarray) – input acoustic feature (T, C, F)
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, num_spkrs, Lmax)
- Returns:
ctc loss value
- Return type:
torch.Tensor
- Returns:
attention loss value
- Return type:
torch.Tensor
- Returns:
accuracy in attention decoder
- Return type:
float
-
init_like_chainer
()[source]¶ Initialize weight like chainer.
chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)
however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)
-
recognize
(x, recog_args, char_list, rnnlm=None)[source]¶ E2E beam search.
- Parameters:
x (ndarray) – input acoustic feature (T, D)
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
recognize_batch
(xs, recog_args, char_list, rnnlm=None)[source]¶ E2E beam search.
- Parameters:
xs (ndarray) – input acoustic feature (T, D)
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
class
espnet.nets.pytorch_backend.e2e_asr_mix.
EncoderMix
(etype, idim, elayers_sd, elayers_rec, eunits, eprojs, subsample, dropout, num_spkrs=2, in_channel=1)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder module for the case of multi-speaker mixture speech.
- Parameters:
etype (str) – type of encoder network
idim (int) – number of dimensions of encoder network
elayers_sd (int) – number of layers of speaker differentiate part in encoder network
elayers_rec (int) – number of layers of shared recognition part in encoder network
eunits (int) – number of lstm units of encoder network
eprojs (int) – number of projection units of encoder network
subsample (np.ndarray) – list of subsampling numbers
dropout (float) – dropout rate
in_channel (int) – number of input channels
num_spkrs (int) – number of number of speakers
Initialize the encoder of single-channel multi-speaker ASR.
-
forward
(xs_pad, ilens)[source]¶ Encodermix forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
- Returns:
list: batch of hidden state sequences [num_spkrs x (B, Tmax, eprojs)]
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.e2e_asr_mix.
PIT
(num_spkrs)[source]¶ Bases:
object
Permutation Invariant Training (PIT) module.
- Parameters:
num_spkrs (int) – number of speakers for PIT process (2 or 3)
Initialize PIT module.
-
min_pit_sample
(loss)[source]¶ Compute the PIT loss for each sample.
- Parameters:
torch.Tensor loss (1-D) – list of losses for one sample, including [h1r1, h1r2, h2r1, h2r2] or [h1r1, h1r2, h1r3, h2r1, h2r2, h2r3, h3r1, h3r2, h3r3]
:return minimum loss of best permutation :rtype torch.Tensor (1) :return the best permutation :rtype List: len=2
-
permutationDFS
(source, start)[source]¶ Get permutations with DFS.
The final result is all permutations of the ‘source’ sequence. e.g. [[1, 2], [2, 1]] or
[[1, 2, 3], [1, 3, 2], [2, 1, 3], [2, 3, 1], [3, 2, 1], [3, 1, 2]]
- Parameters:
source (np.ndarray) – (num_spkrs, 1), e.g. [1, 2, …, N]
start (int) – the start point to permute
espnet.nets.pytorch_backend.e2e_vc_transformer¶
Voice Transformer Network (Transformer-VC) related modules.
-
class
espnet.nets.pytorch_backend.e2e_vc_transformer.
Transformer
(idim, odim, args=None)[source]¶ Bases:
espnet.nets.tts_interface.TTSInterface
,torch.nn.modules.module.Module
VC Transformer module.
This is a module of the Voice Transformer Network (a.k.a. VTN or Transformer-VC) described in Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining, which convert the sequence of acoustic features into the sequence of acoustic features.
Initialize Transformer-VC module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (Namespace, optional) –
- eprenet_conv_layers (int):
Number of encoder prenet convolution layers.
- eprenet_conv_chans (int):
Number of encoder prenet convolution channels.
- eprenet_conv_filts (int):
Filter size of encoder prenet convolution.
transformer_input_layer (str): Input layer before the encoder.
dprenet_layers (int): Number of decoder prenet layers.
dprenet_units (int): Number of decoder prenet hidden units.
elayers (int): Number of encoder layers.
eunits (int): Number of encoder hidden units.
adim (int): Number of attention transformation dimensions.
aheads (int): Number of heads for multi head attention.
dlayers (int): Number of decoder layers.
dunits (int): Number of decoder hidden units.
postnet_layers (int): Number of postnet layers.
postnet_chans (int): Number of postnet channels.
postnet_filts (int): Filter size of postnet.
- use_scaled_pos_enc (bool):
Whether to use trainable scaled positional encoding.
- use_batch_norm (bool):
Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool):
Whether to perform layer normalization before encoder block.
- decoder_normalize_before (bool):
Whether to perform layer normalization before decoder block.
- encoder_concat_after (bool): Whether to concatenate
attention layer’s input and output in encoder.
- decoder_concat_after (bool): Whether to concatenate
attention layer’s input and output in decoder.
reduction_factor (int): Reduction factor (for decoder).
encoder_reduction_factor (int): Reduction factor (for encoder).
spk_embed_dim (int): Number of speaker embedding dimenstions.
spk_embed_integration_type: How to integrate speaker embedding.
transformer_init (float): How to initialize transformer parameters.
transformer_lr (float): Initial value of learning rate.
transformer_warmup_steps (int): Optimizer warmup steps.
- transformer_enc_dropout_rate (float):
Dropout rate in encoder except attention & positional encoding.
- transformer_enc_positional_dropout_rate (float):
Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float):
Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float):
Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float):
Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float):
Dropout rate in deocoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float):
Dropout rate in encoder-deocoder attention module.
eprenet_dropout_rate (float): Dropout rate in encoder prenet.
dprenet_dropout_rate (float): Dropout rate in decoder prenet.
postnet_dropout_rate (float): Dropout rate in postnet.
- use_masking (bool):
Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool):
Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float): Positive sample weight in bce calculation
(only for use_masking=true).
loss_type (str): How to calculate loss.
use_guided_attn_loss (bool): Whether to use guided attention loss.
- num_heads_applied_guided_attn (int):
Number of heads in each layer to apply guided attention loss.
- num_layers_applied_guided_attn (int):
Number of layers to apply guided attention loss.
- modules_applied_guided_attn (list):
List of module names to apply guided attention loss.
guided-attn-loss-sigma (float) Sigma in guided attention loss.
guided-attn-loss-lambda (float): Lambda in guided attention loss.
-
property
attention_plot_class
¶ Return plot class for attention weight plot.
-
property
base_plot_keys
¶ Return base key names to plot during training.
keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss
and validation/main/loss values.
- also loss.png will be created as a figure visulizing main/loss
and validation/main/loss values.
- Returns:
List of strings which are base keys to plot during training.
- Return type:
list
-
calculate_all_attentions
(xs, ilens, ys, olens, spembs=None, skip_output=False, keep_tensor=False, *args, **kwargs)[source]¶ Calculate all of the attention weights.
- Parameters:
xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
skip_output (bool, optional) – Whether to skip calculate the final output.
keep_tensor (bool, optional) – Whether to keep original tensor.
- Returns:
Dict of attention weights and outputs.
- Return type:
dict
-
forward
(xs, ilens, ys, labels, olens, spembs=None, *args, **kwargs)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns:
Loss value.
- Return type:
Tensor
-
inference
(x, inference_args, spemb=None, *args, **kwargs)[source]¶ Generate the sequence of features given the sequences of acoustic features.
- Parameters:
x (Tensor) – Input sequence of acoustic features (T, idim).
inference_args (Namespace) –
threshold (float): Threshold in inference.
minlenratio (float): Minimum length ratio in inference.
maxlenratio (float): Maximum length ratio in inference.
spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).
- Returns:
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).
- Return type:
Tensor
espnet.nets.pytorch_backend.e2e_asr_mulenc¶
Define e2e module for multi-encoder network. https://arxiv.org/pdf/1811.04903.pdf.
-
class
espnet.nets.pytorch_backend.e2e_asr_mulenc.
E2E
(idims, odim, args)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
E2E module.
- Parameters:
idims (List) – List of dimensions of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Initialize this class with python-level args.
- Parameters:
idims (list) – list of the number of an input feature dim.
odim (int) – The number of output vocab.
args (Namespace) – arguments
-
static
attention_add_arguments
(parser)[source]¶ Add arguments for attentions in multi-encoder setting.
-
calculate_all_attentions
(xs_pad_list, ilens_list, ys_pad)[source]¶ E2E attention calculation.
- Parameters:
xs_pad_list (List) – list of batch (torch.Tensor) of padded input sequences [(B, Tmax_1, idim), (B, Tmax_2, idim),..]
ilens_list (List) – list of batch (torch.Tensor) of lengths of input sequences [(B), (B), ..]
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns:
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) multi-encoder case
=> [(B, Lmax, Tmax1), (B, Lmax, Tmax2), …, (B, Lmax, NumEncs)]
other case => attention weights (B, Lmax, Tmax).
- Return type:
float ndarray or list
-
calculate_all_ctc_probs
(xs_pad_list, ilens_list, ys_pad)[source]¶ E2E CTC probability calculation.
- Parameters:
xs_pad_list (List) – list of batch (torch.Tensor) of padded input sequences [(B, Tmax_1, idim), (B, Tmax_2, idim),..]
ilens_list (List) – list of batch (torch.Tensor) of lengths of input sequences [(B), (B), ..]
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns:
CTC probability (B, Tmax, vocab)
- Return type:
float ndarray or list
-
encode
(x_list)[source]¶ Encode feature.
- Parameters:
x_list (list) – input feature [(T1, D), (T2, D), … ]
- Returns:
- list
encoded feature [(T1, D), (T2, D), … ]
-
forward
(xs_pad_list, ilens_list, ys_pad)[source]¶ E2E forward.
- Parameters:
xs_pad_list (List) – list of batch (torch.Tensor) of padded input sequences [(B, Tmax_1, idim), (B, Tmax_2, idim),..]
ilens_list (List) – list of batch (torch.Tensor) of lengths of input sequences [(B), (B), ..]
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns:
loss value
- Return type:
torch.Tensor
-
init_like_chainer
()[source]¶ Initialize weight like chainer.
chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)
however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)
-
recognize
(x_list, recog_args, char_list, rnnlm=None)[source]¶ E2E beam search.
- Parameters:
of ndarray x (list) – list of input acoustic feature [(T1, D), (T2,D),…]
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
recognize_batch
(xs_list, recog_args, char_list, rnnlm=None)[source]¶ E2E beam search.
- Parameters:
xs_list (list) – list of list of input acoustic feature arrays [[(T1_1, D), (T1_2, D), …],[(T2_1, D), (T2_2, D), …], …]
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
scorers
()[source]¶ Get scorers for beam_search (optional).
- Returns:
dict of ScorerInterface objects
- Return type:
dict[str, ScorerInterface]
espnet.nets.pytorch_backend.e2e_mt_transformer¶
Transformer text translation model (pytorch).
-
class
espnet.nets.pytorch_backend.e2e_mt_transformer.
E2E
(idim, odim, args, ignore_id=-1)[source]¶ Bases:
espnet.nets.mt_interface.MTInterface
,torch.nn.modules.module.Module
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
property
attention_plot_class
¶ Return PlotAttentionReport.
-
calculate_all_attentions
(xs_pad, ilens, ys_pad)[source]¶ E2E attention calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
attention weights (B, H, Lmax, Tmax)
- Return type:
float ndarray
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax)
ilens (torch.Tensor) – batch of lengths of source sequences (B)
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
- Return type:
torch.Tensor
- Returns:
attention loss value
- Return type:
torch.Tensor
- Returns:
accuracy in attention decoder
- Return type:
float
-
target_forcing
(xs_pad, ys_pad=None, tgt_lang=None)[source]¶ Prepend target language IDs to source sentences for multilingual MT.
These tags are prepended in source/target sentences as pre-processing.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)
- Returns:
source text without language IDs
- Return type:
torch.Tensor
- Returns:
target text without language IDs
- Return type:
torch.Tensor
- Returns:
target language IDs
- Return type:
torch.Tensor (B, 1)
espnet.nets.pytorch_backend.e2e_asr_transformer¶
Transformer speech recognition model (pytorch).
-
class
espnet.nets.pytorch_backend.e2e_asr_transformer.
E2E
(idim, odim, args, ignore_id=-1)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
property
attention_plot_class
¶ Return PlotAttentionReport.
-
calculate_all_attentions
(xs_pad, ilens, ys_pad)[source]¶ E2E attention calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
attention weights (B, H, Lmax, Tmax)
- Return type:
float ndarray
-
calculate_all_ctc_probs
(xs_pad, ilens, ys_pad)[source]¶ E2E CTC probability calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
CTC probability (B, Tmax, vocab)
- Return type:
float ndarray
-
encode
(x)[source]¶ Encode acoustic features.
- Parameters:
x (ndarray) – source acoustic feature (T, D)
- Returns:
encoder outputs
- Return type:
torch.Tensor
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of source sequences (B)
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
- Returns:
ctc loss value
- Return type:
torch.Tensor
- Returns:
attention loss value
- Return type:
torch.Tensor
- Returns:
accuracy in attention decoder
- Return type:
float
-
recognize
(x, recog_args, char_list=None, rnnlm=None, use_jit=False)[source]¶ Recognize input speech.
- Parameters:
x (ndnarray) – input acoustic feature (B, T, D) or (T, D)
recog_args (Namespace) – argment Namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
espnet.nets.pytorch_backend.e2e_asr_maskctc¶
Mask CTC based non-autoregressive speech recognition model (pytorch).
See https://arxiv.org/abs/2005.08700 for the detail.
-
class
espnet.nets.pytorch_backend.e2e_asr_maskctc.
E2E
(idim, odim, args, ignore_id=-1)[source]¶ Bases:
espnet.nets.pytorch_backend.e2e_asr_transformer.E2E
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of source sequences (B)
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
- Returns:
ctc loss value
- Return type:
torch.Tensor
- Returns:
attention loss value
- Return type:
torch.Tensor
- Returns:
accuracy in attention decoder
- Return type:
float
-
recognize
(x, recog_args, char_list=None, rnnlm=None)[source]¶ Recognize input speech.
- Parameters:
x (ndnarray) – input acoustic feature (B, T, D) or (T, D)
recog_args (Namespace) – argment Namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
decoding result
- Return type:
list
espnet.nets.pytorch_backend.e2e_tts_fastspeech¶
FastSpeech related modules.
-
class
espnet.nets.pytorch_backend.e2e_tts_fastspeech.
FeedForwardTransformer
(idim, odim, args=None)[source]¶ Bases:
espnet.nets.tts_interface.TTSInterface
,torch.nn.modules.module.Module
Feed Forward Transformer for TTS a.k.a. FastSpeech.
This is a module of FastSpeech, feed-forward Transformer with duration predictor described in FastSpeech: Fast, Robust and Controllable Text to Speech, which does not require any auto-regressive processing during inference, resulting in fast decoding compared with auto-regressive Transformer.
Initialize feed-forward Transformer module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (Namespace, optional) –
elayers (int): Number of encoder layers.
eunits (int): Number of encoder hidden units.
adim (int): Number of attention transformation dimensions.
aheads (int): Number of heads for multi head attention.
dlayers (int): Number of decoder layers.
dunits (int): Number of decoder hidden units.
- use_scaled_pos_enc (bool):
Whether to use trainable scaled positional encoding.
- encoder_normalize_before (bool):
Whether to perform layer normalization before encoder block.
- decoder_normalize_before (bool):
Whether to perform layer normalization before decoder block.
- encoder_concat_after (bool): Whether to concatenate attention
layer’s input and output in encoder.
- decoder_concat_after (bool): Whether to concatenate attention
layer’s input and output in decoder.
duration_predictor_layers (int): Number of duration predictor layers.
duration_predictor_chans (int): Number of duration predictor channels.
- duration_predictor_kernel_size (int):
Kernel size of duration predictor.
spk_embed_dim (int): Number of speaker embedding dimensions.
spk_embed_integration_type: How to integrate speaker embedding.
teacher_model (str): Teacher auto-regressive transformer model path.
reduction_factor (int): Reduction factor.
transformer_init (float): How to initialize transformer parameters.
transformer_lr (float): Initial value of learning rate.
transformer_warmup_steps (int): Optimizer warmup steps.
- transformer_enc_dropout_rate (float):
Dropout rate in encoder except attention & positional encoding.
- transformer_enc_positional_dropout_rate (float):
Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float):
Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float):
Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float):
Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float):
Dropout rate in deocoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float):
Dropout rate in encoder-deocoder attention module.
- use_masking (bool):
Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool):
Whether to apply weighted masking in loss calculation.
- transfer_encoder_from_teacher:
Whether to transfer encoder using teacher encoder parameters.
- transferred_encoder_module:
Encoder module to be initialized using teacher parameters.
-
property
attention_plot_class
¶ Return plot class for attention weight plot.
-
property
base_plot_keys
¶ Return base key names to plot during training.
keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns:
List of strings which are base keys to plot during training.
- Return type:
list
-
calculate_all_attentions
(xs, ilens, ys, olens, spembs=None, extras=None, *args, **kwargs)[source]¶ Calculate all of the attention weights.
- Parameters:
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
extras (Tensor, optional) – Batch of precalculated durations (B, Tmax, 1).
- Returns:
Dict of attention weights and outputs.
- Return type:
dict
-
forward
(xs, ilens, ys, olens, spembs=None, extras=None, *args, **kwargs)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
extras (Tensor, optional) – Batch of precalculated durations (B, Tmax, 1).
- Returns:
Loss value.
- Return type:
Tensor
-
inference
(x, inference_args, spemb=None, *args, **kwargs)[source]¶ Generate the sequence of features given the sequences of characters.
- Parameters:
x (Tensor) – Input sequence of characters (T,).
inference_args (Namespace) – Dummy for compatibility.
spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).
- Returns:
Output sequence of features (L, odim). None: Dummy for compatibility. None: Dummy for compatibility.
- Return type:
Tensor
-
class
espnet.nets.pytorch_backend.e2e_tts_fastspeech.
FeedForwardTransformerLoss
(use_masking=True, use_weighted_masking=False)[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for feed-forward Transformer.
Initialize feed-forward Transformer loss module.
- Parameters:
use_masking (bool) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool) – Whether to weighted masking in loss calculation.
-
forward
(after_outs, before_outs, d_outs, ys, ds, ilens, olens)[source]¶ Calculate forward propagation.
- Parameters:
after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).
before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).
d_outs (Tensor) – Batch of outputs of duration predictor (B, Tmax).
ys (Tensor) – Batch of target features (B, Lmax, odim).
ds (Tensor) – Batch of durations (B, Tmax).
ilens (LongTensor) – Batch of the lengths of each input (B,).
olens (LongTensor) – Batch of the lengths of each target (B,).
- Returns:
L1 loss value. Tensor: Duration predictor loss value.
- Return type:
Tensor
espnet.nets.pytorch_backend.e2e_st_transformer¶
Transformer speech recognition model (pytorch).
-
class
espnet.nets.pytorch_backend.e2e_st_transformer.
E2E
(idim, odim, args, ignore_id=-1)[source]¶ Bases:
espnet.nets.st_interface.STInterface
,torch.nn.modules.module.Module
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
property
attention_plot_class
¶ Return PlotAttentionReport.
-
calculate_all_attentions
(xs_pad, ilens, ys_pad, ys_pad_src)[source]¶ E2E attention calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
ys_pad_src (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
attention weights (B, H, Lmax, Tmax)
- Return type:
float ndarray
-
calculate_all_ctc_probs
(xs_pad, ilens, ys_pad, ys_pad_src)[source]¶ E2E CTC probability calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
ys_pad_src (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
CTC probability (B, Tmax, vocab)
- Return type:
float ndarray
-
encode
(x)[source]¶ Encode source acoustic features.
- Parameters:
x (ndarray) – source acoustic feature (T, D)
- Returns:
encoder outputs
- Return type:
torch.Tensor
-
forward
(xs_pad, ilens, ys_pad, ys_pad_src)[source]¶ E2E forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of source sequences (B)
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
ys_pad_src (torch.Tensor) – batch of padded target sequences (B, Lmax)
- Returns:
ctc loss value
- Return type:
torch.Tensor
- Returns:
attention loss value
- Return type:
torch.Tensor
- Returns:
accuracy in attention decoder
- Return type:
float
-
forward_asr
(hs_pad, hs_mask, ys_pad)[source]¶ Forward pass in the auxiliary ASR task.
- Parameters:
hs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)
hs_mask (torch.Tensor) – batch of input token mask (B, Lmax)
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
- Returns:
ASR attention loss value
- Return type:
torch.Tensor
- Returns:
accuracy in ASR attention decoder
- Return type:
float
- Returns:
ASR CTC loss value
- Return type:
torch.Tensor
- Returns:
character error rate from CTC prediction
- Return type:
float
- Returns:
character error rate from attetion decoder prediction
- Return type:
float
- Returns:
word error rate from attetion decoder prediction
- Return type:
float
-
forward_mt
(xs_pad, ys_in_pad, ys_out_pad, ys_mask)[source]¶ Forward pass in the auxiliary MT task.
- Parameters:
xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)
ys_in_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
ys_out_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
ys_mask (torch.Tensor) – batch of input token mask (B, Lmax)
- Returns:
MT loss value
- Return type:
torch.Tensor
- Returns:
accuracy in MT decoder
- Return type:
float
espnet.nets.pytorch_backend.ctc¶
-
class
espnet.nets.pytorch_backend.ctc.
CTC
(odim, eprojs, dropout_rate, ctc_type='builtin', reduce=True)[source]¶ Bases:
torch.nn.modules.module.Module
CTC module
- Parameters:
odim (int) – dimension of outputs
eprojs (int) – number of encoder projection units
dropout_rate (float) – dropout rate (0.0 ~ 1.0)
ctc_type (str) – builtin
reduce (bool) – reduce the CTC loss into a scalar
-
argmax
(hs_pad)[source]¶ argmax of frame activations
- Parameters:
hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)
- Returns:
argmax applied 2d tensor (B, Tmax)
- Return type:
torch.Tensor
-
forced_align
(h, y, blank_id=0)[source]¶ forced alignment.
- Parameters:
h (torch.Tensor) – hidden state sequence, 2d tensor (T, D)
y (int) – id sequence tensor 1d tensor (L)
y – blank symbol index
- Returns:
best alignment results
- Return type:
list
-
forward
(hs_pad, hlens, ys_pad)[source]¶ CTC forward
- Parameters:
hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)
hlens (torch.Tensor) – batch of lengths of hidden state sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns:
ctc loss value
- Return type:
torch.Tensor
-
espnet.nets.pytorch_backend.ctc.
ctc_for
(args, odim, reduce=True)[source]¶ Returns the CTC module for the given args and output dimension
- Parameters:
args (Namespace) – the program args
:param int odim : The output dimension :param bool reduce : return the CTC loss in a scalar :return: the corresponding CTC module
espnet.nets.pytorch_backend.wavenet¶
This code is based on https://github.com/kan-bayashi/PytorchWaveNetVocoder.
-
class
espnet.nets.pytorch_backend.wavenet.
CausalConv1d
(in_channels, out_channels, kernel_size, dilation=1, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module
1D dilated causal convolution.
-
class
espnet.nets.pytorch_backend.wavenet.
OneHot
(depth)[source]¶ Bases:
torch.nn.modules.module.Module
Convert to one-hot vector.
- Parameters:
depth (int) – Dimension of one-hot vector.
-
class
espnet.nets.pytorch_backend.wavenet.
UpSampling
(upsampling_factor, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module
Upsampling layer with deconvolution.
- Parameters:
upsampling_factor (int) – Upsampling factor.
-
class
espnet.nets.pytorch_backend.wavenet.
WaveNet
(n_quantize=256, n_aux=28, n_resch=512, n_skipch=256, dilation_depth=10, dilation_repeat=3, kernel_size=2, upsampling_factor=0)[source]¶ Bases:
torch.nn.modules.module.Module
Conditional wavenet.
- Parameters:
n_quantize (int) – Number of quantization.
n_aux (int) – Number of aux feature dimension.
n_resch (int) – Number of filter channels for residual block.
n_skipch (int) – Number of filter channels for skip connection.
dilation_depth (int) – Number of dilation depth (e.g. if set 10, max dilation = 2^(10-1)).
dilation_repeat (int) – Number of dilation repeat.
kernel_size (int) – Filter size of dilated causal convolution.
upsampling_factor (int) – Upsampling factor.
-
forward
(x, h)[source]¶ Calculate forward propagation.
- Parameters:
x (LongTensor) – Quantized input waveform tensor with the shape (B, T).
h (Tensor) – Auxiliary feature tensor with the shape (B, n_aux, T).
- Returns:
Logits with the shape (B, T, n_quantize).
- Return type:
Tensor
-
generate
(x, h, n_samples, interval=None, mode='sampling')[source]¶ Generate a waveform with fast genration algorithm.
This generation based on Fast WaveNet Generation Algorithm.
- Parameters:
x (LongTensor) – Initial waveform tensor with the shape (T,).
h (Tensor) – Auxiliary feature tensor with the shape (n_samples + T, n_aux).
n_samples (int) – Number of samples to be generated.
interval (int, optional) – Log interval.
mode (str, optional) – “sampling” or “argmax”.
- Returns:
Generated quantized waveform (n_samples).
- Return type:
ndarray
-
espnet.nets.pytorch_backend.wavenet.
decode_mu_law
(y, mu=256)[source]¶ Perform mu-law decoding.
- Parameters:
x (ndarray) – Quantized audio signal with the range from 0 to mu - 1.
mu (int) – Quantized level.
- Returns:
Audio signal with the range from -1 to 1.
- Return type:
ndarray
espnet.nets.pytorch_backend.e2e_tts_transformer¶
TTS-Transformer related modules.
-
class
espnet.nets.pytorch_backend.e2e_tts_transformer.
GuidedMultiHeadAttentionLoss
(sigma=0.4, alpha=1.0, reset_always=True)[source]¶ Bases:
espnet.nets.pytorch_backend.e2e_tts_tacotron2.GuidedAttentionLoss
Guided attention loss function module for multi head attention.
- Parameters:
sigma (float, optional) – Standard deviation to control
close attention to a diagonal. (how) –
alpha (float, optional) – Scaling coefficient (lambda).
reset_always (bool, optional) – Whether to always reset masks.
Initialize guided attention loss module.
- Parameters:
sigma (float, optional) – Standard deviation to control how close attention to a diagonal.
alpha (float, optional) – Scaling coefficient (lambda).
reset_always (bool, optional) – Whether to always reset masks.
-
forward
(att_ws, ilens, olens)[source]¶ Calculate forward propagation.
- Parameters:
att_ws (Tensor) – Batch of multi head attention weights (B, H, T_max_out, T_max_in).
ilens (LongTensor) – Batch of input lengths (B,).
olens (LongTensor) – Batch of output lengths (B,).
- Returns:
Guided attention loss value.
- Return type:
Tensor
-
class
espnet.nets.pytorch_backend.e2e_tts_transformer.
TTSPlot
(att_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0, subsampling_factor=1)[source]¶ Bases:
espnet.nets.pytorch_backend.transformer.plot.PlotAttentionReport
Attention plot module for TTS-Transformer.
-
plotfn
(data_dict, uttid_list, attn_dict, outdir, suffix='png', savefn=None)[source]¶ Plot multi head attentions.
- Parameters:
data_dict (dict) – Utts info from json file.
uttid_list (list) – List of utt_id.
attn_dict (dict) – Multi head attention dict. Values should be numpy.ndarray (H, L, T)
outdir (str) – Directory name to save figures.
suffix (str) – Filename suffix including image type (e.g., png).
savefn (function) – Function to save figures.
-
-
class
espnet.nets.pytorch_backend.e2e_tts_transformer.
Transformer
(idim, odim, args=None)[source]¶ Bases:
espnet.nets.tts_interface.TTSInterface
,torch.nn.modules.module.Module
Text-to-Speech Transformer module.
This is a module of text-to-speech Transformer described in Neural Speech Synthesis with Transformer Network, which convert the sequence of characters or phonemes into the sequence of Mel-filterbanks.
Initialize TTS-Transformer module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (Namespace, optional) –
embed_dim (int): Dimension of character embedding.
- eprenet_conv_layers (int):
Number of encoder prenet convolution layers.
- eprenet_conv_chans (int):
Number of encoder prenet convolution channels.
eprenet_conv_filts (int): Filter size of encoder prenet convolution.
dprenet_layers (int): Number of decoder prenet layers.
dprenet_units (int): Number of decoder prenet hidden units.
elayers (int): Number of encoder layers.
eunits (int): Number of encoder hidden units.
adim (int): Number of attention transformation dimensions.
aheads (int): Number of heads for multi head attention.
dlayers (int): Number of decoder layers.
dunits (int): Number of decoder hidden units.
postnet_layers (int): Number of postnet layers.
postnet_chans (int): Number of postnet channels.
postnet_filts (int): Filter size of postnet.
- use_scaled_pos_enc (bool):
Whether to use trainable scaled positional encoding.
- use_batch_norm (bool):
Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool):
Whether to perform layer normalization before encoder block.
- decoder_normalize_before (bool):
Whether to perform layer normalization before decoder block.
- encoder_concat_after (bool): Whether to concatenate attention
layer’s input and output in encoder.
- decoder_concat_after (bool): Whether to concatenate attention
layer’s input and output in decoder.
reduction_factor (int): Reduction factor.
spk_embed_dim (int): Number of speaker embedding dimenstions.
spk_embed_integration_type: How to integrate speaker embedding.
transformer_init (float): How to initialize transformer parameters.
transformer_lr (float): Initial value of learning rate.
transformer_warmup_steps (int): Optimizer warmup steps.
- transformer_enc_dropout_rate (float):
Dropout rate in encoder except attention & positional encoding.
- transformer_enc_positional_dropout_rate (float):
Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float):
Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float):
Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float):
Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float):
Dropout rate in deocoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float):
Dropout rate in encoder-deocoder attention module.
eprenet_dropout_rate (float): Dropout rate in encoder prenet.
dprenet_dropout_rate (float): Dropout rate in decoder prenet.
postnet_dropout_rate (float): Dropout rate in postnet.
- use_masking (bool):
Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool):
Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float): Positive sample weight in bce calculation
(only for use_masking=true).
loss_type (str): How to calculate loss.
use_guided_attn_loss (bool): Whether to use guided attention loss.
- num_heads_applied_guided_attn (int):
Number of heads in each layer to apply guided attention loss.
- num_layers_applied_guided_attn (int):
Number of layers to apply guided attention loss.
- modules_applied_guided_attn (list):
List of module names to apply guided attention loss.
guided-attn-loss-sigma (float) Sigma in guided attention loss.
guided-attn-loss-lambda (float): Lambda in guided attention loss.
-
property
attention_plot_class
¶ Return plot class for attention weight plot.
-
property
base_plot_keys
¶ Return base key names to plot during training.
keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns:
List of strings which are base keys to plot during training.
- Return type:
list
-
calculate_all_attentions
(xs, ilens, ys, olens, spembs=None, skip_output=False, keep_tensor=False, *args, **kwargs)[source]¶ Calculate all of the attention weights.
- Parameters:
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
skip_output (bool, optional) – Whether to skip calculate the final output.
keep_tensor (bool, optional) – Whether to keep original tensor.
- Returns:
Dict of attention weights and outputs.
- Return type:
dict
-
forward
(xs, ilens, ys, labels, olens, spembs=None, *args, **kwargs)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns:
Loss value.
- Return type:
Tensor
-
inference
(x, inference_args, spemb=None, *args, **kwargs)[source]¶ Generate the sequence of features given the sequences of characters.
- Parameters:
x (Tensor) – Input sequence of characters (T,).
inference_args (Namespace) –
threshold (float): Threshold in inference.
minlenratio (float): Minimum length ratio in inference.
maxlenratio (float): Maximum length ratio in inference.
spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).
- Returns:
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).
- Return type:
Tensor
espnet.nets.pytorch_backend.gtn_ctc¶
GTN CTC implementation.
-
class
espnet.nets.pytorch_backend.gtn_ctc.
GTNCTCLossFunction
(*args, **kwargs)[source]¶ Bases:
torch.autograd.function.Function
GTN CTC module.
-
static
backward
(ctx, grad_output)[source]¶ Backward computation.
- Parameters:
grad_output (torch.tensor) – backward passed gradient value
- Returns:
cumulative gradient output
- Return type:
(torch.Tensor, None, None, None)
-
static
create_ctc_graph
(target, blank_idx)[source]¶ Build gtn graph.
- Parameters:
target (list) – single target sequence
blank_idx (int) – index of blank token
- Returns:
gtn graph of target sequence
- Return type:
gtn.Graph
-
static
forward
(ctx, log_probs, targets, ilens, blank_idx=0, reduction='none')[source]¶ Forward computation.
- Parameters:
log_probs (torch.tensor) – batched log softmax probabilities (B, Tmax, oDim)
targets (list) – batched target sequences, list of lists
blank_idx (int) – index of blank token
- Returns:
ctc loss value
- Return type:
torch.Tensor
-
static
espnet.nets.pytorch_backend.e2e_tts_tacotron2¶
Tacotron 2 related modules.
-
class
espnet.nets.pytorch_backend.e2e_tts_tacotron2.
GuidedAttentionLoss
(sigma=0.4, alpha=1.0, reset_always=True)[source]¶ Bases:
torch.nn.modules.module.Module
Guided attention loss function module.
This module calculates the guided attention loss described in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention, which forces the attention to be diagonal.
Initialize guided attention loss module.
- Parameters:
sigma (float, optional) – Standard deviation to control how close attention to a diagonal.
alpha (float, optional) – Scaling coefficient (lambda).
reset_always (bool, optional) – Whether to always reset masks.
-
forward
(att_ws, ilens, olens)[source]¶ Calculate forward propagation.
- Parameters:
att_ws (Tensor) – Batch of attention weights (B, T_max_out, T_max_in).
ilens (LongTensor) – Batch of input lengths (B,).
olens (LongTensor) – Batch of output lengths (B,).
- Returns:
Guided attention loss value.
- Return type:
Tensor
-
class
espnet.nets.pytorch_backend.e2e_tts_tacotron2.
Tacotron2
(idim, odim, args=None)[source]¶ Bases:
espnet.nets.tts_interface.TTSInterface
,torch.nn.modules.module.Module
Tacotron2 module for end-to-end text-to-speech (E2E-TTS).
This is a module of Spectrogram prediction network in Tacotron2 described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, which converts the sequence of characters into the sequence of Mel-filterbanks.
Initialize Tacotron2 module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (Namespace, optional) –
spk_embed_dim (int): Dimension of the speaker embedding.
embed_dim (int): Dimension of character embedding.
elayers (int): The number of encoder blstm layers.
eunits (int): The number of encoder blstm units.
econv_layers (int): The number of encoder conv layers.
econv_filts (int): The number of encoder conv filter size.
econv_chans (int): The number of encoder conv filter channels.
dlayers (int): The number of decoder lstm layers.
dunits (int): The number of decoder lstm units.
prenet_layers (int): The number of prenet layers.
prenet_units (int): The number of prenet units.
postnet_layers (int): The number of postnet layers.
postnet_filts (int): The number of postnet filter size.
postnet_chans (int): The number of postnet filter channels.
output_activation (int): The name of activation function for outputs.
adim (int): The number of dimension of mlp in attention.
aconv_chans (int): The number of attention conv filter channels.
aconv_filts (int): The number of attention conv filter size.
cumulate_att_w (bool): Whether to cumulate previous attention weight.
use_batch_norm (bool): Whether to use batch normalization.
- use_concate (int): Whether to concatenate encoder embedding
with decoder lstm outputs.
dropout_rate (float): Dropout rate.
zoneout_rate (float): Zoneout rate.
reduction_factor (int): Reduction factor.
spk_embed_dim (int): Number of speaker embedding dimenstions.
- spc_dim (int): Number of spectrogram embedding dimenstions
(only for use_cbhg=True).
use_cbhg (bool): Whether to use CBHG module.
cbhg_conv_bank_layers (int): The number of convoluional banks in CBHG.
- cbhg_conv_bank_chans (int): The number of channels of
convolutional bank in CBHG.
- cbhg_proj_filts (int):
The number of filter size of projection layeri in CBHG.
- cbhg_proj_chans (int):
The number of channels of projection layer in CBHG.
- cbhg_highway_layers (int):
The number of layers of highway network in CBHG.
- cbhg_highway_units (int):
The number of units of highway network in CBHG.
cbhg_gru_units (int): The number of units of GRU in CBHG.
- use_masking (bool):
Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool):
Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float):
Weight of positive sample of stop token (only for use_masking=True).
use-guided-attn-loss (bool): Whether to use guided attention loss.
guided-attn-loss-sigma (float) Sigma in guided attention loss.
guided-attn-loss-lamdba (float): Lambda in guided attention loss.
-
property
base_plot_keys
¶ Return base key names to plot during training.
keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns:
List of strings which are base keys to plot during training.
- Return type:
list
-
calculate_all_attentions
(xs, ilens, ys, spembs=None, keep_tensor=False, *args, **kwargs)[source]¶ Calculate all of the attention weights.
- Parameters:
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
keep_tensor (bool, optional) – Whether to keep original tensor.
- Returns:
Batch of attention weights (B, Lmax, Tmax).
- Return type:
Union[ndarray, Tensor]
-
forward
(xs, ilens, ys, labels, olens, spembs=None, extras=None, *args, **kwargs)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
extras (Tensor, optional) – Batch of groundtruth spectrograms (B, Lmax, spc_dim).
- Returns:
Loss value.
- Return type:
Tensor
-
inference
(x, inference_args, spemb=None, *args, **kwargs)[source]¶ Generate the sequence of features given the sequences of characters.
- Parameters:
x (Tensor) – Input sequence of characters (T,).
inference_args (Namespace) –
threshold (float): Threshold in inference.
minlenratio (float): Minimum length ratio in inference.
maxlenratio (float): Maximum length ratio in inference.
spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).
- Returns:
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
- Return type:
Tensor
-
class
espnet.nets.pytorch_backend.e2e_tts_tacotron2.
Tacotron2Loss
(use_masking=True, use_weighted_masking=False, bce_pos_weight=20.0)[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for Tacotron2.
Initialize Tactoron2 loss module.
- Parameters:
use_masking (bool) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
bce_pos_weight (float) – Weight of positive sample of stop token.
-
forward
(after_outs, before_outs, logits, ys, labels, olens)[source]¶ Calculate forward propagation.
- Parameters:
after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).
before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).
logits (Tensor) – Batch of stop logits (B, Lmax).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
labels (LongTensor) – Batch of the sequences of stop token labels (B, Lmax).
olens (LongTensor) – Batch of the lengths of each target (B,).
- Returns:
L1 loss value. Tensor: Mean square error loss value. Tensor: Binary cross entropy loss value.
- Return type:
Tensor
espnet.nets.pytorch_backend.e2e_asr¶
RNN sequence-to-sequence speech recognition model (pytorch).
-
class
espnet.nets.pytorch_backend.e2e_asr.
E2E
(idim, odim, args)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
calculate_all_attentions
(xs_pad, ilens, ys_pad)[source]¶ E2E attention calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).
- Return type:
float ndarray
-
calculate_all_ctc_probs
(xs_pad, ilens, ys_pad)[source]¶ E2E CTC probability calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
CTC probability (B, Tmax, vocab)
- Return type:
float ndarray
-
encode
(x)[source]¶ Encode acoustic features.
- Parameters:
x (ndarray) – input acoustic feature (T, D)
- Returns:
encoder outputs
- Return type:
torch.Tensor
-
enhance
(xs)[source]¶ Forward only in the frontend stage.
- Parameters:
xs (ndarray) – input acoustic feature (T, C, F)
- Returns:
enhaned feature
- Return type:
torch.Tensor
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
loss value
- Return type:
torch.Tensor
-
init_like_chainer
()[source]¶ Initialize weight like chainer.
chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5) however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)
-
recognize
(x, recog_args, char_list, rnnlm=None)[source]¶ E2E beam search.
- Parameters:
x (ndarray) – input acoustic feature (T, D)
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
recognize_batch
(xs, recog_args, char_list, rnnlm=None)[source]¶ E2E batch beam search.
- Parameters:
xs (list) – list of input acoustic feature arrays [(T_1, D), (T_2, D), …]
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
espnet.nets.pytorch_backend.nets_utils¶
Network related utility tools.
-
espnet.nets.pytorch_backend.nets_utils.
get_subsample
(train_args, mode, arch)[source]¶ Parse the subsampling factors from the args for the specified mode and arch.
- Parameters:
train_args – argument Namespace containing options.
mode – one of (‘asr’, ‘mt’, ‘st’)
arch – one of (‘rnn’, ‘rnn-t’, ‘rnn_mix’, ‘rnn_mulenc’, ‘transformer’)
- Returns:
subsampling factors.
- Return type:
np.ndarray / List[np.ndarray]
-
espnet.nets.pytorch_backend.nets_utils.
make_non_pad_mask
(lengths, xs=None, length_dim=-1)[source]¶ Make mask tensor containing indices of non-padded part.
- Parameters:
lengths (LongTensor or List) – Batch of lengths (B,).
xs (Tensor, optional) – The reference tensor. If set, masks will be the same shape as this tensor.
length_dim (int, optional) – Dimension indicator of the above tensor. See the example.
- Returns:
- mask tensor containing indices of padded part.
dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (including 1.2)
- Return type:
ByteTensor
Examples
With only lengths.
>>> lengths = [5, 3, 2] >>> make_non_pad_mask(lengths) masks = [[1, 1, 1, 1 ,1], [1, 1, 1, 0, 0], [1, 1, 0, 0, 0]]
With the reference tensor.
>>> xs = torch.zeros((3, 2, 4)) >>> make_non_pad_mask(lengths, xs) tensor([[[1, 1, 1, 1], [1, 1, 1, 1]], [[1, 1, 1, 0], [1, 1, 1, 0]], [[1, 1, 0, 0], [1, 1, 0, 0]]], dtype=torch.uint8) >>> xs = torch.zeros((3, 2, 6)) >>> make_non_pad_mask(lengths, xs) tensor([[[1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0]], [[1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0]], [[1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
With the reference tensor and dimension indicator.
>>> xs = torch.zeros((3, 6, 6)) >>> make_non_pad_mask(lengths, xs, 1) tensor([[[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0]], [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]], dtype=torch.uint8) >>> make_non_pad_mask(lengths, xs, 2) tensor([[[1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0]], [[1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0]], [[1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
-
espnet.nets.pytorch_backend.nets_utils.
make_pad_mask
(lengths, xs=None, length_dim=-1, maxlen=None)[source]¶ Make mask tensor containing indices of padded part.
- Parameters:
lengths (LongTensor or List) – Batch of lengths (B,).
xs (Tensor, optional) – The reference tensor. If set, masks will be the same shape as this tensor.
length_dim (int, optional) – Dimension indicator of the above tensor. See the example.
- Returns:
- Mask tensor containing indices of padded part.
dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (including 1.2)
- Return type:
Tensor
Examples
With only lengths.
>>> lengths = [5, 3, 2] >>> make_pad_mask(lengths) masks = [[0, 0, 0, 0 ,0], [0, 0, 0, 1, 1], [0, 0, 1, 1, 1]]
With the reference tensor.
>>> xs = torch.zeros((3, 2, 4)) >>> make_pad_mask(lengths, xs) tensor([[[0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 1], [0, 0, 0, 1]], [[0, 0, 1, 1], [0, 0, 1, 1]]], dtype=torch.uint8) >>> xs = torch.zeros((3, 2, 6)) >>> make_pad_mask(lengths, xs) tensor([[[0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1]], [[0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1]], [[0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
With the reference tensor and dimension indicator.
>>> xs = torch.zeros((3, 6, 6)) >>> make_pad_mask(lengths, xs, 1) tensor([[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1]], [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]], dtype=torch.uint8) >>> make_pad_mask(lengths, xs, 2) tensor([[[0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1]], [[0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1]], [[0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
-
espnet.nets.pytorch_backend.nets_utils.
mask_by_length
(xs, lengths, fill=0)[source]¶ Mask tensor according to length.
- Parameters:
xs (Tensor) – Batch of input tensor (B, *).
lengths (LongTensor or List) – Batch of lengths (B,).
fill (int or float) – Value to fill masked part.
- Returns:
Batch of masked input tensor (B, *).
- Return type:
Tensor
Examples
>>> x = torch.arange(5).repeat(3, 1) + 1 >>> x tensor([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]) >>> lengths = [5, 3, 2] >>> mask_by_length(x, lengths) tensor([[1, 2, 3, 4, 5], [1, 2, 3, 0, 0], [1, 2, 0, 0, 0]])
-
espnet.nets.pytorch_backend.nets_utils.
pad_list
(xs, pad_value)[source]¶ Perform padding for the list of tensors.
- Parameters:
xs (List) – List of Tensors [(T_1, *), (T_2, *), …, (T_B, *)].
pad_value (float) – Value for padding.
- Returns:
Padded tensor (B, Tmax, *).
- Return type:
Tensor
Examples
>>> x = [torch.ones(4), torch.ones(2), torch.ones(1)] >>> x [tensor([1., 1., 1., 1.]), tensor([1., 1.]), tensor([1.])] >>> pad_list(x, 0) tensor([[1., 1., 1., 1.], [1., 1., 0., 0.], [1., 0., 0., 0.]])
-
espnet.nets.pytorch_backend.nets_utils.
rename_state_dict
(old_prefix: str, new_prefix: str, state_dict: Dict[str, torch.Tensor])[source]¶ Replace keys of old prefix with new prefix in state dict.
-
espnet.nets.pytorch_backend.nets_utils.
th_accuracy
(pad_outputs, pad_targets, ignore_label)[source]¶ Calculate accuracy.
- Parameters:
pad_outputs (Tensor) – Prediction tensors (B * Lmax, D).
pad_targets (LongTensor) – Target label tensors (B, Lmax, D).
ignore_label (int) – Ignore label id.
- Returns:
Accuracy value (0.0 - 1.0).
- Return type:
float
-
espnet.nets.pytorch_backend.nets_utils.
to_device
(m, x)[source]¶ Send tensor into the device of the module.
- Parameters:
m (torch.nn.Module) – Torch module.
x (Tensor) – Torch tensor.
- Returns:
Torch tensor located in the same place as torch module.
- Return type:
Tensor
-
espnet.nets.pytorch_backend.nets_utils.
to_torch_tensor
(x)[source]¶ Change to torch.Tensor or ComplexTensor from numpy.ndarray.
- Parameters:
x – Inputs. It should be one of numpy.ndarray, Tensor, ComplexTensor, and dict.
- Returns:
Type converted inputs.
- Return type:
Tensor or ComplexTensor
Examples
>>> xs = np.ones(3, dtype=np.float32) >>> xs = to_torch_tensor(xs) tensor([1., 1., 1.]) >>> xs = torch.ones(3, 4, 5) >>> assert to_torch_tensor(xs) is xs >>> xs = {'real': xs, 'imag': xs} >>> to_torch_tensor(xs) ComplexTensor( Real: tensor([1., 1., 1.]) Imag; tensor([1., 1., 1.]) )
espnet.nets.pytorch_backend.e2e_mt¶
RNN sequence-to-sequence text translation model (pytorch).
-
class
espnet.nets.pytorch_backend.e2e_mt.
E2E
(idim, odim, args)[source]¶ Bases:
espnet.nets.mt_interface.MTInterface
,torch.nn.modules.module.Module
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
calculate_all_attentions
(xs_pad, ilens, ys_pad)[source]¶ E2E attention calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).
- Return type:
float ndarray
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
loss value
- Return type:
torch.Tensor
-
init_like_fairseq
()[source]¶ Initialize weight like Fairseq.
Fairseq basically uses W, b, EmbedID.W ~ Uniform(-0.1, 0.1),
-
target_language_biasing
(xs_pad, ilens, ys_pad)[source]¶ Prepend target language IDs to source sentences for multilingual MT.
These tags are prepended in source/target sentences as pre-processing.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
- Returns:
source text without language IDs
- Return type:
torch.Tensor
- Returns:
target text without language IDs
- Return type:
torch.Tensor
- Returns:
target language IDs
- Return type:
torch.Tensor (B, 1)
-
translate
(x, trans_args, char_list, rnnlm=None)[source]¶ E2E beam search.
- Parameters:
x (ndarray) – input source text feature (B, T, D)
trans_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
translate_batch
(xs, trans_args, char_list, rnnlm=None)[source]¶ E2E batch beam search.
- Parameters:
xs (list) – list of input source text feature arrays [(T_1, D), (T_2, D), …]
trans_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
espnet.nets.pytorch_backend.e2e_asr_mix_transformer¶
Transformer speech recognition model for single-channel multi-speaker mixture speech.
- It is a fusion of e2e_asr_mix.py and e2e_asr_transformer.py. Refer to:
- The Transformer-based Encoder now consists of three stages:
(a): Enc_mix: encoding input mixture speech; (b): Enc_SD: separating mixed speech representations; (c): Enc_rec: transforming each separated speech representation.
PIT is used in CTC to determine the permutation with minimum loss.
-
class
espnet.nets.pytorch_backend.e2e_asr_mix_transformer.
E2E
(idim, odim, args, ignore_id=-1)[source]¶ Bases:
espnet.nets.pytorch_backend.e2e_asr_transformer.E2E
,espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
decoder_and_attention
(hs_pad, hs_mask, ys_pad, batch_size)[source]¶ Forward decoder and attention loss.
-
encode
(x)[source]¶ Encode acoustic features.
- Parameters:
x (ndarray) – source acoustic feature (T, D)
- Returns:
encoder outputs
- Return type:
torch.Tensor
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of source sequences (B)
ys_pad (torch.Tensor) – batch of padded target sequences (B, num_spkrs, Lmax)
- Returns:
ctc loass value
- Return type:
torch.Tensor
- Returns:
attention loss value
- Return type:
torch.Tensor
- Returns:
accuracy in attention decoder
- Return type:
float
-
recog
(enc_output, recog_args, char_list=None, rnnlm=None, use_jit=False)[source]¶ Recognize input speech of each speaker.
- Parameters:
enc_output (ndnarray) – encoder outputs (B, T, D) or (T, D)
recog_args (Namespace) – argment Namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
recognize
(x, recog_args, char_list=None, rnnlm=None, use_jit=False)[source]¶ Recognize input speech of each speaker.
- Parameters:
x (ndnarray) – input acoustic feature (B, T, D) or (T, D)
recog_args (Namespace) – argment Namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
espnet.nets.pytorch_backend.initialization¶
Initialization functions for RNN sequence-to-sequence models.
-
espnet.nets.pytorch_backend.initialization.
lecun_normal_init_parameters
(module)[source]¶ Initialize parameters in the LeCun’s manner.
espnet.nets.pytorch_backend.e2e_asr_conformer¶
Conformer speech recognition model (pytorch).
It is a fusion of e2e_asr_transformer.py Refer to: https://arxiv.org/abs/2005.08100
-
class
espnet.nets.pytorch_backend.e2e_asr_conformer.
E2E
(idim, odim, args, ignore_id=-1)[source]¶ Bases:
espnet.nets.pytorch_backend.e2e_asr_transformer.E2E
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
espnet.nets.pytorch_backend.e2e_asr_transducer¶
Transducer speech recognition model (pytorch).
-
class
espnet.nets.pytorch_backend.e2e_asr_transducer.
E2E
(idim: int, odim: int, args: argparse.Namespace, ignore_id: int = -1, blank_id: int = 0, training: bool = True)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
E2E module for Transducer models.
- Parameters:
idim – Dimension of inputs.
odim – Dimension of outputs.
args – Namespace containing model options.
ignore_id – Padding symbol ID.
blank_id – Blank symbol ID.
training – Whether the model is initialized in training or inference mode.
Construct an E2E object for Transducer model.
-
static
add_arguments
(parser: argparse.ArgumentParser) → argparse.ArgumentParser[source]¶ Add arguments for Transducer model.
-
property
attention_plot_class
¶ Get attention plot class.
-
static
auxiliary_task_add_arguments
(parser: argparse.ArgumentParser) → argparse.ArgumentParser[source]¶ Add arguments for auxiliary task.
-
calculate_all_attentions
(feats: torch.Tensor, feats_len: torch.Tensor, labels: torch.Tensor) → numpy.ndarray[source]¶ E2E attention calculation.
- Parameters:
feats – Feature sequences. (B, F, D_feats)
feats_len – Feature sequences lengths. (B,)
labels – Label ID sequences. (B, L)
- Returns:
- Attention weights with the following shape,
multi-head case => attention weights. (B, D_att, U, T),
other case => attention weights. (B, U, T)
- Return type:
ret
-
static
decoder_add_custom_arguments
(parser: argparse.ArgumentParser) → argparse.ArgumentParser[source]¶ Add arguments for Custom decoder.
-
static
decoder_add_general_arguments
(parser: argparse.ArgumentParser) → argparse.ArgumentParser[source]¶ Add general arguments for decoder.
-
static
decoder_add_rnn_arguments
(parser: argparse.ArgumentParser) → argparse.ArgumentParser[source]¶ Add arguments for RNN decoder.
-
default_parameters
(args: argparse.Namespace)[source]¶ Initialize/reset parameters for Transducer.
- Parameters:
args – Namespace containing model options.
-
encode_custom
(feats: numpy.ndarray) → torch.Tensor[source]¶ Encode acoustic features.
- Parameters:
feats – Feature sequence. (F, D_feats)
- Returns:
Encoded feature sequence. (T, D_enc)
- Return type:
enc_out
-
encode_rnn
(feats: numpy.ndarray) → torch.Tensor[source]¶ Encode acoustic features.
- Parameters:
feats – Feature sequence. (F, D_feats)
- Returns:
Encoded feature sequence. (T, D_enc)
- Return type:
enc_out
-
static
encoder_add_custom_arguments
(parser: argparse.ArgumentParser) → argparse.ArgumentParser[source]¶ Add arguments for Custom encoder.
-
static
encoder_add_general_arguments
(parser: argparse.ArgumentParser) → argparse.ArgumentParser[source]¶ Add general arguments for encoder.
-
static
encoder_add_rnn_arguments
(parser: argparse.ArgumentParser) → argparse.ArgumentParser[source]¶ Add arguments for RNN encoder.
-
forward
(feats: torch.Tensor, feats_len: torch.Tensor, labels: torch.Tensor) → torch.Tensor[source]¶ E2E forward.
- Parameters:
feats – Feature sequences. (B, F, D_feats)
feats_len – Feature sequences lengths. (B,)
labels – Label ID sequences. (B, L)
- Returns:
Transducer loss value
- Return type:
loss
-
recognize
(feats: numpy.ndarray, beam_search: espnet.nets.beam_search_transducer.BeamSearchTransducer) → List[source]¶ Recognize input features.
- Parameters:
feats – Feature sequence. (F, D_feats)
beam_search – Beam search class.
- Returns:
N-best decoding results.
- Return type:
nbest_hyps
-
class
espnet.nets.pytorch_backend.e2e_asr_transducer.
Reporter
(**links)[source]¶ Bases:
chainer.link.Chain
A chainer reporter wrapper for Transducer models.
-
report
(loss: float, loss_trans: float, loss_ctc: float, loss_aux_trans: float, loss_symm_kl_div: float, loss_lm: float, cer: float, wer: float)[source]¶ Instantiate reporter attributes.
- Parameters:
loss – Model loss.
loss_trans – Main Transducer loss.
loss_ctc – CTC loss.
loss_aux_trans – Auxiliary Transducer loss.
loss_symm_kl_div – Symmetric KL-divergence loss.
loss_lm – Label smoothing loss.
cer – Character Error Rate.
wer – Word Error Rate.
-
espnet.nets.pytorch_backend.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.e2e_st_conformer¶
Conformer speech translation model (pytorch).
It is a fusion of e2e_st_transformer.py Refer to: https://arxiv.org/abs/2005.08100
-
class
espnet.nets.pytorch_backend.e2e_st_conformer.
E2E
(idim, odim, args, ignore_id=-1)[source]¶ Bases:
espnet.nets.pytorch_backend.e2e_st_transformer.E2E
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
espnet.nets.pytorch_backend.e2e_st¶
RNN sequence-to-sequence speech translation model (pytorch).
-
class
espnet.nets.pytorch_backend.e2e_st.
E2E
(idim, odim, args)[source]¶ Bases:
espnet.nets.st_interface.STInterface
,torch.nn.modules.module.Module
E2E module.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
Construct an E2E object.
- Parameters:
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
calculate_all_attentions
(xs_pad, ilens, ys_pad, ys_pad_src)[source]¶ E2E attention calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
ys_pad_src (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).
- Return type:
float ndarray
-
calculate_all_ctc_probs
(xs_pad, ilens, ys_pad, ys_pad_src)[source]¶ E2E CTC probability calculation.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- :param torch.Tensor
ys_pad_src: batch of padded token id sequence tensor (B, Lmax)
- Returns:
CTC probability (B, Tmax, vocab)
- Return type:
float ndarray
-
encode
(x)[source]¶ Encode acoustic features.
- Parameters:
x (ndarray) – input acoustic feature (T, D)
- Returns:
encoder outputs
- Return type:
torch.Tensor
-
forward
(xs_pad, ilens, ys_pad, ys_pad_src)[source]¶ E2E forward.
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)
- Returns:
loss value
- Return type:
torch.Tensor
-
forward_asr
(hs_pad, hlens, ys_pad)[source]¶ Forward pass in the auxiliary ASR task.
- Parameters:
hs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)
hlens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
- Returns:
ASR attention loss value
- Return type:
torch.Tensor
- Returns:
accuracy in ASR attention decoder
- Return type:
float
- Returns:
ASR CTC loss value
- Return type:
torch.Tensor
- Returns:
character error rate from CTC prediction
- Return type:
float
- Returns:
character error rate from attetion decoder prediction
- Return type:
float
- Returns:
word error rate from attetion decoder prediction
- Return type:
float
-
forward_mt
(xs_pad, ys_pad)[source]¶ Forward pass in the auxiliary MT task.
- Parameters:
xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
- Returns:
MT loss value
- Return type:
torch.Tensor
- Returns:
accuracy in MT decoder
- Return type:
float
-
init_like_chainer
()[source]¶ Initialize weight like chainer.
chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5) however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)
-
translate
(x, trans_args, char_list, rnnlm=None)[source]¶ E2E beam search.
- Parameters:
x (ndarray) – input acoustic feature (T, D)
trans_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
-
translate_batch
(xs, trans_args, char_list, rnnlm=None)[source]¶ E2E batch beam search.
- Parameters:
xs (list) – list of input acoustic feature arrays [(T_1, D), (T_2, D), …]
trans_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns:
N-best decoding results
- Return type:
list
espnet.nets.pytorch_backend.e2e_vc_tacotron2¶
Tacotron2-VC related modules.
-
class
espnet.nets.pytorch_backend.e2e_vc_tacotron2.
Tacotron2
(idim, odim, args=None)[source]¶ Bases:
espnet.nets.tts_interface.TTSInterface
,torch.nn.modules.module.Module
VC Tacotron2 module for VC.
This is a module of Tacotron2-based VC model, which convert the sequence of acoustic features into the sequence of acoustic features.
Initialize Tacotron2 module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (Namespace, optional) –
spk_embed_dim (int): Dimension of the speaker embedding.
elayers (int): The number of encoder blstm layers.
eunits (int): The number of encoder blstm units.
econv_layers (int): The number of encoder conv layers.
econv_filts (int): The number of encoder conv filter size.
econv_chans (int): The number of encoder conv filter channels.
dlayers (int): The number of decoder lstm layers.
dunits (int): The number of decoder lstm units.
prenet_layers (int): The number of prenet layers.
prenet_units (int): The number of prenet units.
postnet_layers (int): The number of postnet layers.
postnet_filts (int): The number of postnet filter size.
postnet_chans (int): The number of postnet filter channels.
output_activation (int): The name of activation function for outputs.
adim (int): The number of dimension of mlp in attention.
aconv_chans (int): The number of attention conv filter channels.
aconv_filts (int): The number of attention conv filter size.
cumulate_att_w (bool): Whether to cumulate previous attention weight.
use_batch_norm (bool): Whether to use batch normalization.
- use_concate (int):
Whether to concatenate encoder embedding with decoder lstm outputs.
dropout_rate (float): Dropout rate.
zoneout_rate (float): Zoneout rate.
reduction_factor (int): Reduction factor.
spk_embed_dim (int): Number of speaker embedding dimenstions.
- spc_dim (int): Number of spectrogram embedding dimenstions
(only for use_cbhg=True).
use_cbhg (bool): Whether to use CBHG module.
- cbhg_conv_bank_layers (int):
The number of convoluional banks in CBHG.
- cbhg_conv_bank_chans (int):
The number of channels of convolutional bank in CBHG.
- cbhg_proj_filts (int):
The number of filter size of projection layeri in CBHG.
- cbhg_proj_chans (int):
The number of channels of projection layer in CBHG.
- cbhg_highway_layers (int):
The number of layers of highway network in CBHG.
- cbhg_highway_units (int):
The number of units of highway network in CBHG.
cbhg_gru_units (int): The number of units of GRU in CBHG.
use_masking (bool): Whether to mask padded part in loss calculation.
- bce_pos_weight (float): Weight of positive sample of stop token
(only for use_masking=True).
use-guided-attn-loss (bool): Whether to use guided attention loss.
guided-attn-loss-sigma (float) Sigma in guided attention loss.
guided-attn-loss-lamdba (float): Lambda in guided attention loss.
-
property
base_plot_keys
¶ Return base key names to plot during training.
keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss
and validation/main/loss values.
- also loss.png will be created as a figure visulizing main/loss
and validation/main/loss values.
- Returns:
List of strings which are base keys to plot during training.
- Return type:
list
-
calculate_all_attentions
(xs, ilens, ys, spembs=None, *args, **kwargs)[source]¶ Calculate all of the attention weights.
- Parameters:
xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns:
Batch of attention weights (B, Lmax, Tmax).
- Return type:
numpy.ndarray
-
forward
(xs, ilens, ys, labels, olens, spembs=None, spcs=None, *args, **kwargs)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
spcs (Tensor, optional) – Batch of groundtruth spectrograms (B, Lmax, spc_dim).
- Returns:
Loss value.
- Return type:
Tensor
-
inference
(x, inference_args, spemb=None, *args, **kwargs)[source]¶ Generate the sequence of features given the sequences of characters.
- Parameters:
x (Tensor) – Input sequence of acoustic features (T, idim).
inference_args (Namespace) –
threshold (float): Threshold in inference.
minlenratio (float): Minimum length ratio in inference.
maxlenratio (float): Maximum length ratio in inference.
spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).
- Returns:
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
- Return type:
Tensor
espnet.nets.pytorch_backend.tacotron2.encoder¶
Tacotron2 encoder related modules.
-
class
espnet.nets.pytorch_backend.tacotron2.encoder.
Encoder
(idim, input_layer='embed', embed_dim=512, elayers=1, eunits=512, econv_layers=3, econv_chans=512, econv_filts=5, use_batch_norm=True, use_residual=False, dropout_rate=0.5, padding_idx=0)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder module of Spectrogram prediction network.
This is a module of encoder of Spectrogram prediction network in Tacotron2, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This is the encoder which converts either a sequence of characters or acoustic features into the sequence of hidden states.
Initialize Tacotron2 encoder module.
- Parameters:
idim (int) –
input_layer (str) – Input layer type.
embed_dim (int, optional) –
elayers (int, optional) –
eunits (int, optional) –
econv_layers (int, optional) –
econv_filts (int, optional) –
econv_chans (int, optional) –
use_batch_norm (bool, optional) –
use_residual (bool, optional) –
dropout_rate (float, optional) –
-
forward
(xs, ilens=None)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of the padded sequence. Either character ids (B, Tmax) or acoustic feature (B, Tmax, idim * encoder_reduction_factor). Padded value should be 0.
ilens (LongTensor) – Batch of lengths of each input batch (B,).
- Returns:
Batch of the sequences of encoder states(B, Tmax, eunits). LongTensor: Batch of lengths of each sequence (B,)
- Return type:
Tensor
espnet.nets.pytorch_backend.tacotron2.decoder¶
Tacotron2 decoder related modules.
-
class
espnet.nets.pytorch_backend.tacotron2.decoder.
Decoder
(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)[source]¶ Bases:
torch.nn.modules.module.Module
Decoder module of Spectrogram prediction network.
This is a module of decoder of Spectrogram prediction network in Tacotron2, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The decoder generates the sequence of features from the sequence of the hidden states.
Initialize Tacotron2 decoder module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
att (torch.nn.Module) – Instance of attention class.
dlayers (int, optional) – The number of decoder lstm layers.
dunits (int, optional) – The number of decoder lstm units.
prenet_layers (int, optional) – The number of prenet layers.
prenet_units (int, optional) – The number of prenet units.
postnet_layers (int, optional) – The number of postnet layers.
postnet_filts (int, optional) – The number of postnet filter size.
postnet_chans (int, optional) – The number of postnet filter channels.
output_activation_fn (torch.nn.Module, optional) – Activation function for outputs.
cumulate_att_w (bool, optional) – Whether to cumulate previous attention weight.
use_batch_norm (bool, optional) – Whether to use batch normalization.
use_concate (bool, optional) – Whether to concatenate encoder embedding with decoder lstm outputs.
dropout_rate (float, optional) – Dropout rate.
zoneout_rate (float, optional) – Zoneout rate.
reduction_factor (int, optional) – Reduction factor.
-
calculate_all_attentions
(hs, hlens, ys)[source]¶ Calculate all of the attention weights.
- Parameters:
hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).
hlens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).
- Returns:
Batch of attention weights (B, Lmax, Tmax).
- Return type:
numpy.ndarray
Note
This computation is performed in teacher-forcing manner.
-
forward
(hs, hlens, ys)[source]¶ Calculate forward propagation.
- Parameters:
hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).
hlens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).
- Returns:
Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).
- Return type:
Tensor
Note
This computation is performed in teacher-forcing manner.
-
inference
(h, threshold=0.5, minlenratio=0.0, maxlenratio=10.0, use_att_constraint=False, backward_window=None, forward_window=None)[source]¶ Generate the sequence of features given the sequences of characters.
- Parameters:
h (Tensor) – Input sequence of encoder hidden states (T, C).
threshold (float, optional) – Threshold to stop generation.
minlenratio (float, optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.
minlenratio – Minimum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.
use_att_constraint (bool) – Whether to apply attention constraint introduced in Deep Voice 3.
backward_window (int) – Backward window size in attention constraint.
forward_window (int) – Forward window size in attention constraint.
- Returns:
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
- Return type:
Tensor
Note
This computation is performed in auto-regressive manner.
-
class
espnet.nets.pytorch_backend.tacotron2.decoder.
Postnet
(idim, odim, n_layers=5, n_chans=512, n_filts=5, dropout_rate=0.5, use_batch_norm=True)[source]¶ Bases:
torch.nn.modules.module.Module
Postnet module for Spectrogram prediction network.
This is a module of Postnet in Spectrogram prediction network, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The Postnet predicts refines the predicted Mel-filterbank of the decoder, which helps to compensate the detail structure of spectrogram.
Initialize postnet module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
n_layers (int, optional) – The number of layers.
n_filts (int, optional) – The number of filter size.
n_units (int, optional) – The number of filter channels.
use_batch_norm (bool, optional) – Whether to use batch normalization..
dropout_rate (float, optional) – Dropout rate..
-
class
espnet.nets.pytorch_backend.tacotron2.decoder.
Prenet
(idim, n_layers=2, n_units=256, dropout_rate=0.5)[source]¶ Bases:
torch.nn.modules.module.Module
Prenet module for decoder of Spectrogram prediction network.
This is a module of Prenet in the decoder of Spectrogram prediction network, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The Prenet preforms nonlinear conversion of inputs before input to auto-regressive lstm, which helps to learn diagonal attentions.
Note
This module alway applies dropout even in evaluation. See the detail in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.
Initialize prenet module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
n_layers (int, optional) – The number of prenet layers.
n_units (int, optional) – The number of prenet units.
-
forward
(x)[source]¶ Calculate forward propagation.
- Parameters:
x (Tensor) – Batch of input tensors (B, …, idim).
- Returns:
Batch of output tensors (B, …, odim).
- Return type:
Tensor
-
class
espnet.nets.pytorch_backend.tacotron2.decoder.
ZoneOutCell
(cell, zoneout_rate=0.1)[source]¶ Bases:
torch.nn.modules.module.Module
ZoneOut Cell module.
This is a module of zoneout described in Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. This code is modified from eladhoffer/seq2seq.pytorch.
Examples
>>> lstm = torch.nn.LSTMCell(16, 32) >>> lstm = ZoneOutCell(lstm, 0.5)
Initialize zone out cell module.
- Parameters:
cell (torch.nn.Module) – Pytorch recurrent cell module e.g. torch.nn.Module.LSTMCell.
zoneout_rate (float, optional) – Probability of zoneout from 0.0 to 1.0.
-
forward
(inputs, hidden)[source]¶ Calculate forward propagation.
- Parameters:
inputs (Tensor) – Batch of input tensor (B, input_size).
hidden (tuple) –
Tensor: Batch of initial hidden states (B, hidden_size).
Tensor: Batch of initial cell states (B, hidden_size).
- Returns:
Tensor: Batch of next hidden states (B, hidden_size).
Tensor: Batch of next cell states (B, hidden_size).
- Return type:
tuple
-
espnet.nets.pytorch_backend.tacotron2.decoder.
decoder_init
(m)[source]¶ Initialize decoder parameters.
espnet.nets.pytorch_backend.tacotron2.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.tacotron2.cbhg¶
CBHG related modules.
-
class
espnet.nets.pytorch_backend.tacotron2.cbhg.
CBHG
(idim, odim, conv_bank_layers=8, conv_bank_chans=128, conv_proj_filts=3, conv_proj_chans=256, highway_layers=4, highway_units=128, gru_units=256)[source]¶ Bases:
torch.nn.modules.module.Module
CBHG module to convert log Mel-filterbanks to linear spectrogram.
This is a module of CBHG introduced in Tacotron: Towards End-to-End Speech Synthesis. The CBHG converts the sequence of log Mel-filterbanks into linear spectrogram.
Initialize CBHG module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
conv_bank_layers (int, optional) – The number of convolution bank layers.
conv_bank_chans (int, optional) – The number of channels in convolution bank.
conv_proj_filts (int, optional) – Kernel size of convolutional projection layer.
conv_proj_chans (int, optional) – The number of channels in convolutional projection layer.
highway_layers (int, optional) – The number of highway network layers.
highway_units (int, optional) – The number of highway network units.
gru_units (int, optional) – The number of GRU units (for both directions).
-
forward
(xs, ilens)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of the padded sequences of inputs (B, Tmax, idim).
ilens (LongTensor) – Batch of lengths of each input sequence (B,).
- Returns:
Batch of the padded sequence of outputs (B, Tmax, odim). LongTensor: Batch of lengths of each output sequence (B,).
- Return type:
Tensor
-
class
espnet.nets.pytorch_backend.tacotron2.cbhg.
CBHGLoss
(use_masking=True)[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for CBHG.
Initialize CBHG loss module.
- Parameters:
use_masking (bool) – Whether to mask padded part in loss calculation.
-
forward
(cbhg_outs, spcs, olens)[source]¶ Calculate forward propagation.
- Parameters:
cbhg_outs (Tensor) – Batch of CBHG outputs (B, Lmax, spc_dim).
spcs (Tensor) – Batch of groundtruth of spectrogram (B, Lmax, spc_dim).
olens (LongTensor) – Batch of the lengths of each sequence (B,).
- Returns:
L1 loss value Tensor: Mean square error loss value.
- Return type:
Tensor
-
class
espnet.nets.pytorch_backend.tacotron2.cbhg.
HighwayNet
(idim)[source]¶ Bases:
torch.nn.modules.module.Module
Highway Network module.
This is a module of Highway Network introduced in Highway Networks.
Initialize Highway Network module.
- Parameters:
idim (int) – Dimension of the inputs.
espnet.nets.pytorch_backend.fastspeech.length_regulator¶
Length regulator related modules.
-
class
espnet.nets.pytorch_backend.fastspeech.length_regulator.
LengthRegulator
(pad_value=0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Length regulator module for feed-forward Transformer.
This is a module of length regulator described in FastSpeech: Fast, Robust and Controllable Text to Speech. The length regulator expands char or phoneme-level embedding features to frame-level by repeating each feature based on the corresponding predicted durations.
Initilize length regulator module.
- Parameters:
pad_value (float, optional) – Value used for padding.
-
forward
(xs, ds, alpha=1.0)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of sequences of char or phoneme embeddings (B, Tmax, D).
ds (LongTensor) – Batch of durations of each frame (B, T).
alpha (float, optional) – Alpha value to control speed of speech.
- Returns:
replicated input tensor based on durations (B, T*, D).
- Return type:
Tensor
espnet.nets.pytorch_backend.fastspeech.duration_predictor¶
Duration predictor related modules.
-
class
espnet.nets.pytorch_backend.fastspeech.duration_predictor.
DurationPredictor
(idim, n_layers=2, n_chans=384, kernel_size=3, dropout_rate=0.1, offset=1.0)[source]¶ Bases:
torch.nn.modules.module.Module
Duration predictor module.
This is a module of duration predictor described in FastSpeech: Fast, Robust and Controllable Text to Speech. The duration predictor predicts a duration of each frame in log domain from the hidden embeddings of encoder.
Note
The calculation domain of outputs is different between in forward and in inference. In forward, the outputs are calculated in log domain but in inference, those are calculated in linear domain.
Initilize duration predictor module.
- Parameters:
idim (int) – Input dimension.
n_layers (int, optional) – Number of convolutional layers.
n_chans (int, optional) – Number of channels of convolutional layers.
kernel_size (int, optional) – Kernel size of convolutional layers.
dropout_rate (float, optional) – Dropout rate.
offset (float, optional) – Offset value to avoid nan in log domain.
-
forward
(xs, x_masks=None)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of input sequences (B, Tmax, idim).
x_masks (ByteTensor, optional) – Batch of masks indicating padded part (B, Tmax).
- Returns:
Batch of predicted durations in log domain (B, Tmax).
- Return type:
Tensor
-
inference
(xs, x_masks=None)[source]¶ Inference duration.
- Parameters:
xs (Tensor) – Batch of input sequences (B, Tmax, idim).
x_masks (ByteTensor, optional) – Batch of masks indicating padded part (B, Tmax).
- Returns:
Batch of predicted durations in linear domain (B, Tmax).
- Return type:
LongTensor
-
class
espnet.nets.pytorch_backend.fastspeech.duration_predictor.
DurationPredictorLoss
(offset=1.0, reduction='mean')[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for duration predictor.
The loss value is Calculated in log domain to make it Gaussian.
Initilize duration predictor loss module.
- Parameters:
offset (float, optional) – Offset value to avoid nan in log domain.
reduction (str) – Reduction type in loss calculation.
-
forward
(outputs, targets)[source]¶ Calculate forward propagation.
- Parameters:
outputs (Tensor) – Batch of prediction durations in log domain (B, T)
targets (LongTensor) – Batch of groundtruth durations in linear domain (B, T)
- Returns:
Mean squared error loss value.
- Return type:
Tensor
Note
outputs is in log domain but targets is in linear domain.
espnet.nets.pytorch_backend.fastspeech.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.fastspeech.duration_calculator¶
Duration calculator related modules.
-
class
espnet.nets.pytorch_backend.fastspeech.duration_calculator.
DurationCalculator
(teacher_model)[source]¶ Bases:
torch.nn.modules.module.Module
Duration calculator module for FastSpeech.
Initialize duration calculator module.
- Parameters:
teacher_model (e2e_tts_transformer.Transformer) – Pretrained auto-regressive Transformer.
-
forward
(xs, ilens, ys, olens, spembs=None)[source]¶ Calculate forward propagation.
- Parameters:
xs (Tensor) – Batch of the padded sequences of character ids (B, Tmax).
ilens (Tensor) – Batch of lengths of each input sequence (B,).
ys (Tensor) – Batch of the padded sequence of target features (B, Lmax, odim).
olens (Tensor) – Batch of lengths of each output sequence (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns:
Batch of durations (B, Tmax).
- Return type:
Tensor
espnet.nets.pytorch_backend.conformer.encoder_layer¶
Encoder self-attention layer definition.
-
class
espnet.nets.pytorch_backend.conformer.encoder_layer.
EncoderLayer
(size, self_attn, feed_forward, feed_forward_macaron, conv_module, dropout_rate, normalize_before=True, concat_after=False, stochastic_depth_rate=0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder layer module.
- Parameters:
size (int) – Input dimension.
self_attn (torch.nn.Module) – Self-attention module instance. MultiHeadedAttention or RelPositionMultiHeadedAttention instance can be used as the argument.
feed_forward (torch.nn.Module) – Feed-forward module instance. PositionwiseFeedForward, MultiLayeredConv1d, or Conv1dLinear instance can be used as the argument.
feed_forward_macaron (torch.nn.Module) – Additional feed-forward module instance. PositionwiseFeedForward, MultiLayeredConv1d, or Conv1dLinear instance can be used as the argument.
conv_module (torch.nn.Module) – Convolution module instance. ConvlutionModule instance can be used as the argument.
dropout_rate (float) – Dropout rate.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
stochastic_depth_rate (float) – Proability to skip this layer. During training, the layer may skip residual computation and return input as-is with given probability.
Construct an EncoderLayer object.
-
forward
(x_input, mask, cache=None)[source]¶ Compute encoded features.
- Parameters:
x_input (Union[Tuple, torch.Tensor]) – Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).
- Returns:
Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, 1, time).
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.conformer.encoder¶
Encoder definition.
-
class
espnet.nets.pytorch_backend.conformer.encoder.
Encoder
(idim, attention_dim=256, attention_heads=4, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, attention_dropout_rate=0.0, input_layer='conv2d', normalize_before=True, concat_after=False, positionwise_layer_type='linear', positionwise_conv_kernel_size=1, macaron_style=False, pos_enc_layer_type='abs_pos', selfattention_layer_type='selfattn', activation_type='swish', use_cnn_module=False, zero_triu=False, cnn_module_kernel=31, padding_idx=-1, stochastic_depth_rate=0.0, intermediate_layers=None, ctc_softmax=None, conditioning_layer_dim=None)[source]¶ Bases:
torch.nn.modules.module.Module
Conformer encoder module.
- Parameters:
idim (int) – Input dimension.
attention_dim (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
attention_dropout_rate (float) – Dropout rate in attention.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
macaron_style (bool) – Whether to use macaron style for positionwise layer.
pos_enc_layer_type (str) – Encoder positional encoding layer type.
selfattention_layer_type (str) – Encoder attention layer type.
activation_type (str) – Encoder activation function type.
use_cnn_module (bool) – Whether to use convolution module.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.
stochastic_depth_rate (float) – Maximum probability to skip the encoder layer.
intermediate_layers (Union[List[int], None]) – indices of intermediate CTC layer. indices start from 1. if not None, intermediate outputs are returned (which changes return type signature.)
Construct an Encoder object.
espnet.nets.pytorch_backend.conformer.convolution¶
ConvolutionModule definition.
-
class
espnet.nets.pytorch_backend.conformer.convolution.
ConvolutionModule
(channels, kernel_size, activation=ReLU(), bias=True)[source]¶ Bases:
torch.nn.modules.module.Module
ConvolutionModule in Conformer model.
- Parameters:
channels (int) – The number of channels of conv layers.
kernel_size (int) – Kernerl size of conv layers.
Construct an ConvolutionModule object.
espnet.nets.pytorch_backend.conformer.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.conformer.contextual_block_encoder_layer¶
Created on Sat Aug 21 16:57:31 2021.
@author: Keqi Deng (UCAS)
-
class
espnet.nets.pytorch_backend.conformer.contextual_block_encoder_layer.
ContextualBlockEncoderLayer
(size, self_attn, feed_forward, feed_forward_macaron, conv_module, dropout_rate, total_layer_num, normalize_before=True, concat_after=False)[source]¶ Bases:
torch.nn.modules.module.Module
Contexutal Block Encoder layer module.
- Parameters:
size (int) – Input dimension.
self_attn (torch.nn.Module) – Self-attention module instance. MultiHeadedAttention or RelPositionMultiHeadedAttention instance can be used as the argument.
feed_forward (torch.nn.Module) – Feed-forward module instance. PositionwiseFeedForward, MultiLayeredConv1d, or Conv1dLinear instance can be used as the argument.
feed_forward_macaron (torch.nn.Module) – Additional feed-forward module instance. PositionwiseFeedForward, MultiLayeredConv1d, or Conv1dLinear instance can be used as the argument.
conv_module (torch.nn.Module) – Convolution module instance. ConvlutionModule instance can be used as the argument.
dropout_rate (float) – Dropout rate.
total_layer_num (int) – Total number of layers
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
Construct an EncoderLayer object.
-
forward
(x, mask, infer_mode=False, past_ctx=None, next_ctx=None, is_short_segment=False, layer_idx=0, cache=None)[source]¶ Calculate forward propagation.
-
forward_infer
(x, mask, past_ctx=None, next_ctx=None, is_short_segment=False, layer_idx=0, cache=None)[source]¶ Compute encoded features.
- Parameters:
x_input (torch.Tensor) – Input tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).
past_ctx (torch.Tensor) – Previous contexutal vector
next_ctx (torch.Tensor) – Next contexutal vector
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).
- Returns:
Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, 1, time). cur_ctx (torch.Tensor): Current contexutal vector next_ctx (torch.Tensor): Next contexutal vector layer_idx (int): layer index number
- Return type:
torch.Tensor
-
forward_train
(x, mask, past_ctx=None, next_ctx=None, layer_idx=0, cache=None)[source]¶ Compute encoded features.
- Parameters:
x_input (torch.Tensor) – Input tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, time).
past_ctx (torch.Tensor) – Previous contexutal vector
next_ctx (torch.Tensor) – Next contexutal vector
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).
- Returns:
Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time). cur_ctx (torch.Tensor): Current contexutal vector next_ctx (torch.Tensor): Next contexutal vector layer_idx (int): layer index number
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.conformer.swish¶
Swish() activation function for Conformer.
espnet.nets.pytorch_backend.conformer.argument¶
Conformer common arguments.
espnet.nets.pytorch_backend.maskctc.add_mask_token¶
Token masking module for Masked LM.
-
espnet.nets.pytorch_backend.maskctc.add_mask_token.
mask_uniform
(ys_pad, mask_token, eos, ignore_id)[source]¶ Replace random tokens with <mask> label and add <eos> label.
The number of <mask> is chosen from a uniform distribution between one and the target sequence’s length. :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax) :param int mask_token: index of <mask> :param int eos: index of <eos> :param int ignore_id: index of padding :return: padded tensor (B, Lmax) :rtype: torch.Tensor :return: padded tensor (B, Lmax) :rtype: torch.Tensor
espnet.nets.pytorch_backend.maskctc.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.maskctc.mask¶
Attention masking module for Masked LM.
-
espnet.nets.pytorch_backend.maskctc.mask.
square_mask
(ys_in_pad, ignore_id)[source]¶ Create attention mask to avoid attending on padding tokens.
- Parameters:
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
ignore_id (int) – index of padding
dtype (torch.dtype) – result dtype
- Return type:
torch.Tensor (B, Lmax, Lmax)
espnet.nets.pytorch_backend.transformer.dynamic_conv¶
Dynamic Convolution module.
-
class
espnet.nets.pytorch_backend.transformer.dynamic_conv.
DynamicConvolution
(wshare, n_feat, dropout_rate, kernel_size, use_kernel_mask=False, use_bias=False)[source]¶ Bases:
torch.nn.modules.module.Module
Dynamic Convolution layer.
This implementation is based on https://github.com/pytorch/fairseq/tree/master/fairseq
- Parameters:
wshare (int) – the number of kernel of convolution
n_feat (int) – the number of features
dropout_rate (float) – dropout_rate
kernel_size (int) – kernel size (length)
use_kernel_mask (bool) – Use causal mask or not for convolution kernel
use_bias (bool) – Use bias term or not.
Construct Dynamic Convolution layer.
-
forward
(query, key, value, mask)[source]¶ Forward of ‘Dynamic Convolution’.
This function takes query, key and value but uses only quert. This is just for compatibility with self-attention layer (attention.py)
- Parameters:
query (torch.Tensor) – (batch, time1, d_model) input tensor
key (torch.Tensor) – (batch, time2, d_model) NOT USED
value (torch.Tensor) – (batch, time2, d_model) NOT USED
mask (torch.Tensor) – (batch, time1, time2) mask
- Returns:
(batch, time1, d_model) output
- Return type:
x (torch.Tensor)
espnet.nets.pytorch_backend.transformer.encoder_layer¶
Encoder self-attention layer definition.
-
class
espnet.nets.pytorch_backend.transformer.encoder_layer.
EncoderLayer
(size, self_attn, feed_forward, dropout_rate, normalize_before=True, concat_after=False, stochastic_depth_rate=0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder layer module.
- Parameters:
size (int) – Input dimension.
self_attn (torch.nn.Module) – Self-attention module instance. MultiHeadedAttention or RelPositionMultiHeadedAttention instance can be used as the argument.
feed_forward (torch.nn.Module) – Feed-forward module instance. PositionwiseFeedForward, MultiLayeredConv1d, or Conv1dLinear instance can be used as the argument.
dropout_rate (float) – Dropout rate.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
stochastic_depth_rate (float) – Proability to skip this layer. During training, the layer may skip residual computation and return input as-is with given probability.
Construct an EncoderLayer object.
-
forward
(x, mask, cache=None)[source]¶ Compute encoded features.
- Parameters:
x_input (torch.Tensor) – Input tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).
- Returns:
Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, 1, time).
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.transformer.lightconv2d¶
Lightweight 2-Dimensional Convolution module.
-
class
espnet.nets.pytorch_backend.transformer.lightconv2d.
LightweightConvolution2D
(wshare, n_feat, dropout_rate, kernel_size, use_kernel_mask=False, use_bias=False)[source]¶ Bases:
torch.nn.modules.module.Module
Lightweight 2-Dimensional Convolution layer.
This implementation is based on https://github.com/pytorch/fairseq/tree/master/fairseq
- Parameters:
wshare (int) – the number of kernel of convolution
n_feat (int) – the number of features
dropout_rate (float) – dropout_rate
kernel_size (int) – kernel size (length)
use_kernel_mask (bool) – Use causal mask or not for convolution kernel
use_bias (bool) – Use bias term or not.
Construct Lightweight 2-Dimensional Convolution layer.
-
forward
(query, key, value, mask)[source]¶ Forward of ‘Lightweight 2-Dimensional Convolution’.
This function takes query, key and value but uses only query. This is just for compatibility with self-attention layer (attention.py)
- Parameters:
query (torch.Tensor) – (batch, time1, d_model) input tensor
key (torch.Tensor) – (batch, time2, d_model) NOT USED
value (torch.Tensor) – (batch, time2, d_model) NOT USED
mask (torch.Tensor) – (batch, time1, time2) mask
- Returns:
(batch, time1, d_model) output
- Return type:
x (torch.Tensor)
espnet.nets.pytorch_backend.transformer.encoder¶
Encoder definition.
-
class
espnet.nets.pytorch_backend.transformer.encoder.
Encoder
(idim, attention_dim=256, attention_heads=4, conv_wshare=4, conv_kernel_length='11', conv_usebias=False, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, attention_dropout_rate=0.0, input_layer='conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before=True, concat_after=False, positionwise_layer_type='linear', positionwise_conv_kernel_size=1, selfattention_layer_type='selfattn', padding_idx=-1, stochastic_depth_rate=0.0, intermediate_layers=None, ctc_softmax=None, conditioning_layer_dim=None)[source]¶ Bases:
torch.nn.modules.module.Module
Transformer encoder module.
- Parameters:
idim (int) – Input dimension.
attention_dim (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
conv_wshare (int) – The number of kernel of convolution. Only used in selfattention_layer_type == “lightconv*” or “dynamiconv*”.
conv_kernel_length (Union[int, str]) – Kernel size str of convolution (e.g. 71_71_71_71_71_71). Only used in selfattention_layer_type == “lightconv*” or “dynamiconv*”.
conv_usebias (bool) – Whether to use bias in convolution. Only used in selfattention_layer_type == “lightconv*” or “dynamiconv*”.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
attention_dropout_rate (float) – Dropout rate in attention.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
pos_enc_class (torch.nn.Module) – Positional encoding module class. PositionalEncoding `or `ScaledPositionalEncoding
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
selfattention_layer_type (str) – Encoder attention layer type.
padding_idx (int) – Padding idx for input_layer=embed.
stochastic_depth_rate (float) – Maximum probability to skip the encoder layer.
intermediate_layers (Union[List[int], None]) – indices of intermediate CTC layer. indices start from 1. if not None, intermediate outputs are returned (which changes return type signature.)
Construct an Encoder object.
-
forward
(xs, masks)[source]¶ Encode input sequence.
- Parameters:
xs (torch.Tensor) – Input tensor (#batch, time, idim).
masks (torch.Tensor) – Mask tensor (#batch, 1, time).
- Returns:
Output tensor (#batch, time, attention_dim). torch.Tensor: Mask tensor (#batch, 1, time).
- Return type:
torch.Tensor
-
forward_one_step
(xs, masks, cache=None)[source]¶ Encode input frame.
- Parameters:
xs (torch.Tensor) – Input tensor.
masks (torch.Tensor) – Mask tensor.
cache (List[torch.Tensor]) – List of cache tensors.
- Returns:
Output tensor. torch.Tensor: Mask tensor. List[torch.Tensor]: List of new cache tensors.
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.transformer.multi_layer_conv¶
Layer modules for FFT block in FastSpeech (Feed-forward Transformer).
-
class
espnet.nets.pytorch_backend.transformer.multi_layer_conv.
Conv1dLinear
(in_chans, hidden_chans, kernel_size, dropout_rate)[source]¶ Bases:
torch.nn.modules.module.Module
Conv1D + Linear for Transformer block.
A variant of MultiLayeredConv1d, which replaces second conv-layer to linear.
Initialize Conv1dLinear module.
- Parameters:
in_chans (int) – Number of input channels.
hidden_chans (int) – Number of hidden channels.
kernel_size (int) – Kernel size of conv1d.
dropout_rate (float) – Dropout rate.
-
class
espnet.nets.pytorch_backend.transformer.multi_layer_conv.
MultiLayeredConv1d
(in_chans, hidden_chans, kernel_size, dropout_rate)[source]¶ Bases:
torch.nn.modules.module.Module
Multi-layered conv1d for Transformer block.
This is a module of multi-leyered conv1d designed to replace positionwise feed-forward network in Transforner block, which is introduced in FastSpeech: Fast, Robust and Controllable Text to Speech.
Initialize MultiLayeredConv1d module.
- Parameters:
in_chans (int) – Number of input channels.
hidden_chans (int) – Number of hidden channels.
kernel_size (int) – Kernel size of conv1d.
dropout_rate (float) – Dropout rate.
espnet.nets.pytorch_backend.transformer.encoder_mix¶
Encoder Mix definition.
-
class
espnet.nets.pytorch_backend.transformer.encoder_mix.
EncoderMix
(idim, attention_dim=256, attention_heads=4, linear_units=2048, num_blocks_sd=4, num_blocks_rec=8, dropout_rate=0.1, positional_dropout_rate=0.1, attention_dropout_rate=0.0, input_layer='conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before=True, concat_after=False, positionwise_layer_type='linear', positionwise_conv_kernel_size=1, padding_idx=-1, num_spkrs=2)[source]¶ Bases:
espnet.nets.pytorch_backend.transformer.encoder.Encoder
,torch.nn.modules.module.Module
Transformer encoder module.
- Parameters:
idim (int) – input dim
attention_dim (int) – dimension of attention
attention_heads (int) – the number of heads of multi head attention
linear_units (int) – the number of units of position-wise feed forward
num_blocks (int) – the number of decoder blocks
dropout_rate (float) – dropout rate
attention_dropout_rate (float) – dropout rate in attention
positional_dropout_rate (float) – dropout rate after adding positional encoding
or torch.nn.Module input_layer (str) – input layer type
pos_enc_class (class) – PositionalEncoding or ScaledPositionalEncoding
normalize_before (bool) – whether to use layer_norm before the first block
concat_after (bool) – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – linear of conv1d
positionwise_conv_kernel_size (int) – kernel size of positionwise conv1d layer
padding_idx (int) – padding_idx for input_layer=embed
Construct an Encoder object.
-
forward
(xs, masks)[source]¶ Encode input sequence.
- Parameters:
xs (torch.Tensor) – input tensor
masks (torch.Tensor) – input mask
- Returns:
position embedded tensor and mask
- Rtype Tuple[torch.Tensor, torch.Tensor]:
-
forward_one_step
(xs, masks, cache=None)[source]¶ Encode input frame.
- Parameters:
xs (torch.Tensor) – input tensor
masks (torch.Tensor) – input mask
cache (List[torch.Tensor]) – cache tensors
- Returns:
position embedded tensor, mask and new cache
- Rtype Tuple[torch.Tensor, torch.Tensor, List[torch.Tensor]]:
espnet.nets.pytorch_backend.transformer.subsampling_without_posenc¶
Subsampling layer definition.
-
class
espnet.nets.pytorch_backend.transformer.subsampling_without_posenc.
Conv2dSubsamplingWOPosEnc
(idim, odim, dropout_rate, kernels, strides)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional 2D subsampling.
- Parameters:
idim (int) – Input dimension.
odim (int) – Output dimension.
dropout_rate (float) – Dropout rate.
kernels (list) – kernel sizes
strides (list) – stride sizes
Construct an Conv2dSubsamplingWOPosEnc object.
-
forward
(x, x_mask)[source]¶ Subsample x.
- Parameters:
x (torch.Tensor) – Input tensor (#batch, time, idim).
x_mask (torch.Tensor) – Input mask (#batch, 1, time).
- Returns:
- Subsampled tensor (#batch, time’, odim),
where time’ = time // 4.
- torch.Tensor: Subsampled mask (#batch, 1, time’),
where time’ = time // 4.
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.transformer.layer_norm¶
Layer normalization module.
espnet.nets.pytorch_backend.transformer.decoder_layer¶
Decoder self-attention layer definition.
-
class
espnet.nets.pytorch_backend.transformer.decoder_layer.
DecoderLayer
(size, self_attn, src_attn, feed_forward, dropout_rate, normalize_before=True, concat_after=False)[source]¶ Bases:
torch.nn.modules.module.Module
Single decoder layer module.
- Parameters:
size (int) – Input dimension.
self_attn (torch.nn.Module) – Self-attention module instance. MultiHeadedAttention instance can be used as the argument.
src_attn (torch.nn.Module) – Self-attention module instance. MultiHeadedAttention instance can be used as the argument.
feed_forward (torch.nn.Module) – Feed-forward module instance. PositionwiseFeedForward, MultiLayeredConv1d, or Conv1dLinear instance can be used as the argument.
dropout_rate (float) – Dropout rate.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
Construct an DecoderLayer object.
-
forward
(tgt, tgt_mask, memory, memory_mask, cache=None)[source]¶ Compute decoded features.
- Parameters:
tgt (torch.Tensor) – Input tensor (#batch, maxlen_out, size).
tgt_mask (torch.Tensor) – Mask for input tensor (#batch, maxlen_out).
memory (torch.Tensor) – Encoded memory, float32 (#batch, maxlen_in, size).
memory_mask (torch.Tensor) – Encoded memory mask (#batch, maxlen_in).
cache (List[torch.Tensor]) – List of cached tensors. Each tensor shape should be (#batch, maxlen_out - 1, size).
- Returns:
Output tensor(#batch, maxlen_out, size). torch.Tensor: Mask for output tensor (#batch, maxlen_out). torch.Tensor: Encoded memory (#batch, maxlen_in, size). torch.Tensor: Encoded memory mask (#batch, maxlen_in).
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.transformer.dynamic_conv2d¶
Dynamic 2-Dimensional Convolution module.
-
class
espnet.nets.pytorch_backend.transformer.dynamic_conv2d.
DynamicConvolution2D
(wshare, n_feat, dropout_rate, kernel_size, use_kernel_mask=False, use_bias=False)[source]¶ Bases:
torch.nn.modules.module.Module
Dynamic 2-Dimensional Convolution layer.
This implementation is based on https://github.com/pytorch/fairseq/tree/master/fairseq
- Parameters:
wshare (int) – the number of kernel of convolution
n_feat (int) – the number of features
dropout_rate (float) – dropout_rate
kernel_size (int) – kernel size (length)
use_kernel_mask (bool) – Use causal mask or not for convolution kernel
use_bias (bool) – Use bias term or not.
Construct Dynamic 2-Dimensional Convolution layer.
-
forward
(query, key, value, mask)[source]¶ Forward of ‘Dynamic 2-Dimensional Convolution’.
This function takes query, key and value but uses only query. This is just for compatibility with self-attention layer (attention.py)
- Parameters:
query (torch.Tensor) – (batch, time1, d_model) input tensor
key (torch.Tensor) – (batch, time2, d_model) NOT USED
value (torch.Tensor) – (batch, time2, d_model) NOT USED
mask (torch.Tensor) – (batch, time1, time2) mask
- Returns:
(batch, time1, d_model) output
- Return type:
x (torch.Tensor)
espnet.nets.pytorch_backend.transformer.embedding¶
Positional Encoding Module.
-
class
espnet.nets.pytorch_backend.transformer.embedding.
LearnableFourierPosEnc
(d_model, dropout_rate=0.0, max_len=5000, gamma=1.0, apply_scaling=False, hidden_dim=None)[source]¶ Bases:
torch.nn.modules.module.Module
Learnable Fourier Features for Positional Encoding.
See https://arxiv.org/pdf/2106.02795.pdf
- Parameters:
d_model (int) – Embedding dimension.
dropout_rate (float) – Dropout rate.
max_len (int) – Maximum input length.
gamma (float) – init parameter for the positional kernel variance see https://arxiv.org/pdf/2106.02795.pdf.
apply_scaling (bool) – Whether to scale the input before adding the pos encoding.
hidden_dim (int) – if not None, we modulate the pos encodings with an MLP whose hidden layer has hidden_dim neurons.
Initialize class.
-
class
espnet.nets.pytorch_backend.transformer.embedding.
LegacyRelPositionalEncoding
(d_model, dropout_rate, max_len=5000)[source]¶ Bases:
espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding
Relative positional encoding module (old version).
Details can be found in https://github.com/espnet/espnet/pull/2816.
See : Appendix B in https://arxiv.org/abs/1901.02860
- Parameters:
d_model (int) – Embedding dimension.
dropout_rate (float) – Dropout rate.
max_len (int) – Maximum input length.
Initialize class.
-
class
espnet.nets.pytorch_backend.transformer.embedding.
PositionalEncoding
(d_model, dropout_rate, max_len=5000, reverse=False)[source]¶ Bases:
torch.nn.modules.module.Module
Positional encoding.
- Parameters:
d_model (int) – Embedding dimension.
dropout_rate (float) – Dropout rate.
max_len (int) – Maximum input length.
reverse (bool) – Whether to reverse the input position. Only for
class LegacyRelPositionalEncoding. We remove it in the current (the) –
RelPositionalEncoding. (class) –
Construct an PositionalEncoding object.
-
class
espnet.nets.pytorch_backend.transformer.embedding.
RelPositionalEncoding
(d_model, dropout_rate, max_len=5000)[source]¶ Bases:
torch.nn.modules.module.Module
Relative positional encoding module (new implementation).
Details can be found in https://github.com/espnet/espnet/pull/2816.
See : Appendix B in https://arxiv.org/abs/1901.02860
- Parameters:
d_model (int) – Embedding dimension.
dropout_rate (float) – Dropout rate.
max_len (int) – Maximum input length.
Construct an PositionalEncoding object.
-
class
espnet.nets.pytorch_backend.transformer.embedding.
ScaledPositionalEncoding
(d_model, dropout_rate, max_len=5000)[source]¶ Bases:
espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding
Scaled positional encoding module.
See Sec. 3.2 https://arxiv.org/abs/1809.08895
- Parameters:
d_model (int) – Embedding dimension.
dropout_rate (float) – Dropout rate.
max_len (int) – Maximum input length.
Initialize class.
-
class
espnet.nets.pytorch_backend.transformer.embedding.
StreamPositionalEncoding
(d_model, dropout_rate, max_len=5000)[source]¶ Bases:
torch.nn.modules.module.Module
Streaming Positional encoding.
- Parameters:
d_model (int) – Embedding dimension.
dropout_rate (float) – Dropout rate.
max_len (int) – Maximum input length.
Construct an PositionalEncoding object.
espnet.nets.pytorch_backend.transformer.positionwise_feed_forward¶
Positionwise feed forward layer definition.
-
class
espnet.nets.pytorch_backend.transformer.positionwise_feed_forward.
PositionwiseFeedForward
(idim, hidden_units, dropout_rate, activation=ReLU())[source]¶ Bases:
torch.nn.modules.module.Module
Positionwise feed forward layer.
- Parameters:
idim (int) – Input dimenstion.
hidden_units (int) – The number of hidden units.
dropout_rate (float) – Dropout rate.
Construct an PositionwiseFeedForward object.
espnet.nets.pytorch_backend.transformer.label_smoothing_loss¶
Label smoothing module.
-
class
espnet.nets.pytorch_backend.transformer.label_smoothing_loss.
LabelSmoothingLoss
(size, padding_idx, smoothing, normalize_length=False, criterion=KLDivLoss())[source]¶ Bases:
torch.nn.modules.module.Module
Label-smoothing loss.
- Parameters:
size (int) – the number of class
padding_idx (int) – ignored class id
smoothing (float) – smoothing rate (0.0 means the conventional CE)
normalize_length (bool) – normalize loss by sequence length if True
criterion (torch.nn.Module) – loss function to be smoothed
Construct an LabelSmoothingLoss object.
espnet.nets.pytorch_backend.transformer.add_sos_eos¶
Unility functions for Transformer.
-
espnet.nets.pytorch_backend.transformer.add_sos_eos.
add_sos_eos
(ys_pad, sos, eos, ignore_id)[source]¶ Add <sos> and <eos> labels.
- Parameters:
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
sos (int) – index of <sos>
eos (int) – index of <eos>
ignore_id (int) – index of padding
- Returns:
padded tensor (B, Lmax)
- Return type:
torch.Tensor
- Returns:
padded tensor (B, Lmax)
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.transformer.attention¶
Multi-Head Attention layer definition.
-
class
espnet.nets.pytorch_backend.transformer.attention.
LegacyRelPositionMultiHeadedAttention
(n_head, n_feat, dropout_rate, zero_triu=False)[source]¶ Bases:
espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention
Multi-Head Attention layer with relative position encoding (old version).
Details can be found in https://github.com/espnet/espnet/pull/2816.
Paper: https://arxiv.org/abs/1901.02860
- Parameters:
n_head (int) – The number of heads.
n_feat (int) – The number of features.
dropout_rate (float) – Dropout rate.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
Construct an RelPositionMultiHeadedAttention object.
-
forward
(query, key, value, pos_emb, mask)[source]¶ Compute ‘Scaled Dot Product Attention’ with rel. positional encoding.
- Parameters:
query (torch.Tensor) – Query tensor (#batch, time1, size).
key (torch.Tensor) – Key tensor (#batch, time2, size).
value (torch.Tensor) – Value tensor (#batch, time2, size).
pos_emb (torch.Tensor) – Positional embedding tensor (#batch, time1, size).
mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).
- Returns:
Output tensor (#batch, time1, d_model).
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.transformer.attention.
MultiHeadedAttention
(n_head, n_feat, dropout_rate)[source]¶ Bases:
torch.nn.modules.module.Module
Multi-Head Attention layer.
- Parameters:
n_head (int) – The number of heads.
n_feat (int) – The number of features.
dropout_rate (float) – Dropout rate.
Construct an MultiHeadedAttention object.
-
forward
(query, key, value, mask)[source]¶ Compute scaled dot product attention.
- Parameters:
query (torch.Tensor) – Query tensor (#batch, time1, size).
key (torch.Tensor) – Key tensor (#batch, time2, size).
value (torch.Tensor) – Value tensor (#batch, time2, size).
mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).
- Returns:
Output tensor (#batch, time1, d_model).
- Return type:
torch.Tensor
-
forward_attention
(value, scores, mask)[source]¶ Compute attention context vector.
- Parameters:
value (torch.Tensor) – Transformed value (#batch, n_head, time2, d_k).
scores (torch.Tensor) – Attention score (#batch, n_head, time1, time2).
mask (torch.Tensor) – Mask (#batch, 1, time2) or (#batch, time1, time2).
- Returns:
- Transformed value (#batch, time1, d_model)
weighted by the attention score (#batch, time1, time2).
- Return type:
torch.Tensor
-
forward_qkv
(query, key, value)[source]¶ Transform query, key and value.
- Parameters:
query (torch.Tensor) – Query tensor (#batch, time1, size).
key (torch.Tensor) – Key tensor (#batch, time2, size).
value (torch.Tensor) – Value tensor (#batch, time2, size).
- Returns:
Transformed query tensor (#batch, n_head, time1, d_k). torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k). torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.transformer.attention.
RelPositionMultiHeadedAttention
(n_head, n_feat, dropout_rate, zero_triu=False)[source]¶ Bases:
espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention
Multi-Head Attention layer with relative position encoding (new implementation).
Details can be found in https://github.com/espnet/espnet/pull/2816.
Paper: https://arxiv.org/abs/1901.02860
- Parameters:
n_head (int) – The number of heads.
n_feat (int) – The number of features.
dropout_rate (float) – Dropout rate.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
Construct an RelPositionMultiHeadedAttention object.
-
forward
(query, key, value, pos_emb, mask)[source]¶ Compute ‘Scaled Dot Product Attention’ with rel. positional encoding.
- Parameters:
query (torch.Tensor) – Query tensor (#batch, time1, size).
key (torch.Tensor) – Key tensor (#batch, time2, size).
value (torch.Tensor) – Value tensor (#batch, time2, size).
pos_emb (torch.Tensor) – Positional embedding tensor (#batch, 2*time1-1, size).
mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).
- Returns:
Output tensor (#batch, time1, d_model).
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.transformer.optimizer¶
Optimizer module.
espnet.nets.pytorch_backend.transformer.plot¶
-
class
espnet.nets.pytorch_backend.transformer.plot.
PlotAttentionReport
(att_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0, subsampling_factor=1)[source]¶
-
espnet.nets.pytorch_backend.transformer.plot.
plot_multi_head_attention
(data, uttid_list, attn_dict, outdir, suffix='png', savefn=<function savefig>, ikey='input', iaxis=0, okey='output', oaxis=0, subsampling_factor=4)[source]¶ Plot multi head attentions.
- Parameters:
data (dict) – utts info from json file
uttid_list (List) – utterance IDs
torch.Tensor] attn_dict (dict[str,) – multi head attention dict. values should be torch.Tensor (head, input_length, output_length)
outdir (str) – dir to save fig
suffix (str) – filename suffix including image type (e.g., png)
savefn – function to save
ikey (str) – key to access input
iaxis (int) – dimension to access input
okey (str) – key to access output
oaxis (int) – dimension to access output
subsampling_factor – subsampling factor in encoder
espnet.nets.pytorch_backend.transformer.initializer¶
Parameter initialization.
espnet.nets.pytorch_backend.transformer.decoder¶
Decoder definition.
-
class
espnet.nets.pytorch_backend.transformer.decoder.
Decoder
(odim, selfattention_layer_type='selfattn', attention_dim=256, attention_heads=4, conv_wshare=4, conv_kernel_length=11, conv_usebias=False, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, self_attention_dropout_rate=0.0, src_attention_dropout_rate=0.0, input_layer='embed', use_output_layer=True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before=True, concat_after=False)[source]¶ Bases:
espnet.nets.scorer_interface.BatchScorerInterface
,torch.nn.modules.module.Module
Transfomer decoder module.
- Parameters:
odim (int) – Output diminsion.
self_attention_layer_type (str) – Self-attention layer type.
attention_dim (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
conv_wshare (int) – The number of kernel of convolution. Only used in self_attention_layer_type == “lightconv*” or “dynamiconv*”.
conv_kernel_length (Union[int, str]) – Kernel size str of convolution (e.g. 71_71_71_71_71_71). Only used in self_attention_layer_type == “lightconv*” or “dynamiconv*”.
conv_usebias (bool) – Whether to use bias in convolution. Only used in self_attention_layer_type == “lightconv*” or “dynamiconv*”.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
self_attention_dropout_rate (float) – Dropout rate in self-attention.
src_attention_dropout_rate (float) – Dropout rate in source-attention.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
use_output_layer (bool) – Whether to use output layer.
pos_enc_class (torch.nn.Module) – Positional encoding module class. PositionalEncoding `or `ScaledPositionalEncoding
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
Construct an Decoder object.
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch (required).
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
forward
(tgt, tgt_mask, memory, memory_mask)[source]¶ Forward decoder.
- Parameters:
tgt (torch.Tensor) – Input token ids, int64 (#batch, maxlen_out) if input_layer == “embed”. In the other case, input tensor (#batch, maxlen_out, odim).
tgt_mask (torch.Tensor) – Input token mask (#batch, maxlen_out). dtype=torch.uint8 in PyTorch 1.2- and dtype=torch.bool in PyTorch 1.2+ (include 1.2).
memory (torch.Tensor) – Encoded memory, float32 (#batch, maxlen_in, feat).
memory_mask (torch.Tensor) – Encoded memory mask (#batch, maxlen_in). dtype=torch.uint8 in PyTorch 1.2- and dtype=torch.bool in PyTorch 1.2+ (include 1.2).
- Returns:
- Decoded token score before softmax (#batch, maxlen_out, odim)
if use_output_layer is True. In the other case,final block outputs (#batch, maxlen_out, attention_dim).
torch.Tensor: Score mask before softmax (#batch, maxlen_out).
- Return type:
torch.Tensor
-
forward_one_step
(tgt, tgt_mask, memory, cache=None)[source]¶ Forward one step.
- Parameters:
tgt (torch.Tensor) – Input token ids, int64 (#batch, maxlen_out).
tgt_mask (torch.Tensor) – Input token mask (#batch, maxlen_out). dtype=torch.uint8 in PyTorch 1.2- and dtype=torch.bool in PyTorch 1.2+ (include 1.2).
memory (torch.Tensor) – Encoded memory, float32 (#batch, maxlen_in, feat).
cache (List[torch.Tensor]) – List of cached tensors. Each tensor shape should be (#batch, maxlen_out - 1, size).
- Returns:
Output tensor (batch, maxlen_out, odim). List[torch.Tensor]: List of cache tensors of each decoder layer.
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.transformer.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.transformer.contextual_block_encoder_layer¶
Encoder self-attention layer definition.
-
class
espnet.nets.pytorch_backend.transformer.contextual_block_encoder_layer.
ContextualBlockEncoderLayer
(size, self_attn, feed_forward, dropout_rate, total_layer_num, normalize_before=True, concat_after=False)[source]¶ Bases:
torch.nn.modules.module.Module
Contexutal Block Encoder layer module.
- Parameters:
size (int) – Input dimension.
self_attn (torch.nn.Module) – Self-attention module instance. MultiHeadedAttention or RelPositionMultiHeadedAttention instance can be used as the argument.
feed_forward (torch.nn.Module) – Feed-forward module instance. PositionwiseFeedForward, MultiLayeredConv1d, or Conv1dLinear instance can be used as the argument.
dropout_rate (float) – Dropout rate.
total_layer_num (int) – Total number of layers
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
Construct an EncoderLayer object.
-
forward
(x, mask, infer_mode=False, past_ctx=None, next_ctx=None, is_short_segment=False, layer_idx=0, cache=None)[source]¶ Calculate forward propagation.
-
forward_infer
(x, mask, past_ctx=None, next_ctx=None, is_short_segment=False, layer_idx=0, cache=None)[source]¶ Compute encoded features.
- Parameters:
x_input (torch.Tensor) – Input tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).
past_ctx (torch.Tensor) – Previous contexutal vector
next_ctx (torch.Tensor) – Next contexutal vector
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).
- Returns:
Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, 1, time). cur_ctx (torch.Tensor): Current contexutal vector next_ctx (torch.Tensor): Next contexutal vector layer_idx (int): layer index number
- Return type:
torch.Tensor
-
forward_train
(x, mask, past_ctx=None, next_ctx=None, layer_idx=0, cache=None)[source]¶ Compute encoded features.
- Parameters:
x_input (torch.Tensor) – Input tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).
past_ctx (torch.Tensor) – Previous contexutal vector
next_ctx (torch.Tensor) – Next contexutal vector
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).
- Returns:
Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, 1, time). cur_ctx (torch.Tensor): Current contexutal vector next_ctx (torch.Tensor): Next contexutal vector layer_idx (int): layer index number
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.transformer.repeat¶
Repeat the same layer definition.
-
class
espnet.nets.pytorch_backend.transformer.repeat.
MultiSequential
(*args, layer_drop_rate=0.0)[source]¶ Bases:
torch.nn.modules.container.Sequential
Multi-input multi-output torch.nn.Sequential.
Initialize MultiSequential with layer_drop.
- Parameters:
layer_drop_rate (float) – Probability of dropping out each fn (layer).
-
espnet.nets.pytorch_backend.transformer.repeat.
repeat
(N, fn, layer_drop_rate=0.0)[source]¶ Repeat module N times.
- Parameters:
N (int) – Number of repeat time.
fn (Callable) – Function to generate module.
layer_drop_rate (float) – Probability of dropping out each fn (layer).
- Returns:
Repeated model instance.
- Return type:
espnet.nets.pytorch_backend.transformer.lightconv¶
Lightweight Convolution Module.
-
class
espnet.nets.pytorch_backend.transformer.lightconv.
LightweightConvolution
(wshare, n_feat, dropout_rate, kernel_size, use_kernel_mask=False, use_bias=False)[source]¶ Bases:
torch.nn.modules.module.Module
Lightweight Convolution layer.
This implementation is based on https://github.com/pytorch/fairseq/tree/master/fairseq
- Parameters:
wshare (int) – the number of kernel of convolution
n_feat (int) – the number of features
dropout_rate (float) – dropout_rate
kernel_size (int) – kernel size (length)
use_kernel_mask (bool) – Use causal mask or not for convolution kernel
use_bias (bool) – Use bias term or not.
Construct Lightweight Convolution layer.
-
forward
(query, key, value, mask)[source]¶ Forward of ‘Lightweight Convolution’.
This function takes query, key and value but uses only query. This is just for compatibility with self-attention layer (attention.py)
- Parameters:
query (torch.Tensor) – (batch, time1, d_model) input tensor
key (torch.Tensor) – (batch, time2, d_model) NOT USED
value (torch.Tensor) – (batch, time2, d_model) NOT USED
mask (torch.Tensor) – (batch, time1, time2) mask
- Returns:
(batch, time1, d_model) output
- Return type:
x (torch.Tensor)
espnet.nets.pytorch_backend.transformer.argument¶
Transformer common arguments.
espnet.nets.pytorch_backend.transformer.subsampling¶
Subsampling layer definition.
-
class
espnet.nets.pytorch_backend.transformer.subsampling.
Conv1dSubsampling2
(idim, odim, dropout_rate, pos_enc=None)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional 1D subsampling (to 1/2 length).
- Parameters:
idim (int) – Input dimension.
odim (int) – Output dimension.
dropout_rate (float) – Dropout rate.
pos_enc (torch.nn.Module) – Custom position encoding layer.
Construct an Conv1dSubsampling2 object.
-
forward
(x, x_mask)[source]¶ Subsample x.
- Parameters:
x (torch.Tensor) – Input tensor (#batch, time, idim).
x_mask (torch.Tensor) – Input mask (#batch, 1, time).
- Returns:
- Subsampled tensor (#batch, time’, odim),
where time’ = time // 2.
- torch.Tensor: Subsampled mask (#batch, 1, time’),
where time’ = time // 2.
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.transformer.subsampling.
Conv1dSubsampling3
(idim, odim, dropout_rate, pos_enc=None)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional 1D subsampling (to 1/3 length).
- Parameters:
idim (int) – Input dimension.
odim (int) – Output dimension.
dropout_rate (float) – Dropout rate.
pos_enc (torch.nn.Module) – Custom position encoding layer.
Construct an Conv1dSubsampling3 object.
-
forward
(x, x_mask)[source]¶ Subsample x.
- Parameters:
x (torch.Tensor) – Input tensor (#batch, time, idim).
x_mask (torch.Tensor) – Input mask (#batch, 1, time).
- Returns:
- Subsampled tensor (#batch, time’, odim),
where time’ = time // 2.
- torch.Tensor: Subsampled mask (#batch, 1, time’),
where time’ = time // 2.
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.transformer.subsampling.
Conv2dSubsampling
(idim, odim, dropout_rate, pos_enc=None)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional 2D subsampling (to 1/4 length).
- Parameters:
idim (int) – Input dimension.
odim (int) – Output dimension.
dropout_rate (float) – Dropout rate.
pos_enc (torch.nn.Module) – Custom position encoding layer.
Construct an Conv2dSubsampling object.
-
forward
(x, x_mask)[source]¶ Subsample x.
- Parameters:
x (torch.Tensor) – Input tensor (#batch, time, idim).
x_mask (torch.Tensor) – Input mask (#batch, 1, time).
- Returns:
- Subsampled tensor (#batch, time’, odim),
where time’ = time // 4.
- torch.Tensor: Subsampled mask (#batch, 1, time’),
where time’ = time // 4.
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.transformer.subsampling.
Conv2dSubsampling1
(idim, odim, dropout_rate, pos_enc=None)[source]¶ Bases:
torch.nn.modules.module.Module
Similar to Conv2dSubsampling module, but without any subsampling performed.
- Parameters:
idim (int) – Input dimension.
odim (int) – Output dimension.
dropout_rate (float) – Dropout rate.
pos_enc (torch.nn.Module) – Custom position encoding layer.
Construct an Conv2dSubsampling1 object.
-
forward
(x, x_mask)[source]¶ Pass x through 2 Conv2d layers without subsampling.
- Parameters:
x (torch.Tensor) – Input tensor (#batch, time, idim).
x_mask (torch.Tensor) – Input mask (#batch, 1, time).
- Returns:
- Subsampled tensor (#batch, time’, odim).
where time’ = time - 4.
- torch.Tensor: Subsampled mask (#batch, 1, time’).
where time’ = time - 4.
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.transformer.subsampling.
Conv2dSubsampling2
(idim, odim, dropout_rate, pos_enc=None)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional 2D subsampling (to 1/2 length).
- Parameters:
idim (int) – Input dimension.
odim (int) – Output dimension.
dropout_rate (float) – Dropout rate.
pos_enc (torch.nn.Module) – Custom position encoding layer.
Construct an Conv2dSubsampling2 object.
-
forward
(x, x_mask)[source]¶ Subsample x.
- Parameters:
x (torch.Tensor) – Input tensor (#batch, time, idim).
x_mask (torch.Tensor) – Input mask (#batch, 1, time).
- Returns:
- Subsampled tensor (#batch, time’, odim),
where time’ = time // 2.
- torch.Tensor: Subsampled mask (#batch, 1, time’),
where time’ = time // 2.
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.transformer.subsampling.
Conv2dSubsampling6
(idim, odim, dropout_rate, pos_enc=None)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional 2D subsampling (to 1/6 length).
- Parameters:
idim (int) – Input dimension.
odim (int) – Output dimension.
dropout_rate (float) – Dropout rate.
pos_enc (torch.nn.Module) – Custom position encoding layer.
Construct an Conv2dSubsampling6 object.
-
forward
(x, x_mask)[source]¶ Subsample x.
- Parameters:
x (torch.Tensor) – Input tensor (#batch, time, idim).
x_mask (torch.Tensor) – Input mask (#batch, 1, time).
- Returns:
- Subsampled tensor (#batch, time’, odim),
where time’ = time // 6.
- torch.Tensor: Subsampled mask (#batch, 1, time’),
where time’ = time // 6.
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.transformer.subsampling.
Conv2dSubsampling8
(idim, odim, dropout_rate, pos_enc=None)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional 2D subsampling (to 1/8 length).
- Parameters:
idim (int) – Input dimension.
odim (int) – Output dimension.
dropout_rate (float) – Dropout rate.
pos_enc (torch.nn.Module) – Custom position encoding layer.
Construct an Conv2dSubsampling8 object.
-
forward
(x, x_mask)[source]¶ Subsample x.
- Parameters:
x (torch.Tensor) – Input tensor (#batch, time, idim).
x_mask (torch.Tensor) – Input mask (#batch, 1, time).
- Returns:
- Subsampled tensor (#batch, time’, odim),
where time’ = time // 8.
- torch.Tensor: Subsampled mask (#batch, 1, time’),
where time’ = time // 8.
- Return type:
torch.Tensor
-
exception
espnet.nets.pytorch_backend.transformer.subsampling.
TooShortUttError
(message, actual_size, limit)[source]¶ Bases:
Exception
Raised when the utt is too short for subsampling.
- Parameters:
message (str) – Message for error catch
actual_size (int) – the short size that cannot pass the subsampling
limit (int) – the limit size for subsampling
Construct a TooShortUttError for error handler.
espnet.nets.pytorch_backend.transformer.longformer_attention¶
Longformer based Local Attention Definition.
-
class
espnet.nets.pytorch_backend.transformer.longformer_attention.
LongformerAttention
(config: longformer.longformer.LongformerConfig, layer_id: int)[source]¶ Bases:
torch.nn.modules.module.Module
Longformer based Local Attention Definition.
Compute Longformer based Self-Attention.
- Parameters:
config – Longformer attention configuration
layer_id – Integer representing the layer index
-
forward
(query, key, value, mask)[source]¶ Compute Longformer Self-Attention with masking.
Expects len(hidden_states) to be multiple of attention_window. Padding to attention_window happens in
encoder.forward()
to avoid redoing the padding on each layer. :param query: Query tensor (#batch, time1, size). :type query: torch.Tensor :param key: Key tensor (#batch, time2, size). :type key: torch.Tensor :param value: Value tensor (#batch, time2, size). :type value: torch.Tensor :param pos_emb: Positional embedding tensor(#batch, 2*time1-1, size).
- Parameters:
mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).
- Returns:
Output tensor (#batch, time1, d_model).
- Return type:
torch.Tensor
espnet.nets.pytorch_backend.transformer.mask¶
Mask module.
-
espnet.nets.pytorch_backend.transformer.mask.
subsequent_mask
(size, device='cpu', dtype=torch.bool)[source]¶ Create mask for subsequent steps (size, size).
- Parameters:
size (int) – size of mask
device (str) – “cpu” or “cuda” or torch.Tensor.device
dtype (torch.dtype) – result dtype
- Return type:
torch.Tensor
>>> subsequent_mask(3) [[1, 0, 0], [1, 1, 0], [1, 1, 1]]
-
espnet.nets.pytorch_backend.transformer.mask.
target_mask
(ys_in_pad, ignore_id)[source]¶ Create mask for decoder self-attention.
- Parameters:
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
ignore_id (int) – index of padding
dtype (torch.dtype) – result dtype
- Return type:
torch.Tensor (B, Lmax, Lmax)
espnet.nets.pytorch_backend.transducer.custom_decoder¶
Custom decoder definition for Transducer model.
-
class
espnet.nets.pytorch_backend.transducer.custom_decoder.
CustomDecoder
(odim: int, dec_arch: List, input_layer: str = 'embed', repeat_block: int = 0, joint_activation_type: str = 'tanh', positional_encoding_type: str = 'abs_pos', positionwise_layer_type: str = 'linear', positionwise_activation_type: str = 'relu', input_layer_dropout_rate: float = 0.0, blank_id: int = 0)[source]¶ Bases:
espnet.nets.transducer_decoder_interface.TransducerDecoderInterface
,torch.nn.modules.module.Module
Custom decoder module for Transducer model.
- Parameters:
odim – Output dimension.
dec_arch – Decoder block architecture (type and parameters).
input_layer – Input layer type.
repeat_block – Number of times dec_arch is repeated.
joint_activation_type – Type of activation for joint network.
positional_encoding_type – Positional encoding type.
positionwise_layer_type – Positionwise layer type.
positionwise_activation_type – Positionwise activation type.
input_layer_dropout_rate – Dropout rate for input layer.
blank_id – Blank symbol ID.
Construct a CustomDecoder object.
-
batch_score
(hyps: Union[List[espnet.nets.transducer_decoder_interface.Hypothesis], List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis]], dec_states: List[Optional[torch.Tensor]], cache: Dict[str, Any], use_lm: bool) → Tuple[torch.Tensor, List[Optional[torch.Tensor]], torch.Tensor][source]¶ One-step forward hypotheses.
- Parameters:
hyps – Hypotheses.
dec_states – Decoder hidden states. [N x (B, U, D_dec)]
cache – Pairs of (h_dec, dec_states) for each label sequences. (keys)
use_lm – Whether to compute label ID sequences for LM.
- Returns:
Decoder output sequences. (B, D_dec) dec_states: Decoder hidden states. [N x (B, U, D_dec)] lm_labels: Label ID sequences for LM. (B,)
- Return type:
dec_out
-
create_batch_states
(states: List[Optional[torch.Tensor]], new_states: List[Optional[torch.Tensor]], check_list: List[List[int]]) → List[Optional[torch.Tensor]][source]¶ Create decoder hidden states sequences.
- Parameters:
states – Decoder hidden states. [N x (B, U, D_dec)]
new_states – Decoder hidden states. [B x [N x (1, U, D_dec)]]
check_list – Label ID sequences.
- Returns:
New decoder hidden states. [N x (B, U, D_dec)]
- Return type:
states
-
forward
(dec_input: torch.Tensor, dec_mask: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Encode label ID sequences.
- Parameters:
dec_input – Label ID sequences. (B, U)
dec_mask – Label mask sequences. (B, U)
- Returns:
Decoder output sequences. (B, U, D_dec) dec_output_mask: Mask of decoder output sequences. (B, U)
- Return type:
dec_output
-
init_state
(batch_size: Optional[int] = None) → List[Optional[torch.Tensor]][source]¶ Initialize decoder states.
- Parameters:
batch_size – Batch size.
- Returns:
Initial decoder hidden states. [N x None]
- Return type:
state
-
score
(hyp: espnet.nets.transducer_decoder_interface.Hypothesis, cache: Dict[str, Any]) → Tuple[torch.Tensor, List[Optional[torch.Tensor]], torch.Tensor][source]¶ One-step forward hypothesis.
- Parameters:
hyp – Hypothesis.
cache – Pairs of (dec_out, dec_state) for each label sequence. (key)
- Returns:
Decoder output sequence. (1, D_dec) dec_state: Decoder hidden states. [N x (1, U, D_dec)] lm_label: Label ID for LM. (1,)
- Return type:
dec_out
-
select_state
(states: List[Optional[torch.Tensor]], idx: int) → List[Optional[torch.Tensor]][source]¶ Get specified ID state from decoder hidden states.
- Parameters:
states – Decoder hidden states. [N x (B, U, D_dec)]
idx – State ID to extract.
- Returns:
Decoder hidden state for given ID. [N x (1, U, D_dec)]
- Return type:
state_idx
espnet.nets.pytorch_backend.transducer.rnn_encoder¶
RNN encoder implementation for Transducer model.
These classes are based on the ones in espnet.nets.pytorch_backend.rnn.encoders, and modified to output intermediate representation based given list of layers as input. To do so, RNN class rely on a stack of 1-layer LSTM instead of a multi-layer LSTM. The additional outputs are intended to be used with Transducer auxiliary tasks.
-
class
espnet.nets.pytorch_backend.transducer.rnn_encoder.
Encoder
(idim: int, etype: str, elayers: int, eunits: int, eprojs: int, subsample: numpy.ndarray, dropout_rate: float = 0.0, aux_enc_output_layers: List = [])[source]¶ Bases:
torch.nn.modules.module.Module
Encoder module.
- Parameters:
idim – Input dimension.
etype – Encoder units type.
elayers – Number of encoder layers.
eunits – Number of encoder units per layer.
eprojs – Number of projection units per layer.
subsample – Subsampling rate per layer.
dropout_rate – Dropout rate for encoder layers.
intermediate_encoder_layers – Layer IDs for auxiliary encoder output sequences.
Initialize Encoder module.
-
forward
(feats: torch.Tensor, feats_len: torch.Tensor, prev_states: Optional[List[torch.Tensor]] = None)[source]¶ Forward encoder.
- Parameters:
feats – Feature sequences. (B, F, D_feats)
feats_len – Feature sequences lengths. (B,)
prev_states – Previous encoder hidden states. [N x (B, T, D_enc)]
- Returns:
- Encoder output sequences. (B, T, D_enc)
with or without encoder intermediate output sequences. ((B, T, D_enc), [N x (B, T, D_enc)])
enc_out_len: Encoder output sequences lengths. (B,) current_states: Encoder hidden states. [N x (B, T, D_enc)]
- Return type:
enc_out
-
class
espnet.nets.pytorch_backend.transducer.rnn_encoder.
RNN
(idim: int, rnn_type: str, elayers: int, eunits: int, eprojs: int, dropout_rate: float, aux_output_layers: List = [])[source]¶ Bases:
torch.nn.modules.module.Module
RNN module.
- Parameters:
idim – Input dimension.
rnn_type – RNN units type.
elayers – Number of RNN layers.
eunits – Number of units ((2 * eunits) if bidirectional)
eprojs – Number of final projection units.
dropout_rate – Dropout rate for RNN layers.
aux_output_layers – List of layer IDs for auxiliary RNN output sequences.
Initialize RNN module.
-
forward
(rnn_input: torch.Tensor, rnn_len: torch.Tensor, prev_states: Optional[List[torch.Tensor]] = None) → Tuple[torch.Tensor, List[torch.Tensor], torch.Tensor][source]¶ RNN forward.
- Parameters:
rnn_input – RNN input sequences. (B, T, D_in)
rnn_len – RNN input sequences lengths. (B,)
prev_states – RNN hidden states. [N x (B, T, D_proj)]
- Returns:
- RNN output sequences. (B, T, D_proj)
with or without intermediate RNN output sequences. ((B, T, D_proj), [N x (B, T, D_proj)])
rnn_len: RNN output sequences lengths. (B,) current_states: RNN hidden states. [N x (B, T, D_proj)]
- Return type:
rnn_output
-
class
espnet.nets.pytorch_backend.transducer.rnn_encoder.
RNNP
(idim: int, rnn_type: str, elayers: int, eunits: int, eprojs: int, subsample: numpy.ndarray, dropout_rate: float, aux_output_layers: List = [])[source]¶ Bases:
torch.nn.modules.module.Module
RNN with projection layer module.
- Parameters:
idim – Input dimension.
rnn_type – RNNP units type.
elayers – Number of RNNP layers.
eunits – Number of units ((2 * eunits) if bidirectional).
eprojs – Number of projection units.
subsample – Subsampling rate per layer.
dropout_rate – Dropout rate for RNNP layers.
aux_output_layers – Layer IDs for auxiliary RNNP output sequences.
Initialize RNNP module.
-
forward
(rnn_input: torch.Tensor, rnn_len: torch.Tensor, prev_states: Optional[List[torch.Tensor]] = None) → Tuple[torch.Tensor, List[torch.Tensor], torch.Tensor][source]¶ RNNP forward.
- Parameters:
rnn_input – RNN input sequences. (B, T, D_in)
rnn_len – RNN input sequences lengths. (B,)
prev_states – RNN hidden states. [N x (B, T, D_proj)]
- Returns:
- RNN output sequences. (B, T, D_proj)
with or without intermediate RNN output sequences. ((B, T, D_proj), [N x (B, T, D_proj)])
rnn_len: RNN output sequences lengths. (B,) current_states: RNN hidden states. [N x (B, T, D_proj)]
- Return type:
rnn_output
-
class
espnet.nets.pytorch_backend.transducer.rnn_encoder.
VGG2L
(in_channel: int = 1)[source]¶ Bases:
torch.nn.modules.module.Module
VGG-like module.
- Parameters:
in_channel – number of input channels
Initialize VGG-like module.
-
forward
(feats: torch.Tensor, feats_len: torch.Tensor, **kwargs) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ VGG2L forward.
- Parameters:
feats – Feature sequences. (B, F, D_feats)
feats_len – Feature sequences lengths. (B, )
- Returns:
VGG2L output sequences. (B, F // 4, 128 * D_feats // 4) vgg_out_len: VGG2L output sequences lengths. (B,)
- Return type:
vgg_out
-
espnet.nets.pytorch_backend.transducer.rnn_encoder.
encoder_for
(args: argparse.Namespace, idim: int, subsample: numpy.ndarray, aux_enc_output_layers: List = []) → torch.nn.modules.module.Module[source]¶ Instantiate a RNN encoder with specified arguments.
- Parameters:
args – The model arguments.
idim – Input dimension.
subsample – Subsampling rate per layer.
aux_enc_output_layers – Layer IDs for auxiliary encoder output sequences.
- Returns:
Encoder module.
-
espnet.nets.pytorch_backend.transducer.rnn_encoder.
reset_backward_rnn_state
(states: Union[torch.Tensor, List[Optional[torch.Tensor]]]) → Union[torch.Tensor, List[Optional[torch.Tensor]]][source]¶ Set backward BRNN states to zeroes.
- Parameters:
states – Encoder hidden states.
- Returns:
Encoder hidden states with backward set to zero.
- Return type:
states
espnet.nets.pytorch_backend.transducer.arguments¶
Transducer model arguments.
-
espnet.nets.pytorch_backend.transducer.arguments.
add_auxiliary_task_arguments
(group: argparse._ArgumentGroup) → argparse._ArgumentGroup[source]¶ Add arguments for auxiliary task.
-
espnet.nets.pytorch_backend.transducer.arguments.
add_custom_decoder_arguments
(group: argparse._ArgumentGroup) → argparse._ArgumentGroup[source]¶ Define arguments for Custom decoder.
-
espnet.nets.pytorch_backend.transducer.arguments.
add_custom_encoder_arguments
(group: argparse._ArgumentGroup) → argparse._ArgumentGroup[source]¶ Define arguments for Custom encoder.
-
espnet.nets.pytorch_backend.transducer.arguments.
add_custom_training_arguments
(group: argparse._ArgumentGroup) → argparse._ArgumentGroup[source]¶ Define arguments for training with Custom architecture.
-
espnet.nets.pytorch_backend.transducer.arguments.
add_decoder_general_arguments
(group: argparse._ArgumentGroup) → argparse._ArgumentGroup[source]¶ Define general arguments for encoder.
-
espnet.nets.pytorch_backend.transducer.arguments.
add_encoder_general_arguments
(group: argparse._ArgumentGroup) → argparse._ArgumentGroup[source]¶ Define general arguments for encoder.
-
espnet.nets.pytorch_backend.transducer.arguments.
add_rnn_decoder_arguments
(group: argparse._ArgumentGroup) → argparse._ArgumentGroup[source]¶ Define arguments for RNN decoder.
espnet.nets.pytorch_backend.transducer.error_calculator¶
CER/WER computation for Transducer model.
-
class
espnet.nets.pytorch_backend.transducer.error_calculator.
ErrorCalculator
(decoder: Union[espnet.nets.pytorch_backend.transducer.rnn_decoder.RNNDecoder, espnet.nets.pytorch_backend.transducer.custom_decoder.CustomDecoder], joint_network: espnet.nets.pytorch_backend.transducer.joint_network.JointNetwork, token_list: List[int], sym_space: str, sym_blank: str, report_cer: bool = False, report_wer: bool = False)[source]¶ Bases:
object
CER and WER computation for Transducer model.
- Parameters:
decoder – Decoder module.
joint_network – Joint network module.
token_list – Set of unique labels.
sym_space – Space symbol.
sym_blank – Blank symbol.
report_cer – Whether to compute CER.
report_wer – Whether to compute WER.
Construct an ErrorCalculator object for Transducer model.
-
calculate_cer
(hyps: torch.Tensor, refs: torch.Tensor) → float[source]¶ Calculate sentence-level CER score.
- Parameters:
hyps – Hypotheses sequences. (B, L)
refs – References sequences. (B, L)
- Returns:
Average sentence-level CER score.
-
calculate_wer
(hyps: torch.Tensor, refs: torch.Tensor) → float[source]¶ Calculate sentence-level WER score.
- Parameters:
hyps – Hypotheses sequences. (B, L)
refs – References sequences. (B, L)
- Returns:
Average sentence-level WER score.
-
convert_to_char
(hyps: torch.Tensor, refs: torch.Tensor) → Tuple[List, List][source]¶ Convert label ID sequences to character.
- Parameters:
hyps – Hypotheses sequences. (B, L)
refs – References sequences. (B, L)
- Returns:
Character list of hypotheses. char_hyps: Character list of references.
- Return type:
char_hyps
espnet.nets.pytorch_backend.transducer.blocks¶
Set of methods to create custom architecture.
-
espnet.nets.pytorch_backend.transducer.blocks.
build_blocks
(net_part: str, idim: int, input_layer_type: str, blocks: List[Dict[str, Any]], repeat_block: int = 0, self_attn_type: str = 'self_attn', positional_encoding_type: str = 'abs_pos', positionwise_layer_type: str = 'linear', positionwise_activation_type: str = 'relu', conv_mod_activation_type: str = 'relu', input_layer_dropout_rate: float = 0.0, input_layer_pos_enc_dropout_rate: float = 0.0, padding_idx: int = -1) → Tuple[Union[espnet.nets.pytorch_backend.transformer.subsampling.Conv2dSubsampling, espnet.nets.pytorch_backend.transducer.vgg2l.VGG2L, torch.nn.modules.container.Sequential], espnet.nets.pytorch_backend.transformer.repeat.MultiSequential, int, int][source]¶ Build custom model blocks.
- Parameters:
net_part – Network part, either ‘encoder’ or ‘decoder’.
idim – Input dimension.
input_layer – Input layer type.
blocks – Blocks parameters for network part.
repeat_block – Number of times provided blocks are repeated.
positional_encoding_type – Positional encoding layer type.
positionwise_layer_type – Positionwise layer type.
positionwise_activation_type – Positionwise activation type.
conv_mod_activation_type – Convolutional module activation type.
input_layer_dropout_rate – Dropout rate for input layer.
input_layer_pos_enc_dropout_rate – Dropout rate for input layer pos. enc.
padding_idx – Padding symbol ID for embedding layer.
- Returns:
Input layer all_blocks: Encoder/Decoder network. out_dim: Network output dimension. conv_subsampling_factor: Subsampling factor in frontend CNN.
- Return type:
in_layer
-
espnet.nets.pytorch_backend.transducer.blocks.
build_conformer_block
(block: Dict[str, Any], self_attn_class: str, pw_layer_type: str, pw_activation_type: str, conv_mod_activation_type: str) → espnet.nets.pytorch_backend.conformer.encoder_layer.EncoderLayer[source]¶ Build function for conformer block.
- Parameters:
block – Conformer block parameters.
self_attn_type – Self-attention module type.
pw_layer_type – Positionwise layer type.
pw_activation_type – Positionwise activation type.
conv_mod_activation_type – Convolutional module activation type.
- Returns:
Function to create conformer (encoder) block.
-
espnet.nets.pytorch_backend.transducer.blocks.
build_conv1d_block
(block: Dict[str, Any], block_type: str) → espnet.nets.pytorch_backend.transducer.conv1d_nets.CausalConv1d[source]¶ Build function for causal conv1d block.
- Parameters:
block – CausalConv1d or Conv1D block parameters.
- Returns:
Function to create conv1d (encoder) or causal conv1d (decoder) block.
-
espnet.nets.pytorch_backend.transducer.blocks.
build_input_layer
(block: Dict[str, Any], pos_enc_class: torch.nn.modules.module.Module, padding_idx: int) → Tuple[Union[espnet.nets.pytorch_backend.transformer.subsampling.Conv2dSubsampling, espnet.nets.pytorch_backend.transducer.vgg2l.VGG2L, torch.nn.modules.container.Sequential], int][source]¶ Build input layer.
- Parameters:
block – Architecture definition of input layer.
pos_enc_class – Positional encoding class.
padding_idx – Padding symbol ID for embedding layer (if provided).
- Returns:
Input layer module. subsampling_factor: Subsampling factor.
-
espnet.nets.pytorch_backend.transducer.blocks.
build_transformer_block
(net_part: str, block: Dict[str, Any], pw_layer_type: str, pw_activation_type: str) → Union[espnet.nets.pytorch_backend.transformer.encoder_layer.EncoderLayer, espnet.nets.pytorch_backend.transducer.transformer_decoder_layer.TransformerDecoderLayer][source]¶ Build function for transformer block.
- Parameters:
net_part – Network part, either ‘encoder’ or ‘decoder’.
block – Transformer block parameters.
pw_layer_type – Positionwise layer type.
pw_activation_type – Positionwise activation type.
- Returns:
Function to create transformer (encoder or decoder) block.
-
espnet.nets.pytorch_backend.transducer.blocks.
get_pos_enc_and_att_class
(net_part: str, pos_enc_type: str, self_attn_type: str) → Tuple[Union[espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding, espnet.nets.pytorch_backend.transformer.embedding.ScaledPositionalEncoding, espnet.nets.pytorch_backend.transformer.embedding.RelPositionalEncoding], Union[espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention, espnet.nets.pytorch_backend.transformer.attention.RelPositionMultiHeadedAttention]][source]¶ Get positional encoding and self attention module class.
- Parameters:
net_part – Network part, either ‘encoder’ or ‘decoder’.
pos_enc_type – Positional encoding type.
self_attn_type – Self-attention type.
- Returns:
Positional encoding class. self_attn_class: Self-attention class.
- Return type:
pos_enc_class
-
espnet.nets.pytorch_backend.transducer.blocks.
prepare_body_model
(net_part: str, blocks: List[Dict[str, Any]]) → Tuple[int][source]¶ Prepare model body blocks.
- Parameters:
net_part – Network part, either ‘encoder’ or ‘decoder’.
blocks – Blocks parameters for network part.
- Returns:
Network output dimension.
-
espnet.nets.pytorch_backend.transducer.blocks.
prepare_input_layer
(input_layer_type: str, feats_dim: int, blocks: List[Dict[str, Any]], dropout_rate: float, pos_enc_dropout_rate: float) → Dict[str, Any][source]¶ Prepare input layer arguments.
- Parameters:
input_layer_type – Input layer type.
feats_dim – Dimension of input features.
blocks – Blocks parameters for network part.
dropout_rate – Dropout rate for input layer.
pos_enc_dropout_rate – Dropout rate for input layer pos. enc.
- Returns:
Input block parameters.
- Return type:
input_block
-
espnet.nets.pytorch_backend.transducer.blocks.
verify_block_arguments
(net_part: str, block: Dict[str, Any], num_block: int) → Tuple[int, int][source]¶ Verify block arguments are valid.
- Parameters:
net_part – Network part, either ‘encoder’ or ‘decoder’.
block – Block parameters.
num_block – Block ID.
- Returns:
Input and output dimension of the block.
- Return type:
block_io
espnet.nets.pytorch_backend.transducer.custom_encoder¶
Cutom encoder definition for transducer models.
-
class
espnet.nets.pytorch_backend.transducer.custom_encoder.
CustomEncoder
(idim: int, enc_arch: List, input_layer: str = 'linear', repeat_block: int = 1, self_attn_type: str = 'selfattn', positional_encoding_type: str = 'abs_pos', positionwise_layer_type: str = 'linear', positionwise_activation_type: str = 'relu', conv_mod_activation_type: str = 'relu', aux_enc_output_layers: List = [], input_layer_dropout_rate: float = 0.0, input_layer_pos_enc_dropout_rate: float = 0.0, padding_idx: int = -1)[source]¶ Bases:
torch.nn.modules.module.Module
Custom encoder module for transducer models.
- Parameters:
idim – Input dimension.
enc_arch – Encoder block architecture (type and parameters).
input_layer – Input layer type.
repeat_block – Number of times blocks_arch is repeated.
self_attn_type – Self-attention type.
positional_encoding_type – Positional encoding type.
positionwise_layer_type – Positionwise layer type.
positionwise_activation_type – Positionwise activation type.
conv_mod_activation_type – Convolutional module activation type.
aux_enc_output_layers – Layer IDs for auxiliary encoder output sequences.
input_layer_dropout_rate – Dropout rate for input layer.
input_layer_pos_enc_dropout_rate – Dropout rate for input layer pos. enc.
padding_idx – Padding symbol ID for embedding layer.
Construct an CustomEncoder object.
-
forward
(feats: torch.Tensor, mask: torch.Tensor) → Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]][source]¶ Encode feature sequences.
- Parameters:
feats – Feature sequences. (B, F, D_feats)
feats_mask – Feature mask sequences. (B, 1, F)
- Returns:
- Encoder output sequences. (B, T, D_enc) with/without
Auxiliary encoder output sequences. (B, T, D_enc_aux)
- enc_out_mask: Mask for encoder output sequences. (B, 1, T) with/without
Mask for auxiliary encoder output sequences. (B, T, D_enc_aux)
- Return type:
enc_out
espnet.nets.pytorch_backend.transducer.transformer_decoder_layer¶
Transformer decoder layer definition for custom Transducer model.
-
class
espnet.nets.pytorch_backend.transducer.transformer_decoder_layer.
TransformerDecoderLayer
(hdim: int, self_attention: espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention, feed_forward: espnet.nets.pytorch_backend.transformer.positionwise_feed_forward.PositionwiseFeedForward, dropout_rate: float)[source]¶ Bases:
torch.nn.modules.module.Module
Transformer decoder layer module for custom Transducer model.
- Parameters:
hdim – Hidden dimension.
self_attention – Self-attention module.
feed_forward – Feed forward module.
dropout_rate – Dropout rate.
Construct an DecoderLayer object.
-
forward
(sequence: torch.Tensor, mask: torch.Tensor, cache: Optional[torch.Tensor] = None)[source]¶ Compute previous decoder output sequences.
- Parameters:
sequence – Transformer input sequences. (B, U, D_dec)
mask – Transformer intput mask sequences. (B, U)
cache – Cached decoder output sequences. (B, (U - 1), D_dec)
- Returns:
Transformer output sequences. (B, U, D_dec) mask: Transformer output mask sequences. (B, U)
- Return type:
sequence
espnet.nets.pytorch_backend.transducer.rnn_decoder¶
RNN decoder definition for Transducer model.
-
class
espnet.nets.pytorch_backend.transducer.rnn_decoder.
RNNDecoder
(odim: int, dtype: str, dlayers: int, dunits: int, embed_dim: int, dropout_rate: float = 0.0, dropout_rate_embed: float = 0.0, blank_id: int = 0)[source]¶ Bases:
espnet.nets.transducer_decoder_interface.TransducerDecoderInterface
,torch.nn.modules.module.Module
RNN decoder module for Transducer model.
- Parameters:
odim – Output dimension.
dtype – Decoder units type.
dlayers – Number of decoder layers.
dunits – Number of decoder units per layer..
embed_dim – Embedding layer dimension.
dropout_rate – Dropout rate for decoder layers.
dropout_rate_embed – Dropout rate for embedding layer.
blank_id – Blank symbol ID.
Transducer initializer.
-
batch_score
(hyps: Union[List[espnet.nets.transducer_decoder_interface.Hypothesis], List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis]], dec_states: Tuple[torch.Tensor, Optional[torch.Tensor]], cache: Dict[str, Any], use_lm: bool) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], torch.Tensor][source]¶ One-step forward hypotheses.
- Parameters:
hyps – Hypotheses.
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
cache – Pairs of (dec_out, dec_states) for each label sequences. (keys)
use_lm – Whether to compute label ID sequences for LM.
- Returns:
Decoder output sequences. (B, D_dec) dec_states: Decoder hidden states. ((N, B, D_dec), (N, B, D_dec)) lm_labels: Label ID sequences for LM. (B,)
- Return type:
dec_out
-
create_batch_states
(states: Tuple[torch.Tensor, Optional[torch.Tensor]], new_states: List[Tuple[torch.Tensor, Optional[torch.Tensor]]], check_list: Optional[List] = None) → List[Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ Create decoder hidden states.
- Parameters:
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
new_states – Decoder hidden states. [N x ((1, D_dec), (1, D_dec))]
- Returns:
Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
- Return type:
states
-
forward
(labels: torch.Tensor) → torch.Tensor[source]¶ Encode source label sequences.
- Parameters:
labels – Label ID sequences. (B, L)
- Returns:
Decoder output sequences. (B, T, U, D_dec)
- Return type:
dec_out
-
init_state
(batch_size: int) → Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]][source]¶ Initialize decoder states.
- Parameters:
batch_size – Batch size.
- Returns:
Initial decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
-
rnn_forward
(sequence: torch.Tensor, state: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ Encode source label sequences.
- Parameters:
sequence – RNN input sequences. (B, D_emb)
state – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
- Returns:
RNN output sequences. (B, D_dec) (h_next, c_next): Decoder hidden states. (N, B, D_dec), (N, B, D_dec))
- Return type:
sequence
-
score
(hyp: espnet.nets.transducer_decoder_interface.Hypothesis, cache: Dict[str, Any]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]], torch.Tensor][source]¶ One-step forward hypothesis.
- Parameters:
hyp – Hypothesis.
cache – Pairs of (dec_out, state) for each label sequence. (key)
- Returns:
Decoder output sequence. (1, D_dec) new_state: Decoder hidden states. ((N, 1, D_dec), (N, 1, D_dec)) label: Label ID for LM. (1,)
- Return type:
dec_out
-
select_state
(states: Tuple[torch.Tensor, Optional[torch.Tensor]], idx: int) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Get specified ID state from decoder hidden states.
- Parameters:
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
idx – State ID to extract.
- Returns:
- Decoder hidden state for given ID.
((N, 1, D_dec), (N, 1, D_dec))
espnet.nets.pytorch_backend.transducer.vgg2l¶
VGG2L module definition for custom encoder.
-
class
espnet.nets.pytorch_backend.transducer.vgg2l.
VGG2L
(idim: int, odim: int, pos_enc: torch.nn.modules.module.Module = None)[source]¶ Bases:
torch.nn.modules.module.Module
VGG2L module for custom encoder.
- Parameters:
idim – Input dimension.
odim – Output dimension.
pos_enc – Positional encoding class.
Construct a VGG2L object.
-
create_new_mask
(feats_mask: torch.Tensor) → torch.Tensor[source]¶ Create a subsampled mask of feature sequences.
- Parameters:
feats_mask – Mask of feature sequences. (B, 1, F)
- Returns:
Mask of VGG2L output sequences. (B, 1, sub(F))
- Return type:
vgg_mask
-
forward
(feats: torch.Tensor, feats_mask: torch.Tensor) → Union[Tuple[torch.Tensor, torch.Tensor], Tuple[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]][source]¶ Forward VGG2L bottleneck.
- Parameters:
feats – Feature sequences. (B, F, D_feats)
feats_mask – Mask of feature sequences. (B, 1, F)
- Returns:
- VGG output sequences.
(B, sub(F), D_out) or ((B, sub(F), D_out), (B, sub(F), D_att))
vgg_mask: Mask of VGG output sequences. (B, 1, sub(F))
- Return type:
vgg_output
espnet.nets.pytorch_backend.transducer.transducer_tasks¶
Module implementing Transducer main and auxiliary tasks.
-
class
espnet.nets.pytorch_backend.transducer.transducer_tasks.
TransducerTasks
(encoder_dim: int, decoder_dim: int, joint_dim: int, output_dim: int, joint_activation_type: str = 'tanh', transducer_loss_weight: float = 1.0, ctc_loss: bool = False, ctc_loss_weight: float = 0.5, ctc_loss_dropout_rate: float = 0.0, lm_loss: bool = False, lm_loss_weight: float = 0.5, lm_loss_smoothing_rate: float = 0.0, aux_transducer_loss: bool = False, aux_transducer_loss_weight: float = 0.2, aux_transducer_loss_mlp_dim: int = 320, aux_trans_loss_mlp_dropout_rate: float = 0.0, symm_kl_div_loss: bool = False, symm_kl_div_loss_weight: float = 0.2, fastemit_lambda: float = 0.0, blank_id: int = 0, ignore_id: int = -1, training: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
Transducer tasks module.
Initialize module for Transducer tasks.
- Parameters:
encoder_dim – Encoder outputs dimension.
decoder_dim – Decoder outputs dimension.
joint_dim – Joint space dimension.
output_dim – Output dimension.
joint_activation_type – Type of activation for joint network.
transducer_loss_weight – Weight for main transducer loss.
ctc_loss – Compute CTC loss.
ctc_loss_weight – Weight of CTC loss.
ctc_loss_dropout_rate – Dropout rate for CTC loss inputs.
lm_loss – Compute LM loss.
lm_loss_weight – Weight of LM loss.
lm_loss_smoothing_rate – Smoothing rate for LM loss’ label smoothing.
aux_transducer_loss – Compute auxiliary transducer loss.
aux_transducer_loss_weight – Weight of auxiliary transducer loss.
aux_transducer_loss_mlp_dim – Hidden dimension for aux. transducer MLP.
aux_trans_loss_mlp_dropout_rate – Dropout rate for aux. transducer MLP.
symm_kl_div_loss – Compute KL divergence loss.
symm_kl_div_loss_weight – Weight of KL divergence loss.
fastemit_lambda – Regularization parameter for FastEmit.
blank_id – Blank symbol ID.
ignore_id – Padding symbol ID.
training – Whether the model was initializated in training or inference mode.
-
compute_aux_transducer_and_symm_kl_div_losses
(aux_enc_out: torch.Tensor, dec_out: torch.Tensor, joint_out: torch.Tensor, target: torch.Tensor, aux_t_len: torch.Tensor, u_len: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute auxiliary Transducer loss and Jensen-Shannon divergence loss.
- Parameters:
aux_enc_out – Encoder auxiliary output sequences. [N x (B, T_aux, D_enc_aux)]
dec_out – Decoder output sequences. (B, U, D_dec)
joint_out – Joint output sequences. (B, T, U, D_joint)
target – Target character ID sequences. (B, L)
aux_t_len – Auxiliary time lengths. [N x (B,)]
u_len – True U lengths. (B,)
- Returns:
Auxiliary Transducer loss and KL divergence loss values.
-
compute_ctc_loss
(enc_out: torch.Tensor, target: torch.Tensor, t_len: torch.Tensor, u_len: torch.Tensor)[source]¶ Compute CTC loss.
- Parameters:
enc_out – Encoder output sequences. (B, T, D_enc)
target – Target character ID sequences. (B, U)
t_len – Time lengths. (B,)
u_len – Label lengths. (B,)
- Returns:
CTC loss value.
-
compute_lm_loss
(dec_out: torch.Tensor, target: torch.Tensor) → torch.Tensor[source]¶ Forward LM loss.
- Parameters:
dec_out – Decoder output sequences. (B, U, D_dec)
target – Target label ID sequences. (B, U)
- Returns:
LM loss value.
-
compute_transducer_loss
(enc_out: torch.Tensor, dec_out: torch._VariableFunctionsClass.tensor, target: torch.Tensor, t_len: torch.Tensor, u_len: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute Transducer loss.
- Parameters:
enc_out – Encoder output sequences. (B, T, D_enc)
dec_out – Decoder output sequences. (B, U, D_dec)
target – Target label ID sequences. (B, L)
t_len – Time lengths. (B,)
u_len – Label lengths. (B,)
- Returns:
Joint output sequences. (B, T, U, D_joint), Transducer loss value.
- Return type:
(joint_out, loss_trans)
-
forward
(enc_out: torch.Tensor, aux_enc_out: List[torch.Tensor], dec_out: torch.Tensor, labels: torch.Tensor, enc_out_len: torch.Tensor, aux_enc_out_len: torch.Tensor) → Tuple[Tuple[Any], float, float][source]¶ Forward main and auxiliary task.
- Parameters:
enc_out – Encoder output sequences. (B, T, D_enc)
aux_enc_out – Encoder intermediate output sequences. (B, T_aux, D_enc_aux)
dec_out – Decoder output sequences. (B, U, D_dec)
target – Target label ID sequences. (B, L)
t_len – Time lengths. (B,)
aux_t_len – Auxiliary time lengths. (B,)
u_len – Label lengths. (B,)
- Returns:
- Weighted losses.
(transducer loss, ctc loss, aux Transducer loss, KL div loss, LM loss)
cer: Sentence-level CER score. wer: Sentence-level WER score.
-
get_target
()[source]¶ Set target label ID sequences.
Args:
- Returns:
Target label ID sequences. (B, L)
- Return type:
target
-
get_transducer_tasks_io
(labels: torch.Tensor, enc_out_len: torch.Tensor, aux_enc_out_len: Optional[List]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Get Transducer tasks inputs and outputs.
- Parameters:
labels – Label ID sequences. (B, U)
enc_out_len – Time lengths. (B,)
aux_enc_out_len – Auxiliary time lengths. [N X (B,)]
- Returns:
Target label ID sequences. (B, L) lm_loss_target: LM loss target label ID sequences. (B, U) t_len: Time lengths. (B,) aux_t_len: Auxiliary time lengths. [N x (B,)] u_len: Label lengths. (B,)
- Return type:
target
espnet.nets.pytorch_backend.transducer.utils¶
Utility functions for Transducer models.
-
espnet.nets.pytorch_backend.transducer.utils.
check_batch_states
(states, max_len, pad_id)[source]¶ Check decoder hidden states and left pad or trim if necessary.
- Parameters:
state – Decoder hidden states. [N x (B, ?, D_dec)]
max_len – maximum sequence length.
pad_id – Padding symbol ID.
- Returns:
Decoder hidden states. [N x (B, max_len, dec_dim)]
- Return type:
final
-
espnet.nets.pytorch_backend.transducer.utils.
check_state
(state: List[Optional[torch.Tensor]], max_len: int, pad_id: int) → List[Optional[torch.Tensor]][source]¶ Check decoder hidden states and left pad or trim if necessary.
- Parameters:
state – Decoder hidden states. [N x (?, D_dec)]
max_len – maximum sequence length.
pad_id – Padding symbol ID.
- Returns:
Decoder hidden states. [N x (1, max_len, D_dec)]
- Return type:
final
-
espnet.nets.pytorch_backend.transducer.utils.
create_lm_batch_states
(lm_states: Union[List[Any], Dict[str, Any]], lm_layers, is_wordlm: bool) → Union[List[Any], Dict[str, Any]][source]¶ Create LM hidden states.
- Parameters:
lm_states – LM hidden states.
lm_layers – Number of LM layers.
is_wordlm – Whether provided LM is a word-level LM.
- Returns:
LM hidden states.
- Return type:
new_states
-
espnet.nets.pytorch_backend.transducer.utils.
custom_torch_load
(model_path: str, model: torch.nn.modules.module.Module, training: bool = True)[source]¶ Load Transducer model with training-only modules and parameters removed.
- Parameters:
model_path – Model path.
model – Transducer model.
-
espnet.nets.pytorch_backend.transducer.utils.
get_decoder_input
(labels: torch.Tensor, blank_id: int, ignore_id: int) → torch.Tensor[source]¶ Prepare decoder input.
- Parameters:
labels – Label ID sequences. (B, L)
- Returns:
Label ID sequences with blank prefix. (B, U)
- Return type:
decoder_input
-
espnet.nets.pytorch_backend.transducer.utils.
init_lm_state
(lm_model: torch.nn.modules.module.Module)[source]¶ Initialize LM hidden states.
- Parameters:
lm_model – LM module.
- Returns:
Initial LM hidden states.
- Return type:
lm_state
-
espnet.nets.pytorch_backend.transducer.utils.
is_prefix
(x: List[int], pref: List[int]) → bool[source]¶ Check if pref is a prefix of x.
- Parameters:
x – Label ID sequence.
pref – Prefix label ID sequence.
- Returns:
Whether pref is a prefix of x.
-
espnet.nets.pytorch_backend.transducer.utils.
pad_sequence
(labels: List[int], pad_id: int) → List[int][source]¶ Left pad label ID sequences.
- Parameters:
labels – Label ID sequence.
pad_id – Padding symbol ID.
- Returns:
Padded label ID sequences.
- Return type:
final
-
espnet.nets.pytorch_backend.transducer.utils.
recombine_hyps
(hyps: List[espnet.nets.transducer_decoder_interface.Hypothesis]) → List[espnet.nets.transducer_decoder_interface.Hypothesis][source]¶ Recombine hypotheses with same label ID sequence.
- Parameters:
hyps – Hypotheses.
- Returns:
Recombined hypotheses.
- Return type:
final
-
espnet.nets.pytorch_backend.transducer.utils.
select_k_expansions
(hyps: List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis], topk_idxs: torch.Tensor, topk_logps: torch.Tensor, gamma: float) → List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis][source]¶ Return K hypotheses candidates for expansion from a list of hypothesis.
K candidates are selected according to the extended hypotheses probabilities and a prune-by-value method. Where K is equal to beam_size + beta.
- Parameters:
hyps – Hypotheses.
topk_idxs – Indices of candidates hypothesis.
topk_logps – Log-probabilities for hypotheses expansions.
gamma – Allowed logp difference for prune-by-value method.
- Returns:
Best K expansion hypotheses candidates.
- Return type:
k_expansions
-
espnet.nets.pytorch_backend.transducer.utils.
select_lm_state
(lm_states: Union[List[Any], Dict[str, Any]], idx: int, lm_layers: int, is_wordlm: bool) → Union[List[Any], Dict[str, Any]][source]¶ Get ID state from LM hidden states.
- Parameters:
lm_states – LM hidden states.
idx – LM state ID to extract.
lm_layers – Number of LM layers.
is_wordlm – Whether provided LM is a word-level LM.
- Returns:
LM hidden state for given ID.
- Return type:
idx_state
-
espnet.nets.pytorch_backend.transducer.utils.
subtract
(x: List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis], subset: List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis]) → List[espnet.nets.transducer_decoder_interface.ExtendedHypothesis][source]¶ Remove elements of subset if corresponding label ID sequence already exist in x.
- Parameters:
x – Set of hypotheses.
subset – Subset of x.
- Returns:
New set of hypotheses.
- Return type:
final
-
espnet.nets.pytorch_backend.transducer.utils.
valid_aux_encoder_output_layers
(aux_layer_id: List[int], enc_num_layers: int, use_symm_kl_div_loss: bool, subsample: List[int]) → List[int][source]¶ Check whether provided auxiliary encoder layer IDs are valid.
Return the valid list sorted with duplicates removed.
- Parameters:
aux_layer_id – Auxiliary encoder layer IDs.
enc_num_layers – Number of encoder layers.
use_symm_kl_div_loss – Whether symmetric KL divergence loss is used.
subsample – Subsampling rate per layer.
- Returns:
Valid list of auxiliary encoder layers.
- Return type:
valid
espnet.nets.pytorch_backend.transducer.initializer¶
Parameter initialization for Transducer model.
espnet.nets.pytorch_backend.transducer.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.transducer.joint_network¶
Transducer joint network implementation.
-
class
espnet.nets.pytorch_backend.transducer.joint_network.
JointNetwork
(joint_output_size: int, encoder_output_size: int, decoder_output_size: int, joint_space_size: int, joint_activation_type: int)[source]¶ Bases:
torch.nn.modules.module.Module
Transducer joint network module.
- Parameters:
joint_output_size – Joint network output dimension
encoder_output_size – Encoder output dimension.
decoder_output_size – Decoder output dimension.
joint_space_size – Dimension of joint space.
joint_activation_type – Type of activation for joint network.
Joint network initializer.
-
forward
(enc_out: torch.Tensor, dec_out: torch.Tensor, is_aux: bool = False, quantization: bool = False) → torch.Tensor[source]¶ Joint computation of encoder and decoder hidden state sequences.
- Parameters:
enc_out – Expanded encoder output state sequences (B, T, 1, D_enc)
dec_out – Expanded decoder output state sequences (B, 1, U, D_dec)
is_aux – Whether auxiliary tasks in used.
quantization – Whether dynamic quantization is used.
- Returns:
Joint output state sequences. (B, T, U, D_out)
- Return type:
joint_out
espnet.nets.pytorch_backend.transducer.conv1d_nets¶
Convolution networks definition for custom archictecture.
-
class
espnet.nets.pytorch_backend.transducer.conv1d_nets.
CausalConv1d
(idim: int, odim: int, kernel_size: int, stride: int = 1, dilation: int = 1, groups: int = 1, bias: bool = True, batch_norm: bool = False, relu: bool = True, dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
1D causal convolution module for custom decoder.
- Parameters:
idim – Input dimension.
odim – Output dimension.
kernel_size – Size of the convolving kernel.
stride – Stride of the convolution.
dilation – Spacing between the kernel points.
groups – Number of blocked connections from input channels to output channels.
bias – Whether to add a learnable bias to the output.
batch_norm – Whether to apply batch normalization.
relu – Whether to pass final output through ReLU activation.
dropout_rate – Dropout rate.
Construct a CausalConv1d object.
-
forward
(sequence: torch.Tensor, mask: torch.Tensor, cache: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward CausalConv1d for custom decoder.
- Parameters:
sequence – CausalConv1d input sequences. (B, U, D_in)
mask – Mask of CausalConv1d input sequences. (B, 1, U)
- Returns:
CausalConv1d output sequences. (B, sub(U), D_out) mask: Mask of CausalConv1d output sequences. (B, 1, sub(U))
- Return type:
sequence
-
class
espnet.nets.pytorch_backend.transducer.conv1d_nets.
Conv1d
(idim: int, odim: int, kernel_size: Union[int, Tuple], stride: Union[int, Tuple] = 1, dilation: Union[int, Tuple] = 1, groups: Union[int, Tuple] = 1, bias: bool = True, batch_norm: bool = False, relu: bool = True, dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
1D convolution module for custom encoder.
- Parameters:
idim – Input dimension.
odim – Output dimension.
kernel_size – Size of the convolving kernel.
stride – Stride of the convolution.
dilation – Spacing between the kernel points.
groups – Number of blocked connections from input channels to output channels.
bias – Whether to add a learnable bias to the output.
batch_norm – Whether to use batch normalization after convolution.
relu – Whether to use a ReLU activation after convolution.
dropout_rate – Dropout rate.
Construct a Conv1d module object.
-
create_new_mask
(mask: torch.Tensor) → torch.Tensor[source]¶ Create new mask.
- Parameters:
mask – Mask of input sequences. (B, 1, T)
- Returns:
Mask of output sequences. (B, 1, sub(T))
- Return type:
mask
-
create_new_pos_embed
(pos_embed: torch.Tensor) → torch.Tensor[source]¶ Create new positional embedding vector.
- Parameters:
pos_embed – Input sequences positional embedding. (B, 2 * (T - 1), D_att)
- Returns:
- Output sequences positional embedding.
(B, 2 * (sub(T) - 1), D_att)
- Return type:
pos_embed
-
forward
(sequence: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], mask: torch.Tensor) → Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], torch.Tensor][source]¶ Forward ConvEncoderLayer module object.
- Parameters:
sequence –
Input sequences. (B, T, D_in)
or (B, T, D_in), (B, 2 * (T - 1), D_att)
mask – Mask of input sequences. (B, 1, T)
- Returns:
- Output sequences.
- (B, sub(T), D_out)
or (B, sub(T), D_out), (B, 2 * (sub(T) - 1), D_att)
mask: Mask of output sequences. (B, 1, sub(T))
- Return type:
sequence
espnet.nets.pytorch_backend.lm.default¶
Default Recurrent Neural Network Languge Model in lm_train.py.
-
class
espnet.nets.pytorch_backend.lm.default.
ClassifierWithState
(predictor, lossfun=CrossEntropyLoss(), label_key=-1)[source]¶ Bases:
torch.nn.modules.module.Module
A wrapper for pytorch RNNLM.
Initialize class.
:param torch.nn.Module predictor : The RNNLM :param function lossfun : The loss function to use :param int/str label_key :
-
final
(state, index=None)[source]¶ Predict final log probabilities for given state using the predictor.
- Parameters:
state – The state
:return The final log probabilities :rtype torch.Tensor
-
forward
(state, *args, **kwargs)[source]¶ Compute the loss value for an input and label pair.
Notes
It also computes accuracy and stores it to the attribute. When
label_key
isint
, the corresponding element inargs
is treated as ground truth labels. And when it isstr
, the element inkwargs
is used. The all elements ofargs
andkwargs
except the groundtruth labels are features. It feeds features to the predictor and compare the result with ground truth labels.:param torch.Tensor state : the LM state :param list[torch.Tensor] args : Input minibatch :param dict[torch.Tensor] kwargs : Input minibatch :return loss value :rtype torch.Tensor
-
-
class
espnet.nets.pytorch_backend.lm.default.
DefaultRNNLM
(n_vocab, args)[source]¶ Bases:
espnet.nets.scorer_interface.BatchScorerInterface
,espnet.nets.lm_interface.LMInterface
,torch.nn.modules.module.Module
Default RNNLM for LMInterface Implementation.
Note
PyTorch seems to have memory leak when one GPU compute this after data parallel. If parallel GPUs compute this, it seems to be fine. See also https://github.com/espnet/espnet/issues/1075
Initialize class.
- Parameters:
n_vocab (int) – The size of the vocabulary
args (argparse.Namespace) – configurations. see py:method:add_arguments
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch.
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
final_score
(state)[source]¶ Score eos.
- Parameters:
state – Scorer state for prefix tokens
- Returns:
final score
- Return type:
float
-
forward
(x, t)[source]¶ Compute LM loss value from buffer sequences.
- Parameters:
x (torch.Tensor) – Input ids. (batch, len)
t (torch.Tensor) – Target ids. (batch, len)
- Returns:
- Tuple of
loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)
- Return type:
tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Notes
The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)
-
score
(y, state, x)[source]¶ Score new token.
- Parameters:
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – 2D encoder feature that generates ys.
- Returns:
- Tuple of
torch.float32 scores for next token (n_vocab) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
-
class
espnet.nets.pytorch_backend.lm.default.
RNNLM
(n_vocab, n_layers, n_units, n_embed=None, typ='lstm', dropout_rate=0.5, emb_dropout_rate=0.0, tie_weights=False)[source]¶ Bases:
torch.nn.modules.module.Module
A pytorch RNNLM.
Initialize class.
- Parameters:
n_vocab (int) – The size of the vocabulary
n_layers (int) – The number of layers to create
n_units (int) – The number of units per layer
typ (str) – The RNN type
espnet.nets.pytorch_backend.lm.seq_rnn¶
Sequential implementation of Recurrent Neural Network Language Model.
-
class
espnet.nets.pytorch_backend.lm.seq_rnn.
SequentialRNNLM
(n_vocab, args)[source]¶ Bases:
espnet.nets.lm_interface.LMInterface
,torch.nn.modules.module.Module
Sequential RNNLM.
See also
Initialize class.
- Parameters:
n_vocab (int) – The size of the vocabulary
args (argparse.Namespace) – configurations. see py:method:add_arguments
-
forward
(x, t)[source]¶ Compute LM loss value from buffer sequences.
- Parameters:
x (torch.Tensor) – Input ids. (batch, len)
t (torch.Tensor) – Target ids. (batch, len)
- Returns:
- Tuple of
loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)
- Return type:
tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Notes
The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)
-
init_state
(x)[source]¶ Get an initial state for decoding.
- Parameters:
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
score
(y, state, x)[source]¶ Score new token.
- Parameters:
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – 2D encoder feature that generates ys.
- Returns:
- Tuple of
torch.float32 scores for next token (n_vocab) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
espnet.nets.pytorch_backend.lm.transformer¶
Transformer language model.
-
class
espnet.nets.pytorch_backend.lm.transformer.
TransformerLM
(n_vocab, args)[source]¶ Bases:
torch.nn.modules.module.Module
,espnet.nets.lm_interface.LMInterface
,espnet.nets.scorer_interface.BatchScorerInterface
Transformer language model.
Initialize class.
- Parameters:
n_vocab (int) – The size of the vocabulary
args (argparse.Namespace) – configurations. see py:method:add_arguments
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch (required).
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
forward
(x: torch.Tensor, t: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Compute LM loss value from buffer sequences.
- Parameters:
x (torch.Tensor) – Input ids. (batch, len)
t (torch.Tensor) – Target ids. (batch, len)
- Returns:
- Tuple of
loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)
- Return type:
tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Notes
The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)
-
score
(y: torch.Tensor, state: Any, x: torch.Tensor) → Tuple[torch.Tensor, Any][source]¶ Score new token.
- Parameters:
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – encoder feature that generates ys.
- Returns:
- Tuple of
torch.float32 scores for next token (n_vocab) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
espnet.nets.pytorch_backend.lm.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.frontends.feature_transform¶
-
class
espnet.nets.pytorch_backend.frontends.feature_transform.
FeatureTransform
(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = 0.0, fmax: float = None, stats_file: str = None, apply_uttmvn: bool = True, uttmvn_norm_means: bool = True, uttmvn_norm_vars: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x: torch_complex.tensor.ComplexTensor, ilens: Union[torch.LongTensor, numpy.ndarray, List[int]]) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet.nets.pytorch_backend.frontends.feature_transform.
GlobalMVN
(stats_file: str, norm_means: bool = True, norm_vars: bool = True, eps: float = 1e-20)[source]¶ Bases:
torch.nn.modules.module.Module
Apply global mean and variance normalization
- Parameters:
stats_file (str) – npy file of 1-dim array or text file. From the _first element to the {(len(array) - 1) / 2}th element are treated as the sum of features, and the rest excluding the last elements are treated as the sum of the square value of features, and the last elements eqauls to the number of samples.
std_floor (float) –
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(x: torch.Tensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet.nets.pytorch_backend.frontends.feature_transform.
LogMel
(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = 0.0, fmax: float = None, htk: bool = False, norm=1)[source]¶ Bases:
torch.nn.modules.module.Module
Convert STFT to fbank feats
The arguments is same as librosa.filters.mel
- Parameters:
fs – number > 0 [scalar] sampling rate of the incoming signal
n_fft – int > 0 [scalar] number of FFT components
n_mels – int > 0 [scalar] number of Mel bands to generate
fmin – float >= 0 [scalar] lowest frequency (in Hz)
fmax – float >= 0 [scalar] highest frequency (in Hz). If None, use fmax = fs / 2.0
htk – use HTK formula instead of Slaney
norm – {None, 1, np.inf} [scalar] if 1, divide the triangular mel weights by the width of the mel band (area normalization). Otherwise, leave all the triangles aiming for a peak value of 1.0
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(feat: torch.Tensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet.nets.pytorch_backend.frontends.feature_transform.
UtteranceMVN
(norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20)[source]¶ Bases:
torch.nn.modules.module.Module
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(x: torch.Tensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
espnet.nets.pytorch_backend.frontends.feature_transform.
utterance_mvn
(x: torch.Tensor, ilens: torch.LongTensor, norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Apply utterance mean and variance normalization
- Parameters:
x – (B, T, D), assumed zero padded
ilens – (B, T, D)
norm_means –
norm_vars –
eps –
espnet.nets.pytorch_backend.frontends.beamformer¶
-
espnet.nets.pytorch_backend.frontends.beamformer.
apply_beamforming_vector
(beamform_vector: torch_complex.tensor.ComplexTensor, mix: torch_complex.tensor.ComplexTensor) → torch_complex.tensor.ComplexTensor[source]¶
-
espnet.nets.pytorch_backend.frontends.beamformer.
get_mvdr_vector
(psd_s: torch_complex.tensor.ComplexTensor, psd_n: torch_complex.tensor.ComplexTensor, reference_vector: torch.Tensor, eps: float = 1e-15) → torch_complex.tensor.ComplexTensor[source]¶ Return the MVDR(Minimum Variance Distortionless Response) vector:
h = (Npsd^-1 @ Spsd) / (Tr(Npsd^-1 @ Spsd)) @ u
- Reference:
On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420
- Parameters:
psd_s (ComplexTensor) – (…, F, C, C)
psd_n (ComplexTensor) – (…, F, C, C)
reference_vector (torch.Tensor) – (…, C)
eps (float) –
- Returns:
(…, F, C)
- Return type:
beamform_vector (ComplexTensor)r
-
espnet.nets.pytorch_backend.frontends.beamformer.
get_power_spectral_density_matrix
(xs: torch_complex.tensor.ComplexTensor, mask: torch.Tensor, normalization=True, eps: float = 1e-15) → torch_complex.tensor.ComplexTensor[source]¶ Return cross-channel power spectral density (PSD) matrix
- Parameters:
xs (ComplexTensor) – (…, F, C, T)
mask (torch.Tensor) – (…, F, C, T)
normalization (bool) –
eps (float) –
- Returns
psd (ComplexTensor): (…, F, C, C)
espnet.nets.pytorch_backend.frontends.dnn_wpe¶
-
class
espnet.nets.pytorch_backend.frontends.dnn_wpe.
DNN_WPE
(wtype: str = 'blstmp', widim: int = 257, wlayers: int = 3, wunits: int = 300, wprojs: int = 320, dropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask: bool = True, iterations: int = 1, normalization: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, torch_complex.tensor.ComplexTensor][source]¶ The forward function
- Notation:
B: Batch C: Channel T: Time or Sequence length F: Freq or Some dimension of the feature vector
- Parameters:
data – (B, C, T, F)
ilens – (B,)
- Returns:
(B, C, T, F) ilens: (B,)
- Return type:
data
-
espnet.nets.pytorch_backend.frontends.frontend¶
-
class
espnet.nets.pytorch_backend.frontends.frontend.
Frontend
(idim: int, use_wpe: bool = False, wtype: str = 'blstmp', wlayers: int = 3, wunits: int = 300, wprojs: int = 320, wdropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask_for_wpe: bool = True, use_beamformer: bool = False, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, bnmask: int = 2, badim: int = 320, ref_channel: int = -1, bdropout_rate=0.0)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x: torch_complex.tensor.ComplexTensor, ilens: Union[torch.LongTensor, numpy.ndarray, List[int]]) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, Optional[torch_complex.tensor.ComplexTensor]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet.nets.pytorch_backend.frontends.mask_estimator¶
-
class
espnet.nets.pytorch_backend.frontends.mask_estimator.
MaskEstimator
(type, idim, layers, units, projs, dropout, nmask=1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(xs: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]¶ The forward function
- Parameters:
xs – (B, F, C, T)
ilens – (B,)
- Returns:
The hidden vector (B, F, C, T) masks: A tuple of the masks. (B, F, C, T) ilens: (B,)
- Return type:
hs (torch.Tensor)
-
espnet.nets.pytorch_backend.frontends.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.frontends.dnn_beamformer¶
DNN beamformer module.
-
class
espnet.nets.pytorch_backend.frontends.dnn_beamformer.
AttentionReference
(bidim, att_dim)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(psd_in: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor, scaling: float = 2.0) → Tuple[torch.Tensor, torch.LongTensor][source]¶ The forward function
- Parameters:
psd_in (ComplexTensor) – (B, F, C, C)
ilens (torch.Tensor) – (B,)
scaling (float) –
- Returns:
(B, C) ilens (torch.Tensor): (B,)
- Return type:
u (torch.Tensor)
-
-
class
espnet.nets.pytorch_backend.frontends.dnn_beamformer.
DNN_Beamformer
(bidim, btype='blstmp', blayers=3, bunits=300, bprojs=320, bnmask=2, dropout_rate=0.0, badim=320, ref_channel: int = -1, beamformer_type='mvdr')[source]¶ Bases:
torch.nn.modules.module.Module
DNN mask based Beamformer
- Citation:
Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017; https://arxiv.org/abs/1703.04783
-
forward
(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, torch_complex.tensor.ComplexTensor][source]¶ The forward function
- Notation:
B: Batch C: Channel T: Time or Sequence length F: Freq
- Parameters:
data (ComplexTensor) – (B, T, C, F)
ilens (torch.Tensor) – (B,)
- Returns:
(B, T, F) ilens (torch.Tensor): (B,)
- Return type:
enhanced (ComplexTensor)
espnet.nets.pytorch_backend.rnn.attentions¶
Attention modules for RNN.
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttAdd
(eprojs, dunits, att_dim, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
Additive attention
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]¶ AttAdd forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – dummy (does not use)
scaling (float) – scaling parameter before applying softmax
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
previous attention weights (B x T_max)
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttCov
(eprojs, dunits, att_dim, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
Coverage mechanism attention
- Reference: Get To The Point: Summarization with Pointer-Generator Network
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0)[source]¶ AttCov forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev_list (list) – list of previous attention weight
scaling (float) – scaling parameter before applying softmax
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
list of previous attention weights
- Return type:
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttCovLoc
(eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
Coverage mechanism location aware attention
This attention is a combination of coverage and location-aware attentions.
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0)[source]¶ AttCovLoc forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev_list (list) – list of previous attention weight
scaling (float) – scaling parameter before applying softmax
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
list of previous attention weights
- Return type:
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttDot
(eprojs, dunits, att_dim, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
Dot product attention
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]¶ AttDot forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – dummy (does not use)
att_prev (torch.Tensor) – dummy (does not use)
scaling (float) – scaling parameter before applying softmax
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
previous attention weight (B x T_max)
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttForward
(eprojs, dunits, att_dim, aconv_chans, aconv_filts)[source]¶ Bases:
torch.nn.modules.module.Module
Forward attention module.
Reference: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=1.0, last_attended_idx=None, backward_window=1, forward_window=3)[source]¶ Calculate AttForward forward propagation.
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – attention weights of previous step
scaling (float) – scaling parameter before applying softmax
last_attended_idx (int) – index of the inputs of the last attended
backward_window (int) – backward window size in attention constraint
forward_window (int) – forward window size in attetion constraint
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
previous attention weights (B x T_max)
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttForwardTA
(eunits, dunits, att_dim, aconv_chans, aconv_filts, odim)[source]¶ Bases:
torch.nn.modules.module.Module
Forward attention with transition agent module.
Reference: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis
- Parameters:
eunits (int) – # units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
odim (int) – output dimension
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, out_prev, scaling=1.0, last_attended_idx=None, backward_window=1, forward_window=3)[source]¶ Calculate AttForwardTA forward propagation.
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B, Tmax, eunits)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B, dunits)
att_prev (torch.Tensor) – attention weights of previous step
out_prev (torch.Tensor) – decoder outputs of previous step (B, odim)
scaling (float) – scaling parameter before applying softmax
last_attended_idx (int) – index of the inputs of the last attended
backward_window (int) – backward window size in attention constraint
forward_window (int) – forward window size in attetion constraint
- Returns:
attention weighted encoder state (B, dunits)
- Return type:
torch.Tensor
- Returns:
previous attention weights (B, Tmax)
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttLoc
(eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
location-aware attention module.
- Reference: Attention-Based Models for Speech Recognition
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0, last_attended_idx=None, backward_window=1, forward_window=3)[source]¶ Calculate AttLoc forward propagation.
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – previous attention weight (B x T_max)
scaling (float) – scaling parameter before applying softmax
forward_window (int) – forward window size when constraining attention
last_attended_idx (int) – index of the inputs of the last attended
backward_window (int) – backward window size in attention constraint
forward_window – forward window size in attetion constraint
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
previous attention weights (B x T_max)
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttLoc2D
(eprojs, dunits, att_dim, att_win, aconv_chans, aconv_filts, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
2D location-aware attention
This attention is an extended version of location aware attention. It take not only one frame before attention weights, but also earlier frames into account.
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
att_win (int) – attention window size (default=5)
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]¶ AttLoc2D forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – previous attention weight (B x att_win x T_max)
scaling (float) – scaling parameter before applying softmax
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
previous attention weights (B x att_win x T_max)
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttLocRec
(eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
location-aware recurrent attention
This attention is an extended version of location aware attention. With the use of RNN, it take the effect of the history of attention weights into account.
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev_states, scaling=2.0)[source]¶ AttLocRec forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev_states (tuple) – previous attention weight and lstm states ((B, T_max), ((B, att_dim), (B, att_dim)))
scaling (float) – scaling parameter before applying softmax
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
previous attention weights and lstm states (w, (hx, cx)) ((B, T_max), ((B, att_dim), (B, att_dim)))
- Return type:
tuple
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttMultiHeadAdd
(eprojs, dunits, aheads, att_dim_k, att_dim_v, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
Multi head additive attention
- Reference: Attention is all you need
This attention is multi head attention using additive attention for each head.
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
aheads (int) – # heads of multi head attention
att_dim_k (int) – dimension k in multi head attention
att_dim_v (int) – dimension v in multi head attention
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_k and pre_compute_v
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev)[source]¶ AttMultiHeadAdd forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – dummy (does not use)
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
list of previous attention weight (B x T_max) * aheads
- Return type:
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttMultiHeadDot
(eprojs, dunits, aheads, att_dim_k, att_dim_v, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
Multi head dot product attention
- Reference: Attention is all you need
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
aheads (int) – # heads of multi head attention
att_dim_k (int) – dimension k in multi head attention
att_dim_v (int) – dimension v in multi head attention
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_k and pre_compute_v
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev)[source]¶ AttMultiHeadDot forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – dummy (does not use)
- Returns:
attention weighted encoder state (B x D_enc)
- Return type:
torch.Tensor
- Returns:
list of previous attention weight (B x T_max) * aheads
- Return type:
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttMultiHeadLoc
(eprojs, dunits, aheads, att_dim_k, att_dim_v, aconv_chans, aconv_filts, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
Multi head location based attention
- Reference: Attention is all you need
This attention is multi head attention using location-aware attention for each head.
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
aheads (int) – # heads of multi head attention
att_dim_k (int) – dimension k in multi head attention
att_dim_v (int) – dimension v in multi head attention
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_k and pre_compute_v
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]¶ AttMultiHeadLoc forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – list of previous attention weight (B x T_max) * aheads
scaling (float) – scaling parameter before applying softmax
- Returns:
attention weighted encoder state (B x D_enc)
- Return type:
torch.Tensor
- Returns:
list of previous attention weight (B x T_max) * aheads
- Return type:
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttMultiHeadMultiResLoc
(eprojs, dunits, aheads, att_dim_k, att_dim_v, aconv_chans, aconv_filts, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
Multi head multi resolution location based attention
- Reference: Attention is all you need
This attention is multi head attention using location-aware attention for each head. Furthermore, it uses different filter size for each head.
- Parameters:
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
aheads (int) – # heads of multi head attention
att_dim_k (int) – dimension k in multi head attention
att_dim_v (int) – dimension v in multi head attention
aconv_chans (int) – maximum # channels of attention convolution each head use #ch = aconv_chans * (head + 1) / aheads e.g. aheads=4, aconv_chans=100 => filter size = 25, 50, 75, 100
aconv_filts (int) – filter size of attention convolution
han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_k and pre_compute_v
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev)[source]¶ AttMultiHeadMultiResLoc forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – list of previous attention weight (B x T_max) * aheads
- Returns:
attention weighted encoder state (B x D_enc)
- Return type:
torch.Tensor
- Returns:
list of previous attention weight (B x T_max) * aheads
- Return type:
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
GDCAttLoc
(eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False)[source]¶ Bases:
torch.nn.modules.module.Module
Global duration control attention module. Reference: Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis (https://arxiv.org/abs/2202.07907) :param int eprojs: # projection-units of encoder :param int dunits: # units of decoder :param int att_dim: attention dimension :param int aconv_chans: # channels of attention convolution :param int aconv_filts: filter size of attention convolution :param bool han_mode: flag to swith on mode of hierarchical attention
and not store pre_compute_enc_h
-
forward
(enc_hs_pad, enc_hs_len, trans_token, dec_z, att_prev, scaling=1.0, last_attended_idx=None, backward_window=1, forward_window=3)[source]¶ Calcualte AttLoc forward propagation. :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc) :param list enc_hs_len: padded encoder hidden state length (B) :param torch.Tensor trans_token: Global transition token
for duration (B x T_max x 1)
- Parameters:
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – previous attention weight (B x T_max)
scaling (float) – scaling parameter before applying softmax
forward_window (int) – forward window size when constraining attention
last_attended_idx (int) – index of the inputs of the last attended
backward_window (int) – backward window size in attention constraint
forward_window – forward window size in attetion constraint
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
previous attention weights (B x T_max)
- Return type:
torch.Tensor
-
-
class
espnet.nets.pytorch_backend.rnn.attentions.
NoAtt
[source]¶ Bases:
torch.nn.modules.module.Module
No attention
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev)[source]¶ NoAtt forward
- Parameters:
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B, T_max, D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – dummy (does not use)
att_prev (torch.Tensor) – dummy (does not use)
- Returns:
attention weighted encoder state (B, D_enc)
- Return type:
torch.Tensor
- Returns:
previous attention weights
- Return type:
torch.Tensor
-
-
espnet.nets.pytorch_backend.rnn.attentions.
att_for
(args, num_att=1, han_mode=False)[source]¶ Instantiates an attention module given the program arguments
- Parameters:
args (Namespace) – The arguments
num_att (int) – number of attention modules (in multi-speaker case, it can be 2 or more)
han_mode (bool) – switch on/off mode of hierarchical attention network (HAN)
:rtype torch.nn.Module :return: The attention module
-
espnet.nets.pytorch_backend.rnn.attentions.
att_to_numpy
(att_ws, att)[source]¶ Converts attention weights to a numpy array given the attention
- Parameters:
att_ws (list) – The attention weights
att (torch.nn.Module) – The attention
- Return type:
np.ndarray
- Returns:
The numpy array of the attention weights
-
espnet.nets.pytorch_backend.rnn.attentions.
initial_att
(atype, eprojs, dunits, aheads, adim, awin, aconv_chans, aconv_filts, han_mode=False)[source]¶ Instantiates a single attention module
- Parameters:
atype (str) – attention type
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
aheads (int) – # heads of multi head attention
adim (int) – attention dimension
awin (int) – attention window size
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
han_mode (bool) – flag to swith on mode of hierarchical attention
- Returns:
The attention module
espnet.nets.pytorch_backend.rnn.decoders¶
RNN decoder module.
-
class
espnet.nets.pytorch_backend.rnn.decoders.
Decoder
(eprojs, odim, dtype, dlayers, dunits, sos, eos, att, verbose=0, char_list=None, labeldist=None, lsm_weight=0.0, sampling_probability=0.0, dropout=0.0, context_residual=False, replace_sos=False, num_encs=1)[source]¶ Bases:
torch.nn.modules.module.Module
,espnet.nets.scorer_interface.ScorerInterface
Decoder module
- Parameters:
eprojs (int) – encoder projection units
odim (int) – dimension of outputs
dtype (str) – gru or lstm
dlayers (int) – decoder layers
dunits (int) – decoder units
sos (int) – start of sequence symbol id
eos (int) – end of sequence symbol id
att (torch.nn.Module) – attention module
verbose (int) – verbose level
char_list (list) – list of character strings
labeldist (ndarray) – distribution of label smoothing
lsm_weight (float) – label smoothing weight
sampling_probability (float) – scheduled sampling probability
dropout (float) – dropout rate
context_residual (float) – if True, use context vector for token generation
replace_sos (float) – use for multilingual (speech/text) translation
-
calculate_all_attentions
(hs_pad, hlen, ys_pad, strm_idx=0, lang_ids=None)[source]¶ Calculate all of attentions
- Parameters:
hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D) in multi-encoder case, list of torch.Tensor, [(B, Tmax_1, D), (B, Tmax_2, D), …, ] ]
hlen (torch.Tensor) – batch of lengths of hidden state sequences (B) [in multi-encoder case, list of torch.Tensor, [(B), (B), …, ]
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
strm_idx (int) – stream index for parallel speaker attention in multi-speaker case
lang_ids (torch.Tensor) – batch of target language id tensor (B, 1)
- Returns:
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) multi-encoder case =>
[(B, Lmax, Tmax1), (B, Lmax, Tmax2), …, (B, Lmax, NumEncs)]
other case => attention weights (B, Lmax, Tmax).
- Return type:
float ndarray
-
forward
(hs_pad, hlens, ys_pad, strm_idx=0, lang_ids=None)[source]¶ Decoder forward
- Parameters:
hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D) [in multi-encoder case, list of torch.Tensor, [(B, Tmax_1, D), (B, Tmax_2, D), …, ] ]
hlens (torch.Tensor) – batch of lengths of hidden state sequences (B) [in multi-encoder case, list of torch.Tensor, [(B), (B), …, ]
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
strm_idx (int) – stream index indicates the index of decoding stream.
lang_ids (torch.Tensor) – batch of target language id tensor (B, 1)
- Returns:
attention loss value
- Return type:
torch.Tensor
- Returns:
accuracy
- Return type:
float
-
init_state
(x)[source]¶ Get an initial state for decoding (optional).
- Parameters:
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
recognize_beam
(h, lpz, recog_args, char_list, rnnlm=None, strm_idx=0)[source]¶ beam search implementation
- Parameters:
h (torch.Tensor) – encoder hidden state (T, eprojs) [in multi-encoder case, list of torch.Tensor, [(T1, eprojs), (T2, eprojs), …] ]
lpz (torch.Tensor) – ctc log softmax output (T, odim) [in multi-encoder case, list of torch.Tensor, [(T1, odim), (T2, odim), …] ]
recog_args (Namespace) – argument Namespace containing options
char_list – list of character strings
rnnlm (torch.nn.Module) – language module
strm_idx (int) – stream index for speaker parallel attention in multi-speaker case
- Returns:
N-best decoding results
- Return type:
list of dicts
-
recognize_beam_batch
(h, hlens, lpz, recog_args, char_list, rnnlm=None, normalize_score=True, strm_idx=0, lang_ids=None)[source]¶
-
score
(yseq, state, x)[source]¶ Score new token (required).
- Parameters:
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.
- Returns:
- Tuple of
scores for next token that has a shape of (n_vocab) and next state for ys
- Return type:
tuple[torch.Tensor, Any]
espnet.nets.pytorch_backend.rnn.encoders¶
-
class
espnet.nets.pytorch_backend.rnn.encoders.
Encoder
(etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder module
- Parameters:
etype (str) – type of encoder network
idim (int) – number of dimensions of encoder network
elayers (int) – number of layers of encoder network
eunits (int) – number of lstm units of encoder network
eprojs (int) – number of projection units of encoder network
subsample (np.ndarray) – list of subsampling numbers
dropout (float) – dropout rate
in_channel (int) – number of input channels
-
forward
(xs_pad, ilens, prev_states=None)[source]¶ Encoder forward
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
prev_state (torch.Tensor) – batch of previous encoder hidden states (?, …)
- Returns:
batch of hidden state sequences (B, Tmax, eprojs)
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.encoders.
RNN
(idim, elayers, cdim, hdim, dropout, typ='blstm')[source]¶ Bases:
torch.nn.modules.module.Module
RNN module
- Parameters:
idim (int) – dimension of inputs
elayers (int) – number of encoder layers
cdim (int) – number of rnn units (resulted in cdim * 2 if bidirectional)
hdim (int) – number of final projection units
dropout (float) – dropout rate
typ (str) – The RNN type
-
forward
(xs_pad, ilens, prev_state=None)[source]¶ RNN forward
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
prev_state (torch.Tensor) – batch of previous RNN states
- Returns:
batch of hidden state sequences (B, Tmax, eprojs)
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.encoders.
RNNP
(idim, elayers, cdim, hdim, subsample, dropout, typ='blstm')[source]¶ Bases:
torch.nn.modules.module.Module
RNN with projection layer module
- Parameters:
idim (int) – dimension of inputs
elayers (int) – number of encoder layers
cdim (int) – number of rnn units (resulted in cdim * 2 if bidirectional)
hdim (int) – number of projection units
subsample (np.ndarray) – list of subsampling numbers
dropout (float) – dropout rate
typ (str) – The RNN type
-
forward
(xs_pad, ilens, prev_state=None)[source]¶ RNNP forward
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
prev_state (torch.Tensor) – batch of previous RNN states
- Returns:
batch of hidden state sequences (B, Tmax, hdim)
- Return type:
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.encoders.
VGG2L
(in_channel=1)[source]¶ Bases:
torch.nn.modules.module.Module
VGG-like module
- Parameters:
in_channel (int) – number of input channels
-
forward
(xs_pad, ilens, **kwargs)[source]¶ VGG2L forward
- Parameters:
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
- Returns:
batch of padded hidden state sequences (B, Tmax // 4, 128 * D // 4)
- Return type:
torch.Tensor
-
espnet.nets.pytorch_backend.rnn.encoders.
encoder_for
(args, idim, subsample)[source]¶ Instantiates an encoder module given the program arguments
- Parameters:
args (Namespace) – The arguments
or List of integer idim (int) – dimension of input, e.g. 83, or List of dimensions of inputs, e.g. [83,83]
or List of List subsample (List) –
subsample factors, e.g. [1,2,2,1,1], or List of subsample factors of each encoder.
e.g. [[1,2,2,1,1], [1,2,2,1,1]]
:rtype torch.nn.Module :return: The encoder module
espnet.nets.pytorch_backend.rnn.__init__¶
Initialize sub package.
espnet.nets.pytorch_backend.rnn.argument¶
Conformer common arguments.
-
espnet.nets.pytorch_backend.rnn.argument.
add_arguments_rnn_attention_common
(group)[source]¶ Define common arguments for RNN attention.
espnet.nets.pytorch_backend.streaming.window¶
-
class
espnet.nets.pytorch_backend.streaming.window.
WindowStreamingE2E
(e2e, recog_args, rnnlm=None)[source]¶ Bases:
object
WindowStreamingE2E constructor.
- Parameters:
e2e (E2E) – E2E ASR object
recog_args – arguments for “recognize” method of E2E
-
decode_with_attention_offline
()[source]¶ Run the attention decoder offline.
Works even if the previous layers (encoder and CTC decoder) were being run in the online mode. This method should be run after all the audio has been consumed. This is used mostly to compare the results between offline and online implementation of the previous layers.
espnet.nets.pytorch_backend.streaming.segment¶
espnet.nets.pytorch_backend.streaming.__init__¶
Initialize sub package.
-
class
-
-
-
-
class
-
class
-
class
-
-
-