espnet.asr package

Initialize sub package.

espnet.asr.__init__

Initialize sub package.

espnet.asr.asr_mix_utils

This script is used to provide utility functions designed for multi-speaker ASR.

Copyright 2017 Johns Hopkins University (Shinji Watanabe)

Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)

Most functions can be directly used as in asr_utils.py:

CompareValueTrigger, restore_snapshot, adadelta_eps_decay, chainer_load, torch_snapshot, torch_save, torch_resume, AttributeDict, get_model_conf.

class espnet.asr.asr_mix_utils.PlotAttentionReport(att_vis_fn, data, outdir, converter, device, reverse=False)[source]

Bases: chainer.training.extension.Extension

Plot attention reporter.

Parameters:
  • att_vis_fn (espnet.nets.*_backend.e2e_asr.calculate_all_attentions) – Function of attention visualization.

  • data (list[tuple(str, dict[str, dict[str, Any]])]) – List json utt key items.

  • outdir (str) – Directory to save figures.

  • converter (espnet.asr.*_backend.asr.CustomConverter) – CustomConverter object. Function to convert data.

  • device (torch.device) – The destination device to send tensor.

  • reverse (bool) – If True, input and output length are reversed.

Initialize PlotAttentionReport.

draw_attention_plot(att_w)[source]

Visualize attention weights matrix.

Parameters:

att_w (Tensor) – Attention weight matrix.

Returns:

pyplot object with attention matrix image.

Return type:

matplotlib.pyplot

get_attention_weight(idx, att_w, spkr_idx)[source]

Transform attention weight in regard to self.reverse.

get_attention_weights()[source]

Return attention weights.

Returns:

attention weights. It’s shape would be

differ from bachend.dtype=float * pytorch-> 1) multi-head case => (B, H, Lmax, Tmax). 2)

other case => (B, Lmax, Tmax).

  • chainer-> attention weights (B, Lmax, Tmax).

Return type:

arr_ws_sd (numpy.ndarray)

log_attentions(logger, step)[source]

Add image files of attention matrix to tensorboard.

espnet.asr.asr_mix_utils.add_results_to_json(js, nbest_hyps_sd, char_list)[source]

Add N-best results to json.

Parameters:
  • js (dict[str, Any]) – Groundtruth utterance dict.

  • nbest_hyps_sd (list[dict[str, Any]]) – List of hypothesis for multi_speakers (# Utts x # Spkrs).

  • char_list (list[str]) – List of characters.

Returns:

N-best results added utterance dict.

Return type:

dict[str, Any]

espnet.asr.asr_utils

class espnet.asr.asr_utils.CompareValueTrigger(key, compare_fn, trigger=(1, 'epoch'))[source]

Bases: object

Trigger invoked when key value getting bigger or lower than before.

Parameters:
  • key (str) – Key of value.

  • compare_fn ((float, float) -> bool) – Function to compare the values.

  • trigger (tuple(int, str)) – Trigger that decide the comparison interval.

class espnet.asr.asr_utils.PlotAttentionReport(att_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0, subsampling_factor=1)[source]

Bases: chainer.training.extension.Extension

Plot attention reporter.

Parameters:
  • att_vis_fn (espnet.nets.*_backend.e2e_asr.E2E.calculate_all_attentions) – Function of attention visualization.

  • data (list[tuple(str, dict[str, list[Any]])]) – List json utt key items.

  • outdir (str) – Directory to save figures.

  • converter (espnet.asr.*_backend.asr.CustomConverter) – Function to convert data.

  • device (int | torch.device) – Device.

  • reverse (bool) – If True, input and output length are reversed.

  • ikey (str) – Key to access input (for ASR/ST ikey=”input”, for MT ikey=”output”.)

  • iaxis (int) – Dimension to access input (for ASR/ST iaxis=0, for MT iaxis=1.)

  • okey (str) – Key to access output (for ASR/ST okey=”input”, MT okay=”output”.)

  • oaxis (int) – Dimension to access output (for ASR/ST oaxis=0, for MT oaxis=0.)

  • subsampling_factor (int) – subsampling factor in encoder

draw_attention_plot(att_w)[source]

Plot the att_w matrix.

Returns:

pyplot object with attention matrix image.

Return type:

matplotlib.pyplot

draw_han_plot(att_w)[source]

Plot the att_w matrix for hierarchical attention.

Returns:

pyplot object with attention matrix image.

Return type:

matplotlib.pyplot

get_attention_weights()[source]

Return attention weights.

Returns:

attention weights. float. Its shape would be

differ from backend. * pytorch-> 1) multi-head case => (B, H, Lmax, Tmax), 2)

other case => (B, Lmax, Tmax).

  • chainer-> (B, Lmax, Tmax)

Return type:

numpy.ndarray

log_attentions(logger, step)[source]

Add image files of att_ws matrix to the tensorboard.

trim_attention_weight(uttid, att_w)[source]

Transform attention matrix with regard to self.reverse.

class espnet.asr.asr_utils.PlotCTCReport(ctc_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0, subsampling_factor=1)[source]

Bases: chainer.training.extension.Extension

Plot CTC reporter.

Parameters:
  • ctc_vis_fn (espnet.nets.*_backend.e2e_asr.E2E.calculate_all_ctc_probs) – Function of CTC visualization.

  • data (list[tuple(str, dict[str, list[Any]])]) – List json utt key items.

  • outdir (str) – Directory to save figures.

  • converter (espnet.asr.*_backend.asr.CustomConverter) – Function to convert data.

  • device (int | torch.device) – Device.

  • reverse (bool) – If True, input and output length are reversed.

  • ikey (str) – Key to access input (for ASR/ST ikey=”input”, for MT ikey=”output”.)

  • iaxis (int) – Dimension to access input (for ASR/ST iaxis=0, for MT iaxis=1.)

  • okey (str) – Key to access output (for ASR/ST okey=”input”, MT okay=”output”.)

  • oaxis (int) – Dimension to access output (for ASR/ST oaxis=0, for MT oaxis=0.)

  • subsampling_factor (int) – subsampling factor in encoder

draw_ctc_plot(ctc_prob)[source]

Plot the ctc_prob matrix.

Returns:

pyplot object with CTC prob matrix image.

Return type:

matplotlib.pyplot

get_ctc_probs()[source]

Return CTC probs.

Returns:

CTC probs. float. Its shape would be

differ from backend. (B, Tmax, vocab).

Return type:

numpy.ndarray

log_ctc_probs(logger, step)[source]

Add image files of ctc probs to the tensorboard.

trim_ctc_prob(uttid, prob)[source]

Trim CTC posteriors accoding to input lengths.

espnet.asr.asr_utils.adadelta_eps_decay(eps_decay)[source]

Extension to perform adadelta eps decay.

Parameters:

eps_decay (float) – Decay rate of eps.

Returns:

An extension function.

espnet.asr.asr_utils.adam_lr_decay(eps_decay)[source]

Extension to perform adam lr decay.

Parameters:

eps_decay (float) – Decay rate of lr.

Returns:

An extension function.

espnet.asr.asr_utils.add_gradient_noise(model, iteration, duration=100, eta=1.0, scale_factor=0.55)[source]

Adds noise from a standard normal distribution to the gradients.

The standard deviation (sigma) is controlled by the three hyper-parameters below. sigma goes to zero (no noise) with more iterations.

Parameters:
  • model (torch.nn.model) – Model.

  • iteration (int) – Number of iterations.

  • duration (int) – Number of durations to control the interval of the sigma change.

  • eta (float) – The magnitude of sigma.

  • scale_factor (float) – The scale of sigma.

espnet.asr.asr_utils.add_results_to_json(js, nbest_hyps, char_list)[source]

Add N-best results to json.

Parameters:
  • js (dict[str, Any]) – Groundtruth utterance dict.

  • nbest_hyps_sd (list[dict[str, Any]]) – List of hypothesis for multi_speakers: nutts x nspkrs.

  • char_list (list[str]) – List of characters.

Returns:

N-best results added utterance dict.

Return type:

dict[str, Any]

espnet.asr.asr_utils.chainer_load(path, model)[source]

Load chainer model parameters.

Parameters:
  • path (str) – Model path or snapshot file path to be loaded.

  • model (chainer.Chain) – Chainer model.

espnet.asr.asr_utils.format_mulenc_args(args)[source]

Format args for multi-encoder setup.

It deals with following situations: (when args.num_encs=2): 1. args.elayers = None -> args.elayers = [4, 4]; 2. args.elayers = 4 -> args.elayers = [4, 4]; 3. args.elayers = [4, 4, 4] -> args.elayers = [4, 4].

espnet.asr.asr_utils.get_model_conf(model_path, conf_path=None)[source]

Get model config information by reading a model config file (model.json).

Parameters:
  • model_path (str) – Model path.

  • conf_path (str) – Optional model config path.

Returns:

Config information loaded from json file.

Return type:

list[int, int, dict[str, Any]]

espnet.asr.asr_utils.parse_hypothesis(hyp, char_list)[source]

Parse hypothesis.

Parameters:
  • hyp (list[dict[str, Any]]) – Recognition hypothesis.

  • char_list (list[str]) – List of characters.

Returns:

tuple(str, str, str, float)

espnet.asr.asr_utils.plot_spectrogram(plt, spec, mode='db', fs=None, frame_shift=None, bottom=True, left=True, right=True, top=False, labelbottom=True, labelleft=True, labelright=True, labeltop=False, cmap='inferno')[source]

Plot spectrogram using matplotlib.

Parameters:
  • plt (matplotlib.pyplot) – pyplot object.

  • spec (numpy.ndarray) – Input stft (Freq, Time)

  • mode (str) – db or linear.

  • fs (int) – Sample frequency. To convert y-axis to kHz unit.

  • frame_shift (int) – The frame shift of stft. To convert x-axis to second unit.

  • bottom (bool) – Whether to draw the respective ticks.

  • left (bool) –

  • right (bool) –

  • top (bool) –

  • labelbottom (bool) – Whether to draw the respective tick labels.

  • labelleft (bool) –

  • labelright (bool) –

  • labeltop (bool) –

  • cmap (str) – Colormap defined in matplotlib.

espnet.asr.asr_utils.restore_snapshot(model, snapshot, load_fn=None)[source]

Extension to restore snapshot.

Returns:

An extension function.

espnet.asr.asr_utils.snapshot_object(target, filename)[source]

Returns a trainer extension to take snapshots of a given object.

Parameters:
  • target (model) – Object to serialize.

  • filename (str) – Name of the file into which the object is serialized.It can be a format string, where the trainer object is passed to the :meth: str.format method. For example, 'snapshot_{.updater.iteration}' is converted to 'snapshot_10000' at the 10,000th iteration.

Returns:

An extension function.

espnet.asr.asr_utils.torch_load(path, model)[source]

Load torch model states.

Parameters:
  • path (str) – Model path or snapshot file path to be loaded.

  • model (torch.nn.Module) – Torch model.

espnet.asr.asr_utils.torch_resume(snapshot_path, trainer)[source]

Resume from snapshot for pytorch.

Parameters:
  • snapshot_path (str) – Snapshot file path.

  • trainer (chainer.training.Trainer) – Chainer’s trainer instance.

espnet.asr.asr_utils.torch_save(path, model)[source]

Save torch model states.

Parameters:
  • path (str) – Model path to be saved.

  • model (torch.nn.Module) – Torch model.

espnet.asr.asr_utils.torch_snapshot(savefun=<function save>, filename='snapshot.ep.{.updater.epoch}')[source]

Extension to take snapshot of the trainer for pytorch.

Returns:

An extension function.

espnet.asr.chainer_backend.__init__

Initialize sub package.

espnet.asr.chainer_backend.asr

Training/decoding definition for the speech recognition task.

espnet.asr.chainer_backend.asr.recog(args)[source]

Decode with the given args.

Parameters:

args (namespace) – The program arguments.

espnet.asr.chainer_backend.asr.train(args)[source]

Train with the given args.

Parameters:

args (namespace) – The program arguments.

espnet.asr.pytorch_backend.asr_mix

This script is used for multi-speaker speech recognition.

Copyright 2017 Johns Hopkins University (Shinji Watanabe)

Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)

class espnet.asr.pytorch_backend.asr_mix.CustomConverter(subsampling_factor=1, dtype=torch.float32, num_spkrs=2)[source]

Bases: object

Custom batch converter for Pytorch.

Parameters:
  • subsampling_factor (int) – The subsampling factor.

  • dtype (torch.dtype) – Data type to convert.

Initialize the converter.

espnet.asr.pytorch_backend.asr_mix.recog(args)[source]

Decode with the given args.

Parameters:

args (namespace) – The program arguments.

espnet.asr.pytorch_backend.asr_mix.train(args)[source]

Train with the given args.

Parameters:

args (namespace) – The program arguments.

espnet.asr.pytorch_backend.recog

V2 backend for asr_recog.py using py:class:espnet.nets.beam_search.BeamSearch.

espnet.asr.pytorch_backend.recog.recog_v2(args)[source]

Decode with custom models that implements ScorerInterface.

Notes

The previous backend espnet.asr.pytorch_backend.asr.recog only supports E2E and RNNLM

Parameters:

args (namespace) – The program arguments.

:param See py:func:espnet.bin.asr_recog.get_parser for details:

espnet.asr.pytorch_backend.__init__

Initialize sub package.

espnet.asr.pytorch_backend.asr_init

Finetuning methods.

espnet.asr.pytorch_backend.asr_init.create_transducer_compatible_state_dict(model_state_dict, encoder_type, encoder_units)[source]

Create a compatible transducer model state dict for transfer learning.

If RNN encoder modules from a non-Transducer model are found in the pre-trained model state dict, the corresponding modules keys are renamed for compatibility.

Parameters:
  • model_state_dict (Dict) – Pre-trained model state dict

  • encoder_type (str) – Type of pre-trained encoder.

  • encoder_units (int) – Number of encoder units in pre-trained model.

Returns:

Transducer compatible pre-trained model state dict.

Return type:

new_state_dict (Dict)

espnet.asr.pytorch_backend.asr_init.filter_modules(model_state_dict, modules)[source]

Filter non-matched modules in model state dict.

Parameters:
  • model_state_dict (Dict) – Pre-trained model state dict.

  • modules (List) – Specified module(s) to transfer.

Returns:

Filtered module list.

Return type:

new_mods (List)

espnet.asr.pytorch_backend.asr_init.freeze_modules(model, modules)[source]

Freeze model parameters according to modules list.

Parameters:
  • model (torch.nn.Module) – Main model.

  • modules (List) – Specified module(s) to freeze.

Returns:

Updated main model. model_params (filter): Filtered model parameters.

Return type:

model (torch.nn.Module)

espnet.asr.pytorch_backend.asr_init.get_lm_state_dict(lm_state_dict)[source]

Create compatible ASR decoder state dict from LM state dict.

Parameters:

lm_state_dict (Dict) – Pre-trained LM state dict.

Returns:

State dict with compatible key names.

Return type:

new_state_dict (Dict)

espnet.asr.pytorch_backend.asr_init.get_partial_state_dict(model_state_dict, modules)[source]

Create state dict with specified modules matching input model modules.

Parameters:
  • model_state_dict (Dict) – Pre-trained model state dict.

  • modules (Dict) – Specified module(s) to transfer.

Returns:

State dict with specified modules weights.

Return type:

new_state_dict (Dict)

espnet.asr.pytorch_backend.asr_init.get_trained_model_state_dict(model_path, new_is_transducer)[source]

Extract the trained model state dict for pre-initialization.

Parameters:
  • model_path (str) – Path to trained model.

  • new_is_transducer (bool) – Whether the new model is Transducer-based.

Returns:

Trained model state dict.

Return type:

(Dict)

espnet.asr.pytorch_backend.asr_init.load_trained_model(model_path, training=True)[source]

Load the trained model for recognition.

Parameters:
  • model_path (str) – Path to model.***.best

  • training (bool) – Training mode specification for transducer model.

Returns:

Trained model. train_args (Namespace): Trained model arguments.

Return type:

model (torch.nn.Module)

espnet.asr.pytorch_backend.asr_init.load_trained_modules(idim, odim, args, interface=<class 'espnet.nets.asr_interface.ASRInterface'>)[source]

Load ASR/MT/TTS model with pre-trained weights for specified modules.

Parameters:
  • idim (int) – Input dimension.

  • odim (int) – Output dimension.

  • Namespace (args) – Model arguments.

  • interface (ASRInterface|MTInterface|TTSInterface) – Model interface.

Returns:

Model with pre-initialized weights.

Return type:

main_model (torch.nn.Module)

espnet.asr.pytorch_backend.asr_init.transfer_verification(model_state_dict, partial_state_dict, modules)[source]

Verify tuples (key, shape) for input model modules match specified modules.

Parameters:
  • model_state_dict (Dict) – Main model state dict.

  • partial_state_dict (Dict) – Pre-trained model state dict.

  • modules (List) – Specified module(s) to transfer.

Returns:

Whether transfer learning is allowed.

Return type:

(bool)

espnet.asr.pytorch_backend.asr

Training/decoding definition for the speech recognition task.

class espnet.asr.pytorch_backend.asr.CustomConverter(subsampling_factor=1, dtype=torch.float32)[source]

Bases: object

Custom batch converter for Pytorch.

Parameters:
  • subsampling_factor (int) – The subsampling factor.

  • dtype (torch.dtype) – Data type to convert.

Construct a CustomConverter object.

class espnet.asr.pytorch_backend.asr.CustomConverterMulEnc(subsampling_factors=[1, 1], dtype=torch.float32)[source]

Bases: object

Custom batch converter for Pytorch in multi-encoder case.

Parameters:
  • subsampling_factors (list) – List of subsampling factors for each encoder.

  • dtype (torch.dtype) – Data type to convert.

Initialize the converter.

class espnet.asr.pytorch_backend.asr.CustomEvaluator(model, iterator, target, device, ngpu=None, use_ddp=False)[source]

Bases: espnet.utils.training.evaluator.BaseEvaluator

Custom Evaluator for Pytorch.

Parameters:
  • model (torch.nn.Module) – The model to evaluate.

  • iterator (chainer.dataset.Iterator) – The train iterator.

  • target (link | dict[str, link]) – Link object or a dictionary of links to evaluate. If this is just a link object, the link is registered by the name 'main'.

  • device (torch.device) – The device used.

  • ngpu (int) – The number of GPUs.

  • use_ddp (bool) – The flag to use DDP.

evaluate()[source]

Main evaluate routine for CustomEvaluator.

class espnet.asr.pytorch_backend.asr.CustomUpdater(model, grad_clip_threshold, train_iter, optimizer, device, ngpu, grad_noise=False, accum_grad=1, use_apex=False, use_ddp=False)[source]

Bases: chainer.training.updaters.standard_updater.StandardUpdater

Custom Updater for Pytorch.

Parameters:
  • model (torch.nn.Module) – The model to update.

  • grad_clip_threshold (float) – The gradient clipping value to use.

  • train_iter (chainer.dataset.Iterator) – The training iterator.

  • optimizer (torch.optim.optimizer) – The training optimizer.

  • device (torch.device) – The device to use.

  • ngpu (int) – The number of gpus to use.

  • use_apex (bool) – The flag to use Apex in backprop.

  • use_ddp (bool) – The flag to use DDP for multi-GPU training.

update()[source]

Updates the parameters of the target model.

This method implements an update formula for the training task, including data loading, forward/backward computations, and actual updates of parameters.

This method is called once at each iteration of the training loop.

update_core()[source]

Main update routine of the CustomUpdater.

class espnet.asr.pytorch_backend.asr.DistributedDictSummary(device=None)[source]

Bases: object

Distributed version of DictSummary.

This implementation is based on an official implementation below. https://github.com/chainer/chainer/blob/v6.7.0/chainer/reporter.py

To gather stats information from all processes and calculate exact mean values, this class is running AllReduce operation in compute_mean().

add(d)[source]
compute_mean()[source]
espnet.asr.pytorch_backend.asr.enhance(args)[source]

Dumping enhanced speech and mask.

Parameters:

args (namespace) – The program arguments.

espnet.asr.pytorch_backend.asr.is_writable_process(args, worldsize, rank, localrank)[source]
espnet.asr.pytorch_backend.asr.recog(args)[source]

Decode with the given args.

Parameters:

args (namespace) – The program arguments.

espnet.asr.pytorch_backend.asr.train(args)[source]

Train with the given args.

Parameters:

args (namespace) – The program arguments.