Common usages¶
ESPnet1¶
Please first check ESPnet1 tutorial
ESPnet2¶
Please first check ESPnet2 tutorial
Multiple GPU TIPs¶
Note that if you want to use multiple GPUs, the installation of nccl is required before setup.
Currently, espnet1 only supports multiple GPU training within a single node. The distributed setup across multiple nodes is only supported in espnet2.
We don’t support multiple GPU inference. Instead, please split the recognition task for multiple jobs and distribute these split jobs to multiple GPUs.
If you cannot get enough speed improvement with multiple GPUs, you should first check the GPU usage by
nvidia-smi
. If the GPU-Util percentage is low, the bottleneck will come from disk access. You can apply data prefetching by--n-iter-processes 2
in yourrun.sh
to mitigate the problem. Note that this data prefetching consumes a lot of CPU memory, so please be careful when you increase the number of processes.The behavior of batch size in ESPnet2 during multi-GPU training is different from that in ESPnet1. In ESPnet2, the total batch size is not changed regardless of the number of GPUs. Therefore, you need to manually increase the batch size if you increase the number of GPUs. Please refer to this doc for more information.
Start from the middle stage or stop at the specified stage¶
run.sh
has multiple stages, including data preparation, training, etc., so you may likely want to start
from the specified stage if some stages failed for some reason, for example.
You can start from the specified stage as follows and stop the process at the specified stage:
# Start from 3rd stage and stop at 5th stage
$ ./run.sh --stage 3 --stop-stage 5
CTC, attention, and hybrid CTC/attention¶
ESPnet can easily switch the model’s training/decoding mode from CTC, attention, and hybrid CTC/attention.
Each mode can be trained by specifying mtlalpha
(espnet1) ctc_weight
(espnet2):
espnet1
# hybrid CTC/attention (default)
mtlalpha: 0.3
# CTC
mtlalpha: 1.0
# attention
mtlalpha: 0.0
espnet2
# hybrid CTC/attention (default)
model_conf:
ctc_weight: 0.3
# CTC
model_conf:
ctc_weight: 1.0
# attention
model_conf:
ctc_weight: 0.0
Decoding for each mode can be done using the following decoding configurations:
espnet1
# hybrid CTC/attention (default) ctc-weight: 0.3 beam-size: 10 # CTC ctc-weight: 1.0 ## for best path decoding api: v1 # default setting (can be omitted) ## for prefix search decoding w/ beam search api: v2 beam-size: 10 # attention ctc-weight: 0.0 beam-size: 10 maxlenratio: 0.8 minlenratio: 0.3
espnet2
# hybrid CTC/attention (default) ctc_weight: 0.3 beam_size: 10 # CTC ctc_weight: 1.0 beam_size: 10 # attention ctc_weight: 0.0 beam_size: 10 maxlenratio: 0.8 minlenratio: 0.3
The CTC mode does not compute the validation accuracy, and the optimum model is selected with its loss value, e.g.,
espnet1
best_model_criterion: - - valid - cer_ctc - min
espnet2
./run.sh --recog_model model.loss.best
The pure attention mode requires setting the maximum and minimum hypothesis length (
--maxlenratio
and--minlenratio
) appropriately. In general, if you have more insertion errors, you can decrease themaxlenratio
value, while if you have more deletion errors, you can increase theminlenratio
value. Note that the optimum values depend on the ratio of the input frame and output label lengths, which are changed for each language and each BPE unit.Negative
maxlenratio
can be used to set the constant maximum hypothesis length independently from the number of input frames. Ifmaxlenratio
is set to-1
, the decoding will always stop after the first output, which can be used to emulate the utterance classification tasks. This is suitable for some spoken language understanding and speaker identification tasks.About the effectiveness of hybrid CTC/attention during training and recognition, see [2] and [3]. For example, hybrid CTC/attention is not sensitive to the above maximum and minimum hypothesis heuristics.