core tools¶
ESPnet provides several command-line tools for training and evaluating neural networks (NN) under espnet/bin
:
asr_align.py: Align text to audio using CTC segmentation.using a pre-trained speech recognition model.
asr_enhance.py: Enhance noisy speech for speech recognition
asr_recog.py: Transcribe text from speech using a speech recognition model on one CPU or GPU
asr_train.py: Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs
lm_train.py: Train a new language model on one CPU or one GPU
mt_train.py: Train a neural machine translation (NMT) model on one CPU, one or multiple GPUs
mt_trans.py: Translate text from speech using a speech translation model on one CPU or GPU
st_train.py: Train a speech translation (ST) model on one CPU, one or multiple GPUs
st_trans.py: Translate text from speech using a speech translation model on one CPU or GPU
tts_decode.py: Synthesize speech from text using a TTS model on one CPU
tts_train.py: Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs
vc_decode.py: Converting speech using a VC model on one CPU
vc_train.py: Train a new voice conversion (VC) model on one CPU, one or multiple GPUs
asr_align.py¶
Align text to audio using CTC segmentation.using a pre-trained speech recognition model.
usage: asr_align.py [-h] [--config CONFIG] [--ngpu NGPU]
[--dtype {float16,float32,float64}] [--backend {pytorch}]
[--debugmode DEBUGMODE] [--verbose VERBOSE]
[--preprocess-conf PREPROCESS_CONF]
[--data-json DATA_JSON] [--utt-text UTT_TEXT] --model
MODEL [--model-conf MODEL_CONF] [--num-encs NUM_ENCS]
[--subsampling-factor SUBSAMPLING_FACTOR]
[--frame-duration FRAME_DURATION]
[--min-window-size MIN_WINDOW_SIZE]
[--max-window-size MAX_WINDOW_SIZE]
[--use-dict-blank USE_DICT_BLANK] [--set-blank SET_BLANK]
[--gratis-blank GRATIS_BLANK]
[--replace-spaces-with-blanks REPLACE_SPACES_WITH_BLANKS]
[--scoring-length SCORING_LENGTH] --output OUTPUT
Named Arguments¶
- --config
Decoding config file path.
- --ngpu
Number of GPUs (max. 1 is supported)
Default: 0
- --dtype
Possible choices: float16, float32, float64
Float precision (only available in –api v2)
Default: “float32”
- --backend
Possible choices: pytorch
Backend library
Default: “pytorch”
- --debugmode
Debugmode
Default: 1
- --verbose, -V
Verbose option
Default: 1
- --preprocess-conf
The configuration file for the pre-processing
- --data-json
Json of recognition data for audio and text
- --utt-text
Text separated into utterances
- --model
Model file parameters to read
- --model-conf
Model config file
- --num-encs
Number of encoders in the model.
Default: 1
- --subsampling-factor
Subsampling factor. If the encoder sub-samples its input, the number of frames at the CTC layer is reduced by this factor. For example, a BLSTMP with subsampling 1_2_2_1_1 has a subsampling factor of 4.
- --frame-duration
Non-overlapping duration of a single frame in milliseconds.
- --min-window-size
Minimum window size considered for utterance.
- --max-window-size
Maximum window size considered for utterance.
- --use-dict-blank
DEPRECATED.
- --set-blank
Index of model dictionary for blank token (default: 0).
- --gratis-blank
Set the transition cost of the blank token to zero. Audio sections labeled with blank tokens can then be skipped without penalty. Useful if there are unrelated audio segments between utterances.
- --replace-spaces-with-blanks
Fill blanks in between words to better model pauses between words. Segments can be misaligned if this option is combined with –gratis-blank. May increase length of ground truth.
- --scoring-length
Changes partitioning length L for calculation of the confidence score.
- --output
Output segments file
asr_enhance.py¶
Enhance noisy speech for speech recognition
usage: asr_enhance.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] [--verbose VERBOSE]
[--batchsize BATCHSIZE]
[--preprocess-conf PREPROCESS_CONF]
[--recog-json RECOG_JSON] --model MODEL
[--model-conf MODEL_CONF]
[--enh-wspecifier ENH_WSPECIFIER]
[--enh-filetype {mat,hdf5,sound.hdf5,sound}] [--fs FS]
[--keep-length KEEP_LENGTH] [--image-dir IMAGE_DIR]
[--num-images NUM_IMAGES] [--apply-istft APPLY_ISTFT]
[--istft-win-length ISTFT_WIN_LENGTH]
[--istft-n-shift ISTFT_N_SHIFT]
[--istft-window ISTFT_WINDOW]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs
Default: 0
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --verbose, -V
Verbose option
Default: 1
- --batchsize
Batch size for beam search (0: means no batch processing)
Default: 1
- --preprocess-conf
The configuration file for the pre-processing
- --recog-json
Filename of recognition data (json)
- --model
Model file parameters to read
- --model-conf
Model config file
- --enh-wspecifier
Specify the output way for enhanced speech.e.g. ark,scp:outdir,wav.scp
- --enh-filetype
Possible choices: mat, hdf5, sound.hdf5, sound
Specify the file format for enhanced speech. “mat” is the matrix format in kaldi
Default: “sound”
- --fs
The sample frequency
Default: 16000
- --keep-length
Adjust the output length to match with the input for enhanced speech
Default: True
- --image-dir
The directory saving the images.
- --num-images
The number of images files to be saved. If negative, all samples are to be saved.
Default: 20
- --apply-istft
Apply istft to the output from the network
Default: True
- --istft-win-length
The window length for istft. This option is ignored if stft is found in the preprocess-conf
Default: 512
- --istft-n-shift
The window type for istft. This option is ignored if stft is found in the preprocess-conf
Default: 256
- --istft-window
The window type for istft. This option is ignored if stft is found in the preprocess-conf
Default: “hann”
asr_recog.py¶
Transcribe text from speech using a speech recognition model on one CPU or GPU
usage: asr_recog.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--dtype {float16,float32,float64}]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
[--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
[--recog-json RECOG_JSON] --result-label RESULT_LABEL
--model MODEL [--model-conf MODEL_CONF]
[--num-spkrs {1,2}] [--num-encs NUM_ENCS] [--nbest NBEST]
[--beam-size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--ctc-weight CTC_WEIGHT]
[--weights-ctc-dec WEIGHTS_CTC_DEC]
[--ctc-window-margin CTC_WINDOW_MARGIN]
[--search-type {default,nsc,tsd,alsd,maes}]
[--nstep NSTEP] [--prefix-alpha PREFIX_ALPHA]
[--max-sym-exp MAX_SYM_EXP] [--u-max U_MAX]
[--expansion-gamma EXPANSION_GAMMA]
[--expansion-beta EXPANSION_BETA]
[--score-norm [SCORE_NORM]]
[--softmax-temperature SOFTMAX_TEMPERATURE]
[--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
[--word-rnnlm WORD_RNNLM]
[--word-rnnlm-conf WORD_RNNLM_CONF]
[--word-dict WORD_DICT] [--lm-weight LM_WEIGHT]
[--ngram-model NGRAM_MODEL] [--ngram-weight NGRAM_WEIGHT]
[--ngram-scorer {full,part}]
[--streaming-mode {window,segment}]
[--streaming-window STREAMING_WINDOW]
[--streaming-min-blank-dur STREAMING_MIN_BLANK_DUR]
[--streaming-onset-margin STREAMING_ONSET_MARGIN]
[--streaming-offset-margin STREAMING_OFFSET_MARGIN]
[--maskctc-n-iterations MASKCTC_N_ITERATIONS]
[--maskctc-probability-threshold MASKCTC_PROBABILITY_THRESHOLD]
[--quantize-config [QUANTIZE_CONFIG [QUANTIZE_CONFIG ...]]]
[--quantize-dtype {float16,qint8}]
[--quantize-asr-model QUANTIZE_ASR_MODEL]
[--quantize-lm-model QUANTIZE_LM_MODEL]
Named Arguments¶
- --config
Config file path
- --config2
Second config file path that overwrites the settings in –config
- --config3
Third config file path that overwrites the settings in –config and –config2
- --ngpu
Number of GPUs
Default: 0
- --dtype
Possible choices: float16, float32, float64
Float precision (only available in –api v2)
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --verbose, -V
Verbose option
Default: 1
- --batchsize
Batch size for beam search (0: means no batch processing)
Default: 1
- --preprocess-conf
The configuration file for the pre-processing
- --api
Possible choices: v1, v2
Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.
Default: “v1”
- --recog-json
Filename of recognition data (json)
- --result-label
Filename of result label data (json)
- --model
Model file parameters to read
- --model-conf
Model config file
- --num-spkrs
Possible choices: 1, 2
Number of speakers in the speech
Default: 1
- --num-encs
Number of encoders in the model.
Default: 1
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 1
- --penalty
Incertion penalty
Default: 0.0
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths. If maxlenratio<0.0, its absolute value is interpreted as a constant max output length
Default: 0.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --ctc-weight
CTC weight in joint decoding
Default: 0.0
- --weights-ctc-dec
ctc weight assigned to each encoder during decoding.[in multi-encoder mode only]
- --ctc-window-margin
- Use CTC window with margin parameter to accelerate
CTC/attention decoding especially on GPU. Smaller magin makes decoding faster, but may increase search errors. If margin=0 (default), this function is disabled
Default: 0
- --search-type
Possible choices: default, nsc, tsd, alsd, maes
- Type of beam search implementation to use during inference.
Can be either: default beam search (“default”), N-Step Constrained beam search (“nsc”), Time-Synchronous Decoding (“tsd”), Alignment-Length Synchronous Decoding (“alsd”) or modified Adaptive Expansion Search (“maes”).
Default: “default”
- --nstep
- Number of expansion steps allowed in NSC beam search or mAES
(nstep > 0 for NSC and nstep > 1 for mAES).
Default: 1
- --prefix-alpha
Length prefix difference allowed in NSC beam search or mAES.
Default: 2
- --max-sym-exp
Number of symbol expansions allowed in TSD.
Default: 2
- --u-max
Length prefix difference allowed in ALSD.
Default: 400
- --expansion-gamma
Allowed logp difference for prune-by-value method in mAES.
Default: 2.3
- --expansion-beta
- Number of additional candidates for expanded hypotheses
selection in mAES.
Default: 2
- --score-norm
Normalize final hypotheses’ score by length
Default: True
- --softmax-temperature
Penalization term for softmax function.
Default: 1.0
- --rnnlm
RNNLM model file to read
- --rnnlm-conf
RNNLM model config file to read
- --word-rnnlm
Word RNNLM model file to read
- --word-rnnlm-conf
Word RNNLM model config file to read
- --word-dict
Word list to read
- --lm-weight
RNNLM weight
Default: 0.1
- --ngram-model
ngram model file to read
- --ngram-weight
ngram weight
Default: 0.1
- --ngram-scorer
Possible choices: full, part
- if the ngram is set as a part scorer, similar with CTC scorer,
ngram scorer only scores topK hypethesis. if the ngram is set as full scorer, ngram scorer scores all hypthesis the decoding speed of part scorer is musch faster than full one
Default: “part”
- --streaming-mode
Possible choices: window, segment
- Use streaming recognizer for inference.
–batchsize must be set to 0 to enable this mode
- --streaming-window
Window size
Default: 10
- --streaming-min-blank-dur
Minimum blank duration threshold
Default: 10
- --streaming-onset-margin
Onset margin
Default: 1
- --streaming-offset-margin
Offset margin
Default: 1
- --maskctc-n-iterations
Number of decoding iterations.For Mask CTC, set 0 to predict 1 mask/iter.
Default: 10
- --maskctc-probability-threshold
Threshold probability for CTC output
Default: 0.999
- --quantize-config
- Config for dynamic quantization provided as a list of modules,
separated by a comma. E.g.: –quantize-config=[Linear,LSTM,GRU]. Each specified module should be an attribute of ‘torch.nn’, e.g.: torch.nn.Linear, torch.nn.LSTM, torch.nn.GRU, …
- --quantize-dtype
Possible choices: float16, qint8
Dtype for dynamic quantization.
Default: “qint8”
- --quantize-asr-model
Apply dynamic quantization to ASR model.
Default: False
- --quantize-lm-model
Apply dynamic quantization to LM.
Default: False
asr_train.py¶
Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs
usage: asr_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU] [--use-ddp]
[--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] --dict DICT [--seed SEED]
[--debugdir DEBUGDIR] [--resume [RESUME]]
[--minibatches MINIBATCHES] [--verbose VERBOSE]
[--tensorboard-dir [TENSORBOARD_DIR]]
[--report-interval-iters REPORT_INTERVAL_ITERS]
[--save-interval-iters SAVE_INTERVAL_ITERS]
[--train-json TRAIN_JSON] [--valid-json VALID_JSON]
[--model-module MODEL_MODULE] [--num-encs NUM_ENCS]
[--ctc_type {builtin,gtnctc,cudnnctc}]
[--mtlalpha MTLALPHA] [--lsm-weight LSM_WEIGHT]
[--report-cer] [--report-wer] [--nbest NBEST]
[--beam-size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--ctc-weight CTC_WEIGHT] [--rnnlm RNNLM]
[--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
[--sym-space SYM_SPACE] [--sym-blank SYM_BLANK]
[--sortagrad [SORTAGRAD]]
[--batch-count {auto,seq,bin,frame}]
[--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
[--batch-frames-in BATCH_FRAMES_IN]
[--batch-frames-out BATCH_FRAMES_OUT]
[--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
[--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
[--preprocess-conf [PREPROCESS_CONF]]
[--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
[--eps EPS] [--eps-decay EPS_DECAY]
[--weight-decay WEIGHT_DECAY]
[--criterion {loss,loss_eps_decay_only,acc}]
[--threshold THRESHOLD] [--epochs EPOCHS]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
[--num-save-attention NUM_SAVE_ATTENTION]
[--num-save-ctc NUM_SAVE_CTC] [--grad-noise GRAD_NOISE]
[--num-spkrs {1,2}]
[--context-residual [CONTEXT_RESIDUAL]]
[--enc-init ENC_INIT] [--enc-init-mods ENC_INIT_MODS]
[--dec-init DEC_INIT] [--dec-init-mods DEC_INIT_MODS]
[--freeze-mods FREEZE_MODS] [--use-frontend USE_FRONTEND]
[--use-wpe USE_WPE]
[--wtype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
[--wlayers WLAYERS] [--wunits WUNITS] [--wprojs WPROJS]
[--wdropout-rate WDROPOUT_RATE] [--wpe-taps WPE_TAPS]
[--wpe-delay WPE_DELAY]
[--use-dnn-mask-for-wpe USE_DNN_MASK_FOR_WPE]
[--use-beamformer USE_BEAMFORMER]
[--btype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
[--blayers BLAYERS] [--bunits BUNITS] [--bprojs BPROJS]
[--badim BADIM] [--bnmask BNMASK]
[--ref-channel REF_CHANNEL]
[--bdropout-rate BDROPOUT_RATE] [--stats-file STATS_FILE]
[--apply-uttmvn APPLY_UTTMVN]
[--uttmvn-norm-means UTTMVN_NORM_MEANS]
[--uttmvn-norm-vars UTTMVN_NORM_VARS]
[--fbank-fs FBANK_FS] [--n-mels N_MELS]
[--fbank-fmin FBANK_FMIN] [--fbank-fmax FBANK_FMAX]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs. If not given, use all visible devices
- --use-ddp
Enable process-based data parallel. –ngpu’s GPUs will be used. If –ngpu is not given, this tries to identify how many GPUs can be used. But, if it fails, the application will abort. And, currently, single node multi GPUs job is only supported.
Default: False
- --train-dtype
Possible choices: float16, float32, float64, O0, O1, O2, O3
Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --dict
Dictionary
- --seed
Random seed
Default: 1
- --debugdir
Output directory for debugging
- --resume, -r
Resume the training from snapshot
Default: “”
- --minibatches, -N
Process only N minibatches (for debug)
Default: -1
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log dir path
- --report-interval-iters
Report interval iterations
Default: 100
- --save-interval-iters
Save snapshot interval iterations
Default: 0
- --train-json
Filename of train label data (json)
- --valid-json
Filename of validation label data (json)
- --model-module
model defined module (default: espnet.nets.xxx_backend.e2e_asr:E2E)
- --num-encs
Number of encoders in the model.
Default: 1
- --ctc_type
Possible choices: builtin, gtnctc, cudnnctc
Type of CTC implementation to calculate loss.
Default: “builtin”
- --mtlalpha
Multitask learning coefficient, alpha: alpha*ctc_loss + (1-alpha)*att_loss
Default: 0.5
- --lsm-weight
Label smoothing weight
Default: 0.0
- --report-cer
Compute CER on development set
Default: False
- --report-wer
Compute WER on development set
Default: False
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 4
- --penalty
Incertion penalty
Default: 0.0
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
Default: 0.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --ctc-weight
CTC weight in joint decoding
Default: 0.3
- --rnnlm
RNNLM model file to read
- --rnnlm-conf
RNNLM model config file to read
- --lm-weight
RNNLM weight.
Default: 0.1
- --sym-space
Space symbol
Default: “<space>”
- --sym-blank
Blank symbol
Default: “<blank>”
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batch-count
Possible choices: auto, seq, bin, frame
How to count batch_size. The default (auto) will find how to count by args.
Default: “auto”
- --batch-size, --batch-seqs, -b
Maximum seqs in a minibatch (0 to disable)
Default: 0
- --batch-bins
Maximum bins in a minibatch (0 to disable)
Default: 0
- --batch-frames-in
Maximum input frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-out
Maximum output frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-inout
Maximum input+output frames in a minibatch (0 to disable)
Default: 0
- --maxlen-in, --batch-seq-maxlen-in
When –batch-count=seq, batch size is reduced if the input sequence length > ML.
Default: 800
- --maxlen-out, --batch-seq-maxlen-out
When –batch-count=seq, batch size is reduced if the output sequence length > ML
Default: 150
- --n-iter-processes
Number of processes of iterator
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --opt
Possible choices: adadelta, adam, noam
Optimizer
Default: “adadelta”
- --accum-grad
Number of gradient accumuration
Default: 1
- --eps
Epsilon constant for optimizer
Default: 1e-08
- --eps-decay
Decaying ratio of epsilon
Default: 0.01
- --weight-decay
Weight decay ratio
Default: 0.0
- --criterion
Possible choices: loss, loss_eps_decay_only, acc
Criterion to perform epsilon decay
Default: “acc”
- --threshold
Threshold to stop iteration
Default: 0.0001
- --epochs, -e
Maximum number of epochs
Default: 30
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/acc”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --grad-clip
Gradient norm threshold to clip
Default: 5
- --num-save-attention
Number of samples of attention to be saved
Default: 3
- --num-save-ctc
Number of samples of CTC probability to be saved
Default: 3
- --grad-noise
The flag to switch to use noise injection to gradients during training
Default: False
- --num-spkrs
Possible choices: 1, 2
Number of speakers in the speech.
Default: 1
- --context-residual
The flag to switch to use context vector residual in the decoder network
Default: False
- --enc-init
Pre-trained ASR model to initialize encoder.
- --enc-init-mods
List of encoder modules to initialize, separated by a comma.
Default: enc.enc.
- --dec-init
Pre-trained ASR, MT or LM model to initialize decoder.
- --dec-init-mods
List of decoder modules to initialize, separated by a comma.
Default: att.,dec.
- --freeze-mods
List of modules to freeze, separated by a comma.
- --use-frontend
The flag to switch to use frontend system.
Default: False
- --use-wpe
Apply Weighted Prediction Error
Default: False
- --wtype
Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru
Type of encoder network architecture of the mask estimator for WPE.
Default: “blstmp”
- --wlayers
Default: 2
- --wunits
Default: 300
- --wprojs
Default: 300
- --wdropout-rate
Default: 0.0
- --wpe-taps
Default: 5
- --wpe-delay
Default: 3
- --use-dnn-mask-for-wpe
Use DNN to estimate the power spectrogram. This option is experimental.
Default: False
- --use-beamformer
Default: True
- --btype
Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru
Type of encoder network architecture of the mask estimator for Beamformer.
Default: “blstmp”
- --blayers
Default: 2
- --bunits
Default: 300
- --bprojs
Default: 300
- --badim
Default: 320
- --bnmask
Number of beamforming masks, default is 2 for [speech, noise].
Default: 2
- --ref-channel
The reference channel used for beamformer. By default, the channel is estimated by DNN.
Default: -1
- --bdropout-rate
Default: 0.0
- --stats-file
The stats file for the feature normalization
- --apply-uttmvn
Apply utterance level mean variance normalization.
Default: True
- --uttmvn-norm-means
Default: True
- --uttmvn-norm-vars
Default: False
- --fbank-fs
The sample frequency used for the mel-fbank creation.
Default: 16000
- --n-mels
The number of mel-frequency bins.
Default: 80
- --fbank-fmin
Default: 0.0
- --fbank-fmax
lm_train.py¶
Train a new language model on one CPU or one GPU
usage: lm_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] --dict DICT [--seed SEED]
[--resume [RESUME]] [--verbose VERBOSE]
[--tensorboard-dir [TENSORBOARD_DIR]]
[--report-interval-iters REPORT_INTERVAL_ITERS]
--train-label TRAIN_LABEL --valid-label VALID_LABEL
[--test-label TEST_LABEL] [--dump-hdf5-path DUMP_HDF5_PATH]
[--opt OPT] [--sortagrad [SORTAGRAD]]
[--batchsize BATCHSIZE] [--accum-grad ACCUM_GRAD]
[--epoch EPOCH]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--schedulers SCHEDULERS]
[--gradclip GRADCLIP] [--maxlen MAXLEN]
[--model-module MODEL_MODULE]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs. If not given, use all visible devices
- --train-dtype
Possible choices: float16, float32, float64, O0, O1, O2, O3
Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --dict
Dictionary
- --seed
Random seed
Default: 1
- --resume, -r
Resume the training from snapshot
Default: “”
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log dir path
- --report-interval-iters
Report interval iterations
Default: 100
- --train-label
Filename of train label data
- --valid-label
Filename of validation label data
- --test-label
Filename of test label data
- --dump-hdf5-path
Path to dump a preprocessed dataset as hdf5
- --opt
Optimizer
Default: “sgd”
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batchsize, -b
Number of examples in each mini-batch
Default: 300
- --accum-grad
Number of gradient accumueration
Default: 1
- --epoch, -e
Number of sweeps over the dataset to train
Default: 20
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/loss”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --schedulers
optimizer schedulers, you can configure params like: <optimizer-param>-<scheduler-name>-<schduler-param> e.g., “–schedulers lr=noam –lr-noam-warmup 1000”.
- --gradclip, -c
Gradient norm threshold to clip
Default: 5
- --maxlen
Batch size is reduced if the input sequence > ML
Default: 40
- --model-module
model defined module (default: espnet.nets.xxx_backend.lm.default:DefaultRNNLM)
Default: “default”
mt_train.py¶
Train a neural machine translation (NMT) model on one CPU, one or multiple GPUs
usage: mt_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] --dict DICT [--seed SEED]
[--debugdir DEBUGDIR] [--resume [RESUME]]
[--minibatches MINIBATCHES] [--verbose VERBOSE]
[--tensorboard-dir [TENSORBOARD_DIR]]
[--report-interval-iters REPORT_INTERVAL_ITERS]
[--save-interval-iters SAVE_INTERVAL_ITERS]
[--train-json TRAIN_JSON] [--valid-json VALID_JSON]
[--model-module MODEL_MODULE] [--lsm-weight LSM_WEIGHT]
[--report-bleu] [--nbest NBEST] [--beam-size BEAM_SIZE]
[--penalty PENALTY] [--maxlenratio MAXLENRATIO]
[--minlenratio MINLENRATIO] [--rnnlm RNNLM]
[--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
[--sym-space SYM_SPACE] [--sym-blank SYM_BLANK]
[--sortagrad [SORTAGRAD]]
[--batch-count {auto,seq,bin,frame}]
[--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
[--batch-frames-in BATCH_FRAMES_IN]
[--batch-frames-out BATCH_FRAMES_OUT]
[--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
[--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
[--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
[--eps EPS] [--eps-decay EPS_DECAY] [--lr LR]
[--lr-decay LR_DECAY] [--weight-decay WEIGHT_DECAY]
[--criterion {loss,acc}] [--threshold THRESHOLD]
[--epochs EPOCHS]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
[--num-save-attention NUM_SAVE_ATTENTION]
[--context-residual [CONTEXT_RESIDUAL]]
[--tie-src-tgt-embedding [TIE_SRC_TGT_EMBEDDING]]
[--tie-classifier [TIE_CLASSIFIER]] [--enc-init [ENC_INIT]]
[--enc-init-mods ENC_INIT_MODS] [--dec-init [DEC_INIT]]
[--dec-init-mods DEC_INIT_MODS]
[--multilingual MULTILINGUAL] [--replace-sos REPLACE_SOS]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs. If not given, use all visible devices
- --train-dtype
Possible choices: float16, float32, float64, O0, O1, O2, O3
Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --dict
Dictionary for source/target languages
- --seed
Random seed
Default: 1
- --debugdir
Output directory for debugging
- --resume, -r
Resume the training from snapshot
Default: “”
- --minibatches, -N
Process only N minibatches (for debug)
Default: -1
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log dir path
- --report-interval-iters
Report interval iterations
Default: 100
- --save-interval-iters
Save snapshot interval iterations
Default: 0
- --train-json
Filename of train label data (json)
- --valid-json
Filename of validation label data (json)
- --model-module
model defined module (default: espnet.nets.xxx_backend.e2e_mt:E2E)
- --lsm-weight
Label smoothing weight
Default: 0.0
- --report-bleu
Compute BLEU on development set
Default: True
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 4
- --penalty
Incertion penalty
Default: 0.0
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
Default: 0.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --rnnlm
RNNLM model file to read
- --rnnlm-conf
RNNLM model config file to read
- --lm-weight
RNNLM weight.
Default: 0.0
- --sym-space
Space symbol
Default: “<space>”
- --sym-blank
Blank symbol
Default: “<blank>”
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batch-count
Possible choices: auto, seq, bin, frame
How to count batch_size. The default (auto) will find how to count by args.
Default: “auto”
- --batch-size, --batch-seqs, -b
Maximum seqs in a minibatch (0 to disable)
Default: 0
- --batch-bins
Maximum bins in a minibatch (0 to disable)
Default: 0
- --batch-frames-in
Maximum input frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-out
Maximum output frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-inout
Maximum input+output frames in a minibatch (0 to disable)
Default: 0
- --maxlen-in, --batch-seq-maxlen-in
When –batch-count=seq, batch size is reduced if the input sequence length > ML.
Default: 100
- --maxlen-out, --batch-seq-maxlen-out
When –batch-count=seq, batch size is reduced if the output sequence length > ML
Default: 100
- --n-iter-processes
Number of processes of iterator
Default: 0
- --opt
Possible choices: adadelta, adam, noam
Optimizer
Default: “adadelta”
- --accum-grad
Number of gradient accumuration
Default: 1
- --eps
Epsilon constant for optimizer
Default: 1e-08
- --eps-decay
Decaying ratio of epsilon
Default: 0.01
- --lr
Learning rate for optimizer
Default: 0.001
- --lr-decay
Decaying ratio of learning rate
Default: 1.0
- --weight-decay
Weight decay ratio
Default: 0.0
- --criterion
Possible choices: loss, acc
Criterion to perform epsilon decay
Default: “acc”
- --threshold
Threshold to stop iteration
Default: 0.0001
- --epochs, -e
Maximum number of epochs
Default: 30
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/acc”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --grad-clip
Gradient norm threshold to clip
Default: 5
- --num-save-attention
Number of samples of attention to be saved
Default: 3
- --context-residual
The flag to switch to use context vector residual in the decoder network
Default: False
- --tie-src-tgt-embedding
Tie parameters of source embedding and target embedding.
Default: False
- --tie-classifier
Tie parameters of target embedding and output projection layer.
Default: False
- --enc-init
Pre-trained ASR model to initialize encoder.
- --enc-init-mods
List of encoder modules to initialize, separated by a comma.
Default: enc.enc.
- --dec-init
Pre-trained ASR, MT or LM model to initialize decoder.
- --dec-init-mods
List of decoder modules to initialize, separated by a comma.
Default: att., dec.
- --multilingual
Prepend target language ID to the source sentence. Both source/target language IDs must be prepend in the pre-processing stage.
Default: False
- --replace-sos
Replace <sos> in the decoder with a target language ID (the first token in the target sequence)
Default: False
mt_trans.py¶
Translate text from speech using a speech translation model on one CPU or GPU
usage: mt_trans.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--dtype {float16,float32,float64}]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
[--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
[--trans-json TRANS_JSON] --result-label RESULT_LABEL
--model MODEL [--model-conf MODEL_CONF] [--nbest NBEST]
[--beam-size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--tgt-lang TGT_LANG]
Named Arguments¶
- --config
Config file path
- --config2
Second config file path that overwrites the settings in –config
- --config3
Third config file path that overwrites the settings in –config and –config2
- --ngpu
Number of GPUs
Default: 0
- --dtype
Possible choices: float16, float32, float64
Float precision (only available in –api v2)
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --verbose, -V
Verbose option
Default: 1
- --batchsize
Batch size for beam search (0: means no batch processing)
Default: 1
- --preprocess-conf
The configuration file for the pre-processing
- --api
Possible choices: v1, v2
Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.
Default: “v1”
- --trans-json
Filename of translation data (json)
- --result-label
Filename of result label data (json)
- --model
Model file parameters to read
- --model-conf
Model config file
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 1
- --penalty
Incertion penalty
Default: 0.1
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
Default: 3.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --tgt-lang
target language ID (e.g., <en>, <de>, and <fr> etc.)
Default: False
st_train.py¶
Train a speech translation (ST) model on one CPU, one or multiple GPUs
usage: st_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] --dict DICT [--seed SEED]
[--debugdir DEBUGDIR] [--resume [RESUME]]
[--minibatches MINIBATCHES] [--verbose VERBOSE]
[--tensorboard-dir [TENSORBOARD_DIR]]
[--report-interval-iters REPORT_INTERVAL_ITERS]
[--save-interval-iters SAVE_INTERVAL_ITERS]
[--train-json TRAIN_JSON] [--valid-json VALID_JSON]
[--model-module MODEL_MODULE]
[--ctc_type {builtin,gtnctc,cudnnctc}]
[--mtlalpha MTLALPHA] [--asr-weight ASR_WEIGHT]
[--mt-weight MT_WEIGHT] [--lsm-weight LSM_WEIGHT]
[--report-cer] [--report-wer] [--report-bleu]
[--nbest NBEST] [--beam-size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
[--lm-weight LM_WEIGHT] [--sym-space SYM_SPACE]
[--sym-blank SYM_BLANK] [--sortagrad [SORTAGRAD]]
[--batch-count {auto,seq,bin,frame}]
[--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
[--batch-frames-in BATCH_FRAMES_IN]
[--batch-frames-out BATCH_FRAMES_OUT]
[--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
[--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
[--preprocess-conf [PREPROCESS_CONF]]
[--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
[--eps EPS] [--eps-decay EPS_DECAY] [--lr LR]
[--lr-decay LR_DECAY] [--weight-decay WEIGHT_DECAY]
[--criterion {loss,acc}] [--threshold THRESHOLD]
[--epochs EPOCHS]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
[--num-save-attention NUM_SAVE_ATTENTION]
[--num-save-ctc NUM_SAVE_CTC] [--grad-noise GRAD_NOISE]
[--context-residual [CONTEXT_RESIDUAL]]
[--enc-init [ENC_INIT]] [--enc-init-mods ENC_INIT_MODS]
[--dec-init [DEC_INIT]] [--dec-init-mods DEC_INIT_MODS]
[--multilingual MULTILINGUAL] [--replace-sos REPLACE_SOS]
[--stats-file STATS_FILE] [--apply-uttmvn APPLY_UTTMVN]
[--uttmvn-norm-means UTTMVN_NORM_MEANS]
[--uttmvn-norm-vars UTTMVN_NORM_VARS] [--fbank-fs FBANK_FS]
[--n-mels N_MELS] [--fbank-fmin FBANK_FMIN]
[--fbank-fmax FBANK_FMAX]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs. If not given, use all visible devices
- --train-dtype
Possible choices: float16, float32, float64, O0, O1, O2, O3
Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --dict
Dictionary
- --seed
Random seed
Default: 1
- --debugdir
Output directory for debugging
- --resume, -r
Resume the training from snapshot
Default: “”
- --minibatches, -N
Process only N minibatches (for debug)
Default: -1
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log dir path
- --report-interval-iters
Report interval iterations
Default: 100
- --save-interval-iters
Save snapshot interval iterations
Default: 0
- --train-json
Filename of train label data (json)
- --valid-json
Filename of validation label data (json)
- --model-module
model defined module (default: espnet.nets.xxx_backend.e2e_st:E2E)
- --ctc_type
Possible choices: builtin, gtnctc, cudnnctc
Type of CTC implementation to calculate loss.
Default: “builtin”
- --mtlalpha
Multitask learning coefficient, alpha: alpha*ctc_loss + (1-alpha)*att_loss
Default: 0.0
- --asr-weight
Multitask learning coefficient for ASR task, weight: asr_weight*(alpha*ctc_loss + (1-alpha)*att_loss) + (1-asr_weight-mt_weight)*st_loss
Default: 0.0
- --mt-weight
Multitask learning coefficient for MT task, weight: mt_weight*mt_loss + (1-mt_weight-asr_weight)*st_loss
Default: 0.0
- --lsm-weight
Label smoothing weight
Default: 0.0
- --report-cer
Compute CER on development set
Default: False
- --report-wer
Compute WER on development set
Default: False
- --report-bleu
Compute BLEU on development set
Default: True
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 4
- --penalty
Incertion penalty
Default: 0.0
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
Default: 0.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --rnnlm
RNNLM model file to read
- --rnnlm-conf
RNNLM model config file to read
- --lm-weight
RNNLM weight.
Default: 0.0
- --sym-space
Space symbol
Default: “<space>”
- --sym-blank
Blank symbol
Default: “<blank>”
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batch-count
Possible choices: auto, seq, bin, frame
How to count batch_size. The default (auto) will find how to count by args.
Default: “auto”
- --batch-size, --batch-seqs, -b
Maximum seqs in a minibatch (0 to disable)
Default: 0
- --batch-bins
Maximum bins in a minibatch (0 to disable)
Default: 0
- --batch-frames-in
Maximum input frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-out
Maximum output frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-inout
Maximum input+output frames in a minibatch (0 to disable)
Default: 0
- --maxlen-in, --batch-seq-maxlen-in
When –batch-count=seq, batch size is reduced if the input sequence length > ML.
Default: 800
- --maxlen-out, --batch-seq-maxlen-out
When –batch-count=seq, batch size is reduced if the output sequence length > ML
Default: 150
- --n-iter-processes
Number of processes of iterator
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --opt
Possible choices: adadelta, adam, noam
Optimizer
Default: “adadelta”
- --accum-grad
Number of gradient accumuration
Default: 1
- --eps
Epsilon constant for optimizer
Default: 1e-08
- --eps-decay
Decaying ratio of epsilon
Default: 0.01
- --lr
Learning rate for optimizer
Default: 0.001
- --lr-decay
Decaying ratio of learning rate
Default: 1.0
- --weight-decay
Weight decay ratio
Default: 0.0
- --criterion
Possible choices: loss, acc
Criterion to perform epsilon decay
Default: “acc”
- --threshold
Threshold to stop iteration
Default: 0.0001
- --epochs, -e
Maximum number of epochs
Default: 30
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/acc”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --grad-clip
Gradient norm threshold to clip
Default: 5
- --num-save-attention
Number of samples of attention to be saved
Default: 3
- --num-save-ctc
Number of samples of CTC probability to be saved
Default: 3
- --grad-noise
The flag to switch to use noise injection to gradients during training
Default: False
- --context-residual
The flag to switch to use context vector residual in the decoder network
Default: False
- --enc-init
Pre-trained ASR model to initialize encoder.
- --enc-init-mods
List of encoder modules to initialize, separated by a comma.
Default: enc.enc.
- --dec-init
Pre-trained ASR, MT or LM model to initialize decoder.
- --dec-init-mods
List of decoder modules to initialize, separated by a comma.
Default: att., dec.
- --multilingual
Prepend target language ID to the source sentence. Both source/target language IDs must be prepend in the pre-processing stage.
Default: False
- --replace-sos
Replace <sos> in the decoder with a target language ID (the first token in the target sequence)
Default: False
- --stats-file
The stats file for the feature normalization
- --apply-uttmvn
Apply utterance level mean variance normalization.
Default: True
- --uttmvn-norm-means
Default: True
- --uttmvn-norm-vars
Default: False
- --fbank-fs
The sample frequency used for the mel-fbank creation.
Default: 16000
- --n-mels
The number of mel-frequency bins.
Default: 80
- --fbank-fmin
Default: 0.0
- --fbank-fmax
st_trans.py¶
Translate text from speech using a speech translation model on one CPU or GPU
usage: st_trans.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--dtype {float16,float32,float64}]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
[--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
[--trans-json TRANS_JSON] --result-label RESULT_LABEL
--model MODEL [--nbest NBEST] [--beam-size BEAM_SIZE]
[--penalty PENALTY] [--maxlenratio MAXLENRATIO]
[--minlenratio MINLENRATIO] [--tgt-lang TGT_LANG]
Named Arguments¶
- --config
Config file path
- --config2
Second config file path that overwrites the settings in –config
- --config3
Third config file path that overwrites the settings in –config and –config2
- --ngpu
Number of GPUs
Default: 0
- --dtype
Possible choices: float16, float32, float64
Float precision (only available in –api v2)
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --verbose, -V
Verbose option
Default: 1
- --batchsize
Batch size for beam search (0: means no batch processing)
Default: 1
- --preprocess-conf
The configuration file for the pre-processing
- --api
Possible choices: v1, v2
Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.
Default: “v1”
- --trans-json
Filename of translation data (json)
- --result-label
Filename of result label data (json)
- --model
Model file parameters to read
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 1
- --penalty
Incertion penalty
Default: 0.0
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
Default: 0.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --tgt-lang
target language ID (e.g., <en>, <de>, and <fr> etc.)
Default: False
tts_decode.py¶
Synthesize speech from text using a TTS model on one CPU
usage: tts_decode.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] --out OUT [--verbose VERBOSE]
[--preprocess-conf PREPROCESS_CONF] --json JSON --model
MODEL [--model-conf MODEL_CONF]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--threshold THRESHOLD]
[--use-att-constraint USE_ATT_CONSTRAINT]
[--backward-window BACKWARD_WINDOW]
[--forward-window FORWARD_WINDOW]
[--fastspeech-alpha FASTSPEECH_ALPHA]
[--save-durations SAVE_DURATIONS]
[--save-focus-rates SAVE_FOCUS_RATES]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs
Default: 0
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “pytorch”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --out
Output filename
- --verbose, -V
Verbose option
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --json
Filename of train label data (json)
- --model
Model file parameters to read
- --model-conf
Model config file
- --maxlenratio
Maximum length ratio in decoding
Default: 5
- --minlenratio
Minimum length ratio in decoding
Default: 0
- --threshold
Threshold value in decoding
Default: 0.5
- --use-att-constraint
Whether to use the attention constraint
Default: False
- --backward-window
Backward window size in the attention constraint
Default: 1
- --forward-window
Forward window size in the attention constraint
Default: 3
- --fastspeech-alpha
Alpha to change the speed for FastSpeech
Default: 1.0
- --save-durations
Whether to save durations converted from attentions
Default: False
- --save-focus-rates
Whether to save focus rates of attentions
Default: False
tts_train.py¶
Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs
usage: tts_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] [--seed SEED] [--resume [RESUME]]
[--minibatches MINIBATCHES] [--verbose VERBOSE]
[--tensorboard-dir [TENSORBOARD_DIR]]
[--eval-interval-epochs EVAL_INTERVAL_EPOCHS]
[--save-interval-epochs SAVE_INTERVAL_EPOCHS]
[--report-interval-iters REPORT_INTERVAL_ITERS]
--train-json TRAIN_JSON --valid-json VALID_JSON
[--model-module MODEL_MODULE] [--sortagrad [SORTAGRAD]]
[--batch-sort-key [{shuffle,output,input}]]
[--batch-count {auto,seq,bin,frame}]
[--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
[--batch-frames-in BATCH_FRAMES_IN]
[--batch-frames-out BATCH_FRAMES_OUT]
[--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
[--maxlen-out ML]
[--num-iter-processes NUM_ITER_PROCESSES]
[--preprocess-conf PREPROCESS_CONF]
[--use-speaker-embedding USE_SPEAKER_EMBEDDING]
[--use-second-target USE_SECOND_TARGET]
[--opt {adam,noam}] [--accum-grad ACCUM_GRAD] [--lr LR]
[--eps EPS] [--weight-decay WEIGHT_DECAY]
[--epochs EPOCHS]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
[--num-save-attention NUM_SAVE_ATTENTION]
[--keep-all-data-on-mem KEEP_ALL_DATA_ON_MEM]
[--enc-init ENC_INIT] [--enc-init-mods ENC_INIT_MODS]
[--dec-init DEC_INIT] [--dec-init-mods DEC_INIT_MODS]
[--freeze-mods FREEZE_MODS]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs. If not given, use all visible devices
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “pytorch”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --resume, -r
Resume the training from snapshot
Default: “”
- --minibatches, -N
Process only N minibatches (for debug)
Default: -1
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log directory path
- --eval-interval-epochs
Evaluation interval epochs
Default: 1
- --save-interval-epochs
Save interval epochs
Default: 1
- --report-interval-iters
Report interval iterations
Default: 100
- --train-json
Filename of training json
- --valid-json
Filename of validation json
- --model-module
model defined module
Default: “espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2”
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batch-sort-key
Possible choices: shuffle, output, input
Batch sorting key. “shuffle” only work with –batch-count “seq”.
Default: “shuffle”
- --batch-count
Possible choices: auto, seq, bin, frame
How to count batch_size. The default (auto) will find how to count by args.
Default: “auto”
- --batch-size, --batch-seqs, -b
Maximum seqs in a minibatch (0 to disable)
Default: 0
- --batch-bins
Maximum bins in a minibatch (0 to disable)
Default: 0
- --batch-frames-in
Maximum input frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-out
Maximum output frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-inout
Maximum input+output frames in a minibatch (0 to disable)
Default: 0
- --maxlen-in, --batch-seq-maxlen-in
When –batch-count=seq, batch size is reduced if the input sequence length > ML.
Default: 100
- --maxlen-out, --batch-seq-maxlen-out
When –batch-count=seq, batch size is reduced if the output sequence length > ML
Default: 200
- --num-iter-processes
Number of processes of iterator
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --use-speaker-embedding
Whether to use speaker embedding
Default: False
- --use-second-target
Whether to use second target
Default: False
- --opt
Possible choices: adam, noam
Optimizer
Default: “adam”
- --accum-grad
Number of gradient accumuration
Default: 1
- --lr
Learning rate for optimizer
Default: 0.001
- --eps
Epsilon for optimizer
Default: 1e-06
- --weight-decay
Weight decay coefficient for optimizer
Default: 1e-06
- --epochs, -e
Number of maximum epochs
Default: 30
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/loss”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --grad-clip
Gradient norm threshold to clip
Default: 1
- --num-save-attention
Number of samples of attention to be saved
Default: 5
- --keep-all-data-on-mem
Whether to keep all data on memory
Default: False
- --enc-init
Pre-trained TTS model path to initialize encoder.
- --enc-init-mods
List of encoder modules to initialize, separated by a comma.
Default: enc.
- --dec-init
Pre-trained TTS model path to initialize decoder.
- --dec-init-mods
List of decoder modules to initialize, separated by a comma.
Default: dec.
- --freeze-mods
List of modules to freeze (not to train), separated by a comma.
vc_decode.py¶
Converting speech using a VC model on one CPU
usage: vc_decode.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] --out OUT [--verbose VERBOSE]
[--preprocess-conf PREPROCESS_CONF] --json JSON --model
MODEL [--model-conf MODEL_CONF]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--threshold THRESHOLD]
[--use-att-constraint USE_ATT_CONSTRAINT]
[--backward-window BACKWARD_WINDOW]
[--forward-window FORWARD_WINDOW]
[--save-durations SAVE_DURATIONS]
[--save-focus-rates SAVE_FOCUS_RATES]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs
Default: 0
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “pytorch”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --out
Output filename
- --verbose, -V
Verbose option
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --json
Filename of train label data (json)
- --model
Model file parameters to read
- --model-conf
Model config file
- --maxlenratio
Maximum length ratio in decoding
Default: 5
- --minlenratio
Minimum length ratio in decoding
Default: 0
- --threshold
Threshold value in decoding
Default: 0.5
- --use-att-constraint
Whether to use the attention constraint
Default: False
- --backward-window
Backward window size in the attention constraint
Default: 1
- --forward-window
Forward window size in the attention constraint
Default: 3
- --save-durations
Whether to save durations converted from attentions
Default: False
- --save-focus-rates
Whether to save focus rates of attentions
Default: False
vc_train.py¶
Train a new voice conversion (VC) model on one CPU, one or multiple GPUs
usage: vc_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] [--seed SEED] [--resume [RESUME]]
[--minibatches MINIBATCHES] [--verbose VERBOSE]
[--tensorboard-dir [TENSORBOARD_DIR]]
[--eval-interval-epochs EVAL_INTERVAL_EPOCHS]
[--save-interval-epochs SAVE_INTERVAL_EPOCHS]
[--report-interval-iters REPORT_INTERVAL_ITERS]
[--srcspk SRCSPK] [--trgspk TRGSPK] --train-json TRAIN_JSON
--valid-json VALID_JSON [--model-module MODEL_MODULE]
[--sortagrad [SORTAGRAD]]
[--batch-sort-key [{shuffle,output,input}]]
[--batch-count {auto,seq,bin,frame}]
[--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
[--batch-frames-in BATCH_FRAMES_IN]
[--batch-frames-out BATCH_FRAMES_OUT]
[--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
[--maxlen-out ML] [--num-iter-processes NUM_ITER_PROCESSES]
[--preprocess-conf PREPROCESS_CONF]
[--use-speaker-embedding USE_SPEAKER_EMBEDDING]
[--use-second-target USE_SECOND_TARGET]
[--opt {adam,noam,lamb}] [--accum-grad ACCUM_GRAD]
[--lr LR] [--eps EPS] [--weight-decay WEIGHT_DECAY]
[--epochs EPOCHS]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
[--num-save-attention NUM_SAVE_ATTENTION]
[--keep-all-data-on-mem KEEP_ALL_DATA_ON_MEM]
[--enc-init ENC_INIT] [--enc-init-mods ENC_INIT_MODS]
[--dec-init DEC_INIT] [--dec-init-mods DEC_INIT_MODS]
[--freeze-mods FREEZE_MODS]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs. If not given, use all visible devices
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “pytorch”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --resume, -r
Resume the training from snapshot
Default: “”
- --minibatches, -N
Process only N minibatches (for debug)
Default: -1
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log directory path
- --eval-interval-epochs
Evaluation interval epochs
Default: 100
- --save-interval-epochs
Save interval epochs
Default: 1
- --report-interval-iters
Report interval iterations
Default: 10
- --srcspk
Source speaker
- --trgspk
Target speaker
- --train-json
Filename of training json
- --valid-json
Filename of validation json
- --model-module
model defined module
Default: “espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2”
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batch-sort-key
Possible choices: shuffle, output, input
Batch sorting key. “shuffle” only work with –batch-count “seq”.
Default: “shuffle”
- --batch-count
Possible choices: auto, seq, bin, frame
How to count batch_size. The default (auto) will find how to count by args.
Default: “auto”
- --batch-size, --batch-seqs, -b
Maximum seqs in a minibatch (0 to disable)
Default: 0
- --batch-bins
Maximum bins in a minibatch (0 to disable)
Default: 0
- --batch-frames-in
Maximum input frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-out
Maximum output frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-inout
Maximum input+output frames in a minibatch (0 to disable)
Default: 0
- --maxlen-in, --batch-seq-maxlen-in
When –batch-count=seq, batch size is reduced if the input sequence length > ML.
Default: 100
- --maxlen-out, --batch-seq-maxlen-out
When –batch-count=seq, batch size is reduced if the output sequence length > ML
Default: 200
- --num-iter-processes
Number of processes of iterator
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --use-speaker-embedding
Whether to use speaker embedding
Default: False
- --use-second-target
Whether to use second target
Default: False
- --opt
Possible choices: adam, noam, lamb
Optimizer
Default: “adam”
- --accum-grad
Number of gradient accumuration
Default: 1
- --lr
Learning rate for optimizer
Default: 0.001
- --eps
Epsilon for optimizer
Default: 1e-06
- --weight-decay
Weight decay coefficient for optimizer
Default: 1e-06
- --epochs, -e
Number of maximum epochs
Default: 30
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/loss”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --grad-clip
Gradient norm threshold to clip
Default: 1
- --num-save-attention
Number of samples of attention to be saved
Default: 5
- --keep-all-data-on-mem
Whether to keep all data on memory
Default: False
- --enc-init
Pre-trained model path to initialize encoder.
- --enc-init-mods
List of encoder modules to initialize, separated by a comma.
Default: enc.
- --dec-init
Pre-trained model path to initialize decoder.
- --dec-init-mods
List of decoder modules to initialize, separated by a comma.
Default: dec.
- --freeze-mods
List of modules to freeze (not to train), separated by a comma.