core tools (espnet2)¶
ESPnet2 provides several command-line tools for training and evaluating neural networks (NN) under espnet2/bin
:
aggregate_stats_dirs.py¶
usage: aggregate_stats_dirs.py [-h]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--skip_sum_stats] [--input_dir INPUT_DIR]
--output_dir OUTPUT_DIR
Aggregate statistics directories into one directory
optional arguments:
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--skip_sum_stats Skip computing the sum of statistics. (default: False)
--input_dir INPUT_DIR
Input directories (default: None)
--output_dir OUTPUT_DIR
Output directory (default: None)
asr_align.py¶
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/runner/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /home/runner/nltk_data...
[nltk_data] Unzipping corpora/cmudict.zip.
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: asr_align.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--ngpu NGPU] [--dtype {float16,float32,float64}]
--asr_train_config ASR_TRAIN_CONFIG --asr_model_file
ASR_MODEL_FILE [--token_type {char,bpe,None}]
[--bpemodel BPEMODEL] [--fs FS]
[--min_window_size MIN_WINDOW_SIZE]
[--max_window_size MAX_WINDOW_SIZE]
[--set_blank SET_BLANK] [--gratis_blank GRATIS_BLANK]
[--replace_spaces_with_blanks REPLACE_SPACES_WITH_BLANKS]
[--scoring_length SCORING_LENGTH]
[--time_stamps {auto,fixed}]
[--text_converter {tokenize,classic}]
[--kaldi_style_text KALDI_STYLE_TEXT]
[--print_utt_text PRINT_UTT_TEXT]
[--print_utt_score PRINT_UTT_SCORE] -a AUDIO -t TEXT
[-o OUTPUT]
ASR Decoding
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
Model configuration related:
--asr_train_config ASR_TRAIN_CONFIG
--asr_model_file ASR_MODEL_FILE
Text converter related:
--token_type {char,bpe,None}
The token type for ASR model. If not given, refers
from the training args (default: None)
--bpemodel BPEMODEL The model path of sentencepiece. If not given, refers
from the training args (default: None)
CTC segmentation related:
--fs FS Sampling Frequency. The sampling frequency (in Hz) is
needed to correctly determine the starting and ending
time of aligned segments. (default: 16000)
--min_window_size MIN_WINDOW_SIZE
Minimum window size considered for utterance.
(default: None)
--max_window_size MAX_WINDOW_SIZE
Maximum window size considered for utterance.
(default: None)
--set_blank SET_BLANK
Index of model dictionary for blank token. (default:
None)
--gratis_blank GRATIS_BLANK
Set the transition cost of the blank token to zero.
Audio sections labeled with blank tokens can then be
skipped without penalty. Useful if there are unrelated
audio segments between utterances. (default: False)
--replace_spaces_with_blanks REPLACE_SPACES_WITH_BLANKS
Fill blanks in between words to better model pauses
between words. This option is only active for
`--text_converter classic`. Segments can be misaligned
if this option is combined with --gratis-blank.
(default: False)
--scoring_length SCORING_LENGTH
Changes partitioning length L for calculation of the
confidence score. (default: None)
--time_stamps {auto,fixed}
Select method how CTC index duration is estimated, and
thus how the time stamps are calculated. (default:
auto)
--text_converter {tokenize,classic}
How CTC segmentation handles text. (default: tokenize)
Input/output arguments:
--kaldi_style_text KALDI_STYLE_TEXT
Assume that the input text file is kaldi-style
formatted, i.e., the utterance name is at the
beginning of each line. (default: True)
--print_utt_text PRINT_UTT_TEXT
Include the utterance text in the segments output.
(default: True)
--print_utt_score PRINT_UTT_SCORE
Include the confidence score in the segments output.
(default: True)
-a AUDIO, --audio AUDIO
Input audio file. (default: None)
-t TEXT, --text TEXT Input text file. Each line contains the ground truth
of a single utterance. Kaldi-style text files include
the name of the utterance as the first word in the
line. (default: None)
-o OUTPUT, --output OUTPUT
Output in the form of a `segments` file. If not given,
output is written to stdout. (default: -)
asr_inference_streaming.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: asr_inference_streaming.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU]
[--seed SEED]
[--dtype {float16,float32,float64}]
[--num_workers NUM_WORKERS]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE
[--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--sim_chunk_length SIM_CHUNK_LENGTH]
--asr_train_config ASR_TRAIN_CONFIG
--asr_model_file ASR_MODEL_FILE
[--lm_train_config LM_TRAIN_CONFIG]
[--lm_file LM_FILE]
[--word_lm_train_config WORD_LM_TRAIN_CONFIG]
[--word_lm_file WORD_LM_FILE]
[--batch_size BATCH_SIZE] [--nbest NBEST]
[--beam_size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO]
[--minlenratio MINLENRATIO]
[--ctc_weight CTC_WEIGHT]
[--lm_weight LM_WEIGHT]
[--disable_repetition_detection DISABLE_REPETITION_DETECTION]
[--encoded_feat_length_limit ENCODED_FEAT_LENGTH_LIMIT]
[--decoder_text_length_limit DECODER_TEXT_LENGTH_LIMIT]
[--token_type {char,bpe,None}]
[--bpemodel BPEMODEL]
ASR Decoding
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
--sim_chunk_length SIM_CHUNK_LENGTH
The length of one chunk, to which speech will be
divided for evalution of streaming processing.
(default: 0)
The model configuration related:
--asr_train_config ASR_TRAIN_CONFIG
--asr_model_file ASR_MODEL_FILE
--lm_train_config LM_TRAIN_CONFIG
--lm_file LM_FILE
--word_lm_train_config WORD_LM_TRAIN_CONFIG
--word_lm_file WORD_LM_FILE
Beam-search related:
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
--nbest NBEST Output N-best hypotheses (default: 1)
--beam_size BEAM_SIZE
Beam size (default: 20)
--penalty PENALTY Insertion penalty (default: 0.0)
--maxlenratio MAXLENRATIO
Input length ratio to obtain max output length. If
maxlenratio=0.0 (default), it uses a end-detect
function to automatically find maximum hypothesis
lengths (default: 0.0)
--minlenratio MINLENRATIO
Input length ratio to obtain min output length
(default: 0.0)
--ctc_weight CTC_WEIGHT
CTC weight in joint decoding (default: 0.5)
--lm_weight LM_WEIGHT
RNNLM weight (default: 1.0)
--disable_repetition_detection DISABLE_REPETITION_DETECTION
--encoded_feat_length_limit ENCODED_FEAT_LENGTH_LIMIT
Limit the lengths of the encoded featureto input to
the decoder. (default: 0)
--decoder_text_length_limit DECODER_TEXT_LENGTH_LIMIT
Limit the lengths of the textto input to the decoder.
(default: 0)
Text converter related:
--token_type {char,bpe,None}
The token type for ASR model. If not given, refers
from the training args (default: None)
--bpemodel BPEMODEL The model path of sentencepiece. If not given, refers
from the training args (default: None)
asr_train.py¶
usage: asr_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--token_list TOKEN_LIST]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
[--joint_net_conf JOINT_NET_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word,phn,hugging_face,whisper_en,whisper_multilingual}]
[--bpemodel BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese,whisper_en,whisper_basic}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
[--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
[--noise_scp NOISE_SCP]
[--noise_apply_prob NOISE_APPLY_PROB]
[--noise_db_range NOISE_DB_RANGE]
[--short_noise_thres SHORT_NOISE_THRES]
[--aux_ctc_tasks AUX_CTC_TASKS [AUX_CTC_TASKS ...]]
[--frontend {default,sliding_window,s3prl,fused,whisper}]
[--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
[--specaug_conf SPECAUG_CONF]
[--normalize {global_mvn,utterance_mvn,None}]
[--normalize_conf NORMALIZE_CONF]
[--model {espnet,maskctc,pit_espnet}]
[--model_conf MODEL_CONF]
[--preencoder {sinc,linear,None}]
[--preencoder_conf PREENCODER_CONF]
[--encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer}]
[--encoder_conf ENCODER_CONF]
[--postencoder {hugging_face_transformers,None}]
[--postencoder_conf POSTENCODER_CONF]
[--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,whisper,hugging_face_transformers,s4,None}]
[--decoder_conf DECODER_CONF]
[--preprocessor {default,multi}]
[--preprocessor_conf PREPROCESSOR_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
non_linguistic_symbols file path (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--token_list TOKEN_LIST
A text mapping int-id to token (default: None)
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--input_size INPUT_SIZE
The number of input dimension of the feature (default: None)
--ctc_conf CTC_CONF The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True})
--joint_net_conf JOINT_NET_CONF
The keyword arguments for joint network class. (default: None)
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--token_type {bpe,char,word,phn,hugging_face,whisper_en,whisper_multilingual}
The text will be tokenized in the specified level token (default: bpe)
--bpemodel BPEMODEL The model file of sentencepiece (default: None)
--cleaner {None,tacotron,jaconv,vietnamese,whisper_en,whisper_basic}
Apply text cleaning (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
Specify g2p method if --token_type=phn (default: None)
--speech_volume_normalize SPEECH_VOLUME_NORMALIZE
Scale the maximum amplitude to the given value. (default: None)
--rir_scp RIR_SCP The file path of rir scp file. (default: None)
--rir_apply_prob RIR_APPLY_PROB
THe probability for applying RIR convolution. (default: 1.0)
--noise_scp NOISE_SCP
The file path of noise scp file. (default: None)
--noise_apply_prob NOISE_APPLY_PROB
The probability applying Noise adding. (default: 1.0)
--noise_db_range NOISE_DB_RANGE
The range of noise decibel level. (default: 13_15)
--short_noise_thres SHORT_NOISE_THRES
If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
--aux_ctc_tasks AUX_CTC_TASKS [AUX_CTC_TASKS ...]
Auxillary tasks to train on using CTC loss. (default: [])
--frontend {default,sliding_window,s3prl,fused,whisper}
The frontend type (default: default)
--frontend_conf FRONTEND_CONF
The keyword arguments for frontend (default: {})
--specaug {specaug,None}
The specaug type (default: None)
--specaug_conf SPECAUG_CONF
The keyword arguments for specaug (default: {})
--normalize {global_mvn,utterance_mvn,None}
The normalize type (default: utterance_mvn)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--model {espnet,maskctc,pit_espnet}
The model type (default: espnet)
--model_conf MODEL_CONF
The keyword arguments for model (default: {})
--preencoder {sinc,linear,None}
The preencoder type (default: None)
--preencoder_conf PREENCODER_CONF
The keyword arguments for preencoder (default: {})
--encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer}
The encoder type (default: rnn)
--encoder_conf ENCODER_CONF
The keyword arguments for encoder (default: {})
--postencoder {hugging_face_transformers,None}
The postencoder type (default: None)
--postencoder_conf POSTENCODER_CONF
The keyword arguments for postencoder (default: {})
--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,whisper,hugging_face_transformers,s4,None}
The decoder type (default: None)
--decoder_conf DECODER_CONF
The keyword arguments for decoder (default: {})
--preprocessor {default,multi}
The preprocessor type (default: default)
--preprocessor_conf PREPROCESSOR_CONF
The keyword arguments for preprocessor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
asr_transducer_inference.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: asr_transducer_inference.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU]
[--seed SEED]
[--dtype {float16,float32,float64}]
[--num_workers NUM_WORKERS]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE
[--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--asr_train_config ASR_TRAIN_CONFIG]
[--asr_model_file ASR_MODEL_FILE]
[--lm_train_config LM_TRAIN_CONFIG]
[--lm_file LM_FILE] [--model_tag MODEL_TAG]
[--batch_size BATCH_SIZE] [--nbest NBEST]
[--beam_size BEAM_SIZE]
[--lm_weight LM_WEIGHT]
[--beam_search_config BEAM_SEARCH_CONFIG]
[--token_type {char,bpe,None}]
[--bpemodel BPEMODEL]
[--quantize_asr_model QUANTIZE_ASR_MODEL]
[--quantize_modules [QUANTIZE_MODULES [QUANTIZE_MODULES ...]]]
[--quantize_dtype {float16,qint8}]
[--streaming STREAMING]
[--decoding_window DECODING_WINDOW]
[--left_context LEFT_CONTEXT]
[--display_hypotheses DISPLAY_HYPOTHESES]
ASR Transducer Decoding
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--quantize_asr_model QUANTIZE_ASR_MODEL
Apply dynamic quantization to ASR model. (default:
False)
--quantize_modules [QUANTIZE_MODULES [QUANTIZE_MODULES ...]]
Module names to apply dynamic quantization on. The
module names are provided as a list, where each name
is separated by a comma (e.g.: --quantize-
config=[Linear,LSTM,GRU]). Each specified name should
be an attribute of 'torch.nn', e.g.: torch.nn.Linear,
torch.nn.LSTM, torch.nn.GRU, ... (default: None)
--quantize_dtype {float16,qint8}
Dtype for dynamic quantization. (default: qint8)
--streaming STREAMING
Whether to perform chunk-by-chunk inference. (default:
False)
--decoding_window DECODING_WINDOW
Audio length (in milliseconds) to process during
decoding. (default: 640)
--left_context LEFT_CONTEXT
Number of previous frames (AFTER subsamplingà the
attention module can see in current chunk (used by
Conformer and Branchformer block). (default: 32)
--display_hypotheses DISPLAY_HYPOTHESES
Whether to display hypotheses during inference. If
streaming=True, partial hypotheses will also be shown.
(default: False)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
The model configuration related:
--asr_train_config ASR_TRAIN_CONFIG
ASR training configuration (default: None)
--asr_model_file ASR_MODEL_FILE
ASR model parameter file (default: None)
--lm_train_config LM_TRAIN_CONFIG
LM training configuration (default: None)
--lm_file LM_FILE LM parameter file (default: None)
--model_tag MODEL_TAG
Pretrained model tag. If specify this option,
*_train_config and *_file will be overwritten
(default: None)
Beam-search related:
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
--nbest NBEST Output N-best hypotheses (default: 1)
--beam_size BEAM_SIZE
Beam size (default: 5)
--lm_weight LM_WEIGHT
RNNLM weight (default: 1.0)
--beam_search_config BEAM_SEARCH_CONFIG
The keyword arguments for transducer beam search.
(default: {})
Text converter related:
--token_type {char,bpe,None}
The token type for ASR model. If not given, refers
from the training args (default: None)
--bpemodel BPEMODEL The model path of sentencepiece. If not given, refers
from the training args (default: None)
asr_transducer_train.py¶
usage: asr_transducer_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU]
[--seed SEED] [--num_workers NUM_WORKERS]
[--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK]
[--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP]
[--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE]
[--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN]
[--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP]
[--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB]
[--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID]
[--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--token_list TOKEN_LIST]
[--input_size INPUT_SIZE] [--init INIT]
[--model_conf MODEL_CONF]
[--encoder_conf ENCODER_CONF]
[--joint_network_conf JOINT_NETWORK_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word,phn}]
[--bpemodel BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
[--rir_scp RIR_SCP]
[--rir_apply_prob RIR_APPLY_PROB]
[--noise_scp NOISE_SCP]
[--noise_apply_prob NOISE_APPLY_PROB]
[--noise_db_range NOISE_DB_RANGE]
[--frontend {default,sliding_window}]
[--frontend_conf FRONTEND_CONF]
[--specaug {specaug,None}]
[--specaug_conf SPECAUG_CONF]
[--normalize {global_mvn,utterance_mvn,None}]
[--normalize_conf NORMALIZE_CONF]
[--decoder {mega,rnn,rwkv,stateless}]
[--decoder_conf DECODER_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related.
--token_list TOKEN_LIST
Integer-string mapper for tokens. (default: None)
--input_size INPUT_SIZE
The number of dimensions for input features. (default: None)
--init INIT Type of model initialization to use. (default: None)
--model_conf MODEL_CONF
The keyword arguments for the model class. (default: {'transducer_weight': 1.0, 'use_k2_pruned_loss': False, 'k2_pruned_loss_args': {}, 'warmup_steps': 25000, 'validation_nstep': 2, 'fastemit_lambda': 0.0, 'auxiliary_ctc_weight': 0.0, 'auxiliary_ctc_dropout_rate': 0.0, 'auxiliary_lm_loss_weight': 0.0, 'auxiliary_lm_loss_smoothing': 0.05, 'ignore_id': -1, 'sym_space': '<space>', 'sym_blank': '<blank>', 'report_cer': False, 'report_wer': False, 'extract_feats_in_collect_stats': True})
--encoder_conf ENCODER_CONF
The keyword arguments for the encoder class. (default: {})
--joint_network_conf JOINT_NETWORK_CONF
The keyword arguments for the joint network class. (default: {})
Preprocess related.
--use_preprocessor USE_PREPROCESSOR
Whether to apply preprocessing to input data. (default: True)
--token_type {bpe,char,word,phn}
The type of tokens to use during tokenization. (default: bpe)
--bpemodel BPEMODEL The path of the sentencepiece model. (default: None)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
The 'non_linguistic_symbols' file path. (default: None)
--cleaner {None,tacotron,jaconv,vietnamese}
Text cleaner to use. (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
g2p method to use if --token_type=phn. (default: None)
--speech_volume_normalize SPEECH_VOLUME_NORMALIZE
Normalization value for maximum amplitude scaling. (default: None)
--rir_scp RIR_SCP The RIR SCP file path. (default: None)
--rir_apply_prob RIR_APPLY_PROB
The probability of the applied RIR convolution. (default: 1.0)
--noise_scp NOISE_SCP
The path of noise SCP file. (default: None)
--noise_apply_prob NOISE_APPLY_PROB
The probability of the applied noise addition. (default: 1.0)
--noise_db_range NOISE_DB_RANGE
The range of the noise decibel level. (default: 13_15)
--frontend {default,sliding_window}
The frontend type (default: default)
--frontend_conf FRONTEND_CONF
The keyword arguments for frontend (default: {})
--specaug {specaug,None}
The specaug type (default: None)
--specaug_conf SPECAUG_CONF
The keyword arguments for specaug (default: {})
--normalize {global_mvn,utterance_mvn,None}
The normalize type (default: utterance_mvn)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--decoder {mega,rnn,rwkv,stateless}
The decoder type (default: rnn)
--decoder_conf DECODER_CONF
The keyword arguments for decoder (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
diar_inference.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: diar_inference.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
[--dtype {float16,float32,float64}] [--fs FS]
[--num_workers NUM_WORKERS]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--train_config TRAIN_CONFIG]
[--model_file MODEL_FILE] [--model_tag MODEL_TAG]
[--batch_size BATCH_SIZE]
[--segment_size SEGMENT_SIZE] [--hop_size HOP_SIZE]
[--show_progressbar SHOW_PROGRESSBAR]
[--num_spk NUM_SPK] [--enh_s2t_task ENH_S2T_TASK]
[--normalize_segment_scale NORMALIZE_SEGMENT_SCALE]
[--normalize_output_wav NORMALIZE_OUTPUT_WAV]
[--multiply_diar_result MULTIPLY_DIAR_RESULT]
Speaker Diarization inference
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--fs FS Sampling rate (default: 8000)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
The model configuration related:
--train_config TRAIN_CONFIG
Diarization training configuration (default: None)
--model_file MODEL_FILE
Diarization model parameter file (default: None)
--model_tag MODEL_TAG
Pretrained model tag. If specify this option,
train_config and model_file will be overwritten
(default: None)
Data loading related:
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
Diarize speech related:
--segment_size SEGMENT_SIZE
Segment length in seconds for segment-wise speaker
diarization (default: None)
--hop_size HOP_SIZE Hop length in seconds for segment-wise speech
enhancement/separation (default: None)
--show_progressbar SHOW_PROGRESSBAR
Whether to show a progress bar when performing
segment-wise speaker diarization (default: False)
--num_spk NUM_SPK Predetermined number of speakers for inference
(default: None)
Enh + Diar related:
--enh_s2t_task ENH_S2T_TASK
enhancement and diarization joint model (default:
False)
--normalize_segment_scale NORMALIZE_SEGMENT_SCALE
Whether to normalize the energy of the separated
streams in each segment (default: False)
--normalize_output_wav NORMALIZE_OUTPUT_WAV
Whether to normalize the predicted wav to [-1~1]
(default: False)
--multiply_diar_result MULTIPLY_DIAR_RESULT
Whether to multiply diar results to separated waves
(default: False)
diar_train.py¶
usage: diar_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF] [--num_spk NUM_SPK]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--input_size INPUT_SIZE] [--model_conf MODEL_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--frontend {default,sliding_window,s3prl,None}]
[--frontend_conf FRONTEND_CONF]
[--specaug {specaug,None}] [--specaug_conf SPECAUG_CONF]
[--normalize {global_mvn,utterance_mvn,None}]
[--normalize_conf NORMALIZE_CONF]
[--encoder {conformer,transformer,rnn}]
[--encoder_conf ENCODER_CONF] [--decoder {linear}]
[--decoder_conf DECODER_CONF]
[--label_aggregator {label_aggregator}]
[--label_aggregator_conf LABEL_AGGREGATOR_CONF]
[--attractor {rnn,None}]
[--attractor_conf ATTRACTOR_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--num_spk NUM_SPK The number fo speakers (for each recording) used in system training (default: None)
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--input_size INPUT_SIZE
The number of input dimension of the feature (default: None)
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {'diar_weight': 1.0, 'attractor_weight': 1.0})
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--frontend {default,sliding_window,s3prl,None}
The frontend type (default: default)
--frontend_conf FRONTEND_CONF
The keyword arguments for frontend (default: {})
--specaug {specaug,None}
The specaug type (default: None)
--specaug_conf SPECAUG_CONF
The keyword arguments for specaug (default: {})
--normalize {global_mvn,utterance_mvn,None}
The normalize type (default: utterance_mvn)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--encoder {conformer,transformer,rnn}
The encoder type (default: transformer)
--encoder_conf ENCODER_CONF
The keyword arguments for encoder (default: {})
--decoder {linear} The decoder type (default: linear)
--decoder_conf DECODER_CONF
The keyword arguments for decoder (default: {})
--label_aggregator {label_aggregator}
The label_aggregator type (default: label_aggregator)
--label_aggregator_conf LABEL_AGGREGATOR_CONF
The keyword arguments for label_aggregator (default: {})
--attractor {rnn,None}
The attractor type (default: None)
--attractor_conf ATTRACTOR_CONF
The keyword arguments for attractor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
enh_inference.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: enh_inference.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
[--dtype {float16,float32,float64}] [--fs FS]
[--num_workers NUM_WORKERS]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--normalize_output_wav NORMALIZE_OUTPUT_WAV]
[--train_config TRAIN_CONFIG]
[--model_file MODEL_FILE] [--model_tag MODEL_TAG]
[--inference_config INFERENCE_CONFIG]
[--enh_s2t_task ENH_S2T_TASK]
[--batch_size BATCH_SIZE]
[--segment_size SEGMENT_SIZE] [--hop_size HOP_SIZE]
[--normalize_segment_scale NORMALIZE_SEGMENT_SCALE]
[--show_progressbar SHOW_PROGRESSBAR]
[--ref_channel REF_CHANNEL]
Frontend inference
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--fs FS Sampling rate (default: 8000)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Output data related:
--normalize_output_wav NORMALIZE_OUTPUT_WAV
Whether to normalize the predicted wav to [-1~1]
(default: False)
The model configuration related:
--train_config TRAIN_CONFIG
Training configuration file (default: None)
--model_file MODEL_FILE
Model parameter file (default: None)
--model_tag MODEL_TAG
Pretrained model tag. If specify this option,
train_config and model_file will be overwritten
(default: None)
--inference_config INFERENCE_CONFIG
Optional configuration file for overwriting enh model
attributes during inference (default: None)
--enh_s2t_task ENH_S2T_TASK
enhancement and asr joint model (default: False)
Data loading related:
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
SeparateSpeech related:
--segment_size SEGMENT_SIZE
Segment length in seconds for segment-wise speech
enhancement/separation (default: None)
--hop_size HOP_SIZE Hop length in seconds for segment-wise speech
enhancement/separation (default: None)
--normalize_segment_scale NORMALIZE_SEGMENT_SCALE
Whether to normalize the energy of the separated
streams in each segment (default: True)
--show_progressbar SHOW_PROGRESSBAR
Whether to show a progress bar when performing
segment-wise speech enhancement/separation (default:
False)
--ref_channel REF_CHANNEL
If not None, this will overwrite the ref_channel
defined in the separator module (for multi-channel
speech processing) (default: None)
enh_inference_streaming.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: enh_inference_streaming.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU]
[--seed SEED]
[--dtype {float16,float32,float64}]
[--fs FS] [--num_workers NUM_WORKERS]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE
[--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--train_config TRAIN_CONFIG]
[--model_file MODEL_FILE]
[--model_tag MODEL_TAG]
[--inference_config INFERENCE_CONFIG]
[--enh_s2t_task ENH_S2T_TASK]
[--batch_size BATCH_SIZE]
[--ref_channel REF_CHANNEL]
Frontend inference
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--fs FS Sampling rate (default: 8000)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
The model configuration related:
--train_config TRAIN_CONFIG
Training configuration file (default: None)
--model_file MODEL_FILE
Model parameter file (default: None)
--model_tag MODEL_TAG
Pretrained model tag. If specify this option,
train_config and model_file will be overwritten
(default: None)
--inference_config INFERENCE_CONFIG
Optional configuration file for overwriting enh model
attributes during inference (default: None)
--enh_s2t_task ENH_S2T_TASK
enhancement and asr joint model (default: False)
Data loading related:
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
SeparateSpeech related:
--ref_channel REF_CHANNEL
If not None, this will overwrite the ref_channel
defined in the separator module (for multi-channel
speech processing) (default: None)
enh_s2t_train.py¶
usage: enh_s2t_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS]
[--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP]
[--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB]
[--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
[--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--token_list TOKEN_LIST]
[--src_token_list SRC_TOKEN_LIST]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
[--enh_criterions ENH_CRITERIONS]
[--diar_num_spk DIAR_NUM_SPK]
[--diar_input_size DIAR_INPUT_SIZE]
[--enh_model_conf ENH_MODEL_CONF]
[--asr_model_conf ASR_MODEL_CONF]
[--st_model_conf ST_MODEL_CONF]
[--diar_model_conf DIAR_MODEL_CONF]
[--subtask_series {enh,asr,st,diar} [{enh,asr,st,diar} ...]]
[--model_conf MODEL_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word,phn}]
[--bpemodel BPEMODEL]
[--src_token_type {bpe,char,word,phn}]
[--src_bpemodel SRC_BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--text_name TEXT_NAME [TEXT_NAME ...]]
[--enh_encoder {stft,conv,same}]
[--enh_encoder_conf ENH_ENCODER_CONF]
[--enh_separator {asteroid,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube,tfgridnet}]
[--enh_separator_conf ENH_SEPARATOR_CONF]
[--enh_decoder {stft,conv,same}]
[--enh_decoder_conf ENH_DECODER_CONF]
[--enh_mask_module {multi_mask}]
[--enh_mask_module_conf ENH_MASK_MODULE_CONF]
[--frontend {default,sliding_window,s3prl,fused,whisper}]
[--frontend_conf FRONTEND_CONF]
[--specaug {specaug,None}]
[--specaug_conf SPECAUG_CONF]
[--normalize {global_mvn,utterance_mvn,None}]
[--normalize_conf NORMALIZE_CONF]
[--asr_preencoder {sinc,linear,None}]
[--asr_preencoder_conf ASR_PREENCODER_CONF]
[--asr_encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer}]
[--asr_encoder_conf ASR_ENCODER_CONF]
[--asr_postencoder {hugging_face_transformers,None}]
[--asr_postencoder_conf ASR_POSTENCODER_CONF]
[--asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,whisper,hugging_face_transformers,s4,None}]
[--asr_decoder_conf ASR_DECODER_CONF]
[--st_preencoder {sinc,linear,None}]
[--st_preencoder_conf ST_PREENCODER_CONF]
[--st_encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}]
[--st_encoder_conf ST_ENCODER_CONF]
[--st_postencoder {hugging_face_transformers,None}]
[--st_postencoder_conf ST_POSTENCODER_CONF]
[--st_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
[--st_decoder_conf ST_DECODER_CONF]
[--st_extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
[--st_extra_asr_decoder_conf ST_EXTRA_ASR_DECODER_CONF]
[--st_extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
[--st_extra_mt_decoder_conf ST_EXTRA_MT_DECODER_CONF]
[--diar_frontend {default,sliding_window,s3prl,None}]
[--diar_frontend_conf DIAR_FRONTEND_CONF]
[--diar_specaug {specaug,None}]
[--diar_specaug_conf DIAR_SPECAUG_CONF]
[--diar_normalize {global_mvn,utterance_mvn,None}]
[--diar_normalize_conf DIAR_NORMALIZE_CONF]
[--diar_encoder {conformer,transformer,rnn}]
[--diar_encoder_conf DIAR_ENCODER_CONF]
[--diar_decoder {linear}]
[--diar_decoder_conf DIAR_DECODER_CONF]
[--label_aggregator {label_aggregator}]
[--label_aggregator_conf LABEL_AGGREGATOR_CONF]
[--diar_attractor {rnn,None}]
[--diar_attractor_conf DIAR_ATTRACTOR_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--token_list TOKEN_LIST
A text mapping int-id to token (default: None)
--src_token_list SRC_TOKEN_LIST
A text mapping int-id to token (for source language) (default: None)
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--input_size INPUT_SIZE
The number of input dimension of the feature (default: None)
--ctc_conf CTC_CONF The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True})
--enh_criterions ENH_CRITERIONS
The criterions binded with the loss wrappers. (default: [{'name': 'si_snr', 'conf': {}, 'wrapper': 'fixed_order', 'wrapper_conf': {}}])
--diar_num_spk DIAR_NUM_SPK
The number of speakers (for each recording) for diar submodel class (default: None)
--diar_input_size DIAR_INPUT_SIZE
The number of input dimension of the feature (default: None)
--enh_model_conf ENH_MODEL_CONF
The keyword arguments for enh submodel class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None, 'extract_feats_in_collect_stats': False})
--asr_model_conf ASR_MODEL_CONF
The keyword arguments for asr submodel class. (default: {'aux_ctc': None, 'ctc_weight': 0.5, 'interctc_weight': 0.0, 'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_cer': True, 'report_wer': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'transducer_multi_blank_durations': [], 'transducer_multi_blank_sigma': 0.05, 'sym_sos': '<sos/eos>', 'sym_eos': '<sos/eos>', 'extract_feats_in_collect_stats': True, 'lang_token_id': -1})
--st_model_conf ST_MODEL_CONF
The keyword arguments for st submodel class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None, 'extract_feats_in_collect_stats': False})
--diar_model_conf DIAR_MODEL_CONF
The keyword arguments for diar submodel class. (default: {'diar_weight': 1.0, 'attractor_weight': 1.0})
--subtask_series {enh,asr,st,diar} [{enh,asr,st,diar} ...]
The series of subtasks in the pipeline. (default: ('enh', 'asr'))
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {'calc_enh_loss': True, 'bypass_enh_prob': 0})
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: False)
--token_type {bpe,char,word,phn}
The text will be tokenized in the specified level token (default: bpe)
--bpemodel BPEMODEL The model file of sentencepiece (default: None)
--src_token_type {bpe,char,word,phn}
The source text will be tokenized in the specified level token (default: bpe)
--src_bpemodel SRC_BPEMODEL
The model file of sentencepiece (for source language) (default: None)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
non_linguistic_symbols file path (default: None)
--cleaner {None,tacotron,jaconv,vietnamese}
Apply text cleaning (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
Specify g2p method if --token_type=phn (default: None)
--text_name TEXT_NAME [TEXT_NAME ...]
Specify the text_name attribute used in the preprocessor (default: ['text'])
--enh_encoder {stft,conv,same}
The enh_encoder type (default: stft)
--enh_encoder_conf ENH_ENCODER_CONF
The keyword arguments for enh_encoder (default: {})
--enh_separator {asteroid,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube,tfgridnet}
The enh_separator type (default: rnn)
--enh_separator_conf ENH_SEPARATOR_CONF
The keyword arguments for enh_separator (default: {})
--enh_decoder {stft,conv,same}
The enh_decoder type (default: stft)
--enh_decoder_conf ENH_DECODER_CONF
The keyword arguments for enh_decoder (default: {})
--enh_mask_module {multi_mask}
The enh_mask_module type (default: multi_mask)
--enh_mask_module_conf ENH_MASK_MODULE_CONF
The keyword arguments for enh_mask_module (default: {})
--frontend {default,sliding_window,s3prl,fused,whisper}
The frontend type (default: default)
--frontend_conf FRONTEND_CONF
The keyword arguments for frontend (default: {})
--specaug {specaug,None}
The specaug type (default: None)
--specaug_conf SPECAUG_CONF
The keyword arguments for specaug (default: {})
--normalize {global_mvn,utterance_mvn,None}
The normalize type (default: utterance_mvn)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--asr_preencoder {sinc,linear,None}
The asr_preencoder type (default: None)
--asr_preencoder_conf ASR_PREENCODER_CONF
The keyword arguments for asr_preencoder (default: {})
--asr_encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer}
The asr_encoder type (default: rnn)
--asr_encoder_conf ASR_ENCODER_CONF
The keyword arguments for asr_encoder (default: {})
--asr_postencoder {hugging_face_transformers,None}
The asr_postencoder type (default: None)
--asr_postencoder_conf ASR_POSTENCODER_CONF
The keyword arguments for asr_postencoder (default: {})
--asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,whisper,hugging_face_transformers,s4,None}
The asr_decoder type (default: None)
--asr_decoder_conf ASR_DECODER_CONF
The keyword arguments for asr_decoder (default: {})
--st_preencoder {sinc,linear,None}
The st_preencoder type (default: None)
--st_preencoder_conf ST_PREENCODER_CONF
The keyword arguments for st_preencoder (default: {})
--st_encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}
The st_encoder type (default: rnn)
--st_encoder_conf ST_ENCODER_CONF
The keyword arguments for st_encoder (default: {})
--st_postencoder {hugging_face_transformers,None}
The st_postencoder type (default: None)
--st_postencoder_conf ST_POSTENCODER_CONF
The keyword arguments for st_postencoder (default: {})
--st_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
The st_decoder type (default: rnn)
--st_decoder_conf ST_DECODER_CONF
The keyword arguments for st_decoder (default: {})
--st_extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
The st_extra_asr_decoder type (default: rnn)
--st_extra_asr_decoder_conf ST_EXTRA_ASR_DECODER_CONF
The keyword arguments for st_extra_asr_decoder (default: {})
--st_extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
The st_extra_mt_decoder type (default: rnn)
--st_extra_mt_decoder_conf ST_EXTRA_MT_DECODER_CONF
The keyword arguments for st_extra_mt_decoder (default: {})
--diar_frontend {default,sliding_window,s3prl,None}
The diar_frontend type (default: default)
--diar_frontend_conf DIAR_FRONTEND_CONF
The keyword arguments for diar_frontend (default: {})
--diar_specaug {specaug,None}
The diar_specaug type (default: None)
--diar_specaug_conf DIAR_SPECAUG_CONF
The keyword arguments for diar_specaug (default: {})
--diar_normalize {global_mvn,utterance_mvn,None}
The diar_normalize type (default: utterance_mvn)
--diar_normalize_conf DIAR_NORMALIZE_CONF
The keyword arguments for diar_normalize (default: {})
--diar_encoder {conformer,transformer,rnn}
The diar_encoder type (default: transformer)
--diar_encoder_conf DIAR_ENCODER_CONF
The keyword arguments for diar_encoder (default: {})
--diar_decoder {linear}
The diar_decoder type (default: linear)
--diar_decoder_conf DIAR_DECODER_CONF
The keyword arguments for diar_decoder (default: {})
--label_aggregator {label_aggregator}
The label_aggregator type (default: label_aggregator)
--label_aggregator_conf LABEL_AGGREGATOR_CONF
The keyword arguments for label_aggregator (default: {})
--diar_attractor {rnn,None}
The diar_attractor type (default: None)
--diar_attractor_conf DIAR_ATTRACTOR_CONF
The keyword arguments for diar_attractor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
enh_scoring.py¶
usage: enh_scoring.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR
[--dtype {float16,float32,float64}] --ref_scp REF_SCP
--inf_scp INF_SCP [--key_file KEY_FILE]
[--ref_channel REF_CHANNEL]
[--flexible_numspk FLEXIBLE_NUMSPK] [--is_tse IS_TSE]
[--use_dnsmos USE_DNSMOS] [--dnsmos_mode {local,web}]
[--dnsmos_auth_key DNSMOS_AUTH_KEY]
[--dnsmos_primary_model DNSMOS_PRIMARY_MODEL]
[--dnsmos_p808_model DNSMOS_P808_MODEL]
[--use_pesq USE_PESQ]
Frontend inference
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--dtype {float16,float32,float64}
Data type (default: float32)
Input data related:
--ref_scp REF_SCP
--inf_scp INF_SCP
--key_file KEY_FILE
--ref_channel REF_CHANNEL
--flexible_numspk FLEXIBLE_NUMSPK
--is_tse IS_TSE
DNSMOS related:
--use_dnsmos USE_DNSMOS
--dnsmos_mode {local,web}
Use local DNSMOS model or web API for DNSMOS
calculation (default: local)
--dnsmos_auth_key DNSMOS_AUTH_KEY
Required if dnsmsos_mode='web' (default: )
--dnsmos_primary_model DNSMOS_PRIMARY_MODEL
Path to the primary DNSMOS model. Required if
dnsmsos_mode='local' (default:
./DNSMOS/sig_bak_ovr.onnx)
--dnsmos_p808_model DNSMOS_P808_MODEL
Path to the p808 model. Required if
dnsmsos_mode='local' (default: ./DNSMOS/model_v8.onnx)
PESQ related:
--use_pesq USE_PESQ Bebore setting this to True, please make sure that you
or your institution have the license (check
https://www.itu.int/rec/T-REC-P.862-200511-I!Amd2/en)
to report PESQ (default: False)
enh_train.py¶
usage: enh_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--model_conf MODEL_CONF] [--criterions CRITERIONS]
[--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
[--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
[--noise_scp NOISE_SCP]
[--noise_apply_prob NOISE_APPLY_PROB]
[--noise_db_range NOISE_DB_RANGE]
[--short_noise_thres SHORT_NOISE_THRES]
[--use_reverberant_ref USE_REVERBERANT_REF]
[--num_spk NUM_SPK] [--num_noise_type NUM_NOISE_TYPE]
[--sample_rate SAMPLE_RATE]
[--force_single_channel FORCE_SINGLE_CHANNEL]
[--channel_reordering CHANNEL_REORDERING]
[--categories CATEGORIES [CATEGORIES ...]]
[--dynamic_mixing DYNAMIC_MIXING] [--utt2spk UTT2SPK]
[--dynamic_mixing_gain_db DYNAMIC_MIXING_GAIN_DB]
[--encoder {stft,conv,same}] [--encoder_conf ENCODER_CONF]
[--separator {asteroid,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube,tfgridnet}]
[--separator_conf SEPARATOR_CONF]
[--decoder {stft,conv,same}] [--decoder_conf DECODER_CONF]
[--mask_module {multi_mask}]
[--mask_module_conf MASK_MODULE_CONF]
[--preprocessor {dynamic_mixing,enh,None}]
[--preprocessor_conf PREPROCESSOR_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None, 'extract_feats_in_collect_stats': False})
--criterions CRITERIONS
The criterions binded with the loss wrappers. (default: [{'name': 'si_snr', 'conf': {}, 'wrapper': 'fixed_order', 'wrapper_conf': {}}])
Preprocess related
--speech_volume_normalize SPEECH_VOLUME_NORMALIZE
Scale the maximum amplitude to the given value or range. e.g. --speech_volume_normalize 1.0 scales it to 1.0.
--speech_volume_normalize 0.5_1.0 scales it to a random number in the range [0.5, 1.0) (default: None)
--rir_scp RIR_SCP The file path of rir scp file. (default: None)
--rir_apply_prob RIR_APPLY_PROB
THe probability for applying RIR convolution. (default: 1.0)
--noise_scp NOISE_SCP
The file path of noise scp file. (default: None)
--noise_apply_prob NOISE_APPLY_PROB
The probability applying Noise adding. (default: 1.0)
--noise_db_range NOISE_DB_RANGE
The range of signal-to-noise ratio (SNR) level in decibel. (default: 13_15)
--short_noise_thres SHORT_NOISE_THRES
If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
--use_reverberant_ref USE_REVERBERANT_REF
Whether to use reverberant speech references instead of anechoic ones (default: False)
--num_spk NUM_SPK Number of speakers in the input signal. (default: 1)
--num_noise_type NUM_NOISE_TYPE
Number of noise types. (default: 1)
--sample_rate SAMPLE_RATE
Sampling rate of the data (in Hz). (default: 8000)
--force_single_channel FORCE_SINGLE_CHANNEL
Whether to force all data to be single-channel. (default: False)
--channel_reordering CHANNEL_REORDERING
Whether to randomly reorder the channels of the multi-channel signals. (default: False)
--categories CATEGORIES [CATEGORIES ...]
The set of all possible categories in the dataset. Used to add the category information to each sample (default: [])
--dynamic_mixing DYNAMIC_MIXING
Apply dynamic mixing (default: False)
--utt2spk UTT2SPK The file path of utt2spk file. Only used in dynamic_mixing mode. (default: None)
--dynamic_mixing_gain_db DYNAMIC_MIXING_GAIN_DB
Random gain (in dB) for dynamic mixing sources (default: 0.0)
--encoder {stft,conv,same}
The encoder type (default: stft)
--encoder_conf ENCODER_CONF
The keyword arguments for encoder (default: {})
--separator {asteroid,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube,tfgridnet}
The separator type (default: rnn)
--separator_conf SEPARATOR_CONF
The keyword arguments for separator (default: {})
--decoder {stft,conv,same}
The decoder type (default: stft)
--decoder_conf DECODER_CONF
The keyword arguments for decoder (default: {})
--mask_module {multi_mask}
The mask_module type (default: multi_mask)
--mask_module_conf MASK_MODULE_CONF
The keyword arguments for mask_module (default: {})
--preprocessor {dynamic_mixing,enh,None}
The preprocessor type (default: None)
--preprocessor_conf PREPROCESSOR_CONF
The keyword arguments for preprocessor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
enh_tse_inference.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: enh_tse_inference.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU]
[--seed SEED] [--dtype {float16,float32,float64}]
[--fs FS] [--num_workers NUM_WORKERS]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--normalize_output_wav NORMALIZE_OUTPUT_WAV]
[--train_config TRAIN_CONFIG]
[--model_file MODEL_FILE] [--model_tag MODEL_TAG]
[--inference_config INFERENCE_CONFIG]
[--batch_size BATCH_SIZE]
[--segment_size SEGMENT_SIZE]
[--hop_size HOP_SIZE]
[--normalize_segment_scale NORMALIZE_SEGMENT_SCALE]
[--show_progressbar SHOW_PROGRESSBAR]
[--ref_channel REF_CHANNEL]
Frontend inference
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--fs FS Sampling rate (default: 8000)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Output data related:
--normalize_output_wav NORMALIZE_OUTPUT_WAV
Whether to normalize the predicted wav to [-1~1]
(default: False)
The model configuration related:
--train_config TRAIN_CONFIG
Training configuration file (default: None)
--model_file MODEL_FILE
Model parameter file (default: None)
--model_tag MODEL_TAG
Pretrained model tag. If specify this option,
train_config and model_file will be overwritten
(default: None)
--inference_config INFERENCE_CONFIG
Optional configuration file for overwriting enh model
attributes during inference (default: None)
Data loading related:
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
SeparateSpeech related:
--segment_size SEGMENT_SIZE
Segment length in seconds for segment-wise speech
enhancement/separation (default: None)
--hop_size HOP_SIZE Hop length in seconds for segment-wise speech
enhancement/separation (default: None)
--normalize_segment_scale NORMALIZE_SEGMENT_SCALE
Whether to normalize the energy of the separated
streams in each segment (default: False)
--show_progressbar SHOW_PROGRESSBAR
Whether to show a progress bar when performing
segment-wise speech enhancement/separation (default:
False)
--ref_channel REF_CHANNEL
If not None, this will overwrite the ref_channel
defined in the extractor module (for multi-channel
speech processing) (default: None)
enh_tse_train.py¶
usage: enh_tse_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS]
[--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP]
[--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB]
[--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
[--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--model_conf MODEL_CONF] [--criterions CRITERIONS]
[--train_spk2enroll TRAIN_SPK2ENROLL]
[--enroll_segment ENROLL_SEGMENT]
[--load_spk_embedding LOAD_SPK_EMBEDDING]
[--load_all_speakers LOAD_ALL_SPEAKERS]
[--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
[--noise_scp NOISE_SCP]
[--noise_apply_prob NOISE_APPLY_PROB]
[--noise_db_range NOISE_DB_RANGE]
[--short_noise_thres SHORT_NOISE_THRES]
[--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
[--use_reverberant_ref USE_REVERBERANT_REF]
[--num_spk NUM_SPK] [--num_noise_type NUM_NOISE_TYPE]
[--sample_rate SAMPLE_RATE]
[--force_single_channel FORCE_SINGLE_CHANNEL]
[--channel_reordering CHANNEL_REORDERING]
[--categories CATEGORIES [CATEGORIES ...]]
[--encoder {stft,conv,same}]
[--encoder_conf ENCODER_CONF]
[--extractor {td_speakerbeam}]
[--extractor_conf EXTRACTOR_CONF]
[--decoder {stft,conv,same}]
[--decoder_conf DECODER_CONF] [--preprocessor {tse}]
[--preprocessor_conf PREPROCESSOR_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {'num_spk': 1, 'share_encoder': True, 'extract_feats_in_collect_stats': False})
--criterions CRITERIONS
The criterions binded with the loss wrappers. (default: [{'name': 'si_snr', 'conf': {}, 'wrapper': 'fixed_order', 'wrapper_conf': {}}])
Preprocess related
--train_spk2enroll TRAIN_SPK2ENROLL
The scp file containing the mapping from speakerID to enrollment
(This is used to sample the target-speaker enrollment signal) (default: None)
--enroll_segment ENROLL_SEGMENT
Truncate the enrollment audio to the specified length if not None (default: None)
--load_spk_embedding LOAD_SPK_EMBEDDING
Whether to load speaker embeddings instead of enrollments (default: False)
--load_all_speakers LOAD_ALL_SPEAKERS
Whether to load target-speaker for all speakers in each sample (default: False)
--rir_scp RIR_SCP The file path of rir scp file. (default: None)
--rir_apply_prob RIR_APPLY_PROB
THe probability for applying RIR convolution. (default: 1.0)
--noise_scp NOISE_SCP
The file path of noise scp file. (default: None)
--noise_apply_prob NOISE_APPLY_PROB
The probability applying Noise adding. (default: 1.0)
--noise_db_range NOISE_DB_RANGE
The range of signal-to-noise ratio (SNR) level in decibel. (default: 13_15)
--short_noise_thres SHORT_NOISE_THRES
If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
--speech_volume_normalize SPEECH_VOLUME_NORMALIZE
Scale the maximum amplitude to the given value or range. e.g. --speech_volume_normalize 1.0 scales it to 1.0.
--speech_volume_normalize 0.5_1.0 scales it to a random number in the range [0.5, 1.0) (default: None)
--use_reverberant_ref USE_REVERBERANT_REF
Whether to use reverberant speech references instead of anechoic ones (default: False)
--num_spk NUM_SPK Number of speakers in the input signal. (default: 1)
--num_noise_type NUM_NOISE_TYPE
Number of noise types. (default: 1)
--sample_rate SAMPLE_RATE
Sampling rate of the data (in Hz). (default: 8000)
--force_single_channel FORCE_SINGLE_CHANNEL
Whether to force all data to be single-channel. (default: False)
--channel_reordering CHANNEL_REORDERING
Whether to randomly reorder the channels of the multi-channel signals. (default: False)
--categories CATEGORIES [CATEGORIES ...]
The set of all possible categories in the dataset. Used to add the category information to each sample (default: [])
--encoder {stft,conv,same}
The encoder type (default: stft)
--encoder_conf ENCODER_CONF
The keyword arguments for encoder (default: {})
--extractor {td_speakerbeam}
The extractor type (default: td_speakerbeam)
--extractor_conf EXTRACTOR_CONF
The keyword arguments for extractor (default: {})
--decoder {stft,conv,same}
The decoder type (default: stft)
--decoder_conf DECODER_CONF
The keyword arguments for decoder (default: {})
--preprocessor {tse} The preprocessor type (default: tse)
--preprocessor_conf PREPROCESSOR_CONF
The keyword arguments for preprocessor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
gan_svs_train.py¶
usage: gan_svs_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS]
[--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP]
[--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB]
[--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
[--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim2_conf OPTIM2_CONF]
[--scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler2_conf SCHEDULER2_CONF]
[--generator_first GENERATOR_FIRST]
[--token_list TOKEN_LIST] [--odim ODIM]
[--model_conf MODEL_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word,phn}]
[--bpemodel BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--fs FS]
[--score_feats_extract {frame_score_feats,syllable_score_feats}]
[--score_feats_extract_conf SCORE_FEATS_EXTRACT_CONF]
[--feats_extract {fbank,log_spectrogram,linear_spectrogram}]
[--feats_extract_conf FEATS_EXTRACT_CONF]
[--normalize {global_mvn,utterance_mvn,None}]
[--normalize_conf NORMALIZE_CONF]
[--svs {vits,joint_score2wav}] [--svs_conf SVS_CONF]
[--pitch_extract {dio,None}]
[--pitch_extract_conf PITCH_EXTRACT_CONF]
[--pitch_normalize {global_mvn,utterance_mvn,None}]
[--pitch_normalize_conf PITCH_NORMALIZE_CONF]
[--ying_extract {ying,None}]
[--ying_extract_conf YING_EXTRACT_CONF]
[--energy_extract {energy,None}]
[--energy_extract_conf ENERGY_EXTRACT_CONF]
[--energy_normalize {global_mvn,utterance_mvn,None}]
[--energy_normalize_conf ENERGY_NORMALIZE_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--generator_first GENERATOR_FIRST
Whether to update generator first. (default: False)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
non_linguistic_symbols file path (default: None)
--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
Apply text cleaning (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
Specify g2p method if --token_type=phn (default: None)
--fs FS sample rate (default: 24000)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
--optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim2_conf OPTIM2_CONF
The keyword arguments for optimizer (default: {})
--scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler2_conf SCHEDULER2_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--token_list TOKEN_LIST
A text mapping int-id to token (default: None)
--odim ODIM The number of dimension of output feature (default: None)
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {})
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--token_type {bpe,char,word,phn}
The text will be tokenized in the specified level token (default: phn)
--bpemodel BPEMODEL The model file of sentencepiece (default: None)
--score_feats_extract {frame_score_feats,syllable_score_feats}
The score_feats_extract type (default: frame_score_feats)
--score_feats_extract_conf SCORE_FEATS_EXTRACT_CONF
The keyword arguments for score_feats_extract (default: {})
--feats_extract {fbank,log_spectrogram,linear_spectrogram}
The feats_extract type (default: linear_spectrogram)
--feats_extract_conf FEATS_EXTRACT_CONF
The keyword arguments for feats_extract (default: {})
--normalize {global_mvn,utterance_mvn,None}
The normalize type (default: None)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--svs {vits,joint_score2wav}
The svs type (default: vits)
--svs_conf SVS_CONF The keyword arguments for svs (default: {})
--pitch_extract {dio,None}
The pitch_extract type (default: None)
--pitch_extract_conf PITCH_EXTRACT_CONF
The keyword arguments for pitch_extract (default: {})
--pitch_normalize {global_mvn,utterance_mvn,None}
The pitch_normalize type (default: None)
--pitch_normalize_conf PITCH_NORMALIZE_CONF
The keyword arguments for pitch_normalize (default: {})
--ying_extract {ying,None}
The ying_extract type (default: None)
--ying_extract_conf YING_EXTRACT_CONF
The keyword arguments for ying_extract (default: {})
--energy_extract {energy,None}
The energy_extract type (default: None)
--energy_extract_conf ENERGY_EXTRACT_CONF
The keyword arguments for energy_extract (default: {})
--energy_normalize {global_mvn,utterance_mvn,None}
The energy_normalize type (default: None)
--energy_normalize_conf ENERGY_NORMALIZE_CONF
The keyword arguments for energy_normalize (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
gan_tts_train.py¶
usage: gan_tts_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS]
[--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP]
[--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB]
[--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
[--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim2_conf OPTIM2_CONF]
[--scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler2_conf SCHEDULER2_CONF]
[--generator_first GENERATOR_FIRST]
[--token_list TOKEN_LIST] [--odim ODIM]
[--model_conf MODEL_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word,phn}]
[--bpemodel BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--feats_extract {fbank,log_spectrogram,linear_spectrogram}]
[--feats_extract_conf FEATS_EXTRACT_CONF]
[--normalize {global_mvn,utterance_mvn,None}]
[--normalize_conf NORMALIZE_CONF]
[--tts {vits,joint_text2wav,jets}]
[--tts_conf TTS_CONF] [--pitch_extract {dio,None}]
[--pitch_extract_conf PITCH_EXTRACT_CONF]
[--pitch_normalize {global_mvn,utterance_mvn,None}]
[--pitch_normalize_conf PITCH_NORMALIZE_CONF]
[--energy_extract {energy,None}]
[--energy_extract_conf ENERGY_EXTRACT_CONF]
[--energy_normalize {global_mvn,utterance_mvn,None}]
[--energy_normalize_conf ENERGY_NORMALIZE_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--generator_first GENERATOR_FIRST
Whether to update generator first. (default: False)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
non_linguistic_symbols file path (default: None)
--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
Apply text cleaning (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
Specify g2p method if --token_type=phn (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
--optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim2_conf OPTIM2_CONF
The keyword arguments for optimizer (default: {})
--scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler2_conf SCHEDULER2_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--token_list TOKEN_LIST
A text mapping int-id to token (default: None)
--odim ODIM The number of dimension of output feature (default: None)
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {})
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--token_type {bpe,char,word,phn}
The text will be tokenized in the specified level token (default: phn)
--bpemodel BPEMODEL The model file of sentencepiece (default: None)
--feats_extract {fbank,log_spectrogram,linear_spectrogram}
The feats_extract type (default: linear_spectrogram)
--feats_extract_conf FEATS_EXTRACT_CONF
The keyword arguments for feats_extract (default: {})
--normalize {global_mvn,utterance_mvn,None}
The normalize type (default: None)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--tts {vits,joint_text2wav,jets}
The tts type (default: vits)
--tts_conf TTS_CONF The keyword arguments for tts (default: {})
--pitch_extract {dio,None}
The pitch_extract type (default: None)
--pitch_extract_conf PITCH_EXTRACT_CONF
The keyword arguments for pitch_extract (default: {})
--pitch_normalize {global_mvn,utterance_mvn,None}
The pitch_normalize type (default: None)
--pitch_normalize_conf PITCH_NORMALIZE_CONF
The keyword arguments for pitch_normalize (default: {})
--energy_extract {energy,None}
The energy_extract type (default: None)
--energy_extract_conf ENERGY_EXTRACT_CONF
The keyword arguments for energy_extract (default: {})
--energy_normalize {global_mvn,utterance_mvn,None}
The energy_normalize type (default: None)
--energy_normalize_conf ENERGY_NORMALIZE_CONF
The keyword arguments for energy_normalize (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
hubert_train.py¶
usage: hubert_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS]
[--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP]
[--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--token_list TOKEN_LIST]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--collate_fn_conf COLLATE_FN_CONF]
[--input_size INPUT_SIZE] [--num_classes NUM_CLASSES]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word,phn}]
[--bpemodel BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
[--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
[--noise_scp NOISE_SCP]
[--noise_apply_prob NOISE_APPLY_PROB]
[--noise_db_range NOISE_DB_RANGE]
[--pred_masked_weight PRED_MASKED_WEIGHT]
[--pred_nomask_weight PRED_NOMASK_WEIGHT]
[--loss_weights LOSS_WEIGHTS]
[--frontend {default,sliding_window}]
[--frontend_conf FRONTEND_CONF]
[--specaug {specaug,None}]
[--specaug_conf SPECAUG_CONF]
[--normalize {global_mvn,utterance_mvn,None}]
[--normalize_conf NORMALIZE_CONF]
[--preencoder {sinc,None}]
[--preencoder_conf PREENCODER_CONF]
[--encoder {hubert_pretrain,torchaudio_hubert}]
[--encoder_conf ENCODER_CONF]
[--model {fairseq,torchaudio}]
[--model_conf MODEL_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--pred_masked_weight PRED_MASKED_WEIGHT
weight for predictive loss for masked frames (default: 1.0)
--pred_nomask_weight PRED_NOMASK_WEIGHT
weight for predictive loss for unmasked frames (default: 0.0)
--loss_weights LOSS_WEIGHTS
weights for additional loss terms (not first one) (default: 0.0)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--token_list TOKEN_LIST
A text mapping int-id to token (default: None)
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--collate_fn_conf COLLATE_FN_CONF
The keyword arguments for collate_fn class. (default: {})
--input_size INPUT_SIZE
The number of input dimension of the feature (default: None)
--num_classes NUM_CLASSES
The number of classes in hubert (default: None)
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--token_type {bpe,char,word,phn}
The text will be tokenized in the specified level token (default: bpe)
--bpemodel BPEMODEL The model file of sentencepiece (default: None)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
non_linguistic_symbols file path (default: None)
--cleaner {None,tacotron,jaconv,vietnamese}
Apply text cleaning (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
Specify g2p method if --token_type=phn (default: None)
--speech_volume_normalize SPEECH_VOLUME_NORMALIZE
Scale the maximum amplitude to the given value. (default: None)
--rir_scp RIR_SCP The file path of rir scp file. (default: None)
--rir_apply_prob RIR_APPLY_PROB
THe probability for applying RIR convolution. (default: 1.0)
--noise_scp NOISE_SCP
The file path of noise scp file. (default: None)
--noise_apply_prob NOISE_APPLY_PROB
The probability applying Noise adding. (default: 1.0)
--noise_db_range NOISE_DB_RANGE
The range of noise decibel level. (default: 13_15)
--frontend {default,sliding_window}
The frontend type (default: default)
--frontend_conf FRONTEND_CONF
The keyword arguments for frontend (default: {})
--specaug {specaug,None}
The specaug type (default: None)
--specaug_conf SPECAUG_CONF
The keyword arguments for specaug (default: {})
--normalize {global_mvn,utterance_mvn,None}
The normalize type (default: utterance_mvn)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--preencoder {sinc,None}
The preencoder type (default: None)
--preencoder_conf PREENCODER_CONF
The keyword arguments for preencoder (default: {})
--encoder {hubert_pretrain,torchaudio_hubert}
The encoder type (default: hubert_pretrain)
--encoder_conf ENCODER_CONF
The keyword arguments for encoder (default: {})
--model {fairseq,torchaudio}
The model type (default: fairseq)
--model_conf MODEL_CONF
The keyword arguments for model (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
hugging_face_export_vocabulary.py¶
usage: hugging_face_export_vocabulary.py [-h]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output OUTPUT --model_name_or_path
MODEL_NAME_OR_PATH
[--add_symbol ADD_SYMBOL]
Export Hugging Face vocabulary
optional arguments:
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output OUTPUT, -o OUTPUT
Output text. - indicates sys.stdout (default: None)
--model_name_or_path MODEL_NAME_OR_PATH
Hugging Face model name or path (default: None)
--add_symbol ADD_SYMBOL
Append symbol e.g. --add_symbol '<blank>:0'
--add_symbol '<unk>:1' (default: [])
launch.py¶
usage: launch.py [-h] [--cmd CMD] [--log LOG]
[--max_num_log_files MAX_NUM_LOG_FILES] [--ngpu NGPU]
[--num_nodes NUM_NODES | --host HOST] [--envfile ENVFILE]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
[--init_file_prefix INIT_FILE_PREFIX]
args [args ...]
Launch distributed process with appropriate options.
positional arguments:
args
optional arguments:
--cmd CMD The path of cmd script of Kaldi: run.pl. queue.pl, or
slurm.pl (default: utils/run.pl)
--log LOG The path of log file used by cmd (default: run.log)
--max_num_log_files MAX_NUM_LOG_FILES
The maximum number of log-files to be kept (default:
1000)
--ngpu NGPU The number of GPUs per node (default: 1)
--num_nodes NUM_NODES
The number of nodes (default: 1)
--host HOST Directly specify the host names. The job are submitted
via SSH. Multiple host names can be specified by
splitting by comma. e.g. host1,host2 You can also the
device id after the host name with ':'. e.g.
host1:0:2:3,host2:0:2. If the device ids are specified
in this way, the value of --ngpu is ignored. (default:
None)
--envfile ENVFILE Source the shell script before executing command. This
option is used when --host is specified. (default:
path.sh)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Distributed method is used when single-node mode.
(default: True)
--master_port MASTER_PORT
Specify the port number of masterMaster is a host
machine has RANK0 process. (default: None)
--master_addr MASTER_ADDR
Specify the address s of master. Master is a host
machine has RANK0 process. (default: None)
--init_file_prefix INIT_FILE_PREFIX
The file name prefix for init_file, which is used for
'Shared-file system initialization'. This option is
used when --port is not specified (default:
.dist_init_)
lm_calc_perplexity.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: lm_calc_perplexity.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU]
[--seed SEED] [--dtype {float16,float32,float64}]
[--num_workers NUM_WORKERS]
[--batch_size BATCH_SIZE] [--log_base LOG_BASE]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--train_config TRAIN_CONFIG]
[--model_file MODEL_FILE]
Calc perplexity
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
--log_base LOG_BASE The base of logarithm for Perplexity. If None,
napier's constant is used. (default: None)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
The model configuration related:
--train_config TRAIN_CONFIG
--model_file MODEL_FILE
lm_train.py¶
usage: lm_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
[--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--model_conf MODEL_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word}] [--bpemodel BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--lm {seq_rnn,transformer}] [--lm_conf LM_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
non_linguistic_symbols file path (default: None)
--cleaner {None,tacotron,jaconv,vietnamese}
Apply text cleaning (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
Specify g2p method if --token_type=phn (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--token_list TOKEN_LIST
A text mapping int-id to token (default: None)
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {'ignore_id': 0})
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--token_type {bpe,char,word}
--bpemodel BPEMODEL The model file fo sentencepiece (default: None)
--lm {seq_rnn,transformer}
The lm type (default: seq_rnn)
--lm_conf LM_CONF The keyword arguments for lm (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
mt_inference.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: mt_inference.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
[--dtype {float16,float32,float64}]
[--num_workers NUM_WORKERS]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--mt_train_config MT_TRAIN_CONFIG]
[--mt_model_file MT_MODEL_FILE]
[--lm_train_config LM_TRAIN_CONFIG] [--lm_file LM_FILE]
[--word_lm_train_config WORD_LM_TRAIN_CONFIG]
[--word_lm_file WORD_LM_FILE] [--ngram_file NGRAM_FILE]
[--model_tag MODEL_TAG] [--batch_size BATCH_SIZE]
[--nbest NBEST] [--beam_size BEAM_SIZE]
[--penalty PENALTY] [--maxlenratio MAXLENRATIO]
[--minlenratio MINLENRATIO] [--ctc_weight CTC_WEIGHT]
[--lm_weight LM_WEIGHT] [--ngram_weight NGRAM_WEIGHT]
[--token_type {char,bpe,None}] [--bpemodel BPEMODEL]
MT Decoding
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
The model configuration related:
--mt_train_config MT_TRAIN_CONFIG
ST training configuration (default: None)
--mt_model_file MT_MODEL_FILE
MT model parameter file (default: None)
--lm_train_config LM_TRAIN_CONFIG
LM training configuration (default: None)
--lm_file LM_FILE LM parameter file (default: None)
--word_lm_train_config WORD_LM_TRAIN_CONFIG
Word LM training configuration (default: None)
--word_lm_file WORD_LM_FILE
Word LM parameter file (default: None)
--ngram_file NGRAM_FILE
N-gram parameter file (default: None)
--model_tag MODEL_TAG
Pretrained model tag. If specify this option,
*_train_config and *_file will be overwritten
(default: None)
Beam-search related:
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
--nbest NBEST Output N-best hypotheses (default: 1)
--beam_size BEAM_SIZE
Beam size (default: 20)
--penalty PENALTY Insertion penalty (default: 0.0)
--maxlenratio MAXLENRATIO
Input length ratio to obtain max output length. If
maxlenratio=0.0 (default), it uses a end-detect
function to automatically find maximum hypothesis
lengths.If maxlenratio<0.0, its absolute value is
interpretedas a constant max output length (default:
0.0)
--minlenratio MINLENRATIO
Input length ratio to obtain min output length
(default: 0.0)
--ctc_weight CTC_WEIGHT
CTC weight in joint decoding (default: 0.0)
--lm_weight LM_WEIGHT
RNNLM weight (default: 1.0)
--ngram_weight NGRAM_WEIGHT
ngram weight (default: 0.9)
Text converter related:
--token_type {char,bpe,None}
The token type for ST model. If not given, refers from
the training args (default: None)
--bpemodel BPEMODEL The model path of sentencepiece. If not given, refers
from the training args (default: None)
mt_train.py¶
usage: mt_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
[--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
[--src_token_list SRC_TOKEN_LIST]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word,phn}]
[--src_token_type {bpe,char,word,phn}]
[--bpemodel BPEMODEL] [--src_bpemodel SRC_BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--tokenizer_encode_conf TOKENIZER_ENCODE_CONF]
[--src_tokenizer_encode_conf SRC_TOKENIZER_ENCODE_CONF]
[--frontend {embed}] [--frontend_conf FRONTEND_CONF]
[--specaug {specaug,None}] [--specaug_conf SPECAUG_CONF]
[--preencoder {sinc,linear,None}]
[--preencoder_conf PREENCODER_CONF]
[--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,branchformer,e_branchformer}]
[--encoder_conf ENCODER_CONF]
[--postencoder {hugging_face_transformers,None}]
[--postencoder_conf POSTENCODER_CONF]
[--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
[--decoder_conf DECODER_CONF] [--model {mt,discrete_asr}]
[--model_conf MODEL_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
non_linguistic_symbols file path (default: None)
--cleaner {None,tacotron,jaconv,vietnamese}
Apply text cleaning (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
Specify g2p method if --token_type=phn (default: None)
--tokenizer_encode_conf TOKENIZER_ENCODE_CONF
Tokenization encoder conf, e.g. BPE dropout: enable_sampling=True, alpha=0.1, nbest_size=-1 (default: None)
--src_tokenizer_encode_conf SRC_TOKENIZER_ENCODE_CONF
Src tokenization encoder conf, e.g. BPE dropout: enable_sampling=True, alpha=0.1, nbest_size=-1 (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--token_list TOKEN_LIST
A text mapping int-id to token (for target language) (default: None)
--src_token_list SRC_TOKEN_LIST
A text mapping int-id to token (for source language) (default: None)
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--input_size INPUT_SIZE
The number of input dimension of the feature (default: None)
--ctc_conf CTC_CONF The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True})
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--token_type {bpe,char,word,phn}
The target text will be tokenized in the specified level token (default: bpe)
--src_token_type {bpe,char,word,phn}
The source text will be tokenized in the specified level token (default: bpe)
--bpemodel BPEMODEL The model file of sentencepiece (for target language) (default: None)
--src_bpemodel SRC_BPEMODEL
The model file of sentencepiece (for source language) (default: None)
--frontend {embed} The frontend type (default: embed)
--frontend_conf FRONTEND_CONF
The keyword arguments for frontend (default: {})
--specaug {specaug,None}
The specaug type (default: None)
--specaug_conf SPECAUG_CONF
The keyword arguments for specaug (default: {})
--preencoder {sinc,linear,None}
The preencoder type (default: None)
--preencoder_conf PREENCODER_CONF
The keyword arguments for preencoder (default: {})
--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,branchformer,e_branchformer}
The encoder type (default: rnn)
--encoder_conf ENCODER_CONF
The keyword arguments for encoder (default: {})
--postencoder {hugging_face_transformers,None}
The postencoder type (default: None)
--postencoder_conf POSTENCODER_CONF
The keyword arguments for postencoder (default: {})
--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
The decoder type (default: rnn)
--decoder_conf DECODER_CONF
The keyword arguments for decoder (default: {})
--model {mt,discrete_asr}
The model type (default: mt)
--model_conf MODEL_CONF
The keyword arguments for model (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
pack.py¶
usage: pack.py [-h] {asr,st,tts,enh,diar,svs,enh_s2t,ssl} ...
Pack input files to archive format
positional arguments:
{asr,st,tts,enh,diar,svs,enh_s2t,ssl}
optional arguments:
spk_train.py¶
usage: spk_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--use_preprocessor USE_PREPROCESSOR]
[--input_size INPUT_SIZE]
[--target_duration TARGET_DURATION] [--spk2utt SPK2UTT]
[--sample_rate SAMPLE_RATE] [--num_eval NUM_EVAL]
[--rir_scp RIR_SCP] [--model_conf MODEL_CONF]
[--frontend {default,sliding_window,raw,None}]
[--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
[--specaug_conf SPECAUG_CONF]
[--normalize {global_mvn,utterance_mvn,None}]
[--normalize_conf NORMALIZE_CONF] [--encoder {rawnet3}]
[--encoder_conf ENCODER_CONF] [--pooling {chn_attn_stat}]
[--pooling_conf POOLING_CONF] [--projector {rawnet3}]
[--projector_conf PROJECTOR_CONF]
[--preprocessor {common,spk}]
[--preprocessor_conf PREPROCESSOR_CONF]
[--loss {aamsoftmax}] [--loss_conf LOSS_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--input_size INPUT_SIZE
The number of input dimension of the feature (default: None)
--target_duration TARGET_DURATION
Duration (in seconds) of samples in a minibatch (default: 3.0)
--spk2utt SPK2UTT Directory of spk2utt file to be used in label mapping (default: )
--sample_rate SAMPLE_RATE
Sampling rate (default: 16000)
--num_eval NUM_EVAL Number of segments to make from one utterance in the inference phase (default: 10)
--rir_scp RIR_SCP Directory of the rir data to be augmented (default: )
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {})
--frontend {default,sliding_window,raw,None}
The frontend type (default: default)
--frontend_conf FRONTEND_CONF
The keyword arguments for frontend (default: {})
--specaug {specaug,None}
The specaug type (default: None)
--specaug_conf SPECAUG_CONF
The keyword arguments for specaug (default: {})
--normalize {global_mvn,utterance_mvn,None}
The normalize type (default: None)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--encoder {rawnet3} The encoder type (default: rawnet3)
--encoder_conf ENCODER_CONF
The keyword arguments for encoder (default: {})
--pooling {chn_attn_stat}
The pooling type (default: chn_attn_stat)
--pooling_conf POOLING_CONF
The keyword arguments for pooling (default: {})
--projector {rawnet3}
The projector type (default: rawnet3)
--projector_conf PROJECTOR_CONF
The keyword arguments for projector (default: {})
--preprocessor {common,spk}
The preprocessor type (default: spk)
--preprocessor_conf PREPROCESSOR_CONF
The keyword arguments for preprocessor (default: {})
--loss {aamsoftmax} The loss type (default: aam)
--loss_conf LOSS_CONF
The keyword arguments for loss (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
split_scps.py¶
usage: split_scps.py [-h]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--scps SCPS [SCPS ...] [--names NAMES [NAMES ...]]
[--num_splits NUM_SPLITS] --output_dir OUTPUT_DIR
Split scp files
optional arguments:
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--scps SCPS [SCPS ...]
Input texts (default: None)
--names NAMES [NAMES ...]
Output names for each files (default: None)
--num_splits NUM_SPLITS
Split number (default: None)
--output_dir OUTPUT_DIR
Output directory (default: None)
st_inference.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: st_inference.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
[--dtype {float16,float32,float64}]
[--num_workers NUM_WORKERS]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--st_train_config ST_TRAIN_CONFIG]
[--st_model_file ST_MODEL_FILE]
[--lm_train_config LM_TRAIN_CONFIG] [--lm_file LM_FILE]
[--word_lm_train_config WORD_LM_TRAIN_CONFIG]
[--word_lm_file WORD_LM_FILE] [--ngram_file NGRAM_FILE]
[--model_tag MODEL_TAG] [--enh_s2t_task ENH_S2T_TASK]
[--batch_size BATCH_SIZE] [--nbest NBEST]
[--beam_size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--lm_weight LM_WEIGHT] [--ngram_weight NGRAM_WEIGHT]
[--token_type {char,bpe,None}] [--bpemodel BPEMODEL]
ST Decoding
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
The model configuration related:
--st_train_config ST_TRAIN_CONFIG
ST training configuration (default: None)
--st_model_file ST_MODEL_FILE
ST model parameter file (default: None)
--lm_train_config LM_TRAIN_CONFIG
LM training configuration (default: None)
--lm_file LM_FILE LM parameter file (default: None)
--word_lm_train_config WORD_LM_TRAIN_CONFIG
Word LM training configuration (default: None)
--word_lm_file WORD_LM_FILE
Word LM parameter file (default: None)
--ngram_file NGRAM_FILE
N-gram parameter file (default: None)
--model_tag MODEL_TAG
Pretrained model tag. If specify this option,
*_train_config and *_file will be overwritten
(default: None)
--enh_s2t_task ENH_S2T_TASK
enhancement and asr joint model (default: False)
Beam-search related:
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
--nbest NBEST Output N-best hypotheses (default: 1)
--beam_size BEAM_SIZE
Beam size (default: 20)
--penalty PENALTY Insertion penalty (default: 0.0)
--maxlenratio MAXLENRATIO
Input length ratio to obtain max output length. If
maxlenratio=0.0 (default), it uses a end-detect
function to automatically find maximum hypothesis
lengths.If maxlenratio<0.0, its absolute value is
interpretedas a constant max output length (default:
0.0)
--minlenratio MINLENRATIO
Input length ratio to obtain min output length
(default: 0.0)
--lm_weight LM_WEIGHT
RNNLM weight (default: 1.0)
--ngram_weight NGRAM_WEIGHT
ngram weight (default: 0.9)
Text converter related:
--token_type {char,bpe,None}
The token type for ST model. If not given, refers from
the training args (default: None)
--bpemodel BPEMODEL The model path of sentencepiece. If not given, refers
from the training args (default: None)
st_train.py¶
usage: st_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
[--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
[--src_token_list SRC_TOKEN_LIST]
[--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
[--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
[--model_conf MODEL_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word,phn}]
[--src_token_type {bpe,char,word,phn,none}]
[--bpemodel BPEMODEL] [--src_bpemodel SRC_BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
[--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
[--noise_scp NOISE_SCP]
[--noise_apply_prob NOISE_APPLY_PROB]
[--noise_db_range NOISE_DB_RANGE]
[--short_noise_thres SHORT_NOISE_THRES]
[--frontend {default,sliding_window,s3prl}]
[--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
[--specaug_conf SPECAUG_CONF]
[--normalize {global_mvn,utterance_mvn,None}]
[--normalize_conf NORMALIZE_CONF]
[--preencoder {sinc,linear,None}]
[--preencoder_conf PREENCODER_CONF]
[--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}]
[--encoder_conf ENCODER_CONF]
[--postencoder {hugging_face_transformers,None}]
[--postencoder_conf POSTENCODER_CONF]
[--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
[--decoder_conf DECODER_CONF]
[--extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
[--extra_asr_decoder_conf EXTRA_ASR_DECODER_CONF]
[--extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
[--extra_mt_decoder_conf EXTRA_MT_DECODER_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--token_list TOKEN_LIST
A text mapping int-id to token (for target language) (default: None)
--src_token_list SRC_TOKEN_LIST
A text mapping int-id to token (for source language) (default: None)
--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
The initialization method (default: None)
--input_size INPUT_SIZE
The number of input dimension of the feature (default: None)
--ctc_conf CTC_CONF The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True})
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {'asr_weight': 0.0, 'mt_weight': 0.0, 'mtlalpha': 0.0, 'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_cer': True, 'report_wer': True, 'report_bleu': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'extract_feats_in_collect_stats': True})
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--token_type {bpe,char,word,phn}
The target text will be tokenized in the specified level token (default: bpe)
--src_token_type {bpe,char,word,phn,none}
The source text will be tokenized in the specified level token (default: bpe)
--bpemodel BPEMODEL The model file of sentencepiece (for target language) (default: None)
--src_bpemodel SRC_BPEMODEL
The model file of sentencepiece (for source language) (default: None)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
non_linguistic_symbols file path (default: None)
--cleaner {None,tacotron,jaconv,vietnamese}
Apply text cleaning (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
Specify g2p method if --token_type=phn (default: None)
--speech_volume_normalize SPEECH_VOLUME_NORMALIZE
Scale the maximum amplitude to the given value. (default: None)
--rir_scp RIR_SCP The file path of rir scp file. (default: None)
--rir_apply_prob RIR_APPLY_PROB
THe probability for applying RIR convolution. (default: 1.0)
--noise_scp NOISE_SCP
The file path of noise scp file. (default: None)
--noise_apply_prob NOISE_APPLY_PROB
The probability applying Noise adding. (default: 1.0)
--noise_db_range NOISE_DB_RANGE
The range of noise decibel level. (default: 13_15)
--short_noise_thres SHORT_NOISE_THRES
If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
--frontend {default,sliding_window,s3prl}
The frontend type (default: default)
--frontend_conf FRONTEND_CONF
The keyword arguments for frontend (default: {})
--specaug {specaug,None}
The specaug type (default: None)
--specaug_conf SPECAUG_CONF
The keyword arguments for specaug (default: {})
--normalize {global_mvn,utterance_mvn,None}
The normalize type (default: utterance_mvn)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--preencoder {sinc,linear,None}
The preencoder type (default: None)
--preencoder_conf PREENCODER_CONF
The keyword arguments for preencoder (default: {})
--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}
The encoder type (default: rnn)
--encoder_conf ENCODER_CONF
The keyword arguments for encoder (default: {})
--postencoder {hugging_face_transformers,None}
The postencoder type (default: None)
--postencoder_conf POSTENCODER_CONF
The keyword arguments for postencoder (default: {})
--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
The decoder type (default: rnn)
--decoder_conf DECODER_CONF
The keyword arguments for decoder (default: {})
--extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
The extra_asr_decoder type (default: rnn)
--extra_asr_decoder_conf EXTRA_ASR_DECODER_CONF
The keyword arguments for extra_asr_decoder (default: {})
--extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
The extra_mt_decoder type (default: rnn)
--extra_mt_decoder_conf EXTRA_MT_DECODER_CONF
The keyword arguments for extra_mt_decoder (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
tts_inference.py¶
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):
usage: tts_inference.py [-h] [--config CONFIG]
[--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
--output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
[--dtype {float16,float32,float64}]
[--num_workers NUM_WORKERS] [--batch_size BATCH_SIZE]
--data_path_and_name_and_type
DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--train_config TRAIN_CONFIG]
[--model_file MODEL_FILE] [--model_tag MODEL_TAG]
[--maxlenratio MAXLENRATIO]
[--minlenratio MINLENRATIO] [--threshold THRESHOLD]
[--use_att_constraint USE_ATT_CONSTRAINT]
[--backward_window BACKWARD_WINDOW]
[--forward_window FORWARD_WINDOW]
[--use_teacher_forcing USE_TEACHER_FORCING]
[--speed_control_alpha SPEED_CONTROL_ALPHA]
[--noise_scale NOISE_SCALE]
[--noise_scale_dur NOISE_SCALE_DUR]
[--always_fix_seed ALWAYS_FIX_SEED]
[--vocoder_config VOCODER_CONFIG]
[--vocoder_file VOCODER_FILE]
[--vocoder_tag VOCODER_TAG]
TTS inference
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--output_dir OUTPUT_DIR
The path of output directory (default: None)
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--dtype {float16,float32,float64}
Data type (default: float32)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--batch_size BATCH_SIZE
The batch size for inference (default: 1)
--speed_control_alpha SPEED_CONTROL_ALPHA
Alpha in FastSpeech to change the speed of generated
speech (default: 1.0)
--noise_scale NOISE_SCALE
Noise scale parameter for the flow in vits (default:
0.667)
--noise_scale_dur NOISE_SCALE_DUR
Noise scale parameter for the stochastic duration
predictor in vits (default: 0.8)
Input data related:
--data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
--key_file KEY_FILE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
The model configuration related:
--train_config TRAIN_CONFIG
Training configuration file (default: None)
--model_file MODEL_FILE
Model parameter file (default: None)
--model_tag MODEL_TAG
Pretrained model tag. If specify this option,
train_config and model_file will be overwritten
(default: None)
Decoding related:
--maxlenratio MAXLENRATIO
Maximum length ratio in decoding (default: 10.0)
--minlenratio MINLENRATIO
Minimum length ratio in decoding (default: 0.0)
--threshold THRESHOLD
Threshold value in decoding (default: 0.5)
--use_att_constraint USE_ATT_CONSTRAINT
Whether to use attention constraint (default: False)
--backward_window BACKWARD_WINDOW
Backward window value in attention constraint
(default: 1)
--forward_window FORWARD_WINDOW
Forward window value in attention constraint (default:
3)
--use_teacher_forcing USE_TEACHER_FORCING
Whether to use teacher forcing (default: False)
--always_fix_seed ALWAYS_FIX_SEED
Whether to always fix seed (default: False)
Vocoder related:
--vocoder_config VOCODER_CONFIG
Vocoder configuration file (default: None)
--vocoder_file VOCODER_FILE
Vocoder parameter file (default: None)
--vocoder_tag VOCODER_TAG
Pretrained vocoder tag. If specify this option,
vocoder_config and vocoder_file will be overwritten
(default: None)
tts_train.py¶
usage: tts_train.py [-h] [--config CONFIG] [--print_config]
[--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
[--dry_run DRY_RUN]
[--iterator_type {sequence,chunk,task,none}]
[--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
[--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
[--dist_backend DIST_BACKEND]
[--dist_init_method DIST_INIT_METHOD]
[--dist_world_size DIST_WORLD_SIZE]
[--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
[--dist_master_addr DIST_MASTER_ADDR]
[--dist_master_port DIST_MASTER_PORT]
[--dist_launcher {slurm,mpi,None}]
[--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
[--unused_parameters UNUSED_PARAMETERS]
[--sharded_ddp SHARDED_DDP]
[--cudnn_enabled CUDNN_ENABLED]
[--cudnn_benchmark CUDNN_BENCHMARK]
[--cudnn_deterministic CUDNN_DETERMINISTIC]
[--collect_stats COLLECT_STATS]
[--write_collected_feats WRITE_COLLECTED_FEATS]
[--max_epoch MAX_EPOCH] [--patience PATIENCE]
[--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
[--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
[--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
[--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
[--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
[--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
[--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
[--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
[--train_dtype {float16,float32,float64}]
[--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
[--use_matplotlib USE_MATPLOTLIB]
[--use_tensorboard USE_TENSORBOARD]
[--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
[--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
[--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
[--wandb_name WANDB_NAME]
[--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
[--detect_anomaly DETECT_ANOMALY]
[--pretrain_path PRETRAIN_PATH]
[--init_param [INIT_PARAM [INIT_PARAM ...]]]
[--ignore_init_mismatch IGNORE_INIT_MISMATCH]
[--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
[--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
[--batch_size BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE]
[--batch_bins BATCH_BINS]
[--valid_batch_bins VALID_BATCH_BINS]
[--train_shape_file TRAIN_SHAPE_FILE]
[--valid_shape_file VALID_SHAPE_FILE]
[--batch_type {unsorted,sorted,folded,length,numel}]
[--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
[--fold_length FOLD_LENGTH]
[--sort_in_batch {descending,ascending}]
[--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
[--sort_batch {descending,ascending}]
[--multiple_iterator MULTIPLE_ITERATOR]
[--chunk_length CHUNK_LENGTH]
[--chunk_shift_ratio CHUNK_SHIFT_RATIO]
[--num_cache_chunks NUM_CACHE_CHUNKS]
[--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
[--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
[--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
[--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
[--max_cache_size MAX_CACHE_SIZE]
[--max_cache_fd MAX_CACHE_FD]
[--valid_max_cache_size VALID_MAX_CACHE_SIZE]
[--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
[--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
[--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
[--optim_conf OPTIM_CONF]
[--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
[--scheduler_conf SCHEDULER_CONF]
[--token_list TOKEN_LIST] [--odim ODIM]
[--model_conf MODEL_CONF]
[--use_preprocessor USE_PREPROCESSOR]
[--token_type {bpe,char,word,phn}] [--bpemodel BPEMODEL]
[--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
[--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
[--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
[--feats_extract {fbank,spectrogram,linear_spectrogram}]
[--feats_extract_conf FEATS_EXTRACT_CONF]
[--normalize {global_mvn,None}]
[--normalize_conf NORMALIZE_CONF]
[--tts {tacotron2,transformer,fastspeech,fastspeech2,prodiff,vits,joint_text2wav,jets}]
[--tts_conf TTS_CONF] [--pitch_extract {dio,None}]
[--pitch_extract_conf PITCH_EXTRACT_CONF]
[--pitch_normalize {global_mvn,None}]
[--pitch_normalize_conf PITCH_NORMALIZE_CONF]
[--energy_extract {energy,None}]
[--energy_extract_conf ENERGY_EXTRACT_CONF]
[--energy_normalize {global_mvn,None}]
[--energy_normalize_conf ENERGY_NORMALIZE_CONF]
base parser
optional arguments:
--config CONFIG Give config file in yaml format (default: None)
--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
non_linguistic_symbols file path (default: None)
--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
Apply text cleaning (default: None)
--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
Specify g2p method if --token_type=phn (default: None)
Common configuration:
--print_config Print the config file and exit (default: False)
--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
The verbose level of logging (default: INFO)
--dry_run DRY_RUN Perform process without training (default: False)
--iterator_type {sequence,chunk,task,none}
Specify iterator type (default: sequence)
--output_dir OUTPUT_DIR
--ngpu NGPU The number of gpus. 0 indicates CPU mode (default: 0)
--seed SEED Random seed (default: 0)
--num_workers NUM_WORKERS
The number of workers used for DataLoader (default: 1)
--num_att_plot NUM_ATT_PLOT
The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)
distributed training related:
--dist_backend DIST_BACKEND
distributed backend (default: nccl)
--dist_init_method DIST_INIT_METHOD
if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
--dist_world_size DIST_WORLD_SIZE
number of nodes for distributed training (default: None)
--dist_rank DIST_RANK
node rank for distributed training (default: None)
--local_rank LOCAL_RANK
local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
--dist_master_addr DIST_MASTER_ADDR
The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
--dist_master_port DIST_MASTER_PORT
The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
--dist_launcher {slurm,mpi,None}
The launcher type for distributed training (default: None)
--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
--unused_parameters UNUSED_PARAMETERS
Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel (default: False)
--sharded_ddp SHARDED_DDP
Enable sharded training provided by fairscale (default: False)
cudnn mode related:
--cudnn_enabled CUDNN_ENABLED
Enable CUDNN (default: True)
--cudnn_benchmark CUDNN_BENCHMARK
Enable cudnn-benchmark mode (default: False)
--cudnn_deterministic CUDNN_DETERMINISTIC
Enable cudnn-deterministic mode (default: True)
collect stats mode related:
--collect_stats COLLECT_STATS
Perform on "collect stats" mode (default: False)
--write_collected_feats WRITE_COLLECTED_FEATS
Write the output features from the model when "collect stats" mode (default: False)
Trainer related:
--max_epoch MAX_EPOCH
The maximum number epoch to train (default: 40)
--patience PATIENCE Number of epochs to wait without improvement before stopping the training (default: None)
--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
Remove previous snapshots excluding the n-best scored epochs (default: [10])
--nbest_averaging_interval NBEST_AVERAGING_INTERVAL
The epoch interval to apply model averaging and save nbest models (default: 0)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (default: 5.0)
--grad_clip_type GRAD_CLIP_TYPE
The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
--grad_noise GRAD_NOISE
The flag to switch to use noise injection to gradients during training (default: False)
--accum_grad ACCUM_GRAD
The number of gradient accumulation (default: 1)
--no_forward_run NO_FORWARD_RUN
Just only iterating data loading without model forwarding and training (default: False)
--resume RESUME Enable resuming if checkpoint is existing (default: False)
--train_dtype {float16,float32,float64}
Data type for training. (default: float32)
--use_amp USE_AMP Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
--log_interval LOG_INTERVAL
Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
--use_matplotlib USE_MATPLOTLIB
Enable matplotlib logging (default: True)
--use_tensorboard USE_TENSORBOARD
Enable tensorboard logging (default: True)
--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
Whether to create graph in tensorboard (default: False)
--use_wandb USE_WANDB
Enable wandb logging (default: False)
--wandb_project WANDB_PROJECT
Specify wandb project (default: None)
--wandb_id WANDB_ID Specify wandb id (default: None)
--wandb_entity WANDB_ENTITY
Specify wandb entity (default: None)
--wandb_name WANDB_NAME
Specify wandb run name (default: None)
--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
Set the model log period (default: -1)
--detect_anomaly DETECT_ANOMALY
Set torch.autograd.set_detect_anomaly (default: False)
Pretraining model related:
--pretrain_path PRETRAIN_PATH
This option is obsoleted (default: None)
--init_param [INIT_PARAM [INIT_PARAM ...]]
Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
# Load all parameters --init_param some/where/model.pth
# Load only decoder parameters --init_param some/where/model.pth:decoder:decoder
# Load only decoder parameters excluding decoder.embed --init_param some/where/model.pth:decoder:decoder:decoder.embed
--init_param some/where/model.pth:decoder:decoder:decoder.embed
(default: [])
--ignore_init_mismatch IGNORE_INIT_MISMATCH
Ignore size mismatch when loading pre-trained model (default: False)
--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
Freeze parameters (default: [])
BatchSampler related:
--num_iters_per_epoch NUM_ITERS_PER_EPOCH
Restrict the number of iterations for training per epoch (default: None)
--batch_size BATCH_SIZE
The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
--valid_batch_size VALID_BATCH_SIZE
If not given, the value of --batch_size is used (default: None)
--batch_bins BATCH_BINS
The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
--valid_batch_bins VALID_BATCH_BINS
If not given, the value of --batch_bins is used (default: None)
--train_shape_file TRAIN_SHAPE_FILE
--valid_shape_file VALID_SHAPE_FILE
Sequence iterator related:
--batch_type {unsorted,sorted,folded,length,numel}
"unsorted":
UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.
utterance_id_a
utterance_id_b
utterance_id_c
The fist column is referred, so 'shape file' can be used, too.
utterance_id_a 100,80
utterance_id_b 400,80
utterance_id_c 512,80
"sorted":
SortedBatchSampler sorts samples by the length of the first input in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"folded":
FoldedBatchSampler supports variable batch_size. The batch_size is decided by
batch_size = base_batch_size // (L // fold_length)
L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler
"length":
LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.
utterance_id_a 1000
utterance_id_b 1453
utterance_id_c 1241
The first element of feature dimensions is referred, so 'shape_file' can be also used.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
"numel":
NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.
utterance_id_a 1000,80
utterance_id_b 1453,80
utterance_id_c 1241,80
(default: folded)
--valid_batch_type {unsorted,sorted,folded,length,numel,None}
If not given, the value of --batch_type is used (default: None)
--fold_length FOLD_LENGTH
--sort_in_batch {descending,ascending}
Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
--shuffle_within_batch SHUFFLE_WITHIN_BATCH
Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
--sort_batch {descending,ascending}
Sort mini-batches by the sample lengths (default: descending)
--multiple_iterator MULTIPLE_ITERATOR
Use multiple iterator mode (default: False)
Chunk iterator related:
--chunk_length CHUNK_LENGTH
Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded. (default: 500)
--chunk_shift_ratio CHUNK_SHIFT_RATIO
Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
--num_cache_chunks NUM_CACHE_CHUNKS
Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
- exactly match one of the prefixes in `chunk_excluded_key_prefixes`
- have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
Dataset related:
--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:
"sound":
Audio format types which supported by sndfile wav, flac, etc.
utterance_id_a a.wav
utterance_id_b b.wav
...
"multi_columns_sound":
Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data
utterance_id_a a.wav a2.wav
utterance_id_b b.wav b2.wav
...
"score":
Return text as is. The text contains tempo and note info.
For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.
utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
...
"duration":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
...
"kaldi_ark":
Kaldi-ark file type.
utterance_id_A /some/where/a.ark:123
utterance_id_B /some/where/a.ark:456
...
"npy":
Npy file format.
utterance_id_A /some/where/a.npy
utterance_id_B /some/where/b.npy
...
"text_int":
A text file in which is written a sequence of interger numbers separated by space.
utterance_id_A 12 0 1 3
utterance_id_B 3 3 1
...
"csv_int":
A text file in which is written a sequence of interger numbers separated by comma.
utterance_id_A 100,80
utterance_id_B 143,80
...
"text_float":
A text file in which is written a sequence of float numbers separated by space.
utterance_id_A 12. 3.1 3.4 4.4
utterance_id_B 3. 3.12 1.1
...
"csv_float":
A text file in which is written a sequence of float numbers separated by comma.
utterance_id_A 12.,3.1,3.4,4.4
utterance_id_B 3.,3.12,1.1
...
"text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
utterance_id_A hello world
utterance_id_B foo bar
...
"random_text":
Return text as is. The text must be converted to ndarray by 'preprocess'.
hello world
foo bar
...
"hdf5":
A HDF5 file which contains arrays at the first level or the second level. >>> f = h5py.File('file.h5')
>>> array1 = f['utterance_id_A']
>>> array2 = f['utterance_id_B']
"rand_float":
Generate random float-ndarray which has the given shapes in the file.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rand_int_\d+_\d+":
e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.
utterance_id_A 3,4
utterance_id_B 10,4
...
"rttm":
rttm file loader, currently support for speaker diarization
SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA> SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA> SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA> END file1 <NA> 4023 <NA> <NA> <NA> <NA> ...
(default: [])
--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
--max_cache_size MAX_CACHE_SIZE
The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
--max_cache_fd MAX_CACHE_FD
The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
--valid_max_cache_size VALID_MAX_CACHE_SIZE
The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)
Optimizer related:
--exclude_weight_decay EXCLUDE_WEIGHT_DECAY
Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
The optimizer type (default: adadelta)
--optim_conf OPTIM_CONF
The keyword arguments for optimizer (default: {})
--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
The lr scheduler type (default: None)
--scheduler_conf SCHEDULER_CONF
The keyword arguments for lr scheduler (default: {})
Task related
--token_list TOKEN_LIST
A text mapping int-id to token (default: None)
--odim ODIM The number of dimension of output feature (default: None)
--model_conf MODEL_CONF
The keyword arguments for model class. (default: {})
Preprocess related
--use_preprocessor USE_PREPROCESSOR
Apply preprocessing to data or not (default: True)
--token_type {bpe,char,word,phn}
The text will be tokenized in the specified level token (default: phn)
--bpemodel BPEMODEL The model file of sentencepiece (default: None)
--feats_extract {fbank,spectrogram,linear_spectrogram}
The feats_extract type (default: fbank)
--feats_extract_conf FEATS_EXTRACT_CONF
The keyword arguments for feats_extract (default: {})
--normalize {global_mvn,None}
The normalize type (default: global_mvn)
--normalize_conf NORMALIZE_CONF
The keyword arguments for normalize (default: {})
--tts {tacotron2,transformer,fastspeech,fastspeech2,prodiff,vits,joint_text2wav,jets}
The tts type (default: tacotron2)
--tts_conf TTS_CONF The keyword arguments for tts (default: {})
--pitch_extract {dio,None}
The pitch_extract type (default: None)
--pitch_extract_conf PITCH_EXTRACT_CONF
The keyword arguments for pitch_extract (default: {})
--pitch_normalize {global_mvn,None}
The pitch_normalize type (default: None)
--pitch_normalize_conf PITCH_NORMALIZE_CONF
The keyword arguments for pitch_normalize (default: {})
--energy_extract {energy,None}
The energy_extract type (default: None)
--energy_extract_conf ENERGY_EXTRACT_CONF
The keyword arguments for energy_extract (default: {})
--energy_normalize {global_mvn,None}
The energy_normalize type (default: None)
--energy_normalize_conf ENERGY_NORMALIZE_CONF
The keyword arguments for energy_normalize (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
def backtrace(trace: np.ndarray):