espnet2.bin package

espnet2.bin.asvspoof_inference

class espnet2.bin.asvspoof_inference.SpeechAntiSpoof(asvspoof_train_config: Union[pathlib.Path, str] = None, asvspoof_model_file: Union[pathlib.Path, str] = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32')[source]

Bases: object

SpeechAntiSpoof class .. rubric:: Examples

>>> import soundfile
>>> speech_anti_spoof = SpeechAntiSpoof("asvspoof_config.yml", "asvspoof.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech_anti_spoof(audio)
prediction_result (int)
espnet2.bin.asvspoof_inference.get_parser()[source]
espnet2.bin.asvspoof_inference.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asvspoof_train_config: Optional[str], asvspoof_model_file: Optional[str], allow_variable_data_keys: bool)[source]
espnet2.bin.asvspoof_inference.main(cmd=None)[source]

espnet2.bin.uasr_inference_k2

espnet2.bin.uasr_inference_k2.get_parser()[source]
espnet2.bin.uasr_inference_k2.indices_to_split_size(indices, total_elements: int = None)[source]

convert indices to split_size

During decoding, the api torch.tensor_split should be used. However, torch.tensor_split is only available with pytorch >= 1.8.0. So torch.split is used to pass ci with pytorch < 1.8.0. This fuction is used to prepare input for torch.split.

espnet2.bin.uasr_inference_k2.inference(output_dir: str, decoding_graph: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], uasr_train_config: Optional[str], uasr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], word_token_list: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, streaming: bool, is_ctc_decoding: bool, use_nbest_rescoring: bool, num_paths: int, nbest_batch_size: int, nll_batch_size: int, k2_config: Optional[str])[source]
class espnet2.bin.uasr_inference_k2.k2Speech2Text(uasr_train_config: Union[pathlib.Path, str], decoding_graph: str, uasr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 8, ctc_weight: float = 0.5, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, search_beam_size: int = 20, output_beam_size: int = 20, min_active_states: int = 14000, max_active_states: int = 56000, blank_bias: float = 0.0, lattice_weight: float = 1.0, is_ctc_decoding: bool = True, lang_dir: Optional[str] = None, token_list_file: Optional[str] = None, use_fgram_rescoring: bool = False, use_nbest_rescoring: bool = False, am_weight: float = 0.5, decoder_weight: float = 0.5, nnlm_weight: float = 1.0, num_paths: int = 1000, nbest_batch_size: int = 500, nll_batch_size: int = 100)[source]

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = k2Speech2Text("uasr_config.yml", "uasr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech = np.expand_dims(audio, 0) # shape: [batch_size, speech_length]
>>> speech_lengths = np.array([audio.shape[0]]) # shape: [batch_size]
>>> batch = {"speech": speech, "speech_lengths", speech_lengths}
>>> speech2text(batch)
[(text, token, token_int, score), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build k2Speech2Text instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

Speech2Text instance.

Return type:

Speech2Text

espnet2.bin.uasr_inference_k2.main(cmd=None)[source]

espnet2.bin.asr_transducer_train

espnet2.bin.asr_transducer_train.get_parser()[source]

Get parser for ASR Transducer task.

espnet2.bin.asr_transducer_train.main(cmd=None)[source]

ASR Transducer training.

Example

% python asr_transducer_train.py asr –print_config

–optim adadelta > conf/train_asr.yaml

% python asr_transducer_train.py

–config conf/tuning/transducer/train_rnn_transducer.yaml

espnet2.bin.uasr_train

espnet2.bin.uasr_train.get_parser()[source]
espnet2.bin.uasr_train.main(cmd=None)[source]

UASR training.

Example

% python uasr_train.py uasr –print_config –optim adadelta

> conf/train_uasr.yaml

% python uasr_train.py –config conf/train_uasr.yaml

espnet2.bin.asr_train

espnet2.bin.asr_train.get_parser()[source]
espnet2.bin.asr_train.main(cmd=None)[source]

ASR training.

Example

% python asr_train.py asr –print_config –optim adadelta

> conf/train_asr.yaml

% python asr_train.py –config conf/train_asr.yaml

espnet2.bin.tts_train

espnet2.bin.tts_train.get_parser()[source]
espnet2.bin.tts_train.main(cmd=None)[source]

TTS training

Example

% python tts_train.py asr –print_config –optim adadelta % python tts_train.py –config conf/train_asr.yaml

espnet2.bin.slu_train

espnet2.bin.slu_train.get_parser()[source]
espnet2.bin.slu_train.main(cmd=None)[source]

SLU training.

Example

% python slu_train.py slu –print_config –optim adadelta

> conf/train_slu.yaml

% python slu_train.py –config conf/train_slu.yaml

espnet2.bin.mt_train

espnet2.bin.mt_train.get_parser()[source]
espnet2.bin.mt_train.main(cmd=None)[source]

MT training.

Example

% python mt_train.py st –print_config –optim adadelta

> conf/train_mt.yaml

% python mt_train.py –config conf/train_mt.yaml

espnet2.bin.lm_inference

class espnet2.bin.lm_inference.GenerateText(lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlen: int = 100, minlen: int = 0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ngram_weight: float = 0.0, penalty: float = 0.0, nbest: int = 1, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]

Bases: object

GenerateText class

Examples

>>> generatetext = GenerateText(
        lm_train_config="lm_config.yaml",
        lm_file="lm.pth",
        token_type="bpe",
        bpemodel="bpe.model",
    )
>>> prompt = "I have travelled to many "
>>> generatetext(prompt)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build GenerateText instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

GenerateText instance.

Return type:

GenerateText

espnet2.bin.lm_inference.get_parser()[source]
espnet2.bin.lm_inference.inference(output_dir: str, maxlen: int, minlen: int, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]
espnet2.bin.lm_inference.main(cmd=None)[source]

espnet2.bin.asr_inference

class espnet2.bin.asr_inference.Speech2Text(asr_train_config: Union[pathlib.Path, str] = None, asr_model_file: Union[pathlib.Path, str] = None, transducer_conf: dict = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, enh_s2t_task: bool = False, quantize_asr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8', hugging_face_decoder: bool = False, hugging_face_decoder_max_length: int = 256, time_sync: bool = False, multi_asr: bool = False)[source]

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Speech2Text instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

Speech2Text instance.

Return type:

Speech2Text

espnet2.bin.asr_inference.get_parser()[source]
espnet2.bin.asr_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: Optional[str], asr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, transducer_conf: Optional[dict], streaming: bool, enh_s2t_task: bool, quantize_asr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str, hugging_face_decoder: bool, hugging_face_decoder_max_length: int, time_sync: bool, multi_asr: bool)[source]
espnet2.bin.asr_inference.main(cmd=None)[source]

espnet2.bin.spk_train

espnet2.bin.spk_train.get_parser()[source]
espnet2.bin.spk_train.main(cmd=None)[source]
Speaker embedding extractor training. Trained model can be used for

speaker verification, open set speaker identification, and also as embeddings for various other tasks including speaker diarization.

Example

% python spk_train.py –print_config –optim adadelta

> conf/train_spk.yaml

% python spk_train.py –config conf/train_diar.yaml

espnet2.bin.enh_scoring

espnet2.bin.enh_scoring.get_parser()[source]
espnet2.bin.enh_scoring.get_readers(scps: List[str], dtype: str)[source]
espnet2.bin.enh_scoring.main(cmd=None)[source]
espnet2.bin.enh_scoring.read_audio(reader, key, audio_format='sound')[source]
espnet2.bin.enh_scoring.scoring(output_dir: str, dtype: str, log_level: Union[int, str], key_file: str, ref_scp: List[str], inf_scp: List[str], ref_channel: int, flexible_numspk: bool, is_tse: bool, use_dnsmos: bool, dnsmos_args: Dict, use_pesq: bool)[source]

espnet2.bin.pack

class espnet2.bin.pack.ASRPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['asr_model_file', 'lm_file']
yaml_files = ['asr_train_config', 'lm_train_config']
class espnet2.bin.pack.DiarPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']
yaml_files = ['train_config']
class espnet2.bin.pack.EnhPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']
yaml_files = ['train_config']
class espnet2.bin.pack.EnhS2TPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['enh_s2t_model_file', 'lm_file']
yaml_files = ['enh_s2t_train_config', 'lm_train_config']
class espnet2.bin.pack.PackedContents[source]

Bases: object

files = []
yaml_files = []
class espnet2.bin.pack.SSLPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']
yaml_files = ['train_config']
class espnet2.bin.pack.STPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['st_model_file']
yaml_files = ['st_train_config']
class espnet2.bin.pack.SVSPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']
yaml_files = ['train_config']
class espnet2.bin.pack.TTSPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']
yaml_files = ['train_config']
espnet2.bin.pack.add_arguments(parser: argparse.ArgumentParser, contents: Type[espnet2.bin.pack.PackedContents])[source]
espnet2.bin.pack.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.pack.main(cmd=None)[source]

espnet2.bin.asr_inference_streaming

class espnet2.bin.asr_inference_streaming.Speech2TextStreaming(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, disable_repetition_detection=False, decoder_text_length_limit=0, encoded_feat_length_limit=0)[source]

Bases: object

Speech2TextStreaming class

Details in “Streaming Transformer ASR with Blockwise Synchronous Beam Search” (https://arxiv.org/abs/2006.14941)

Examples

>>> import soundfile
>>> speech2text = Speech2TextStreaming("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
apply_frontend(speech: torch.Tensor, prev_states=None, is_final: bool = False)[source]
assemble_hyps(hyps)[source]
reset()[source]
espnet2.bin.asr_inference_streaming.get_parser()[source]
espnet2.bin.asr_inference_streaming.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, sim_chunk_length: int, disable_repetition_detection: bool, encoded_feat_length_limit: int, decoder_text_length_limit: int)[source]
espnet2.bin.asr_inference_streaming.main(cmd=None)[source]

espnet2.bin.gan_svs_train

espnet2.bin.gan_svs_train.get_parser()[source]
espnet2.bin.gan_svs_train.main(cmd=None)[source]

GAN-based SVS training

Example

% python gan_svs_train.py –print_config –optim1 adadelta % python gan_svs_train.py –config conf/train.yaml

espnet2.bin.hubert_train

espnet2.bin.hubert_train.get_parser()[source]
espnet2.bin.hubert_train.main(cmd=None)[source]

Hubert pretraining.

Example

% python hubert_train.py asr –print_config –optim adadelta > conf/hubert_asr.yaml % python hubert_train.py –config conf/train_asr.yaml

espnet2.bin.enh_inference_streaming

class espnet2.bin.enh_inference_streaming.SeparateSpeechStreaming(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, ref_channel: Optional[int] = None, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False)[source]

Bases: object

SeparateSpeechStreaming class. Separate a small audio chunk in streaming.

Examples

>>> import soundfile
>>> separate_speech = SeparateSpeechStreaming("enh_config.yml", "enh.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> lengths = torch.LongTensor(audio.shape[-1])
>>> speech_sim_chunks = separate_speech.frame(wav)
>>> output_chunks = [[] for ii in range(separate_speech.num_spk)]
>>>
>>> for chunk in speech_sim_chunks:
>>>     output = separate_speech(chunk)
>>>     for spk in range(separate_speech.num_spk):
>>>         output_chunks[spk].append(output[spk])
>>>
>>> separate_speech.reset()
>>> waves = [
>>>     separate_speech.merge(chunks, length)
>>>     for chunks in output_chunks ]
frame(audio)[source]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build SeparateSpeech instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

SeparateSpeech instance.

Return type:

SeparateSpeech

merge(chunks, ilens=None)[source]
reset()[source]
espnet2.bin.enh_inference_streaming.get_parser()[source]
espnet2.bin.enh_inference_streaming.humanfriendly_or_none(value: str)[source]
espnet2.bin.enh_inference_streaming.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, ref_channel: Optional[int], enh_s2t_task: bool)[source]
espnet2.bin.enh_inference_streaming.main(cmd=None)[source]

espnet2.bin.st_train

espnet2.bin.st_train.get_parser()[source]
espnet2.bin.st_train.main(cmd=None)[source]

ST training.

Example

% python st_train.py st –print_config –optim adadelta

> conf/train_st.yaml

% python st_train.py –config conf/train_st.yaml

espnet2.bin.tts_inference

Script to run the inference of text-to-speeech model.

class espnet2.bin.tts_inference.Text2Speech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, speed_control_alpha: float = 1.0, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, vocoder_config: Union[pathlib.Path, str] = None, vocoder_file: Union[pathlib.Path, str] = None, dtype: str = 'float32', device: str = 'cpu', seed: int = 777, always_fix_seed: bool = False, prefer_normalized_feats: bool = False)[source]

Bases: object

Text2Speech class.

Examples

>>> from espnet2.bin.tts_inference import Text2Speech
>>> # Case 1: Load the local model and use Griffin-Lim vocoder
>>> text2speech = Text2Speech(
>>>     train_config="/path/to/config.yml",
>>>     model_file="/path/to/model.pth",
>>> )
>>> # Case 2: Load the local model and the pretrained vocoder
>>> text2speech = Text2Speech.from_pretrained(
>>>     train_config="/path/to/config.yml",
>>>     model_file="/path/to/model.pth",
>>>     vocoder_tag="kan-bayashi/ljspeech_tacotron2",
>>> )
>>> # Case 3: Load the pretrained model and use Griffin-Lim vocoder
>>> text2speech = Text2Speech.from_pretrained(
>>>     model_tag="kan-bayashi/ljspeech_tacotron2",
>>> )
>>> # Case 4: Load the pretrained model and the pretrained vocoder
>>> text2speech = Text2Speech.from_pretrained(
>>>     model_tag="kan-bayashi/ljspeech_tacotron2",
>>>     vocoder_tag="parallel_wavegan/ljspeech_parallel_wavegan.v1",
>>> )
>>> # Run inference and save as wav file
>>> import soundfile as sf
>>> wav = text2speech("Hello, World")["wav"]
>>> sf.write("out.wav", wav.numpy(), text2speech.fs, "PCM_16")

Initialize Text2Speech module.

static from_pretrained(model_tag: Optional[str] = None, vocoder_tag: Optional[str] = None, **kwargs)[source]

Build Text2Speech instance from the pretrained model.

Parameters:
  • model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

  • vocoder_tag (Optional[str]) – Vocoder tag of the pretrained vocoders. Currently, the tags of parallel_wavegan are supported, which should start with the prefix “parallel_wavegan/”.

Returns:

Text2Speech instance.

Return type:

Text2Speech

property fs

Return sampling rate.

property use_lids

Return sid is needed or not in the inference.

property use_sids

Return sid is needed or not in the inference.

property use_speech

Return speech is needed or not in the inference.

property use_spembs

Return spemb is needed or not in the inference.

espnet2.bin.tts_inference.get_parser()[source]

Get argument parser.

espnet2.bin.tts_inference.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], threshold: float, minlenratio: float, maxlenratio: float, use_teacher_forcing: bool, use_att_constraint: bool, backward_window: int, forward_window: int, speed_control_alpha: float, noise_scale: float, noise_scale_dur: float, always_fix_seed: bool, allow_variable_data_keys: bool, vocoder_config: Optional[str], vocoder_file: Optional[str], vocoder_tag: Optional[str])[source]

Run text-to-speech inference.

espnet2.bin.tts_inference.main(cmd=None)[source]

Run TTS model inference.

espnet2.bin.asvspoof_train

espnet2.bin.asvspoof_train.get_parser()[source]
espnet2.bin.asvspoof_train.main(cmd=None)[source]

ASVSpoof training. .. rubric:: Example

% python asvspoof_train.py asr –print_config –optim adadelta

> conf/train_asvspoof.yaml

% python asvspoof_train.py –config conf/train_asvspoof.yaml

espnet2.bin.diar_inference

class espnet2.bin.diar_inference.DiarizeSpeech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, normalize_output_wav: bool = False, num_spk: Optional[int] = None, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False, multiply_diar_result: bool = False)[source]

Bases: object

DiarizeSpeech class

Examples

>>> import soundfile
>>> diarization = DiarizeSpeech("diar_config.yaml", "diar.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> diarization(audio)
[(spk_id, start, end), (spk_id2, start2, end2)]
cal_permumation(ref_wavs, enh_wavs, criterion='si_snr')[source]

Calculate the permutation between seaprated streams in two adjacent segments.

Parameters:
  • ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • criterion (str) – one of (“si_snr”, “mse”, “corr)

Returns:

permutation for enh_wavs (Batch, num_spk)

Return type:

perm (torch.Tensor)

decode(encoder_out, encoder_out_lens)[source]
encode(speech, lengths)[source]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build DiarizeSpeech instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

DiarizeSpeech instance.

Return type:

DiarizeSpeech

permute_diar(waves, spk_prediction)[source]
espnet2.bin.diar_inference.get_parser()[source]
espnet2.bin.diar_inference.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, num_spk: Optional[int], normalize_output_wav: bool, multiply_diar_result: bool, enh_s2t_task: bool)[source]
espnet2.bin.diar_inference.main(cmd=None)[source]

espnet2.bin.mt_inference

class espnet2.bin.mt_inference.Text2Text(mt_train_config: Union[pathlib.Path, str] = None, mt_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1)[source]

Bases: object

Text2Text class

Examples

>>> text2text = Text2Text("mt_config.yml", "mt.pth")
>>> text2text(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Text2Text instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

Text2Text instance.

Return type:

Text2Text

espnet2.bin.mt_inference.get_parser()[source]
espnet2.bin.mt_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], mt_train_config: Optional[str], mt_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool)[source]
espnet2.bin.mt_inference.main(cmd=None)[source]

espnet2.bin.split_scps

espnet2.bin.split_scps.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.split_scps.main(cmd=None)[source]
espnet2.bin.split_scps.split_scps(scps: List[str], num_splits: int, names: Optional[List[str]], output_dir: str, log_level: str)[source]

espnet2.bin.asr_inference_k2

espnet2.bin.lm_calc_perplexity

espnet2.bin.lm_calc_perplexity.calc_perplexity(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], log_base: Optional[float], allow_variable_data_keys: bool)[source]
espnet2.bin.lm_calc_perplexity.get_parser()[source]
espnet2.bin.lm_calc_perplexity.main(cmd=None)[source]

espnet2.bin.svs_train

espnet2.bin.svs_train.get_parser()[source]
espnet2.bin.svs_train.main(cmd=None)[source]

SVS training

Example

% python svs_train.py svs –print_config –optim adadelta % python svs_train.py –config conf/train_svs.yaml

espnet2.bin.asr_inference_maskctc

class espnet2.bin.asr_inference_maskctc.Speech2Text(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32', maskctc_n_iterations: int = 10, maskctc_threshold_probability: float = 0.99)[source]

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Speech2Text instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

Speech2Text instance.

Return type:

Speech2Text

espnet2.bin.asr_inference_maskctc.get_parser()[source]
espnet2.bin.asr_inference_maskctc.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, maskctc_n_iterations: int, maskctc_threshold_probability: float)[source]
espnet2.bin.asr_inference_maskctc.main(cmd=None)[source]

espnet2.bin.enh_s2t_train

espnet2.bin.enh_s2t_train.get_parser()[source]
espnet2.bin.enh_s2t_train.main(cmd=None)[source]

EnhS2T training.

Example

% python enh_s2t_train.py enh_s2t –print_config –optim adadelta

> conf/train_enh_s2t.yaml

% python enh_s2t_train.py –config conf/train_enh_s2t.yaml

espnet2.bin.whisper_export_vocabulary

espnet2.bin.whisper_export_vocabulary.export_vocabulary(output: str, whisper_model: str, log_level: str)[source]
espnet2.bin.whisper_export_vocabulary.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.whisper_export_vocabulary.main(cmd=None)[source]

espnet2.bin.uasr_inference

class espnet2.bin.uasr_inference.Speech2Text(uasr_train_config: Union[pathlib.Path, str] = None, uasr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, nbest: int = 1, quantize_uasr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]

Bases: object

Speech2Text class for unsupervised ASR

Examples

>>> import soundfile
>>> speech2text = Speech2Text("uasr_config.yml", "uasr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis_object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Speech2Text instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

Speech2Text instance.

Return type:

Speech2Text

espnet2.bin.uasr_inference.get_parser()[source]
espnet2.bin.uasr_inference.inference(output_dir: str, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], uasr_train_config: Optional[str], uasr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, quantize_uasr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]
espnet2.bin.uasr_inference.main(cmd=None)[source]

espnet2.bin.diar_train

espnet2.bin.diar_train.get_parser()[source]
espnet2.bin.diar_train.main(cmd=None)[source]

Speaker diarization training.

Example

% python diar_train.py diar –print_config –optim adadelta

> conf/train_diar.yaml

% python diar_train.py –config conf/train_diar.yaml

espnet2.bin.asr_align

Perform CTC segmentation to align utterances within audio files.

class espnet2.bin.asr_align.CTCSegmentation(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, fs: int = 16000, ngpu: int = 0, batch_size: int = 1, dtype: str = 'float32', kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]

Bases: object

Align text to audio using CTC segmentation.

Usage:

Initialize with given ASR model and parameters. If needed, parameters for CTC segmentation can be set with set_config(·). Then call the instance as function to align text within an audio file.

Example

>>> # example file included in the ESPnet repository
>>> import soundfile
>>> speech, fs = soundfile.read("test_utils/ctc_align_test.wav")
>>> # load an ASR model
>>> from espnet_model_zoo.downloader import ModelDownloader
>>> d = ModelDownloader()
>>> wsjmodel = d.download_and_unpack( "kamo-naoyuki/wsj" )
>>> # Apply CTC segmentation
>>> aligner = CTCSegmentation( **wsjmodel )
>>> text=["utt1 THE SALE OF THE HOTELS", "utt2 ON PROPERTY MANAGEMENT"]
>>> aligner.set_config( gratis_blank=True )
>>> segments = aligner( speech, text, fs=fs )
>>> print( segments )
utt1 utt 0.27 1.72 -0.1663 THE SALE OF THE HOTELS
utt2 utt 4.54 6.10 -4.9646 ON PROPERTY MANAGEMENT
On multiprocessing:

To parallelize the computation with multiprocessing, these three steps can be separated: (1) get_lpz: obtain the lpz, (2) prepare_segmentation_task: prepare the task, and (3) get_segments: perform CTC segmentation. Note that the function get_segments is a staticmethod and therefore independent of an already initialized CTCSegmentation object.

References

CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition 2020, Kürzinger, Winkelbauer, Li, Watzel, Rigoll https://arxiv.org/abs/2007.09127

More parameters are described in https://github.com/lumaku/ctc-segmentation

Initialize the CTCSegmentation module.

Parameters:
  • asr_train_config – ASR model config file (yaml).

  • asr_model_file – ASR model file (pth).

  • fs – Sample rate of audio file.

  • ngpu – Number of GPUs. Set 0 for processing on CPU, set to 1 for processing on GPU. Multi-GPU aligning is currently not implemented. Default: 0.

  • batch_size – Currently, only batch size == 1 is implemented.

  • dtype – Data type used for inference. Set dtype according to the ASR model.

  • kaldi_style_text – A kaldi-style text file includes the name of the utterance at the start of the line. If True, the utterance name is expected as first word at each line. If False, utterance names are automatically generated. Set this option according to your input data. Default: True.

  • text_converter – How CTC segmentation handles text. “tokenize”: Use ESPnet 2 preprocessing to tokenize the text. “classic”: The text is preprocessed as in ESPnet 1 which takes token length into account. If the ASR model has longer tokens, this option may yield better results. Default: “tokenize”.

  • time_stamps – Choose the method how the time stamps are calculated. While “fixed” and “auto” use both the sample rate, the ratio of samples to one frame is either automatically determined for each inference or fixed at a certain ratio that is initially determined by the module, but can be changed via the parameter samples_to_frames_ratio. Recommended for longer audio files: “auto”.

  • **ctc_segmentation_args – Parameters for CTC segmentation.

choices_text_converter = ['tokenize', 'classic']
choices_time_stamps = ['auto', 'fixed']
config = CtcSegmentationParameters( )
estimate_samples_to_frames_ratio(speech_len=215040)[source]

Determine the ratio of encoded frames to sample points.

This method helps to determine the time a single encoded frame occupies. As the sample rate already gave the number of samples, only the ratio of samples per encoded CTC frame are needed. This function estimates them by doing one inference, which is only needed once.

Parameters:

speech_len – Length of randomly generated speech vector for single inference. Default: 215040.

Returns:

Estimated ratio.

Return type:

samples_to_frames_ratio

fs = 16000
get_lpz(speech: Union[torch.Tensor, numpy.ndarray])[source]

Obtain CTC posterior log probabilities for given speech data.

Parameters:

speech – Speech audio input.

Returns:

Numpy vector with CTC log posterior probabilities.

Return type:

lpz

static get_segments(task: espnet2.bin.asr_align.CTCSegmentationTask)[source]

Obtain segments for given utterance texts and CTC log posteriors.

Parameters:

task – CTCSegmentationTask object that contains ground truth and CTC posterior probabilities.

Returns:

Dictionary with alignments. Combine this with the task

object to obtain a human-readable segments representation.

Return type:

result

get_timing_config(speech_len=None, lpz_len=None)[source]

Obtain parameters to determine time stamps.

prepare_segmentation_task(text, lpz, name=None, speech_len=None)[source]

Preprocess text, and gather text and lpz into a task object.

Text is pre-processed and tokenized depending on configuration. If speech_len is given, the timing configuration is updated. Text, lpz, and configuration is collected in a CTCSegmentationTask object. The resulting object can be serialized and passed in a multiprocessing computation.

A minimal amount of text processing is done, i.e., splitting the utterances in text into a list and applying text_cleaner. It is recommended that you normalize the text beforehand, e.g., change numbers into their spoken equivalent word, remove special characters, and convert UTF-8 characters to chars corresponding to your ASR model dictionary.

The text is tokenized based on the text_converter setting:

The “tokenize” method is more efficient and the easiest for models based on latin or cyrillic script that only contain the main chars, [“a”, “b”, …] or for Japanese or Chinese ASR models with ~3000 short Kanji / Hanzi tokens.

The “classic” method improves the the accuracy of the alignments for models that contain longer tokens, but with a greater complexity for computation. The function scans for partial tokens which may improve time resolution. For example, the word “▁really” will be broken down into ['▁', '▁r', '▁re', '▁real', '▁really']. The alignment will be based on the most probable activation sequence given by the network.

Parameters:
  • text – List or multiline-string with utterance ground truths.

  • lpz – Log CTC posterior probabilities obtained from the CTC-network; numpy array shaped as ( <time steps>, <classes> ).

  • name – Audio file name. Choose a unique name, or the original audio file name, to distinguish multiple audio files. Default: None.

  • speech_len – Number of sample points. If given, the timing configuration is automatically derived from length of fs, length of speech and length of lpz. If None is given, make sure the timing parameters are correct, see time_stamps for reference! Default: None.

Returns:

CTCSegmentationTask object that can be passed to

get_segments() in order to obtain alignments.

Return type:

task

samples_to_frames_ratio = None
set_config(**kwargs)[source]

Set CTC segmentation parameters.

Parameters for timing:
time_stamps: Select method how CTC index duration is estimated, and

thus how the time stamps are calculated.

fs: Sample rate. samples_to_frames_ratio: If you want to directly determine the

ratio of samples to CTC frames, set this parameter, and set time_stamps to “fixed”. Note: If you want to calculate the time stamps as in ESPnet 1, set this parameter to: subsampling_factor * frame_duration / 1000.

Parameters for text preparation:

set_blank: Index of blank in token list. Default: 0. replace_spaces_with_blanks: Inserts blanks between words, which is

useful for handling long pauses between words. Only used in text_converter="classic" preprocessing mode. Default: False.

kaldi_style_text: Determines whether the utterance name is expected

as fist word of the utterance. Set at module initialization.

text_converter: How CTC segmentation handles text.

Set at module initialization.

Parameters for alignment:
min_window_size: Minimum number of frames considered for a single

utterance. The current default value of 8000 corresponds to roughly 4 minutes (depending on ASR model) and should be OK in most cases. If your utterances are further apart, increase this value, or decrease it for smaller audio files.

max_window_size: Maximum window size. It should not be necessary

to change this value.

gratis_blank: If True, the transition cost of blank is set to zero.

Useful for long preambles or if there are large unrelated segments between utterances. Default: False.

Parameters for calculation of confidence score:
scoring_length: Block length to calculate confidence score. The

default value of 30 should be OK in most cases.

text_converter = 'tokenize'
time_stamps = 'auto'
warned_about_misconfiguration = False
class espnet2.bin.asr_align.CTCSegmentationTask(**kwargs)[source]

Bases: object

Task object for CTC segmentation.

When formatted with str(·), this object returns results in a kaldi-style segments file formatting. The human-readable output can be configured with the printing options.

Properties:
text: Utterance texts, separated by line. But without the utterance

name at the beginning of the line (as in kaldi-style text).

ground_truth_mat: Ground truth matrix (CTC segmentation). utt_begin_indices: Utterance separator for the Ground truth matrix. timings: Time marks of the corresponding chars. state_list: Estimated alignment of chars/tokens. segments: Calculated segments as: (start, end, confidence score). config: CTC Segmentation configuration object. name: Name of aligned audio file (Optional). If given, name is

considered when generating the text.

utt_ids: The list of utterance names (Optional). This list should

have the same length as the number of utterances.

lpz: CTC posterior log probabilities (Optional).

Properties for printing:

print_confidence_score: Includes the confidence score. print_utterance_text: Includes utterance text.

Initialize the module.

char_probs = None
config = None
done = False
ground_truth_mat = None
lpz = None
name = 'utt'
print_confidence_score = True
print_utterance_text = True
segments = None
set(**kwargs)[source]

Update properties.

Parameters:

**kwargs – Key-value dict that contains all properties with their new values. Unknown properties are ignored.

state_list = None
text = None
timings = None
utt_begin_indices = None
utt_ids = None
espnet2.bin.asr_align.ctc_align(log_level: Union[int, str], asr_train_config: str, asr_model_file: str, audio: pathlib.Path, text: TextIO, output: TextIO, print_utt_text: bool = True, print_utt_score: bool = True, **kwargs)[source]

Provide the scripting interface to align text to audio.

espnet2.bin.asr_align.get_parser()[source]

Obtain an argument-parser for the script interface.

espnet2.bin.asr_align.main(cmd=None)[source]

Parse arguments and start the alignment in ctc_align(·).

espnet2.bin.slu_inference

class espnet2.bin.slu_inference.Speech2Understand(slu_train_config: Union[pathlib.Path, str] = None, slu_model_file: Union[pathlib.Path, str] = None, transducer_conf: dict = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, quantize_asr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]

Bases: object

Speech2Understand class

Examples

>>> import soundfile
>>> speech2understand = Speech2Understand("slu_config.yml", "slu.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2understand(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Speech2Understand instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

Speech2Understand instance.

Return type:

Speech2Understand

espnet2.bin.slu_inference.get_parser()[source]
espnet2.bin.slu_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], slu_train_config: Optional[str], slu_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, transducer_conf: Optional[dict], streaming: bool, quantize_asr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]
espnet2.bin.slu_inference.main(cmd=None)[source]

espnet2.bin.st_inference_streaming

class espnet2.bin.st_inference_streaming.Speech2TextStreaming(st_train_config: Union[pathlib.Path, str], st_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, disable_repetition_detection=False, decoder_text_length_limit=0, encoded_feat_length_limit=0)[source]

Bases: object

Speech2TextStreaming class

Details in “Streaming Transformer ASR with Blockwise Synchronous Beam Search” (https://arxiv.org/abs/2006.14941)

Examples

>>> import soundfile
>>> speech2text = Speech2TextStreaming("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
apply_frontend(speech: torch.Tensor, prev_states=None, is_final: bool = False)[source]
assemble_hyps(hyps)[source]
reset()[source]
espnet2.bin.st_inference_streaming.get_parser()[source]
espnet2.bin.st_inference_streaming.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], st_train_config: str, st_model_file: str, lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, sim_chunk_length: int, disable_repetition_detection: bool, encoded_feat_length_limit: int, decoder_text_length_limit: int)[source]
espnet2.bin.st_inference_streaming.main(cmd=None)[source]

espnet2.bin.enh_tse_train

espnet2.bin.enh_tse_train.get_parser()[source]
espnet2.bin.enh_tse_train.main(cmd=None)[source]

Target Speaker Extraction model training.

Example

% python enh_tse_train.py asr –print_config –optim adadelta

> conf/train_enh.yaml

% python enh_tse_train.py –config conf/train_enh.yaml

espnet2.bin.st_inference

class espnet2.bin.st_inference.Speech2Text(st_train_config: Union[pathlib.Path, str] = None, st_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, enh_s2t_task: bool = False)[source]

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = Speech2Text("st_config.yml", "st.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Speech2Text instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

Speech2Text instance.

Return type:

Speech2Text

espnet2.bin.st_inference.get_parser()[source]
espnet2.bin.st_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], st_train_config: Optional[str], st_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, enh_s2t_task: bool)[source]
espnet2.bin.st_inference.main(cmd=None)[source]

espnet2.bin.lm_train

espnet2.bin.lm_train.get_parser()[source]
espnet2.bin.lm_train.main(cmd=None)[source]

LM training.

Example

% python lm_train.py asr –print_config –optim adadelta % python lm_train.py –config conf/train_asr.yaml

espnet2.bin.enh_tse_inference

class espnet2.bin.enh_tse_inference.SeparateSpeech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, ref_channel: Optional[int] = None, normalize_output_wav: bool = False, device: str = 'cpu', dtype: str = 'float32')[source]

Bases: object

SeparateSpeech class

Examples

>>> import soundfile
>>> separate_speech = SeparateSpeech("enh_config.yml", "enh.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> separate_speech(audio)
[separated_audio1, separated_audio2, ...]
cal_permumation(ref_wavs, enh_wavs, criterion='si_snr')[source]

Calculate the permutation between seaprated streams in two adjacent segments.

Parameters:
  • ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • criterion (str) – one of (“si_snr”, “mse”, “corr)

Returns:

permutation for enh_wavs (Batch, num_spk)

Return type:

perm (torch.Tensor)

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build SeparateSpeech instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

SeparateSpeech instance.

Return type:

SeparateSpeech

espnet2.bin.enh_tse_inference.build_model_from_args_and_file(task, args, model_file, device)[source]
espnet2.bin.enh_tse_inference.get_parser()[source]
espnet2.bin.enh_tse_inference.get_train_config(train_config, model_file=None)[source]
espnet2.bin.enh_tse_inference.humanfriendly_or_none(value: str)[source]
espnet2.bin.enh_tse_inference.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, ref_channel: Optional[int], normalize_output_wav: bool)[source]
espnet2.bin.enh_tse_inference.main(cmd=None)[source]
espnet2.bin.enh_tse_inference.recursive_dict_update(dict_org, dict_patch, verbose=False, log_prefix='')[source]

Update dict_org with dict_patch in-place recursively.

espnet2.bin.asr_transducer_inference

Inference class definition for Transducer models.

class espnet2.bin.asr_transducer_inference.Speech2Text(asr_train_config: Union[pathlib.Path, str, None] = None, asr_model_file: Union[pathlib.Path, str, None] = None, beam_search_config: Optional[Dict[str, Any]] = None, lm_train_config: Union[pathlib.Path, str, None] = None, lm_file: Union[pathlib.Path, str, None] = None, token_type: Optional[str] = None, bpemodel: Optional[str] = None, device: str = 'cpu', beam_size: int = 5, dtype: str = 'float32', lm_weight: float = 1.0, quantize_asr_model: bool = False, quantize_modules: Optional[List[str]] = None, quantize_dtype: str = 'qint8', nbest: int = 1, streaming: bool = False, decoding_window: int = 640, left_context: int = 32)[source]

Bases: object

Speech2Text class for Transducer models.

Parameters:
  • asr_train_config – ASR model training config path.

  • asr_model_file – ASR model path.

  • beam_search_config – Beam search config path.

  • lm_train_config – Language Model training config path.

  • lm_file – Language Model config path.

  • token_type – Type of token units.

  • bpemodel – BPE model path.

  • device – Device to use for inference.

  • beam_size – Size of beam during search.

  • dtype – Data type.

  • lm_weight – Language model weight.

  • quantize_asr_model – Whether to apply dynamic quantization to ASR model.

  • quantize_modules – List of module names to apply dynamic quantization on.

  • quantize_dtype – Dynamic quantization data type.

  • nbest – Number of final hypothesis.

  • streaming – Whether to perform chunk-by-chunk inference.

  • decoding_window – Size of the decoding window (in milliseconds).

  • left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).

Construct a Speech2Text object.

static from_pretrained(model_tag: Optional[str] = None, **kwargs) → espnet2.bin.asr_transducer_inference.Speech2Text[source]

Build Speech2Text instance from the pretrained model.

Parameters:

model_tag – Model tag of the pretrained models.

Returns:

Speech2Text instance.

hypotheses_to_results(nbest_hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → List[Any][source]

Build partial or final results from the hypotheses.

Parameters:

nbest_hyps – N-best hypothesis.

Returns:

Results containing different representation for the hypothesis.

Return type:

results

reset_streaming_cache() → None[source]

Reset Speech2Text parameters.

streaming_decode(speech: Union[torch.Tensor, numpy.ndarray], is_final: bool = False) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]

Speech2Text streaming call.

Parameters:
  • speech – Chunk of speech data. (S)

  • is_final – Whether speech corresponds to the final chunk of data.

Returns:

N-best hypothesis.

Return type:

nbest_hypothesis

espnet2.bin.asr_transducer_inference.get_parser()[source]

Get Transducer model inference parser.

espnet2.bin.asr_transducer_inference.inference(output_dir: str, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], asr_train_config: Optional[str], asr_model_file: Optional[str], beam_search_config: Optional[dict], lm_train_config: Optional[str], lm_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], key_file: Optional[str], allow_variable_data_keys: bool, quantize_asr_model: Optional[bool], quantize_modules: Optional[List[str]], quantize_dtype: Optional[str], streaming: bool, decoding_window: int, left_context: int, display_hypotheses: bool) → None[source]

Transducer model inference.

Parameters:
  • output_dir – Output directory path.

  • batch_size – Batch decoding size.

  • dtype – Data type.

  • beam_size – Beam size.

  • ngpu – Number of GPUs.

  • seed – Random number generator seed.

  • lm_weight – Weight of language model.

  • nbest – Number of final hypothesis.

  • num_workers – Number of workers.

  • log_level – Level of verbose for logs.

  • data_path_and_name_and_type

  • asr_train_config – ASR model training config path.

  • asr_model_file – ASR model path.

  • beam_search_config – Beam search config path.

  • lm_train_config – Language Model training config path.

  • lm_file – Language Model path.

  • model_tag – Model tag.

  • token_type – Type of token units.

  • bpemodel – BPE model path.

  • key_file – File key.

  • allow_variable_data_keys – Whether to allow variable data keys.

  • quantize_asr_model – Whether to apply dynamic quantization to ASR model.

  • quantize_modules – List of module names to apply dynamic quantization on.

  • quantize_dtype – Dynamic quantization data type.

  • streaming – Whether to perform chunk-by-chunk inference.

  • decoding_window – Audio length (in milliseconds) to process during decoding.

  • left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).

  • display_hypotheses – Whether to display (partial and full) hypotheses.

espnet2.bin.asr_transducer_inference.main(cmd=None)[source]

espnet2.bin.__init__

espnet2.bin.hugging_face_export_vocabulary

espnet2.bin.hugging_face_export_vocabulary.export_vocabulary(output: str, model_name_or_path: str, log_level: str, add_symbol: List[str])[source]
espnet2.bin.hugging_face_export_vocabulary.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.hugging_face_export_vocabulary.main(cmd=None)[source]

espnet2.bin.gan_tts_train

espnet2.bin.gan_tts_train.get_parser()[source]
espnet2.bin.gan_tts_train.main(cmd=None)[source]

GAN-based TTS training

Example

% python gan_tts_train.py –print_config –optim1 adadelta % python gan_tts_train.py –config conf/train.yaml

espnet2.bin.uasr_extract_feature

espnet2.bin.uasr_extract_feature.extract_feature(uasr_train_config: Optional[str], uasr_model_file: Optional[str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], batch_size: int, dtype: str, num_workers: int, allow_variable_data_keys: bool, ngpu: int, output_dir: str, dset: str, log_level: Union[int, str])[source]
espnet2.bin.uasr_extract_feature.get_parser()[source]
espnet2.bin.uasr_extract_feature.main(cmd=None)[source]

espnet2.bin.svs_inference

Script to run the inference of singing-voice-synthesis model.

class espnet2.bin.svs_inference.SingingGenerate(train_config: Union[pathlib.Path, str, None], model_file: Union[pathlib.Path, str, None] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False, use_att_constraint: bool = False, use_dynamic_filter: bool = False, backward_window: int = 2, forward_window: int = 4, speed_control_alpha: float = 1.0, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, vocoder_config: Union[pathlib.Path, str] = None, vocoder_checkpoint: Union[pathlib.Path, str] = None, dtype: str = 'float32', device: str = 'cpu', seed: int = 777, always_fix_seed: bool = False, prefer_normalized_feats: bool = False)[source]

Bases: object

SingingGenerate class

Examples

>>> import soundfile
>>> svs = SingingGenerate("config.yml", "model.pth")
>>> wav = svs("Hello World")[0]
>>> soundfile.write("out.wav", wav.numpy(), svs.fs, "PCM_16")

Initialize SingingGenerate module.

static from_pretrained(model_tag: Optional[str] = None, vocoder_tag: Optional[str] = None, **kwargs)[source]

Build SingingGenerate instance from the pretrained model.

Parameters:
  • model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

  • vocoder_tag (Optional[str]) – Vocoder tag of the pretrained vocoders. Currently, the tags of parallel_wavegan are supported, which should start with the prefix “parallel_wavegan/”.

Returns:

SingingGenerate instance.

Return type:

SingingGenerate

property fs

Return sampling rate.

property use_lids

Return sid is needed or not in the inference.

property use_sids

Return sid is needed or not in the inference.

property use_speech

Return speech is needed or not in the inference.

property use_spembs

Return spemb is needed or not in the inference.

espnet2.bin.svs_inference.get_parser()[source]

Get argument parser.

espnet2.bin.svs_inference.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], use_teacher_forcing: bool, noise_scale: float, noise_scale_dur: float, allow_variable_data_keys: bool, vocoder_config: Optional[str] = None, vocoder_checkpoint: Optional[str] = None, vocoder_tag: Optional[str] = None)[source]

Perform SVS model decoding.

espnet2.bin.svs_inference.main(cmd=None)[source]

Run SVS model decoding.

espnet2.bin.enh_inference

class espnet2.bin.enh_inference.SeparateSpeech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, ref_channel: Optional[int] = None, normalize_output_wav: bool = False, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False)[source]

Bases: object

SeparateSpeech class

Examples

>>> import soundfile
>>> separate_speech = SeparateSpeech("enh_config.yml", "enh.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> separate_speech(audio)
[separated_audio1, separated_audio2, ...]
cal_permumation(ref_wavs, enh_wavs, criterion='si_snr')[source]

Calculate the permutation between seaprated streams in two adjacent segments.

Parameters:
  • ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • criterion (str) – one of (“si_snr”, “mse”, “corr)

Returns:

permutation for enh_wavs (Batch, num_spk)

Return type:

perm (torch.Tensor)

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build SeparateSpeech instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns:

SeparateSpeech instance.

Return type:

SeparateSpeech

espnet2.bin.enh_inference.build_model_from_args_and_file(task, args, model_file, device)[source]
espnet2.bin.enh_inference.get_parser()[source]
espnet2.bin.enh_inference.get_train_config(train_config, model_file=None)[source]
espnet2.bin.enh_inference.humanfriendly_or_none(value: str)[source]
espnet2.bin.enh_inference.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, ref_channel: Optional[int], normalize_output_wav: bool, enh_s2t_task: bool)[source]
espnet2.bin.enh_inference.main(cmd=None)[source]
espnet2.bin.enh_inference.recursive_dict_update(dict_org, dict_patch, verbose=False, log_prefix='')[source]

Update dict_org with dict_patch in-place recursively.

espnet2.bin.launch

espnet2.bin.launch.get_parser()[source]
espnet2.bin.launch.main(cmd=None)[source]

espnet2.bin.enh_train

espnet2.bin.enh_train.get_parser()[source]
espnet2.bin.enh_train.main(cmd=None)[source]

Enhancemnet frontend training.

Example

% python enh_train.py enh –print_config –optim adadelta

> conf/train_enh.yaml

% python enh_train.py –config conf/train_enh.yaml

espnet2.bin.aggregate_stats_dirs

espnet2.bin.aggregate_stats_dirs.aggregate_stats_dirs(input_dir: Iterable[Union[str, pathlib.Path]], output_dir: Union[str, pathlib.Path], log_level: str, skip_sum_stats: bool)[source]
espnet2.bin.aggregate_stats_dirs.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.aggregate_stats_dirs.main(cmd=None)[source]

espnet2.bin.tokenize_text

espnet2.bin.tokenize_text.field2slice(field: Optional[str]) → slice[source]

Convert field string to slice.

Note that field string accepts 1-based integer. .. rubric:: Examples

>>> field2slice("1-")
slice(0, None, None)
>>> field2slice("1-3")
slice(0, 3, None)
>>> field2slice("-3")
slice(None, 3, None)
espnet2.bin.tokenize_text.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.tokenize_text.main(cmd=None)[source]
espnet2.bin.tokenize_text.tokenize(input: str, output: str, field: Optional[str], delimiter: Optional[str], token_type: str, space_symbol: str, non_linguistic_symbols: Optional[str], bpemodel: Optional[str], log_level: str, write_vocabulary: bool, vocabulary_size: int, remove_non_linguistic_symbols: bool, cutoff: int, add_symbol: List[str], cleaner: Optional[str], g2p: Optional[str], add_nonsplit_symbol: List[str])[source]