espnet2.bin package¶

espnet2.bin.asvspoof_inference¶

class espnet2.bin.asvspoof_inference.SpeechAntiSpoof(asvspoof_train_config: Union[pathlib.Path, str] = None, asvspoof_model_file: Union[pathlib.Path, str] = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32')[source]¶

Bases: object

SpeechAntiSpoof class .. rubric:: Examples

>>> import soundfile
>>> speech_anti_spoof = SpeechAntiSpoof("asvspoof_config.yml", "asvspoof.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech_anti_spoof(audio)
prediction_result (int)

espnet2.bin.asvspoof_inference.get_parser()[source]¶

espnet2.bin.asvspoof_inference.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asvspoof_train_config: Optional[str], asvspoof_model_file: Optional[str], allow_variable_data_keys: bool)[source]¶

espnet2.bin.asvspoof_inference.main(cmd=None)[source]¶

espnet2.bin.uasr_inference_k2¶

espnet2.bin.uasr_inference_k2.get_parser()[source]¶

espnet2.bin.uasr_inference_k2.indices_to_split_size(indices, total_elements: int = None)[source]¶

convert indices to split_size

During decoding, the api torch.tensor_split should be used. However, torch.tensor_split is only available with pytorch >= 1.8.0. So torch.split is used to pass ci with pytorch < 1.8.0. This fuction is used to prepare input for torch.split.

espnet2.bin.uasr_inference_k2.inference(output_dir: str, decoding_graph: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], uasr_train_config: Optional[str], uasr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], word_token_list: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, streaming: bool, is_ctc_decoding: bool, use_nbest_rescoring: bool, num_paths: int, nbest_batch_size: int, nll_batch_size: int, k2_config: Optional[str])[source]¶

class espnet2.bin.uasr_inference_k2.k2Speech2Text(uasr_train_config: Union[pathlib.Path, str], decoding_graph: str, uasr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 8, ctc_weight: float = 0.5, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, search_beam_size: int = 20, output_beam_size: int = 20, min_active_states: int = 14000, max_active_states: int = 56000, blank_bias: float = 0.0, lattice_weight: float = 1.0, is_ctc_decoding: bool = True, lang_dir: Optional[str] = None, token_list_file: Optional[str] = None, use_fgram_rescoring: bool = False, use_nbest_rescoring: bool = False, am_weight: float = 0.5, decoder_weight: float = 0.5, nnlm_weight: float = 1.0, num_paths: int = 1000, nbest_batch_size: int = 500, nll_batch_size: int = 100)[source]¶

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = k2Speech2Text("uasr_config.yml", "uasr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech = np.expand_dims(audio, 0) # shape: [batch_size, speech_length]
>>> speech_lengths = np.array([audio.shape[0]]) # shape: [batch_size]
>>> batch = {"speech": speech, "speech_lengths", speech_lengths}
>>> speech2text(batch)
[(text, token, token_int, score), ...]

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build k2Speech2Text instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: Speech2Text instance.
Return type:: Speech2Text

espnet2.bin.uasr_inference_k2.main(cmd=None)[source]¶

espnet2.bin.asr_transducer_train¶

espnet2.bin.asr_transducer_train.get_parser()[source]¶: Get parser for ASR Transducer task.

espnet2.bin.asr_transducer_train.main(cmd=None)[source]¶

ASR Transducer training.

Example

% python asr_transducer_train.py asr –print_config: –optim adadelta > conf/train_asr.yaml
% python asr_transducer_train.py: –config conf/tuning/transducer/train_rnn_transducer.yaml

espnet2.bin.uasr_train¶

espnet2.bin.uasr_train.get_parser()[source]¶

espnet2.bin.uasr_train.main(cmd=None)[source]¶

UASR training.

Example

% python uasr_train.py uasr –print_config –optim adadelta: > conf/train_uasr.yaml

% python uasr_train.py –config conf/train_uasr.yaml

espnet2.bin.asr_train¶

espnet2.bin.asr_train.get_parser()[source]¶

espnet2.bin.asr_train.main(cmd=None)[source]¶

ASR training.

Example

% python asr_train.py asr –print_config –optim adadelta: > conf/train_asr.yaml

% python asr_train.py –config conf/train_asr.yaml

espnet2.bin.tts_train¶

espnet2.bin.tts_train.get_parser()[source]¶

espnet2.bin.tts_train.main(cmd=None)[source]¶

TTS training

Example

% python tts_train.py asr –print_config –optim adadelta % python tts_train.py –config conf/train_asr.yaml

espnet2.bin.slu_train¶

espnet2.bin.slu_train.get_parser()[source]¶

espnet2.bin.slu_train.main(cmd=None)[source]¶

SLU training.

Example

% python slu_train.py slu –print_config –optim adadelta: > conf/train_slu.yaml

% python slu_train.py –config conf/train_slu.yaml

espnet2.bin.mt_train¶

espnet2.bin.mt_train.get_parser()[source]¶

espnet2.bin.mt_train.main(cmd=None)[source]¶

MT training.

Example

% python mt_train.py st –print_config –optim adadelta: > conf/train_mt.yaml

% python mt_train.py –config conf/train_mt.yaml

espnet2.bin.lm_inference¶

class espnet2.bin.lm_inference.GenerateText(lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlen: int = 100, minlen: int = 0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ngram_weight: float = 0.0, penalty: float = 0.0, nbest: int = 1, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]¶

Bases: object

GenerateText class

Examples

>>> generatetext = GenerateText(
        lm_train_config="lm_config.yaml",
        lm_file="lm.pth",
        token_type="bpe",
        bpemodel="bpe.model",
    )
>>> prompt = "I have travelled to many "
>>> generatetext(prompt)
[(text, token, token_int, hypothesis object), ...]

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build GenerateText instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: GenerateText instance.
Return type:: GenerateText

espnet2.bin.lm_inference.get_parser()[source]¶

espnet2.bin.lm_inference.inference(output_dir: str, maxlen: int, minlen: int, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]¶

espnet2.bin.lm_inference.main(cmd=None)[source]¶

espnet2.bin.asr_inference¶

class espnet2.bin.asr_inference.Speech2Text(asr_train_config: Union[pathlib.Path, str] = None, asr_model_file: Union[pathlib.Path, str] = None, transducer_conf: dict = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, enh_s2t_task: bool = False, quantize_asr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8', hugging_face_decoder: bool = False, hugging_face_decoder_max_length: int = 256, time_sync: bool = False, multi_asr: bool = False)[source]¶

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build Speech2Text instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: Speech2Text instance.
Return type:: Speech2Text

espnet2.bin.asr_inference.get_parser()[source]¶

espnet2.bin.asr_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: Optional[str], asr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, transducer_conf: Optional[dict], streaming: bool, enh_s2t_task: bool, quantize_asr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str, hugging_face_decoder: bool, hugging_face_decoder_max_length: int, time_sync: bool, multi_asr: bool)[source]¶

espnet2.bin.asr_inference.main(cmd=None)[source]¶

espnet2.bin.spk_train¶

espnet2.bin.spk_train.get_parser()[source]¶

espnet2.bin.spk_train.main(cmd=None)[source]¶

Speaker embedding extractor training. Trained model can be used for: speaker verification, open set speaker identification, and also as embeddings for various other tasks including speaker diarization.

Example

% python spk_train.py –print_config –optim adadelta: > conf/train_spk.yaml

% python spk_train.py –config conf/train_diar.yaml

espnet2.bin.enh_scoring¶

espnet2.bin.enh_scoring.get_parser()[source]¶

espnet2.bin.enh_scoring.get_readers(scps: List[str], dtype: str)[source]¶

espnet2.bin.enh_scoring.main(cmd=None)[source]¶

espnet2.bin.enh_scoring.read_audio(reader, key, audio_format='sound')[source]¶

espnet2.bin.enh_scoring.scoring(output_dir: str, dtype: str, log_level: Union[int, str], key_file: str, ref_scp: List[str], inf_scp: List[str], ref_channel: int, flexible_numspk: bool, is_tse: bool, use_dnsmos: bool, dnsmos_args: Dict, use_pesq: bool)[source]¶

espnet2.bin.pack¶

class espnet2.bin.pack.ASRPackedContents[source]¶

Bases: espnet2.bin.pack.PackedContents

files = ['asr_model_file', 'lm_file']¶

yaml_files = ['asr_train_config', 'lm_train_config']¶

class espnet2.bin.pack.DiarPackedContents[source]¶

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']¶

yaml_files = ['train_config']¶

class espnet2.bin.pack.EnhPackedContents[source]¶

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']¶

yaml_files = ['train_config']¶

class espnet2.bin.pack.EnhS2TPackedContents[source]¶

Bases: espnet2.bin.pack.PackedContents

files = ['enh_s2t_model_file', 'lm_file']¶

yaml_files = ['enh_s2t_train_config', 'lm_train_config']¶

class espnet2.bin.pack.PackedContents[source]¶

Bases: object

files = []¶

yaml_files = []¶

class espnet2.bin.pack.SSLPackedContents[source]¶

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']¶

yaml_files = ['train_config']¶

class espnet2.bin.pack.STPackedContents[source]¶

Bases: espnet2.bin.pack.PackedContents

files = ['st_model_file']¶

yaml_files = ['st_train_config']¶

class espnet2.bin.pack.SVSPackedContents[source]¶

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']¶

yaml_files = ['train_config']¶

class espnet2.bin.pack.TTSPackedContents[source]¶

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']¶

yaml_files = ['train_config']¶

espnet2.bin.pack.add_arguments(parser: argparse.ArgumentParser, contents: Type[espnet2.bin.pack.PackedContents])[source]¶

espnet2.bin.pack.get_parser() → argparse.ArgumentParser[source]¶

espnet2.bin.pack.main(cmd=None)[source]¶

espnet2.bin.asr_inference_streaming¶

class espnet2.bin.asr_inference_streaming.Speech2TextStreaming(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, disable_repetition_detection=False, decoder_text_length_limit=0, encoded_feat_length_limit=0)[source]¶

Bases: object

Speech2TextStreaming class

Details in “Streaming Transformer ASR with Blockwise Synchronous Beam Search” (https://arxiv.org/abs/2006.14941)

Examples

>>> import soundfile
>>> speech2text = Speech2TextStreaming("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

apply_frontend(speech: torch.Tensor, prev_states=None, is_final: bool = False)[source]¶

assemble_hyps(hyps)[source]¶

reset()[source]¶

espnet2.bin.asr_inference_streaming.get_parser()[source]¶

espnet2.bin.asr_inference_streaming.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, sim_chunk_length: int, disable_repetition_detection: bool, encoded_feat_length_limit: int, decoder_text_length_limit: int)[source]¶

espnet2.bin.asr_inference_streaming.main(cmd=None)[source]¶

espnet2.bin.gan_svs_train¶

espnet2.bin.gan_svs_train.get_parser()[source]¶

espnet2.bin.gan_svs_train.main(cmd=None)[source]¶

GAN-based SVS training

Example

% python gan_svs_train.py –print_config –optim1 adadelta % python gan_svs_train.py –config conf/train.yaml

espnet2.bin.hubert_train¶

espnet2.bin.hubert_train.get_parser()[source]¶

espnet2.bin.hubert_train.main(cmd=None)[source]¶

Hubert pretraining.

Example

% python hubert_train.py asr –print_config –optim adadelta > conf/hubert_asr.yaml % python hubert_train.py –config conf/train_asr.yaml

espnet2.bin.enh_inference_streaming¶

class espnet2.bin.enh_inference_streaming.SeparateSpeechStreaming(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, ref_channel: Optional[int] = None, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False)[source]¶

Bases: object

SeparateSpeechStreaming class. Separate a small audio chunk in streaming.

Examples

>>> import soundfile
>>> separate_speech = SeparateSpeechStreaming("enh_config.yml", "enh.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> lengths = torch.LongTensor(audio.shape[-1])
>>> speech_sim_chunks = separate_speech.frame(wav)
>>> output_chunks = [[] for ii in range(separate_speech.num_spk)]
>>>
>>> for chunk in speech_sim_chunks:
>>>     output = separate_speech(chunk)
>>>     for spk in range(separate_speech.num_spk):
>>>         output_chunks[spk].append(output[spk])
>>>
>>> separate_speech.reset()
>>> waves = [
>>>     separate_speech.merge(chunks, length)
>>>     for chunks in output_chunks ]

frame(audio)[source]¶

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build SeparateSpeech instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: SeparateSpeech instance.
Return type:: SeparateSpeech

merge(chunks, ilens=None)[source]¶

reset()[source]¶

espnet2.bin.enh_inference_streaming.get_parser()[source]¶

espnet2.bin.enh_inference_streaming.humanfriendly_or_none(value: str)[source]¶

espnet2.bin.enh_inference_streaming.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, ref_channel: Optional[int], enh_s2t_task: bool)[source]¶

espnet2.bin.enh_inference_streaming.main(cmd=None)[source]¶

espnet2.bin.st_train¶

espnet2.bin.st_train.get_parser()[source]¶

espnet2.bin.st_train.main(cmd=None)[source]¶

ST training.

Example

% python st_train.py st –print_config –optim adadelta: > conf/train_st.yaml

% python st_train.py –config conf/train_st.yaml

espnet2.bin.tts_inference¶

Script to run the inference of text-to-speeech model.

class espnet2.bin.tts_inference.Text2Speech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, speed_control_alpha: float = 1.0, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, vocoder_config: Union[pathlib.Path, str] = None, vocoder_file: Union[pathlib.Path, str] = None, dtype: str = 'float32', device: str = 'cpu', seed: int = 777, always_fix_seed: bool = False, prefer_normalized_feats: bool = False)[source]¶

Bases: object

Text2Speech class.

Examples

>>> from espnet2.bin.tts_inference import Text2Speech
>>> # Case 1: Load the local model and use Griffin-Lim vocoder
>>> text2speech = Text2Speech(
>>>     train_config="/path/to/config.yml",
>>>     model_file="/path/to/model.pth",
>>> )
>>> # Case 2: Load the local model and the pretrained vocoder
>>> text2speech = Text2Speech.from_pretrained(
>>>     train_config="/path/to/config.yml",
>>>     model_file="/path/to/model.pth",
>>>     vocoder_tag="kan-bayashi/ljspeech_tacotron2",
>>> )
>>> # Case 3: Load the pretrained model and use Griffin-Lim vocoder
>>> text2speech = Text2Speech.from_pretrained(
>>>     model_tag="kan-bayashi/ljspeech_tacotron2",
>>> )
>>> # Case 4: Load the pretrained model and the pretrained vocoder
>>> text2speech = Text2Speech.from_pretrained(
>>>     model_tag="kan-bayashi/ljspeech_tacotron2",
>>>     vocoder_tag="parallel_wavegan/ljspeech_parallel_wavegan.v1",
>>> )
>>> # Run inference and save as wav file
>>> import soundfile as sf
>>> wav = text2speech("Hello, World")["wav"]
>>> sf.write("out.wav", wav.numpy(), text2speech.fs, "PCM_16")

Initialize Text2Speech module.

static from_pretrained(model_tag: Optional[str] = None, vocoder_tag: Optional[str] = None, **kwargs)[source]¶

Build Text2Speech instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
vocoder_tag (Optional[str]) – Vocoder tag of the pretrained vocoders. Currently, the tags of parallel_wavegan are supported, which should start with the prefix “parallel_wavegan/”.

Returns:

Text2Speech instance.

Return type:

Text2Speech

property fs¶: Return sampling rate.

property use_lids¶: Return sid is needed or not in the inference.

property use_sids¶: Return sid is needed or not in the inference.

property use_speech¶: Return speech is needed or not in the inference.

property use_spembs¶: Return spemb is needed or not in the inference.

espnet2.bin.tts_inference.get_parser()[source]¶: Get argument parser.

espnet2.bin.tts_inference.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], threshold: float, minlenratio: float, maxlenratio: float, use_teacher_forcing: bool, use_att_constraint: bool, backward_window: int, forward_window: int, speed_control_alpha: float, noise_scale: float, noise_scale_dur: float, always_fix_seed: bool, allow_variable_data_keys: bool, vocoder_config: Optional[str], vocoder_file: Optional[str], vocoder_tag: Optional[str])[source]¶: Run text-to-speech inference.

espnet2.bin.tts_inference.main(cmd=None)[source]¶: Run TTS model inference.

espnet2.bin.asvspoof_train¶

espnet2.bin.asvspoof_train.get_parser()[source]¶

espnet2.bin.asvspoof_train.main(cmd=None)[source]¶

ASVSpoof training. .. rubric:: Example

% python asvspoof_train.py asr –print_config –optim adadelta: > conf/train_asvspoof.yaml

% python asvspoof_train.py –config conf/train_asvspoof.yaml

espnet2.bin.diar_inference¶

class espnet2.bin.diar_inference.DiarizeSpeech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, normalize_output_wav: bool = False, num_spk: Optional[int] = None, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False, multiply_diar_result: bool = False)[source]¶

Bases: object

DiarizeSpeech class

Examples

>>> import soundfile
>>> diarization = DiarizeSpeech("diar_config.yaml", "diar.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> diarization(audio)
[(spk_id, start, end), (spk_id2, start2, end2)]

cal_permumation(ref_wavs, enh_wavs, criterion='si_snr')[source]¶

Calculate the permutation between seaprated streams in two adjacent segments.

Parameters:

ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
criterion (str) – one of (“si_snr”, “mse”, “corr)

Returns:

permutation for enh_wavs (Batch, num_spk)

Return type:

perm (torch.Tensor)

decode(encoder_out, encoder_out_lens)[source]¶

encode(speech, lengths)[source]¶

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build DiarizeSpeech instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: DiarizeSpeech instance.
Return type:: DiarizeSpeech

permute_diar(waves, spk_prediction)[source]¶

espnet2.bin.diar_inference.get_parser()[source]¶

espnet2.bin.diar_inference.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, num_spk: Optional[int], normalize_output_wav: bool, multiply_diar_result: bool, enh_s2t_task: bool)[source]¶

espnet2.bin.diar_inference.main(cmd=None)[source]¶

espnet2.bin.mt_inference¶

class espnet2.bin.mt_inference.Text2Text(mt_train_config: Union[pathlib.Path, str] = None, mt_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1)[source]¶

Bases: object

Text2Text class

Examples

>>> text2text = Text2Text("mt_config.yml", "mt.pth")
>>> text2text(audio)
[(text, token, token_int, hypothesis object), ...]

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build Text2Text instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: Text2Text instance.
Return type:: Text2Text

espnet2.bin.mt_inference.get_parser()[source]¶

espnet2.bin.mt_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], mt_train_config: Optional[str], mt_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool)[source]¶

espnet2.bin.mt_inference.main(cmd=None)[source]¶

espnet2.bin.split_scps¶

espnet2.bin.split_scps.get_parser() → argparse.ArgumentParser[source]¶

espnet2.bin.split_scps.main(cmd=None)[source]¶

espnet2.bin.split_scps.split_scps(scps: List[str], num_splits: int, names: Optional[List[str]], output_dir: str, log_level: str)[source]¶

espnet2.bin.asr_inference_k2¶

espnet2.bin.lm_calc_perplexity¶

espnet2.bin.lm_calc_perplexity.calc_perplexity(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], log_base: Optional[float], allow_variable_data_keys: bool)[source]¶

espnet2.bin.lm_calc_perplexity.get_parser()[source]¶

espnet2.bin.lm_calc_perplexity.main(cmd=None)[source]¶

espnet2.bin.svs_train¶

espnet2.bin.svs_train.get_parser()[source]¶

espnet2.bin.svs_train.main(cmd=None)[source]¶

SVS training

Example

% python svs_train.py svs –print_config –optim adadelta % python svs_train.py –config conf/train_svs.yaml

espnet2.bin.asr_inference_maskctc¶

class espnet2.bin.asr_inference_maskctc.Speech2Text(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32', maskctc_n_iterations: int = 10, maskctc_threshold_probability: float = 0.99)[source]¶

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build Speech2Text instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: Speech2Text instance.
Return type:: Speech2Text

espnet2.bin.asr_inference_maskctc.get_parser()[source]¶

espnet2.bin.asr_inference_maskctc.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, maskctc_n_iterations: int, maskctc_threshold_probability: float)[source]¶

espnet2.bin.asr_inference_maskctc.main(cmd=None)[source]¶

espnet2.bin.enh_s2t_train¶

espnet2.bin.enh_s2t_train.get_parser()[source]¶

espnet2.bin.enh_s2t_train.main(cmd=None)[source]¶

EnhS2T training.

Example

% python enh_s2t_train.py enh_s2t –print_config –optim adadelta: > conf/train_enh_s2t.yaml

% python enh_s2t_train.py –config conf/train_enh_s2t.yaml

espnet2.bin.whisper_export_vocabulary¶

espnet2.bin.whisper_export_vocabulary.export_vocabulary(output: str, whisper_model: str, log_level: str)[source]¶

espnet2.bin.whisper_export_vocabulary.get_parser() → argparse.ArgumentParser[source]¶

espnet2.bin.whisper_export_vocabulary.main(cmd=None)[source]¶

espnet2.bin.uasr_inference¶

class espnet2.bin.uasr_inference.Speech2Text(uasr_train_config: Union[pathlib.Path, str] = None, uasr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, nbest: int = 1, quantize_uasr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]¶

Bases: object

Speech2Text class for unsupervised ASR

Examples

>>> import soundfile
>>> speech2text = Speech2Text("uasr_config.yml", "uasr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis_object), ...]

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build Speech2Text instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: Speech2Text instance.
Return type:: Speech2Text

espnet2.bin.uasr_inference.get_parser()[source]¶

espnet2.bin.uasr_inference.inference(output_dir: str, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], uasr_train_config: Optional[str], uasr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, quantize_uasr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]¶

espnet2.bin.uasr_inference.main(cmd=None)[source]¶

espnet2.bin.diar_train¶

espnet2.bin.diar_train.get_parser()[source]¶

espnet2.bin.diar_train.main(cmd=None)[source]¶

Speaker diarization training.

Example

% python diar_train.py diar –print_config –optim adadelta: > conf/train_diar.yaml

% python diar_train.py –config conf/train_diar.yaml

espnet2.bin.asr_align¶

Perform CTC segmentation to align utterances within audio files.

class espnet2.bin.asr_align.CTCSegmentation(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, fs: int = 16000, ngpu: int = 0, batch_size: int = 1, dtype: str = 'float32', kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]¶

Bases: object

Align text to audio using CTC segmentation.

Usage:: Initialize with given ASR model and parameters. If needed, parameters for CTC segmentation can be set with set_config(·). Then call the instance as function to align text within an audio file.

Example

>>> # example file included in the ESPnet repository
>>> import soundfile
>>> speech, fs = soundfile.read("test_utils/ctc_align_test.wav")
>>> # load an ASR model
>>> from espnet_model_zoo.downloader import ModelDownloader
>>> d = ModelDownloader()
>>> wsjmodel = d.download_and_unpack( "kamo-naoyuki/wsj" )
>>> # Apply CTC segmentation
>>> aligner = CTCSegmentation( **wsjmodel )
>>> text=["utt1 THE SALE OF THE HOTELS", "utt2 ON PROPERTY MANAGEMENT"]
>>> aligner.set_config( gratis_blank=True )
>>> segments = aligner( speech, text, fs=fs )
>>> print( segments )
utt1 utt 0.27 1.72 -0.1663 THE SALE OF THE HOTELS
utt2 utt 4.54 6.10 -4.9646 ON PROPERTY MANAGEMENT

On multiprocessing:: To parallelize the computation with multiprocessing, these three steps can be separated: (1) get_lpz: obtain the lpz, (2) prepare_segmentation_task: prepare the task, and (3) get_segments: perform CTC segmentation. Note that the function get_segments is a staticmethod and therefore independent of an already initialized CTCSegmentation object.

References

CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition 2020, Kürzinger, Winkelbauer, Li, Watzel, Rigoll https://arxiv.org/abs/2007.09127

More parameters are described in https://github.com/lumaku/ctc-segmentation

Initialize the CTCSegmentation module.

Parameters:

asr_train_config – ASR model config file (yaml).
asr_model_file – ASR model file (pth).
fs – Sample rate of audio file.
ngpu – Number of GPUs. Set 0 for processing on CPU, set to 1 for processing on GPU. Multi-GPU aligning is currently not implemented. Default: 0.
batch_size – Currently, only batch size == 1 is implemented.
dtype – Data type used for inference. Set dtype according to the ASR model.
kaldi_style_text – A kaldi-style text file includes the name of the utterance at the start of the line. If True, the utterance name is expected as first word at each line. If False, utterance names are automatically generated. Set this option according to your input data. Default: True.
text_converter – How CTC segmentation handles text. “tokenize”: Use ESPnet 2 preprocessing to tokenize the text. “classic”: The text is preprocessed as in ESPnet 1 which takes token length into account. If the ASR model has longer tokens, this option may yield better results. Default: “tokenize”.
time_stamps – Choose the method how the time stamps are calculated. While “fixed” and “auto” use both the sample rate, the ratio of samples to one frame is either automatically determined for each inference or fixed at a certain ratio that is initially determined by the module, but can be changed via the parameter samples_to_frames_ratio. Recommended for longer audio files: “auto”.
**ctc_segmentation_args – Parameters for CTC segmentation.

choices_text_converter = ['tokenize', 'classic']¶

choices_time_stamps = ['auto', 'fixed']¶

config = CtcSegmentationParameters( )¶

estimate_samples_to_frames_ratio(speech_len=215040)[source]¶

Determine the ratio of encoded frames to sample points.

This method helps to determine the time a single encoded frame occupies. As the sample rate already gave the number of samples, only the ratio of samples per encoded CTC frame are needed. This function estimates them by doing one inference, which is only needed once.

Parameters:: speech_len – Length of randomly generated speech vector for single inference. Default: 215040.
Returns:: Estimated ratio.
Return type:: samples_to_frames_ratio

fs = 16000¶

get_lpz(speech: Union[torch.Tensor, numpy.ndarray])[source]¶

Obtain CTC posterior log probabilities for given speech data.

Parameters:: speech – Speech audio input.
Returns:: Numpy vector with CTC log posterior probabilities.
Return type:: lpz

static get_segments(task: espnet2.bin.asr_align.CTCSegmentationTask)[source]¶

Obtain segments for given utterance texts and CTC log posteriors.

Parameters:

task – CTCSegmentationTask object that contains ground truth and CTC posterior probabilities.

Returns:

Dictionary with alignments. Combine this with the task: object to obtain a human-readable segments representation.

Return type:

result

get_timing_config(speech_len=None, lpz_len=None)[source]¶: Obtain parameters to determine time stamps.

prepare_segmentation_task(text, lpz, name=None, speech_len=None)[source]¶

Preprocess text, and gather text and lpz into a task object.

Text is pre-processed and tokenized depending on configuration. If speech_len is given, the timing configuration is updated. Text, lpz, and configuration is collected in a CTCSegmentationTask object. The resulting object can be serialized and passed in a multiprocessing computation.

A minimal amount of text processing is done, i.e., splitting the utterances in text into a list and applying text_cleaner. It is recommended that you normalize the text beforehand, e.g., change numbers into their spoken equivalent word, remove special characters, and convert UTF-8 characters to chars corresponding to your ASR model dictionary.

The text is tokenized based on the text_converter setting:

The “tokenize” method is more efficient and the easiest for models based on latin or cyrillic script that only contain the main chars, [“a”, “b”, …] or for Japanese or Chinese ASR models with ~3000 short Kanji / Hanzi tokens.

The “classic” method improves the the accuracy of the alignments for models that contain longer tokens, but with a greater complexity for computation. The function scans for partial tokens which may improve time resolution. For example, the word “▁really” will be broken down into ['▁', '▁r', '▁re', '▁real', '▁really']. The alignment will be based on the most probable activation sequence given by the network.

Parameters:

text – List or multiline-string with utterance ground truths.
lpz – Log CTC posterior probabilities obtained from the CTC-network; numpy array shaped as ( <time steps>, <classes> ).
name – Audio file name. Choose a unique name, or the original audio file name, to distinguish multiple audio files. Default: None.
speech_len – Number of sample points. If given, the timing configuration is automatically derived from length of fs, length of speech and length of lpz. If None is given, make sure the timing parameters are correct, see time_stamps for reference! Default: None.

Returns:

CTCSegmentationTask object that can be passed to: get_segments() in order to obtain alignments.

Return type:

task

samples_to_frames_ratio = None¶

set_config(**kwargs)[source]¶

Set CTC segmentation parameters.

Parameters for timing:

time_stamps: Select method how CTC index duration is estimated, and: thus how the time stamps are calculated.

fs: Sample rate. samples_to_frames_ratio: If you want to directly determine the

ratio of samples to CTC frames, set this parameter, and set time_stamps to “fixed”. Note: If you want to calculate the time stamps as in ESPnet 1, set this parameter to: subsampling_factor * frame_duration / 1000.

Parameters for text preparation:

set_blank: Index of blank in token list. Default: 0. replace_spaces_with_blanks: Inserts blanks between words, which is

useful for handling long pauses between words. Only used in text_converter="classic" preprocessing mode. Default: False.

kaldi_style_text: Determines whether the utterance name is expected: as fist word of the utterance. Set at module initialization.
text_converter: How CTC segmentation handles text.: Set at module initialization.

Parameters for alignment:

min_window_size: Minimum number of frames considered for a single: utterance. The current default value of 8000 corresponds to roughly 4 minutes (depending on ASR model) and should be OK in most cases. If your utterances are further apart, increase this value, or decrease it for smaller audio files.
max_window_size: Maximum window size. It should not be necessary: to change this value.
gratis_blank: If True, the transition cost of blank is set to zero.: Useful for long preambles or if there are large unrelated segments between utterances. Default: False.

Parameters for calculation of confidence score:

scoring_length: Block length to calculate confidence score. The: default value of 30 should be OK in most cases.

text_converter = 'tokenize'¶

time_stamps = 'auto'¶

warned_about_misconfiguration = False¶

class espnet2.bin.asr_align.CTCSegmentationTask(**kwargs)[source]¶

Bases: object

Task object for CTC segmentation.

When formatted with str(·), this object returns results in a kaldi-style segments file formatting. The human-readable output can be configured with the printing options.

Properties:

text: Utterance texts, separated by line. But without the utterance: name at the beginning of the line (as in kaldi-style text).

ground_truth_mat: Ground truth matrix (CTC segmentation). utt_begin_indices: Utterance separator for the Ground truth matrix. timings: Time marks of the corresponding chars. state_list: Estimated alignment of chars/tokens. segments: Calculated segments as: (start, end, confidence score). config: CTC Segmentation configuration object. name: Name of aligned audio file (Optional). If given, name is

considered when generating the text.

utt_ids: The list of utterance names (Optional). This list should: have the same length as the number of utterances.

lpz: CTC posterior log probabilities (Optional).

Properties for printing:

print_confidence_score: Includes the confidence score. print_utterance_text: Includes utterance text.

Initialize the module.

char_probs = None¶

config = None¶

done = False¶

ground_truth_mat = None¶

lpz = None¶

name = 'utt'¶

print_confidence_score = True¶

print_utterance_text = True¶

segments = None¶

set(**kwargs)[source]¶

Update properties.

Parameters:: **kwargs – Key-value dict that contains all properties with their new values. Unknown properties are ignored.

state_list = None¶

text = None¶

timings = None¶

utt_begin_indices = None¶

utt_ids = None¶

espnet2.bin.asr_align.ctc_align(log_level: Union[int, str], asr_train_config: str, asr_model_file: str, audio: pathlib.Path, text: TextIO, output: TextIO, print_utt_text: bool = True, print_utt_score: bool = True, **kwargs)[source]¶: Provide the scripting interface to align text to audio.

espnet2.bin.asr_align.get_parser()[source]¶: Obtain an argument-parser for the script interface.

espnet2.bin.asr_align.main(cmd=None)[source]¶: Parse arguments and start the alignment in ctc_align(·).

espnet2.bin.slu_inference¶

class espnet2.bin.slu_inference.Speech2Understand(slu_train_config: Union[pathlib.Path, str] = None, slu_model_file: Union[pathlib.Path, str] = None, transducer_conf: dict = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, quantize_asr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]¶

Bases: object

Speech2Understand class

Examples

>>> import soundfile
>>> speech2understand = Speech2Understand("slu_config.yml", "slu.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2understand(audio)
[(text, token, token_int, hypothesis object), ...]

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build Speech2Understand instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: Speech2Understand instance.
Return type:: Speech2Understand

espnet2.bin.slu_inference.get_parser()[source]¶

espnet2.bin.slu_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], slu_train_config: Optional[str], slu_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, transducer_conf: Optional[dict], streaming: bool, quantize_asr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]¶

espnet2.bin.slu_inference.main(cmd=None)[source]¶

espnet2.bin.st_inference_streaming¶

class espnet2.bin.st_inference_streaming.Speech2TextStreaming(st_train_config: Union[pathlib.Path, str], st_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, disable_repetition_detection=False, decoder_text_length_limit=0, encoded_feat_length_limit=0)[source]¶

Bases: object

Speech2TextStreaming class

Details in “Streaming Transformer ASR with Blockwise Synchronous Beam Search” (https://arxiv.org/abs/2006.14941)

Examples

>>> import soundfile
>>> speech2text = Speech2TextStreaming("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

apply_frontend(speech: torch.Tensor, prev_states=None, is_final: bool = False)[source]¶

assemble_hyps(hyps)[source]¶

reset()[source]¶

espnet2.bin.st_inference_streaming.get_parser()[source]¶

espnet2.bin.st_inference_streaming.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], st_train_config: str, st_model_file: str, lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, sim_chunk_length: int, disable_repetition_detection: bool, encoded_feat_length_limit: int, decoder_text_length_limit: int)[source]¶

espnet2.bin.st_inference_streaming.main(cmd=None)[source]¶

espnet2.bin.enh_tse_train¶

espnet2.bin.enh_tse_train.get_parser()[source]¶

espnet2.bin.enh_tse_train.main(cmd=None)[source]¶

Target Speaker Extraction model training.

Example

% python enh_tse_train.py asr –print_config –optim adadelta: > conf/train_enh.yaml

% python enh_tse_train.py –config conf/train_enh.yaml

espnet2.bin.st_inference¶

class espnet2.bin.st_inference.Speech2Text(st_train_config: Union[pathlib.Path, str] = None, st_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, enh_s2t_task: bool = False)[source]¶

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = Speech2Text("st_config.yml", "st.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build Speech2Text instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: Speech2Text instance.
Return type:: Speech2Text

espnet2.bin.st_inference.get_parser()[source]¶

espnet2.bin.st_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], st_train_config: Optional[str], st_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, enh_s2t_task: bool)[source]¶

espnet2.bin.st_inference.main(cmd=None)[source]¶

espnet2.bin.lm_train¶

espnet2.bin.lm_train.get_parser()[source]¶

espnet2.bin.lm_train.main(cmd=None)[source]¶

LM training.

Example

% python lm_train.py asr –print_config –optim adadelta % python lm_train.py –config conf/train_asr.yaml

espnet2.bin.enh_tse_inference¶

class espnet2.bin.enh_tse_inference.SeparateSpeech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, ref_channel: Optional[int] = None, normalize_output_wav: bool = False, device: str = 'cpu', dtype: str = 'float32')[source]¶

Bases: object

SeparateSpeech class

Examples

>>> import soundfile
>>> separate_speech = SeparateSpeech("enh_config.yml", "enh.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> separate_speech(audio)
[separated_audio1, separated_audio2, ...]

cal_permumation(ref_wavs, enh_wavs, criterion='si_snr')[source]¶

Calculate the permutation between seaprated streams in two adjacent segments.

Parameters:

ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
criterion (str) – one of (“si_snr”, “mse”, “corr)

Returns:

permutation for enh_wavs (Batch, num_spk)

Return type:

perm (torch.Tensor)

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build SeparateSpeech instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: SeparateSpeech instance.
Return type:: SeparateSpeech

espnet2.bin.enh_tse_inference.build_model_from_args_and_file(task, args, model_file, device)[source]¶

espnet2.bin.enh_tse_inference.get_parser()[source]¶

espnet2.bin.enh_tse_inference.get_train_config(train_config, model_file=None)[source]¶

espnet2.bin.enh_tse_inference.humanfriendly_or_none(value: str)[source]¶

espnet2.bin.enh_tse_inference.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, ref_channel: Optional[int], normalize_output_wav: bool)[source]¶

espnet2.bin.enh_tse_inference.main(cmd=None)[source]¶

espnet2.bin.enh_tse_inference.recursive_dict_update(dict_org, dict_patch, verbose=False, log_prefix='')[source]¶: Update dict_org with dict_patch in-place recursively.

espnet2.bin.asr_transducer_inference¶

Inference class definition for Transducer models.

class espnet2.bin.asr_transducer_inference.Speech2Text(asr_train_config: Union[pathlib.Path, str, None] = None, asr_model_file: Union[pathlib.Path, str, None] = None, beam_search_config: Optional[Dict[str, Any]] = None, lm_train_config: Union[pathlib.Path, str, None] = None, lm_file: Union[pathlib.Path, str, None] = None, token_type: Optional[str] = None, bpemodel: Optional[str] = None, device: str = 'cpu', beam_size: int = 5, dtype: str = 'float32', lm_weight: float = 1.0, quantize_asr_model: bool = False, quantize_modules: Optional[List[str]] = None, quantize_dtype: str = 'qint8', nbest: int = 1, streaming: bool = False, decoding_window: int = 640, left_context: int = 32)[source]¶

Bases: object

Speech2Text class for Transducer models.

Parameters:

asr_train_config – ASR model training config path.
asr_model_file – ASR model path.
beam_search_config – Beam search config path.
lm_train_config – Language Model training config path.
lm_file – Language Model config path.
token_type – Type of token units.
bpemodel – BPE model path.
device – Device to use for inference.
beam_size – Size of beam during search.
dtype – Data type.
lm_weight – Language model weight.
quantize_asr_model – Whether to apply dynamic quantization to ASR model.
quantize_modules – List of module names to apply dynamic quantization on.
quantize_dtype – Dynamic quantization data type.
nbest – Number of final hypothesis.
streaming – Whether to perform chunk-by-chunk inference.
decoding_window – Size of the decoding window (in milliseconds).
left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).

Construct a Speech2Text object.

static from_pretrained(model_tag: Optional[str] = None, **kwargs) → espnet2.bin.asr_transducer_inference.Speech2Text[source]¶

Build Speech2Text instance from the pretrained model.

Parameters:: model_tag – Model tag of the pretrained models.
Returns:: Speech2Text instance.

hypotheses_to_results(nbest_hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → List[Any][source]¶

Build partial or final results from the hypotheses.

Parameters:: nbest_hyps – N-best hypothesis.
Returns:: Results containing different representation for the hypothesis.
Return type:: results

reset_streaming_cache() → None[source]¶: Reset Speech2Text parameters.

streaming_decode(speech: Union[torch.Tensor, numpy.ndarray], is_final: bool = False) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]¶

Speech2Text streaming call.

Parameters:

speech – Chunk of speech data. (S)
is_final – Whether speech corresponds to the final chunk of data.

Returns:

N-best hypothesis.

Return type:

nbest_hypothesis

espnet2.bin.asr_transducer_inference.get_parser()[source]¶: Get Transducer model inference parser.

espnet2.bin.asr_transducer_inference.inference(output_dir: str, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], asr_train_config: Optional[str], asr_model_file: Optional[str], beam_search_config: Optional[dict], lm_train_config: Optional[str], lm_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], key_file: Optional[str], allow_variable_data_keys: bool, quantize_asr_model: Optional[bool], quantize_modules: Optional[List[str]], quantize_dtype: Optional[str], streaming: bool, decoding_window: int, left_context: int, display_hypotheses: bool) → None[source]¶

Transducer model inference.

Parameters:

output_dir – Output directory path.
batch_size – Batch decoding size.
dtype – Data type.
beam_size – Beam size.
ngpu – Number of GPUs.
seed – Random number generator seed.
lm_weight – Weight of language model.
nbest – Number of final hypothesis.
num_workers – Number of workers.
log_level – Level of verbose for logs.
data_path_and_name_and_type –
asr_train_config – ASR model training config path.
asr_model_file – ASR model path.
beam_search_config – Beam search config path.
lm_train_config – Language Model training config path.
lm_file – Language Model path.
model_tag – Model tag.
token_type – Type of token units.
bpemodel – BPE model path.
key_file – File key.
allow_variable_data_keys – Whether to allow variable data keys.
quantize_asr_model – Whether to apply dynamic quantization to ASR model.
quantize_modules – List of module names to apply dynamic quantization on.
quantize_dtype – Dynamic quantization data type.
streaming – Whether to perform chunk-by-chunk inference.
decoding_window – Audio length (in milliseconds) to process during decoding.
left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).
display_hypotheses – Whether to display (partial and full) hypotheses.

espnet2.bin.asr_transducer_inference.main(cmd=None)[source]¶

espnet2.bin.init¶

espnet2.bin.hugging_face_export_vocabulary¶

espnet2.bin.hugging_face_export_vocabulary.export_vocabulary(output: str, model_name_or_path: str, log_level: str, add_symbol: List[str])[source]¶

espnet2.bin.hugging_face_export_vocabulary.get_parser() → argparse.ArgumentParser[source]¶

espnet2.bin.hugging_face_export_vocabulary.main(cmd=None)[source]¶

espnet2.bin.gan_tts_train¶

espnet2.bin.gan_tts_train.get_parser()[source]¶

espnet2.bin.gan_tts_train.main(cmd=None)[source]¶

GAN-based TTS training

Example

% python gan_tts_train.py –print_config –optim1 adadelta % python gan_tts_train.py –config conf/train.yaml

espnet2.bin.uasr_extract_feature¶

espnet2.bin.uasr_extract_feature.extract_feature(uasr_train_config: Optional[str], uasr_model_file: Optional[str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], batch_size: int, dtype: str, num_workers: int, allow_variable_data_keys: bool, ngpu: int, output_dir: str, dset: str, log_level: Union[int, str])[source]¶

espnet2.bin.uasr_extract_feature.get_parser()[source]¶

espnet2.bin.uasr_extract_feature.main(cmd=None)[source]¶

espnet2.bin.svs_inference¶

Script to run the inference of singing-voice-synthesis model.

class espnet2.bin.svs_inference.SingingGenerate(train_config: Union[pathlib.Path, str, None], model_file: Union[pathlib.Path, str, None] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False, use_att_constraint: bool = False, use_dynamic_filter: bool = False, backward_window: int = 2, forward_window: int = 4, speed_control_alpha: float = 1.0, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, vocoder_config: Union[pathlib.Path, str] = None, vocoder_checkpoint: Union[pathlib.Path, str] = None, dtype: str = 'float32', device: str = 'cpu', seed: int = 777, always_fix_seed: bool = False, prefer_normalized_feats: bool = False)[source]¶

Bases: object

SingingGenerate class

Examples

>>> import soundfile
>>> svs = SingingGenerate("config.yml", "model.pth")
>>> wav = svs("Hello World")[0]
>>> soundfile.write("out.wav", wav.numpy(), svs.fs, "PCM_16")

Initialize SingingGenerate module.

static from_pretrained(model_tag: Optional[str] = None, vocoder_tag: Optional[str] = None, **kwargs)[source]¶

Build SingingGenerate instance from the pretrained model.

Parameters:

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
vocoder_tag (Optional[str]) – Vocoder tag of the pretrained vocoders. Currently, the tags of parallel_wavegan are supported, which should start with the prefix “parallel_wavegan/”.

Returns:

SingingGenerate instance.

Return type:

SingingGenerate

property fs¶: Return sampling rate.

property use_lids¶: Return sid is needed or not in the inference.

property use_sids¶: Return sid is needed or not in the inference.

property use_speech¶: Return speech is needed or not in the inference.

property use_spembs¶: Return spemb is needed or not in the inference.

espnet2.bin.svs_inference.get_parser()[source]¶: Get argument parser.

espnet2.bin.svs_inference.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], use_teacher_forcing: bool, noise_scale: float, noise_scale_dur: float, allow_variable_data_keys: bool, vocoder_config: Optional[str] = None, vocoder_checkpoint: Optional[str] = None, vocoder_tag: Optional[str] = None)[source]¶: Perform SVS model decoding.

espnet2.bin.svs_inference.main(cmd=None)[source]¶: Run SVS model decoding.

espnet2.bin.enh_inference¶

class espnet2.bin.enh_inference.SeparateSpeech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, ref_channel: Optional[int] = None, normalize_output_wav: bool = False, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False)[source]¶

Bases: object

SeparateSpeech class

Examples

>>> import soundfile
>>> separate_speech = SeparateSpeech("enh_config.yml", "enh.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> separate_speech(audio)
[separated_audio1, separated_audio2, ...]

cal_permumation(ref_wavs, enh_wavs, criterion='si_snr')[source]¶

Calculate the permutation between seaprated streams in two adjacent segments.

Parameters:

ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
criterion (str) – one of (“si_snr”, “mse”, “corr)

Returns:

permutation for enh_wavs (Batch, num_spk)

Return type:

perm (torch.Tensor)

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]¶

Build SeparateSpeech instance from the pretrained model.

Parameters:: model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
Returns:: SeparateSpeech instance.
Return type:: SeparateSpeech

espnet2.bin.enh_inference.build_model_from_args_and_file(task, args, model_file, device)[source]¶

espnet2.bin.enh_inference.get_parser()[source]¶

espnet2.bin.enh_inference.get_train_config(train_config, model_file=None)[source]¶

espnet2.bin.enh_inference.humanfriendly_or_none(value: str)[source]¶

espnet2.bin.enh_inference.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, ref_channel: Optional[int], normalize_output_wav: bool, enh_s2t_task: bool)[source]¶

espnet2.bin.enh_inference.main(cmd=None)[source]¶

espnet2.bin.enh_inference.recursive_dict_update(dict_org, dict_patch, verbose=False, log_prefix='')[source]¶: Update dict_org with dict_patch in-place recursively.

espnet2.bin.launch¶

espnet2.bin.launch.get_parser()[source]¶

espnet2.bin.launch.main(cmd=None)[source]¶

espnet2.bin.enh_train¶

espnet2.bin.enh_train.get_parser()[source]¶

espnet2.bin.enh_train.main(cmd=None)[source]¶

Enhancemnet frontend training.

Example

% python enh_train.py enh –print_config –optim adadelta: > conf/train_enh.yaml

% python enh_train.py –config conf/train_enh.yaml

espnet2.bin.aggregate_stats_dirs¶

espnet2.bin.aggregate_stats_dirs.aggregate_stats_dirs(input_dir: Iterable[Union[str, pathlib.Path]], output_dir: Union[str, pathlib.Path], log_level: str, skip_sum_stats: bool)[source]¶

espnet2.bin.aggregate_stats_dirs.get_parser() → argparse.ArgumentParser[source]¶

espnet2.bin.aggregate_stats_dirs.main(cmd=None)[source]¶

espnet2.bin.tokenize_text¶

espnet2.bin.tokenize_text.field2slice(field: Optional[str]) → slice[source]¶

Convert field string to slice.

Note that field string accepts 1-based integer. .. rubric:: Examples

>>> field2slice("1-")
slice(0, None, None)
>>> field2slice("1-3")
slice(0, 3, None)
>>> field2slice("-3")
slice(None, 3, None)

espnet2.bin.tokenize_text.get_parser() → argparse.ArgumentParser[source]¶

espnet2.bin.tokenize_text.main(cmd=None)[source]¶

espnet2.bin.tokenize_text.tokenize(input: str, output: str, field: Optional[str], delimiter: Optional[str], token_type: str, space_symbol: str, non_linguistic_symbols: Optional[str], bpemodel: Optional[str], log_level: str, write_vocabulary: bool, vocabulary_size: int, remove_non_linguistic_symbols: bool, cutoff: int, add_symbol: List[str], cleaner: Optional[str], g2p: Optional[str], add_nonsplit_symbol: List[str])[source]¶

espnet2.bin package¶

espnet2.bin.asvspoof_inference¶

espnet2.bin.uasr_inference_k2¶

espnet2.bin.asr_transducer_train¶

espnet2.bin.uasr_train¶

espnet2.bin.asr_train¶

espnet2.bin.tts_train¶

espnet2.bin.slu_train¶

espnet2.bin.mt_train¶

espnet2.bin.lm_inference¶

espnet2.bin.asr_inference¶

espnet2.bin.spk_train¶

espnet2.bin.enh_scoring¶

espnet2.bin.pack¶

espnet2.bin.asr_inference_streaming¶

espnet2.bin.gan_svs_train¶

espnet2.bin.hubert_train¶

espnet2.bin.enh_inference_streaming¶

espnet2.bin.st_train¶

espnet2.bin.tts_inference¶

espnet2.bin.asvspoof_train¶

espnet2.bin.diar_inference¶

espnet2.bin.mt_inference¶

espnet2.bin.split_scps¶

espnet2.bin.asr_inference_k2¶

espnet2.bin.lm_calc_perplexity¶

espnet2.bin.svs_train¶

espnet2.bin.asr_inference_maskctc¶

espnet2.bin.enh_s2t_train¶

espnet2.bin.whisper_export_vocabulary¶

espnet2.bin.uasr_inference¶

espnet2.bin.diar_train¶

espnet2.bin.asr_align¶

espnet2.bin.slu_inference¶

espnet2.bin.st_inference_streaming¶

espnet2.bin.enh_tse_train¶

espnet2.bin.st_inference¶

espnet2.bin.lm_train¶

espnet2.bin.enh_tse_inference¶

espnet2.bin.asr_transducer_inference¶

espnet2.bin.__init__¶

espnet2.bin.hugging_face_export_vocabulary¶

espnet2.bin.gan_tts_train¶

espnet2.bin.uasr_extract_feature¶

espnet2.bin.svs_inference¶

espnet2.bin.enh_inference¶

espnet2.bin.launch¶

espnet2.bin.enh_train¶

espnet2.bin.aggregate_stats_dirs¶

espnet2.bin.tokenize_text¶

espnet2.bin.init¶