espnet2.bin package¶
espnet2.bin.asvspoof_inference¶
-
class
espnet2.bin.asvspoof_inference.
SpeechAntiSpoof
(asvspoof_train_config: Union[pathlib.Path, str] = None, asvspoof_model_file: Union[pathlib.Path, str] = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32')[source]¶ Bases:
object
SpeechAntiSpoof class .. rubric:: Examples
>>> import soundfile >>> speech_anti_spoof = SpeechAntiSpoof("asvspoof_config.yml", "asvspoof.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech_anti_spoof(audio) prediction_result (int)
-
espnet2.bin.asvspoof_inference.
inference
(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asvspoof_train_config: Optional[str], asvspoof_model_file: Optional[str], allow_variable_data_keys: bool)[source]¶
espnet2.bin.uasr_inference_k2¶
-
espnet2.bin.uasr_inference_k2.
indices_to_split_size
(indices, total_elements: int = None)[source]¶ convert indices to split_size
During decoding, the api torch.tensor_split should be used. However, torch.tensor_split is only available with pytorch >= 1.8.0. So torch.split is used to pass ci with pytorch < 1.8.0. This fuction is used to prepare input for torch.split.
-
espnet2.bin.uasr_inference_k2.
inference
(output_dir: str, decoding_graph: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], uasr_train_config: Optional[str], uasr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], word_token_list: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, streaming: bool, is_ctc_decoding: bool, use_nbest_rescoring: bool, num_paths: int, nbest_batch_size: int, nll_batch_size: int, k2_config: Optional[str])[source]¶
-
class
espnet2.bin.uasr_inference_k2.
k2Speech2Text
(uasr_train_config: Union[pathlib.Path, str], decoding_graph: str, uasr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 8, ctc_weight: float = 0.5, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, search_beam_size: int = 20, output_beam_size: int = 20, min_active_states: int = 14000, max_active_states: int = 56000, blank_bias: float = 0.0, lattice_weight: float = 1.0, is_ctc_decoding: bool = True, lang_dir: Optional[str] = None, token_list_file: Optional[str] = None, use_fgram_rescoring: bool = False, use_nbest_rescoring: bool = False, am_weight: float = 0.5, decoder_weight: float = 0.5, nnlm_weight: float = 1.0, num_paths: int = 1000, nbest_batch_size: int = 500, nll_batch_size: int = 100)[source]¶ Bases:
object
Speech2Text class
Examples
>>> import soundfile >>> speech2text = k2Speech2Text("uasr_config.yml", "uasr.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech = np.expand_dims(audio, 0) # shape: [batch_size, speech_length] >>> speech_lengths = np.array([audio.shape[0]]) # shape: [batch_size] >>> batch = {"speech": speech, "speech_lengths", speech_lengths} >>> speech2text(batch) [(text, token, token_int, score), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build k2Speech2Text instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
Speech2Text instance.
- Return type:
-
static
espnet2.bin.asr_transducer_train¶
espnet2.bin.uasr_train¶
espnet2.bin.asr_train¶
espnet2.bin.tts_train¶
espnet2.bin.slu_train¶
espnet2.bin.mt_train¶
espnet2.bin.lm_inference¶
-
class
espnet2.bin.lm_inference.
GenerateText
(lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlen: int = 100, minlen: int = 0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ngram_weight: float = 0.0, penalty: float = 0.0, nbest: int = 1, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]¶ Bases:
object
GenerateText class
Examples
>>> generatetext = GenerateText( lm_train_config="lm_config.yaml", lm_file="lm.pth", token_type="bpe", bpemodel="bpe.model", ) >>> prompt = "I have travelled to many " >>> generatetext(prompt) [(text, token, token_int, hypothesis object), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build GenerateText instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
GenerateText instance.
- Return type:
-
static
-
espnet2.bin.lm_inference.
inference
(output_dir: str, maxlen: int, minlen: int, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]¶
espnet2.bin.asr_inference¶
-
class
espnet2.bin.asr_inference.
Speech2Text
(asr_train_config: Union[pathlib.Path, str] = None, asr_model_file: Union[pathlib.Path, str] = None, transducer_conf: dict = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, enh_s2t_task: bool = False, quantize_asr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8', hugging_face_decoder: bool = False, hugging_face_decoder_max_length: int = 256, time_sync: bool = False, multi_asr: bool = False)[source]¶ Bases:
object
Speech2Text class
Examples
>>> import soundfile >>> speech2text = Speech2Text("asr_config.yml", "asr.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis object), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build Speech2Text instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
Speech2Text instance.
- Return type:
-
static
-
espnet2.bin.asr_inference.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: Optional[str], asr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, transducer_conf: Optional[dict], streaming: bool, enh_s2t_task: bool, quantize_asr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str, hugging_face_decoder: bool, hugging_face_decoder_max_length: int, time_sync: bool, multi_asr: bool)[source]¶
espnet2.bin.spk_train¶
-
espnet2.bin.spk_train.
main
(cmd=None)[source]¶ - Speaker embedding extractor training. Trained model can be used for
speaker verification, open set speaker identification, and also as embeddings for various other tasks including speaker diarization.
Example
- % python spk_train.py –print_config –optim adadelta
> conf/train_spk.yaml
% python spk_train.py –config conf/train_diar.yaml
espnet2.bin.enh_scoring¶
espnet2.bin.pack¶
-
class
espnet2.bin.pack.
ASRPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['asr_model_file', 'lm_file']¶
-
yaml_files
= ['asr_train_config', 'lm_train_config']¶
-
-
class
espnet2.bin.pack.
DiarPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['model_file']¶
-
yaml_files
= ['train_config']¶
-
-
class
espnet2.bin.pack.
EnhPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['model_file']¶
-
yaml_files
= ['train_config']¶
-
-
class
espnet2.bin.pack.
EnhS2TPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['enh_s2t_model_file', 'lm_file']¶
-
yaml_files
= ['enh_s2t_train_config', 'lm_train_config']¶
-
-
class
espnet2.bin.pack.
SSLPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['model_file']¶
-
yaml_files
= ['train_config']¶
-
-
class
espnet2.bin.pack.
STPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['st_model_file']¶
-
yaml_files
= ['st_train_config']¶
-
-
class
espnet2.bin.pack.
SVSPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['model_file']¶
-
yaml_files
= ['train_config']¶
-
-
class
espnet2.bin.pack.
TTSPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['model_file']¶
-
yaml_files
= ['train_config']¶
-
espnet2.bin.asr_inference_streaming¶
-
class
espnet2.bin.asr_inference_streaming.
Speech2TextStreaming
(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, disable_repetition_detection=False, decoder_text_length_limit=0, encoded_feat_length_limit=0)[source]¶ Bases:
object
Speech2TextStreaming class
Details in “Streaming Transformer ASR with Blockwise Synchronous Beam Search” (https://arxiv.org/abs/2006.14941)
Examples
>>> import soundfile >>> speech2text = Speech2TextStreaming("asr_config.yml", "asr.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis object), ...]
-
espnet2.bin.asr_inference_streaming.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, sim_chunk_length: int, disable_repetition_detection: bool, encoded_feat_length_limit: int, decoder_text_length_limit: int)[source]¶
espnet2.bin.gan_svs_train¶
espnet2.bin.hubert_train¶
espnet2.bin.enh_inference_streaming¶
-
class
espnet2.bin.enh_inference_streaming.
SeparateSpeechStreaming
(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, ref_channel: Optional[int] = None, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False)[source]¶ Bases:
object
SeparateSpeechStreaming class. Separate a small audio chunk in streaming.
Examples
>>> import soundfile >>> separate_speech = SeparateSpeechStreaming("enh_config.yml", "enh.pth") >>> audio, rate = soundfile.read("speech.wav") >>> lengths = torch.LongTensor(audio.shape[-1]) >>> speech_sim_chunks = separate_speech.frame(wav) >>> output_chunks = [[] for ii in range(separate_speech.num_spk)] >>> >>> for chunk in speech_sim_chunks: >>> output = separate_speech(chunk) >>> for spk in range(separate_speech.num_spk): >>> output_chunks[spk].append(output[spk]) >>> >>> separate_speech.reset() >>> waves = [ >>> separate_speech.merge(chunks, length) >>> for chunks in output_chunks ]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build SeparateSpeech instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
SeparateSpeech instance.
- Return type:
-
static
-
espnet2.bin.enh_inference_streaming.
inference
(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, ref_channel: Optional[int], enh_s2t_task: bool)[source]¶
espnet2.bin.st_train¶
espnet2.bin.tts_inference¶
Script to run the inference of text-to-speeech model.
-
class
espnet2.bin.tts_inference.
Text2Speech
(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, speed_control_alpha: float = 1.0, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, vocoder_config: Union[pathlib.Path, str] = None, vocoder_file: Union[pathlib.Path, str] = None, dtype: str = 'float32', device: str = 'cpu', seed: int = 777, always_fix_seed: bool = False, prefer_normalized_feats: bool = False)[source]¶ Bases:
object
Text2Speech class.
Examples
>>> from espnet2.bin.tts_inference import Text2Speech >>> # Case 1: Load the local model and use Griffin-Lim vocoder >>> text2speech = Text2Speech( >>> train_config="/path/to/config.yml", >>> model_file="/path/to/model.pth", >>> ) >>> # Case 2: Load the local model and the pretrained vocoder >>> text2speech = Text2Speech.from_pretrained( >>> train_config="/path/to/config.yml", >>> model_file="/path/to/model.pth", >>> vocoder_tag="kan-bayashi/ljspeech_tacotron2", >>> ) >>> # Case 3: Load the pretrained model and use Griffin-Lim vocoder >>> text2speech = Text2Speech.from_pretrained( >>> model_tag="kan-bayashi/ljspeech_tacotron2", >>> ) >>> # Case 4: Load the pretrained model and the pretrained vocoder >>> text2speech = Text2Speech.from_pretrained( >>> model_tag="kan-bayashi/ljspeech_tacotron2", >>> vocoder_tag="parallel_wavegan/ljspeech_parallel_wavegan.v1", >>> ) >>> # Run inference and save as wav file >>> import soundfile as sf >>> wav = text2speech("Hello, World")["wav"] >>> sf.write("out.wav", wav.numpy(), text2speech.fs, "PCM_16")
Initialize Text2Speech module.
-
static
from_pretrained
(model_tag: Optional[str] = None, vocoder_tag: Optional[str] = None, **kwargs)[source]¶ Build Text2Speech instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
vocoder_tag (Optional[str]) – Vocoder tag of the pretrained vocoders. Currently, the tags of parallel_wavegan are supported, which should start with the prefix “parallel_wavegan/”.
- Returns:
Text2Speech instance.
- Return type:
-
property
fs
¶ Return sampling rate.
-
property
use_lids
¶ Return sid is needed or not in the inference.
-
property
use_sids
¶ Return sid is needed or not in the inference.
-
property
use_speech
¶ Return speech is needed or not in the inference.
-
property
use_spembs
¶ Return spemb is needed or not in the inference.
-
static
-
espnet2.bin.tts_inference.
inference
(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], threshold: float, minlenratio: float, maxlenratio: float, use_teacher_forcing: bool, use_att_constraint: bool, backward_window: int, forward_window: int, speed_control_alpha: float, noise_scale: float, noise_scale_dur: float, always_fix_seed: bool, allow_variable_data_keys: bool, vocoder_config: Optional[str], vocoder_file: Optional[str], vocoder_tag: Optional[str])[source]¶ Run text-to-speech inference.
espnet2.bin.asvspoof_train¶
espnet2.bin.diar_inference¶
-
class
espnet2.bin.diar_inference.
DiarizeSpeech
(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, normalize_output_wav: bool = False, num_spk: Optional[int] = None, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False, multiply_diar_result: bool = False)[source]¶ Bases:
object
DiarizeSpeech class
Examples
>>> import soundfile >>> diarization = DiarizeSpeech("diar_config.yaml", "diar.pth") >>> audio, rate = soundfile.read("speech.wav") >>> diarization(audio) [(spk_id, start, end), (spk_id2, start2, end2)]
-
cal_permumation
(ref_wavs, enh_wavs, criterion='si_snr')[source]¶ Calculate the permutation between seaprated streams in two adjacent segments.
- Parameters:
ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
criterion (str) – one of (“si_snr”, “mse”, “corr)
- Returns:
permutation for enh_wavs (Batch, num_spk)
- Return type:
perm (torch.Tensor)
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build DiarizeSpeech instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
DiarizeSpeech instance.
- Return type:
-
-
espnet2.bin.diar_inference.
inference
(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, num_spk: Optional[int], normalize_output_wav: bool, multiply_diar_result: bool, enh_s2t_task: bool)[source]¶
espnet2.bin.mt_inference¶
-
class
espnet2.bin.mt_inference.
Text2Text
(mt_train_config: Union[pathlib.Path, str] = None, mt_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1)[source]¶ Bases:
object
Text2Text class
Examples
>>> text2text = Text2Text("mt_config.yml", "mt.pth") >>> text2text(audio) [(text, token, token_int, hypothesis object), ...]
-
espnet2.bin.mt_inference.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], mt_train_config: Optional[str], mt_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool)[source]¶
espnet2.bin.split_scps¶
espnet2.bin.asr_inference_k2¶
espnet2.bin.lm_calc_perplexity¶
-
espnet2.bin.lm_calc_perplexity.
calc_perplexity
(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], log_base: Optional[float], allow_variable_data_keys: bool)[source]¶
espnet2.bin.svs_train¶
espnet2.bin.asr_inference_maskctc¶
-
class
espnet2.bin.asr_inference_maskctc.
Speech2Text
(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32', maskctc_n_iterations: int = 10, maskctc_threshold_probability: float = 0.99)[source]¶ Bases:
object
Speech2Text class
Examples
>>> import soundfile >>> speech2text = Speech2Text("asr_config.yml", "asr.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis object), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build Speech2Text instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
Speech2Text instance.
- Return type:
-
static
-
espnet2.bin.asr_inference_maskctc.
inference
(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, maskctc_n_iterations: int, maskctc_threshold_probability: float)[source]¶
espnet2.bin.enh_s2t_train¶
espnet2.bin.whisper_export_vocabulary¶
espnet2.bin.uasr_inference¶
-
class
espnet2.bin.uasr_inference.
Speech2Text
(uasr_train_config: Union[pathlib.Path, str] = None, uasr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, nbest: int = 1, quantize_uasr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]¶ Bases:
object
Speech2Text class for unsupervised ASR
Examples
>>> import soundfile >>> speech2text = Speech2Text("uasr_config.yml", "uasr.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis_object), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build Speech2Text instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
Speech2Text instance.
- Return type:
-
static
-
espnet2.bin.uasr_inference.
inference
(output_dir: str, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], uasr_train_config: Optional[str], uasr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, quantize_uasr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]¶
espnet2.bin.diar_train¶
espnet2.bin.asr_align¶
Perform CTC segmentation to align utterances within audio files.
-
class
espnet2.bin.asr_align.
CTCSegmentation
(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, fs: int = 16000, ngpu: int = 0, batch_size: int = 1, dtype: str = 'float32', kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]¶ Bases:
object
Align text to audio using CTC segmentation.
- Usage:
Initialize with given ASR model and parameters. If needed, parameters for CTC segmentation can be set with
set_config(·)
. Then call the instance as function to align text within an audio file.
Example
>>> # example file included in the ESPnet repository >>> import soundfile >>> speech, fs = soundfile.read("test_utils/ctc_align_test.wav") >>> # load an ASR model >>> from espnet_model_zoo.downloader import ModelDownloader >>> d = ModelDownloader() >>> wsjmodel = d.download_and_unpack( "kamo-naoyuki/wsj" ) >>> # Apply CTC segmentation >>> aligner = CTCSegmentation( **wsjmodel ) >>> text=["utt1 THE SALE OF THE HOTELS", "utt2 ON PROPERTY MANAGEMENT"] >>> aligner.set_config( gratis_blank=True ) >>> segments = aligner( speech, text, fs=fs ) >>> print( segments ) utt1 utt 0.27 1.72 -0.1663 THE SALE OF THE HOTELS utt2 utt 4.54 6.10 -4.9646 ON PROPERTY MANAGEMENT
- On multiprocessing:
To parallelize the computation with multiprocessing, these three steps can be separated: (1)
get_lpz
: obtain the lpz, (2)prepare_segmentation_task
: prepare the task, and (3)get_segments
: perform CTC segmentation. Note that the function get_segments is a staticmethod and therefore independent of an already initialized CTCSegmentation object.
References
CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition 2020, Kürzinger, Winkelbauer, Li, Watzel, Rigoll https://arxiv.org/abs/2007.09127
More parameters are described in https://github.com/lumaku/ctc-segmentation
Initialize the CTCSegmentation module.
- Parameters:
asr_train_config – ASR model config file (yaml).
asr_model_file – ASR model file (pth).
fs – Sample rate of audio file.
ngpu – Number of GPUs. Set 0 for processing on CPU, set to 1 for processing on GPU. Multi-GPU aligning is currently not implemented. Default: 0.
batch_size – Currently, only batch size == 1 is implemented.
dtype – Data type used for inference. Set dtype according to the ASR model.
kaldi_style_text – A kaldi-style text file includes the name of the utterance at the start of the line. If True, the utterance name is expected as first word at each line. If False, utterance names are automatically generated. Set this option according to your input data. Default: True.
text_converter – How CTC segmentation handles text. “tokenize”: Use ESPnet 2 preprocessing to tokenize the text. “classic”: The text is preprocessed as in ESPnet 1 which takes token length into account. If the ASR model has longer tokens, this option may yield better results. Default: “tokenize”.
time_stamps – Choose the method how the time stamps are calculated. While “fixed” and “auto” use both the sample rate, the ratio of samples to one frame is either automatically determined for each inference or fixed at a certain ratio that is initially determined by the module, but can be changed via the parameter
samples_to_frames_ratio
. Recommended for longer audio files: “auto”.**ctc_segmentation_args – Parameters for CTC segmentation.
-
choices_text_converter
= ['tokenize', 'classic']¶
-
choices_time_stamps
= ['auto', 'fixed']¶
-
config
= CtcSegmentationParameters( )¶
-
estimate_samples_to_frames_ratio
(speech_len=215040)[source]¶ Determine the ratio of encoded frames to sample points.
This method helps to determine the time a single encoded frame occupies. As the sample rate already gave the number of samples, only the ratio of samples per encoded CTC frame are needed. This function estimates them by doing one inference, which is only needed once.
- Parameters:
speech_len – Length of randomly generated speech vector for single inference. Default: 215040.
- Returns:
Estimated ratio.
- Return type:
samples_to_frames_ratio
-
fs
= 16000¶
-
get_lpz
(speech: Union[torch.Tensor, numpy.ndarray])[source]¶ Obtain CTC posterior log probabilities for given speech data.
- Parameters:
speech – Speech audio input.
- Returns:
Numpy vector with CTC log posterior probabilities.
- Return type:
lpz
-
static
get_segments
(task: espnet2.bin.asr_align.CTCSegmentationTask)[source]¶ Obtain segments for given utterance texts and CTC log posteriors.
- Parameters:
task – CTCSegmentationTask object that contains ground truth and CTC posterior probabilities.
- Returns:
- Dictionary with alignments. Combine this with the task
object to obtain a human-readable segments representation.
- Return type:
result
-
get_timing_config
(speech_len=None, lpz_len=None)[source]¶ Obtain parameters to determine time stamps.
-
prepare_segmentation_task
(text, lpz, name=None, speech_len=None)[source]¶ Preprocess text, and gather text and lpz into a task object.
Text is pre-processed and tokenized depending on configuration. If
speech_len
is given, the timing configuration is updated. Text, lpz, and configuration is collected in a CTCSegmentationTask object. The resulting object can be serialized and passed in a multiprocessing computation.A minimal amount of text processing is done, i.e., splitting the utterances in
text
into a list and applyingtext_cleaner
. It is recommended that you normalize the text beforehand, e.g., change numbers into their spoken equivalent word, remove special characters, and convert UTF-8 characters to chars corresponding to your ASR model dictionary.The text is tokenized based on the
text_converter
setting:The “tokenize” method is more efficient and the easiest for models based on latin or cyrillic script that only contain the main chars, [“a”, “b”, …] or for Japanese or Chinese ASR models with ~3000 short Kanji / Hanzi tokens.
The “classic” method improves the the accuracy of the alignments for models that contain longer tokens, but with a greater complexity for computation. The function scans for partial tokens which may improve time resolution. For example, the word “▁really” will be broken down into
['▁', '▁r', '▁re', '▁real', '▁really']
. The alignment will be based on the most probable activation sequence given by the network.- Parameters:
text – List or multiline-string with utterance ground truths.
lpz – Log CTC posterior probabilities obtained from the CTC-network; numpy array shaped as ( <time steps>, <classes> ).
name – Audio file name. Choose a unique name, or the original audio file name, to distinguish multiple audio files. Default: None.
speech_len – Number of sample points. If given, the timing configuration is automatically derived from length of fs, length of speech and length of lpz. If None is given, make sure the timing parameters are correct, see time_stamps for reference! Default: None.
- Returns:
- CTCSegmentationTask object that can be passed to
get_segments()
in order to obtain alignments.
- Return type:
task
-
samples_to_frames_ratio
= None¶
-
set_config
(**kwargs)[source]¶ Set CTC segmentation parameters.
- Parameters for timing:
- time_stamps: Select method how CTC index duration is estimated, and
thus how the time stamps are calculated.
fs: Sample rate. samples_to_frames_ratio: If you want to directly determine the
ratio of samples to CTC frames, set this parameter, and set
time_stamps
to “fixed”. Note: If you want to calculate the time stamps as in ESPnet 1, set this parameter to:subsampling_factor * frame_duration / 1000
.- Parameters for text preparation:
set_blank: Index of blank in token list. Default: 0. replace_spaces_with_blanks: Inserts blanks between words, which is
useful for handling long pauses between words. Only used in
text_converter="classic"
preprocessing mode. Default: False.- kaldi_style_text: Determines whether the utterance name is expected
as fist word of the utterance. Set at module initialization.
- text_converter: How CTC segmentation handles text.
Set at module initialization.
- Parameters for alignment:
- min_window_size: Minimum number of frames considered for a single
utterance. The current default value of 8000 corresponds to roughly 4 minutes (depending on ASR model) and should be OK in most cases. If your utterances are further apart, increase this value, or decrease it for smaller audio files.
- max_window_size: Maximum window size. It should not be necessary
to change this value.
- gratis_blank: If True, the transition cost of blank is set to zero.
Useful for long preambles or if there are large unrelated segments between utterances. Default: False.
- Parameters for calculation of confidence score:
- scoring_length: Block length to calculate confidence score. The
default value of 30 should be OK in most cases.
-
text_converter
= 'tokenize'¶
-
time_stamps
= 'auto'¶
-
warned_about_misconfiguration
= False¶
-
class
espnet2.bin.asr_align.
CTCSegmentationTask
(**kwargs)[source]¶ Bases:
object
Task object for CTC segmentation.
When formatted with str(·), this object returns results in a kaldi-style segments file formatting. The human-readable output can be configured with the printing options.
- Properties:
- text: Utterance texts, separated by line. But without the utterance
name at the beginning of the line (as in kaldi-style text).
ground_truth_mat: Ground truth matrix (CTC segmentation). utt_begin_indices: Utterance separator for the Ground truth matrix. timings: Time marks of the corresponding chars. state_list: Estimated alignment of chars/tokens. segments: Calculated segments as: (start, end, confidence score). config: CTC Segmentation configuration object. name: Name of aligned audio file (Optional). If given, name is
considered when generating the text.
- utt_ids: The list of utterance names (Optional). This list should
have the same length as the number of utterances.
lpz: CTC posterior log probabilities (Optional).
- Properties for printing:
print_confidence_score: Includes the confidence score. print_utterance_text: Includes utterance text.
Initialize the module.
-
char_probs
= None¶
-
config
= None¶
-
done
= False¶
-
ground_truth_mat
= None¶
-
lpz
= None¶
-
name
= 'utt'¶
-
print_confidence_score
= True¶
-
print_utterance_text
= True¶
-
segments
= None¶
-
set
(**kwargs)[source]¶ Update properties.
- Parameters:
**kwargs – Key-value dict that contains all properties with their new values. Unknown properties are ignored.
-
state_list
= None¶
-
text
= None¶
-
timings
= None¶
-
utt_begin_indices
= None¶
-
utt_ids
= None¶
espnet2.bin.slu_inference¶
-
class
espnet2.bin.slu_inference.
Speech2Understand
(slu_train_config: Union[pathlib.Path, str] = None, slu_model_file: Union[pathlib.Path, str] = None, transducer_conf: dict = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, quantize_asr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]¶ Bases:
object
Speech2Understand class
Examples
>>> import soundfile >>> speech2understand = Speech2Understand("slu_config.yml", "slu.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2understand(audio) [(text, token, token_int, hypothesis object), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build Speech2Understand instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
Speech2Understand instance.
- Return type:
-
static
-
espnet2.bin.slu_inference.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], slu_train_config: Optional[str], slu_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, transducer_conf: Optional[dict], streaming: bool, quantize_asr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]¶
espnet2.bin.st_inference_streaming¶
-
class
espnet2.bin.st_inference_streaming.
Speech2TextStreaming
(st_train_config: Union[pathlib.Path, str], st_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, disable_repetition_detection=False, decoder_text_length_limit=0, encoded_feat_length_limit=0)[source]¶ Bases:
object
Speech2TextStreaming class
Details in “Streaming Transformer ASR with Blockwise Synchronous Beam Search” (https://arxiv.org/abs/2006.14941)
Examples
>>> import soundfile >>> speech2text = Speech2TextStreaming("asr_config.yml", "asr.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis object), ...]
-
espnet2.bin.st_inference_streaming.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], st_train_config: str, st_model_file: str, lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, sim_chunk_length: int, disable_repetition_detection: bool, encoded_feat_length_limit: int, decoder_text_length_limit: int)[source]¶
espnet2.bin.enh_tse_train¶
espnet2.bin.st_inference¶
-
class
espnet2.bin.st_inference.
Speech2Text
(st_train_config: Union[pathlib.Path, str] = None, st_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, enh_s2t_task: bool = False)[source]¶ Bases:
object
Speech2Text class
Examples
>>> import soundfile >>> speech2text = Speech2Text("st_config.yml", "st.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis object), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build Speech2Text instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
Speech2Text instance.
- Return type:
-
static
-
espnet2.bin.st_inference.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], st_train_config: Optional[str], st_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, enh_s2t_task: bool)[source]¶
espnet2.bin.lm_train¶
espnet2.bin.enh_tse_inference¶
-
class
espnet2.bin.enh_tse_inference.
SeparateSpeech
(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, ref_channel: Optional[int] = None, normalize_output_wav: bool = False, device: str = 'cpu', dtype: str = 'float32')[source]¶ Bases:
object
SeparateSpeech class
Examples
>>> import soundfile >>> separate_speech = SeparateSpeech("enh_config.yml", "enh.pth") >>> audio, rate = soundfile.read("speech.wav") >>> separate_speech(audio) [separated_audio1, separated_audio2, ...]
-
cal_permumation
(ref_wavs, enh_wavs, criterion='si_snr')[source]¶ Calculate the permutation between seaprated streams in two adjacent segments.
- Parameters:
ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
criterion (str) – one of (“si_snr”, “mse”, “corr)
- Returns:
permutation for enh_wavs (Batch, num_spk)
- Return type:
perm (torch.Tensor)
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build SeparateSpeech instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
SeparateSpeech instance.
- Return type:
-
-
espnet2.bin.enh_tse_inference.
build_model_from_args_and_file
(task, args, model_file, device)[source]¶
-
espnet2.bin.enh_tse_inference.
inference
(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, ref_channel: Optional[int], normalize_output_wav: bool)[source]¶
espnet2.bin.asr_transducer_inference¶
Inference class definition for Transducer models.
-
class
espnet2.bin.asr_transducer_inference.
Speech2Text
(asr_train_config: Union[pathlib.Path, str, None] = None, asr_model_file: Union[pathlib.Path, str, None] = None, beam_search_config: Optional[Dict[str, Any]] = None, lm_train_config: Union[pathlib.Path, str, None] = None, lm_file: Union[pathlib.Path, str, None] = None, token_type: Optional[str] = None, bpemodel: Optional[str] = None, device: str = 'cpu', beam_size: int = 5, dtype: str = 'float32', lm_weight: float = 1.0, quantize_asr_model: bool = False, quantize_modules: Optional[List[str]] = None, quantize_dtype: str = 'qint8', nbest: int = 1, streaming: bool = False, decoding_window: int = 640, left_context: int = 32)[source]¶ Bases:
object
Speech2Text class for Transducer models.
- Parameters:
asr_train_config – ASR model training config path.
asr_model_file – ASR model path.
beam_search_config – Beam search config path.
lm_train_config – Language Model training config path.
lm_file – Language Model config path.
token_type – Type of token units.
bpemodel – BPE model path.
device – Device to use for inference.
beam_size – Size of beam during search.
dtype – Data type.
lm_weight – Language model weight.
quantize_asr_model – Whether to apply dynamic quantization to ASR model.
quantize_modules – List of module names to apply dynamic quantization on.
quantize_dtype – Dynamic quantization data type.
nbest – Number of final hypothesis.
streaming – Whether to perform chunk-by-chunk inference.
decoding_window – Size of the decoding window (in milliseconds).
left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).
Construct a Speech2Text object.
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs) → espnet2.bin.asr_transducer_inference.Speech2Text[source]¶ Build Speech2Text instance from the pretrained model.
- Parameters:
model_tag – Model tag of the pretrained models.
- Returns:
Speech2Text instance.
-
hypotheses_to_results
(nbest_hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → List[Any][source]¶ Build partial or final results from the hypotheses.
- Parameters:
nbest_hyps – N-best hypothesis.
- Returns:
Results containing different representation for the hypothesis.
- Return type:
results
-
streaming_decode
(speech: Union[torch.Tensor, numpy.ndarray], is_final: bool = False) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]¶ Speech2Text streaming call.
- Parameters:
speech – Chunk of speech data. (S)
is_final – Whether speech corresponds to the final chunk of data.
- Returns:
N-best hypothesis.
- Return type:
nbest_hypothesis
-
espnet2.bin.asr_transducer_inference.
inference
(output_dir: str, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], asr_train_config: Optional[str], asr_model_file: Optional[str], beam_search_config: Optional[dict], lm_train_config: Optional[str], lm_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], key_file: Optional[str], allow_variable_data_keys: bool, quantize_asr_model: Optional[bool], quantize_modules: Optional[List[str]], quantize_dtype: Optional[str], streaming: bool, decoding_window: int, left_context: int, display_hypotheses: bool) → None[source]¶ Transducer model inference.
- Parameters:
output_dir – Output directory path.
batch_size – Batch decoding size.
dtype – Data type.
beam_size – Beam size.
ngpu – Number of GPUs.
seed – Random number generator seed.
lm_weight – Weight of language model.
nbest – Number of final hypothesis.
num_workers – Number of workers.
log_level – Level of verbose for logs.
data_path_and_name_and_type –
asr_train_config – ASR model training config path.
asr_model_file – ASR model path.
beam_search_config – Beam search config path.
lm_train_config – Language Model training config path.
lm_file – Language Model path.
model_tag – Model tag.
token_type – Type of token units.
bpemodel – BPE model path.
key_file – File key.
allow_variable_data_keys – Whether to allow variable data keys.
quantize_asr_model – Whether to apply dynamic quantization to ASR model.
quantize_modules – List of module names to apply dynamic quantization on.
quantize_dtype – Dynamic quantization data type.
streaming – Whether to perform chunk-by-chunk inference.
decoding_window – Audio length (in milliseconds) to process during decoding.
left_context – Number of previous frames the attention module can see in current chunk (used by Conformer and Branchformer block).
display_hypotheses – Whether to display (partial and full) hypotheses.
espnet2.bin.__init__¶
espnet2.bin.hugging_face_export_vocabulary¶
espnet2.bin.gan_tts_train¶
espnet2.bin.uasr_extract_feature¶
-
espnet2.bin.uasr_extract_feature.
extract_feature
(uasr_train_config: Optional[str], uasr_model_file: Optional[str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], batch_size: int, dtype: str, num_workers: int, allow_variable_data_keys: bool, ngpu: int, output_dir: str, dset: str, log_level: Union[int, str])[source]¶
espnet2.bin.svs_inference¶
Script to run the inference of singing-voice-synthesis model.
-
class
espnet2.bin.svs_inference.
SingingGenerate
(train_config: Union[pathlib.Path, str, None], model_file: Union[pathlib.Path, str, None] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False, use_att_constraint: bool = False, use_dynamic_filter: bool = False, backward_window: int = 2, forward_window: int = 4, speed_control_alpha: float = 1.0, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, vocoder_config: Union[pathlib.Path, str] = None, vocoder_checkpoint: Union[pathlib.Path, str] = None, dtype: str = 'float32', device: str = 'cpu', seed: int = 777, always_fix_seed: bool = False, prefer_normalized_feats: bool = False)[source]¶ Bases:
object
SingingGenerate class
Examples
>>> import soundfile >>> svs = SingingGenerate("config.yml", "model.pth") >>> wav = svs("Hello World")[0] >>> soundfile.write("out.wav", wav.numpy(), svs.fs, "PCM_16")
Initialize SingingGenerate module.
-
static
from_pretrained
(model_tag: Optional[str] = None, vocoder_tag: Optional[str] = None, **kwargs)[source]¶ Build SingingGenerate instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
vocoder_tag (Optional[str]) – Vocoder tag of the pretrained vocoders. Currently, the tags of parallel_wavegan are supported, which should start with the prefix “parallel_wavegan/”.
- Returns:
SingingGenerate instance.
- Return type:
-
property
fs
¶ Return sampling rate.
-
property
use_lids
¶ Return sid is needed or not in the inference.
-
property
use_sids
¶ Return sid is needed or not in the inference.
-
property
use_speech
¶ Return speech is needed or not in the inference.
-
property
use_spembs
¶ Return spemb is needed or not in the inference.
-
static
-
espnet2.bin.svs_inference.
inference
(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], use_teacher_forcing: bool, noise_scale: float, noise_scale_dur: float, allow_variable_data_keys: bool, vocoder_config: Optional[str] = None, vocoder_checkpoint: Optional[str] = None, vocoder_tag: Optional[str] = None)[source]¶ Perform SVS model decoding.
espnet2.bin.enh_inference¶
-
class
espnet2.bin.enh_inference.
SeparateSpeech
(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, ref_channel: Optional[int] = None, normalize_output_wav: bool = False, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False)[source]¶ Bases:
object
SeparateSpeech class
Examples
>>> import soundfile >>> separate_speech = SeparateSpeech("enh_config.yml", "enh.pth") >>> audio, rate = soundfile.read("speech.wav") >>> separate_speech(audio) [separated_audio1, separated_audio2, ...]
-
cal_permumation
(ref_wavs, enh_wavs, criterion='si_snr')[source]¶ Calculate the permutation between seaprated streams in two adjacent segments.
- Parameters:
ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
criterion (str) – one of (“si_snr”, “mse”, “corr)
- Returns:
permutation for enh_wavs (Batch, num_spk)
- Return type:
perm (torch.Tensor)
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build SeparateSpeech instance from the pretrained model.
- Parameters:
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns:
SeparateSpeech instance.
- Return type:
-
-
espnet2.bin.enh_inference.
inference
(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, ref_channel: Optional[int], normalize_output_wav: bool, enh_s2t_task: bool)[source]¶
espnet2.bin.launch¶
espnet2.bin.enh_train¶
espnet2.bin.aggregate_stats_dirs¶
espnet2.bin.tokenize_text¶
-
espnet2.bin.tokenize_text.
field2slice
(field: Optional[str]) → slice[source]¶ Convert field string to slice.
Note that field string accepts 1-based integer. .. rubric:: Examples
>>> field2slice("1-") slice(0, None, None) >>> field2slice("1-3") slice(0, 3, None) >>> field2slice("-3") slice(None, 3, None)
-
espnet2.bin.tokenize_text.
tokenize
(input: str, output: str, field: Optional[str], delimiter: Optional[str], token_type: str, space_symbol: str, non_linguistic_symbols: Optional[str], bpemodel: Optional[str], log_level: str, write_vocabulary: bool, vocabulary_size: int, remove_non_linguistic_symbols: bool, cutoff: int, add_symbol: List[str], cleaner: Optional[str], g2p: Optional[str], add_nonsplit_symbol: List[str])[source]¶