espnet_onnx demonstration¶
This notebook provides a demonstration of how to export your trained model into onnx format. Currently only ASR is supported.
see also: - ESPnet: https://github.com/espnet/espnet - espnet_onnx: https://github.com/Masao-Someki/espnet_onnx
Author: Masao Someki
Table of Contents¶
Install Dependency
Export your model
Inference with onnx
Using streaming model
Install Dependency¶
To run this demo, you need to install the following packages. - espnet_onnx - torch >= 1.11.0 (already installed in Colab) - espnet - espnet_model_zoo - onnx
torch
, espnet
, espnet_model_zoo
, onnx
is required to run the exportation demo.
[ ]:
!pip install -U espnet_onnx espnet espnet_model_zoo onnx
# in this demo, we need to update scipy to avoid an error
!pip install -U scipy
Export your model¶
Export model from espnet_model_zoo¶
The easiest way to export a model is to use espnet_model_zoo
. You can download, unpack, and export the pretrained models with export_from_pretrained
method. espnet_onnx
will save the onnx models into cache directory, which is ${HOME}/.cache/espnet_onnx
in default.
[ ]:
# export the model.
from espnet_onnx.export import ModelExport
tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'
m = ModelExport()
m.export_from_pretrained(tag_name)
Export from custom model¶
espnet_onnx
can also export your own trained model with export
method.
espnet2.bin.asr_inference.Speech2Text
instance. You can also export from a zipped file, by using the export_from_zip
function.from_pretrained
method to load parameters, but you can load your own model.[ ]:
# prepare the espnet2.bin.asr_inference.Speech2Text instance.
from espnet2.bin.asr_inference import Speech2Text
tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'
speech2text = Speech2Text.from_pretrained(tag_name)
# export model
from espnet_onnx.export import ModelExport
sample_model_tag = 'demo/sample_model_1'
m = ModelExport()
m.export(
speech2text,
sample_model_tag,
quantize=False
)
Inference with onnx¶
Now, let’s use the exported models for inference.
[1]:
# please provide the tag_name to specify exported model.
tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'
# upload wav file and let's inference!
import librosa
from google.colab import files
wav_file = files.upload()
y, sr = librosa.load(list(wav_file.keys())[0], sr=16000)
# Use the exported onnx file to inference.
from espnet_onnx import Speech2Text
speech2text = Speech2Text(tag_name)
nbest = speech2text(y)
print(nbest[0][0])
Saving LJ050-0030.wav to LJ050-0030 (1).wav
/usr/local/lib/python3.7/dist-packages/espnet_onnx/asr/scorer/interface.py:96: UserWarning: RNNDecoder batch score is implemented through for loop not parallelized
self.__class__.__name__
ih n uw n ih sh ih z ah v aa l z ow r eh sil t er m ey z z
Using streaming model¶
Model exportation is exactly the same as non-streaming model. You can follow the #Export your model
chapter.
As for streaming, you can specify the following configuration additionaly. Usually, these values should be the same as the training configuration. - block_size - hop_size - look_ahead
The length of the speech should be the same as streaming_model.hop_size
. This value is calculated as follows
For example, the length of the speech is 8704 with the following configuration. - block_size = 40 - hop_size = 16 - look_ahead = 16 - encoder.subsample = 4 - stft.n_fft = 512 - stft.hop_length = 128
Now, let’s demonstrate the streaming inference.
[ ]:
# Export the streaming model.
# Note that the following model is very large
from espnet_onnx.export import ModelExport
tag_name = 'D-Keqi/espnet_asr_train_asr_streaming_transformer_raw_en_bpe500_sp_valid.acc.ave'
m = ModelExport()
m.export_from_pretrained(tag_name)
[3]:
# In this tutorial, we will use the recorded wav file to simulate streaming.
import librosa
from espnet_onnx import StreamingSpeech2Text
tag_name = 'D-Keqi/espnet_asr_train_asr_streaming_transformer_raw_en_bpe500_sp_valid.acc.ave'
streaming_model = StreamingSpeech2Text(tag_name)
# upload wav file
from google.colab import files
wav_file = files.upload()
y, sr = librosa.load(list(wav_file.keys())[0], sr=16000)
num_process = len(y) // streaming_model.hop_size + 1
print(f"I will split your audio file into {num_process} blocks.")
# simulate streaming.
streaming_model.start()
for i in range(num_process):
# prepare wav file
start = i * streaming_model.hop_size
end = (i + 1) * streaming_model.hop_size
wav_streaming = y[start : end]
# apply padding if len(wav_streaming) < streaming_model.hop_size
wav_streaming = streaming_model.pad(wav_streaming)
# compute asr
nbest = streaming_model(wav_streaming)
print(f'Result at position {i} : {nbest[0][0]}')
final_nbest = streaming_model.end()
print(f'Final result : {final_nbest[0][0]}')
Saving LJ050-0030.wav to LJ050-0030 (2).wav
I will split your audio file into 4 blocks.
Result at position 0 :
Result at position 1 : the commis
Result at position 2 : and the commiss
Result at position 3 : the commission also recommen
Final result : the commission also recommends