espnet_onnx demonstration¶

This notebook provides a demonstration of how to export your trained model into onnx format. Currently only ASR is supported.

see also: - ESPnet: https://github.com/espnet/espnet - espnet_onnx: https://github.com/Masao-Someki/espnet_onnx

Author: Masao Someki

Table of Contents¶

Install Dependency
Export your model
Inference with onnx
Using streaming model

Install Dependency¶

To run this demo, you need to install the following packages. - espnet_onnx - torch >= 1.11.0 (already installed in Colab) - espnet - espnet_model_zoo - onnx

torch, espnet, espnet_model_zoo, onnx is required to run the exportation demo.

[ ]:

!pip install -U espnet_onnx espnet espnet_model_zoo onnx

# in this demo, we need to update scipy to avoid an error
!pip install -U scipy

Export your model¶

Export model from espnet_model_zoo¶

The easiest way to export a model is to use espnet_model_zoo. You can download, unpack, and export the pretrained models with export_from_pretrained method. espnet_onnx will save the onnx models into cache directory, which is ${HOME}/.cache/espnet_onnx in default.

[ ]:

# export the model.
from espnet_onnx.export import ModelExport

tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'

m = ModelExport()
m.export_from_pretrained(tag_name)

Export from custom model¶

espnet_onnx can also export your own trained model with export method.

The following script shows how to export from espnet2.bin.asr_inference.Speech2Text instance. You can also export from a zipped file, by using the export_from_zip function.

For this demonstration, I’m using the from_pretrained method to load parameters, but you can load your own model.

[ ]:

# prepare the espnet2.bin.asr_inference.Speech2Text instance.
from espnet2.bin.asr_inference import Speech2Text

tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'
speech2text = Speech2Text.from_pretrained(tag_name)


# export model
from espnet_onnx.export import ModelExport

sample_model_tag = 'demo/sample_model_1'
m = ModelExport()
m.export(
    speech2text,
    sample_model_tag,
    quantize=False
)

Inference with onnx¶

Now, let’s use the exported models for inference.

[1]:

# please provide the tag_name to specify exported model.
tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'


# upload wav file and let's inference!
import librosa
from google.colab import files

wav_file = files.upload()
y, sr = librosa.load(list(wav_file.keys())[0], sr=16000)


# Use the exported onnx file to inference.
from espnet_onnx import Speech2Text

speech2text = Speech2Text(tag_name)
nbest = speech2text(y)
print(nbest[0][0])

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Saving LJ050-0030.wav to LJ050-0030 (1).wav

/usr/local/lib/python3.7/dist-packages/espnet_onnx/asr/scorer/interface.py:96: UserWarning: RNNDecoder batch score is implemented through for loop not parallelized
  self.__class__.__name__

ih n uw n ih sh ih z ah v aa l z ow r eh sil t er m ey z z

Using streaming model¶

Model exportation is exactly the same as non-streaming model. You can follow the #Export your model chapter.

As for streaming, you can specify the following configuration additionaly. Usually, these values should be the same as the training configuration. - block_size - hop_size - look_ahead

The length of the speech should be the same as streaming_model.hop_size. This value is calculated as follows

\[\begin{split}\begin{align} h &= \text{hop_size} * \text{encoder.subsample} * \text{stft.hop_length}\\ \text{padding} &= (\text{stft.n_fft} // \text{stft.hop_length}) * \text{stft.hop_length} \\ \text{len(wav)} &= h + \text{padding} \end{align}\end{split}\]

For example, the length of the speech is 8704 with the following configuration. - block_size = 40 - hop_size = 16 - look_ahead = 16 - encoder.subsample = 4 - stft.n_fft = 512 - stft.hop_length = 128

Now, let’s demonstrate the streaming inference.

[ ]:

# Export the streaming model.
# Note that the following model is very large
from espnet_onnx.export import ModelExport

tag_name = 'D-Keqi/espnet_asr_train_asr_streaming_transformer_raw_en_bpe500_sp_valid.acc.ave'

m = ModelExport()
m.export_from_pretrained(tag_name)

[3]:

# In this tutorial, we will use the recorded wav file to simulate streaming.
import librosa
from espnet_onnx import StreamingSpeech2Text

tag_name = 'D-Keqi/espnet_asr_train_asr_streaming_transformer_raw_en_bpe500_sp_valid.acc.ave'
streaming_model = StreamingSpeech2Text(tag_name)

# upload wav file
from google.colab import files
wav_file = files.upload()
y, sr = librosa.load(list(wav_file.keys())[0], sr=16000)

num_process = len(y) // streaming_model.hop_size + 1
print(f"I will split your audio file into {num_process} blocks.")

# simulate streaming.
streaming_model.start()
for i in range(num_process):
  # prepare wav file
  start = i * streaming_model.hop_size
  end = (i + 1) * streaming_model.hop_size
  wav_streaming = y[start : end]

  # apply padding if len(wav_streaming) < streaming_model.hop_size
  wav_streaming = streaming_model.pad(wav_streaming)

  # compute asr
  nbest = streaming_model(wav_streaming)
  print(f'Result at position {i} : {nbest[0][0]}')

final_nbest = streaming_model.end()
print(f'Final result : {final_nbest[0][0]}')

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Saving LJ050-0030.wav to LJ050-0030 (2).wav
I will split your audio file into 4 blocks.
Result at position 0 :
Result at position 1 : the commis
Result at position 2 : and the commiss
Result at position 3 : the commission also recommen
Final result : the commission also recommends