After using Azure TTS in the previous article (link), I will study the reverse technology: speech recognition. Microsoft of course has this API, but for voice recognition, there is the Facebook SeamlessM4T model that can be used, and the effect is also good, so it is enough to use this model for recognition.
Facebook SeamlessM4T actually includes several technologies.
- S2ST stands for “Speech-to-Speech Translation,” which refers to the direct translation of speech from one language to another, such as translating English into Chinese.
- S2TT stands for Speech-to-Text Translation, which is more like directly translating English speech into Chinese text.
- T2ST is a technology that converts text into speech directly, allowing the conversion of text and speech in different languages.
- T2TT (Translation to Translation): Translation between different languages.
- ASR (Automatic Speech Recognition): converting speech to text with the same language.
The functions are quite comprehensive, and this time we will only use the ASR function. Maybe other things will have the opportunity to be used in the future. If you want to try it out, there is also an online demo available for you to check out.
Basic environment installation
Some basic environmental settings, such as Anaconda and shared scripts, have already been written in the article titled “Common Operations“. Please take a look first to ensure that all instructions can run correctly.
Create a Conda environment
Due to the different dependencies of each project, an environment will be set up for each case here.
Download project and model
Here we will download the language models “seamless-m4t-vocoder” and “seamless-m4t-large”. If not downloaded in advance, they will be downloaded during use and ultimately stored in the location of “~/.cache/torch/hub/”. If you just want to try out the evaluation, you can skip this part, as it is still somewhat complicated to modify.
As the downloaded content is the source code of a Python package, some modifications need to be made to specify the model path. To use a specified location for the model, you need to modify the following three files located in src/seamless_communication/assets/cards/.
- seamlessM4T_large.yaml: multitask_unity_large.pt
- vocoder_36langs.yaml: vocoder_36langs.pt
- unity_nllb-100.yaml: tokenizer.model
Please replace the
tokenizer variables with the paths to the downloaded models.
- seamlessM4T_large.yaml: file://home/ubuntu/m4t/seamless_communication/seamless-m4t-large/multitask_unity_large.pt
- vocoder_36langs.yaml: file:///home/ubuntu/myprj/m4t/seamless_communication/seamless-m4t-vocoder/vocoder_36langs.pt
- unity_nllb-100.yaml: file://home/ubuntu/m4t/seamless_communication/seamless-m4t-large/tokenizer.model
Note that the file scheme should only have two slashes, not three, for security reasons. Using three slashes will result in an error.
Next, we will install the seamless package. This process will involve downloading many things, so please wait patiently.
Executing an example
After installation, you can run the example program. The above example will perform speech recognition on the “out.wav” file and print it out. If there is an error in the model path before, an error will occur. Re-run the part of pip install . to correct and try again.
As time goes by, AI’s technological reserves become more and more complete. Large language models, translation, TTS, speech recognition, the more understanding of technology we have, the more diverse combinations we can create.