Adding Speech Recognition support

albert · October 11, 2019, 10:28pm

Speech Recognition is quickly becoming a very robust technology and I think it would be a great addition to Tryton because it can be useful in several use cases such as:

Commanding Sao (say “Next Tab”, “New Record”, etc without the need of mouse clicks)
Filling Text fields (such as notes) to introduce large texts. Very useful not only from a PC but also using a tablet or a Smart Phone
Giving commands such as “Show me the last 20 invoices” or “Show me a line chart of sales of Product X in the last year”

As an example of how far the technology is going you can take a look at https://cloud.google.com/speech-to-text/ where you’ll see that it is currently already possible to recognize +100 languages, punctuated transcriptions or even speaker diarization.

In order to add Speech Recognition we need two things:

Using a Speech Recognition engine to convert speech to text
Add a means to “understand” the text to effective commands

Speech Recognition

The Speech Recognition part is the easiest thanks existing to existing engines. For example, HTML5 already includes an API (which includes Speech Synthesis too).

You can do a quick test using Chrome/Chromium by saving the following JavaScript code in custom.js in sao:

var script = document.createElement('script');

script.onload = function () {
    if (annyang) {
        var commands = {
            'tryton *action': function(action) {
                jQuery('#global-search-entry').focus();
                jQuery('#global-search-entry').val(action);
            }
        };
        annyang.addCommands(commands);
        annyang.setLanguage('en-US');
        annyang.start();
    }
};
script.src = '//cdnjs.cloudflare.com/ajax/libs/annyang/2.6.0/annyang.min.js';
document.head.appendChild(script);

You just need to say “Tryton Customer Invoices” and it will fill in the global search widget with the text “Customer Invoices”. The popup is not opened, but you get the idea of what is easily achievable.

Interpreting the transcript

As stated above, there are several use cases which would need to be addressed in different ways:

Commanding sao should be relatively simple using, for example, Annyang (the library used in the example above). We should simply define the commands required for moving around Sao, make them translatable, and implement their actions. I already gave some examples. “New Record” could trigger the “New Record” action of the currently opened form. “Next Tab” could change to the next tab, etc.
Filling text fields could be an extension to the commanding sao. For example, saying "Input " could fill in the text to the given field. So “Input Description The customer does not want us to deliver the goods yet” would save the text “The customer does not want us to deliver the goods yet” to the “Description” field.
Other interesting interpretations such as “Show me the last 20 invoices” could be handled on the server. The existing global search widget could be a good place to handle that and we could provide a means for modules to extend it.

For me, that feature should simply allow us to execute existing Tryton actions (window actions would be the most usual one), but we could consider adding a new “Text” action (ir.action.text) which simply returns a text to give the user an answer (Speach Synthesis could be used to give the answer). For example, the user could say “Tryton tell me the current balance of 572 account” and it would simply answer “20.000€”. No need for opening a new tab to get that information.

For handling user supplied sentences, something similar to ALICE could be implemented in trytond. You can find more information on how it works at https://www.pandorabots.com/docs/aiml-fundamentals/.

The idea is that ir module could handle sentences such as "Show me the last " but each module could add its own Categories (in AIML terminology). So the sale module could add the category to interpret ""Show me a line chart of sales of in the last ".

ced · October 11, 2019, 10:56pm

But it is only supported by Google: Can I use... Support tables for HTML5, CSS3, etc

ced · October 12, 2019, 7:54am

There is also privacy and security concerns as the recorded sounds are sent to Google. For some businesses, this may be unacceptable to have such recorded piece which may be sensitive sent to abroad company.

For me this looks more like a bot answering questions. The questions are transcribed from voice but it is a detail. I think it would make more sense to create a specific bot with extended interpretation of languages than just a pattern matching. This does not need to be inside Tryton but a side service with just RPC access.
The benefit of a bot is that it can be plugged to many channels like IRC, XMPP room, Slack, twitter etc.

albert · October 12, 2019, 9:02am

According to https://hacks.mozilla.org/2016/01/firefox-and-the-web-speech-api/ Firefox also supports SpeechRecognition API but it needs to be activated in about:config.

I tried it with my computer and although the parameter is available, it does not seem to have any effects. Maybe it works in other operating systems…

albert · October 12, 2019, 9:05am

The bot is only useful for the text-type answers. In the other cases, such as “Show me the latest invoice” it should simply open a tab (so it is an action).

albert · October 12, 2019, 9:27am

For the case of introducing text in a Text widget, we could consider adding a small icon that the user can press to start the SpeechRecognition API to fill in the field.

ced · October 15, 2019, 1:54pm

Well the bot could answer an URL in some cases. This will be more powerful than having a limited set of predefined actions.

Why not as long as we need to give permission to use the microphone.

albert · May 23, 2021, 10:27am

I tested vosk-api library and it’s amazing:

It’s open source
Installs with a simple “pip3 install vosk”
Does speech recognition offline (no worries about data privacy)
Supports 18 languages
Works extremely well for the languages I could test: Catalan, English and Spanish
Models can be downloaded individually and take just between 50 and 100 MB once uncompressed