Usable speech recognition

Robert · November 2016

Annemarie pointed out this topic to me. My main task at Athom is to make Homey perform the correct action based on a text/speech command, so I'll try to give some background as to why Homey sometimes cannot "hear" you correctly or seems to not have heard anything at all.

To explain, we need to know the basics of what has happened between the moment your voice command leaves your lips and to point Homey executes the command. First, the sound waves you have created travel through the room and get picked up by the microphones. The signal picked up by the microphones is combined and sent to external partners which turn the audio sample into a text string. The text string gets returned to your Homey and Homey then determines what action should be performed based on the words that were found.

Ok, so that is what's happening. Now let's look at the steps which are causing the inaccuracies you are experiencing. The point where it almost always goes wrong is with the Speech To Text (STT) conversion performed by our partners. These companies have typically trained their models with close mic samples, which are speech samples where the speaker had the mic right by their mouth. This produces a clean signal with little noise and echo. However, in Homey's case the audio has had to travel through a room where it has bounced off the walls. This reduces the quality and makes the audio less similar to the samples which the STT model was trained with. To make things worse, this is happening in an environment which often introduces additional distorting coming from the TV or a truck driving by outside. The means that Homey's audio sample does not match as well with the model, so the STT engine is not confident that it has detected one correct word but it ends up with several possibilities. It then picks the most possibility by looking at which words have been detected around the "doubtful" word, and chooses the combination of words which most often appear together. Determining which words appear together most often is done by reading large bodies of text and keeping track of how often one word is followed by some other word. These bodies of text are often things like several years of newspaper articles or books.

In summary, noise causes the model the have to guess, and it guesses wrong because commands said to Homey are often not word combinations typically found in books or newspapers.

There are definitely solutions to this problem, all of which are unfortunately very resource-intensive. The reason the Echo is working better right now is because the have literally had hundreds of people working on it. At Athom, it's just me. I am doing my best every day to make Homey understand you better, but it simply takes a lot of hard work to get improvements.

Now on the bright side! I think there are some simple steps you can take to get Homey to hear you better. I control Homey almost exclusively using voice, and it works >90% of the time for me. Here's what might help you:

- Use short, grammatically correct phrases instead of just words. If you are saying a phrase which is more likely to occur in a book, you are improving your odds of the STT understanding it correctly. So try saying "turn on the lights" instead of "lights on".

- Don't put a word that is key to your command at the very beginning or end of your phrase. So "turn on the light" is better than "turn the light on" because the word "on" is key to this command, and in the second example there are fewer words surrounding it, so there is less context to help determine the right word.

- Use the mobile app. A mobile phones' recording is far more similar to the closed mic sample that the STT engine was trained with, so it will work better.

- Place Homey near the spot where you are most likely to be when you give a voice command. For example, I have placed my Homeys on my desk, next to my bed and next to the couch.

- Don't put Homey near a source of noise. (Obviously)

I hope these tips will help you become a little more satisfied with Homey's voice recognition. Thank you for being active on the forum and for your continued support. If you have any more question about how to get the speech recognition working better for you, please let me know in this topic!

Fire69 · November 2016

Robert said:

Annemarie pointed out this topic to me. My main task at Athom is to make Homey perform the correct action based on a text/speech command, so I'll try to give some background as to why Homey will sometimes not "hear" you correctly or seems to not have heard anything at all.

Thanks for the very elaborate info!

I have 1 question/remark though:

Robert said:

The point where it almost always goed wrong is with the Speech To Text (STT) conversion performed by our partners.

Robert said:
- Place Homey near the spot where you are most likely to be when you give a voice command. For example, I have placed my Homeys on my desk, next to my bed and next to the couch.

These 2 points are obviously linked to each other.
But first of all, most users only have 1 Homey

And because of that, we can't place it somewhere close to where we will be giving the voice commands.

And that's what I'm seeing when I give a command up close or farther away. Up close the orange ring fluctuates very visibly. But when you're further than 1 or 2 meters away, it hardly fluctuates at all, like it's not hearing anything.
So that's why I was thinking it's not because of noise or bouncing voices, but just simply the volume that's too low?

Robert · November 2016

Fire69 said: But first of all, most users only have 1 Homey
And because of that, we can't place it somewhere close to where we will be giving the voice commands.

Definitely. But in that case you can still try placing your Homey near to one place where you think you will give most speech commands. Even though that will not cover everything, it may still be an improvement.

Fire69 said: Up close the orange ring fluctuates very visibly. But when you're further than 1 or 2 meters away, it hardly fluctuates at all, like it's not hearing anything.
So that's why I was thinking it's not because of noise or bouncing voices, but just simply the volume that's too low?

This has to do with signal to noise ratio. The microphones are always receiving some noise, regardless where you are standing compared to Homey. As you move further away the signal from your voice received by the mic decreases exponentially since you do not produce a straight beam of sound, but instead your sound waves get broadcasted in all directions. If we were to boost that recording we would also be boosting the noise, so there is no net gain. I have extensively tested noise reduction algorithms so we would actually be able to improve accuracy by boosting the signal, but unfortunately the noise pattern proved to complex to filter out in a way that would actually make it better resemble the STT models.

Pils · November 2016

Thanks @robert for this usefull explanation.

It raises a few questions or reactions, hope you can comment on that.

i use dutch. So in many normal sentences one of the key words are on the end such as"zet de lampen aan/uit" or "zet de lampen op rood".

the sentences should be normal like you talk to people, otherwise we are adjusting to homey, instead of the other way arround. As @emile said, it must be usable by his or my mother.

using the phone app for speech is not a real option I think. Because getting the phone when at the dinner table , open de app, and say the command is not a real improvement compared to gettting my remote and switch on the lights.

furthermore, to control the tv with homey it must be in front of tv, putting him in de middle of the room could be better for voice, but not for IR.

Also I found out something which I think is also one of the problems.
I have a pre-listening sound and post-listening sound.

i noticed that when I'm further away he ends listening when in the middle of a sentence. Closer that problem is less often.

could it be the case that he listens too short and that we can extend that with a few seconds?

Mathijs · November 2016

90% success rate would make it indeed very usable, however and do not take offence, I have never seen or heard any user that got anything even close to that. The most positive reports seem to hover around 50%, most far below that. Would it be possible for you to make a short video? Not to prove it works, but so we get any idea on how your setup works and how you actually speak?

Personally I got perhaps a 10% success rate even though the Homey is only 30 cm away.

Rocodamelshe · November 2016

90 % is close to mine experience also.
Took a short video to show u. It's a bit dark but u could see that Homey is approx 10 cm away from my playbar which is playing music at the time of recording. Mesa sitting exactly 5 meters away from Homey (measured!) and facing Homey when I speak.

Fire69 · November 2016

Rocodamelshe said:

90 % is close to mine experience also.
Took a short video to show u. It's a bit dark but u could see that Homey is approx 10 cm away from my playbar which is playing music at the time of recording. Mesa sitting exactly 5 meters away from Homey (measured!) and facing Homey when I speak.

That's just... wow...

My Homey is in about the same position as yours, underneath the tv, but when I'm in my seat about 4m away it doesn't fluctuate like that at all... And it doesn't understand a word I say (/speech-input doesn't even show any results). And that's with the TV off.

Joolee · November 2016

Hey Robert, Now that someone from Athom is actually responding to these questions; At the very start, Athom communicated that the signal processing chip is currently unused. A bit later (I can't find that any more) they also told us that an external company was busy recording samples and writing firmware for the processing chip.

After that, I've asked a few times on Slack but nothing has been communicated on this subject. So now the questions;
- Is the chip being utilised right now (I hope not, that would mean the recognition probably won't get much better and we'd better give up on it)
- What about that external company?

//Edit:
Just thought up another question; I like how Athom has you (someone working full-time on this). But why don't we ever see any updates about the voice recognition in the weekly status reports? As stated earlier, Athom has been dead-quiet about the voice issues so I personally had already given up on it.

honey · November 2016

REALLY?
Just to recap:
Our homes are noisy and we have too much echo.
Athom does not have the resources to tackle the issue.
There is no solid plan how to fix it.
Be positive it works for some people.
We need to learn how to speak.

Sell this on the open market.

Homey has two microphones that could help to reduce the noise.
I have dead silence in my room, so nothing is wrong with the environment here.
I took homey into an almost echo free room (Kids bedroom, lot of things all around, fabrics, carpet, tonns of soft toys), yet no improvement. Anyway the echo should not be the issue every home has echo so Athom should work around and not the user or the interior designer.
Tested Google now in the livingroom from 5 meter distance and it works. So please don`t hang on the explanation that mobile devices works because of the small distance.

"These companies have typically trained their models with close mic samples". Of course that is how you do any kind of base sampling. Very disappointing.

G4nd41f · November 2016

@Robert
Have you been trying the Google speech recognition for Homey and if yes how were the results? I think Emilie mentioned that somewhere. It works really well on my phone, even if I am 3-4m away

.

RobinVanKekem · November 2016

Joolee · November 2016

RobinVanKekem said:
<filmpje>

Cant we send the speech-to-text output of Echo to Homey? :P

EdTst · November 2016

I did a simple test (in Dutch) and took the examples from Athom's website. I made sure the room was quiet and I spoke in a normal speaking voice at a distance of 1,5 m from Homey.

To be sure it wasn't me, I retried with someone else.

I say: Noem een getal
Homey heard: een getal
2nd try: een getal

I say: Geef een nummer groter dan 6
Homey heard: rotterdam 6
2nd try: wolven die 6

I say: Kop of munt
Homey heard: Kop of munt
2nd try: Kop of munt

I say: Wat is je ip adres
Homey heard: wat is jouw e-mailadres
2nd try: Wait is je ip adres

I say: Zet de led ring aan:
Homey heard: Herinnering aan
2nd try: ze met ring

When the room isn't quiet or when I am more than 3m away, Homey understands almost nothing.

Gerjan · November 2016

Is there a list with all the voice commands Homey supports?

EdTst · November 2016

Gerjan said:

Is there a list with all the voice commands Homey supports?

Yes: https://www.athom.com/en/support/KB000037/

RobinVanKekem · November 2016

Even the large company's can't do it right the first time:

https://plus.google.com/110558071969009568835/posts/Hjscz7h5MvF

jjtbsomhorst · November 2016

So never ever compare one sales video with another sales video. Ofcourse in those video's the companie shows you what It could be doing not the pain that goes with it

Robert · November 2016

Pils said:
i noticed that when I'm further away he ends listening when in the middle of a sentence. Closer that problem is less often.

could it be the case that he listens too short and that we can extend that with a few seconds?

That's interesting. When to stop listening is determined by detecting a presence of sounds associated with voice followed by a certain absence of voice sounds. In the near future to take another look at the thresholds and see if I can get listening to better match when we are saying something to Homey in different scenarios.

Mathijs said:
90% success rate would make it indeed very usable, however and do not take offence, I have never seen or heard any user that got anything even close to that. The most positive reports seem to hover around 50%, most far below that. Would it be possible for you to make a short video? Not to prove it works, but so we get any idea on how your setup works and how you actually speak?

Personally I got perhaps a 10% success rate even though the Homey is only 30 cm away.

Sure! If I get around to it I will post a video later this week.

G4nd41f said:
@Robert ;
Have you been trying the Google speech recognition for Homey and if yes how were the results? I think Emilie mentioned that somewhere. It works really well on my phone, even if I am 3-4m away .

We did test the Google speech api, but we came to the conclusion that the extra time and costs involved mean it is not the right choice to implement Google STT at this time.

Pils · November 2016

@Robert ;
Maybe it is an idea to let the threshold be adjustable for geeks, so we can experiment with that.
In the mean time if change some of mij short words like "eten" to "we gaan eten" and "eten graag". But no better results. The problem with the short listening-time is bigger/more when i use large sentence.

Maybe we can add the feature to let him say "sorry, i did not hear what you where saying" and then he listens again for another change to say the sentence. Instead of again saying "ok homey".

I also have spent some time reading about the Amazon Echo (Alexa) and when i read that, i'm afraid that homey wont be able in the future to listen 90%. Why? Because the Echo has 7 microphones in 7 directions. That must have a reason. Maybe we can get better results if we would know where the mics are in Homey, so we can move it a little bit with the mics to meat the livingroom.

Also a good sollution would be, if homey wont be able to listen better in the near future, that it could work with Echo (DOT). Then we have to pay 80 euro and we can use the speech engine from them. Maybe Athom can sell them as a package deal with some other stuff.

Although i bought homey mainly for its voice support, it has also the capabilities to talk many device languages. So a good integration with third party voice recognition is a real option. For your bedroom, at near distance, homey can listen, but in a livingroom it seems not to be the case.

ZperX · November 2016

Robert said:
We did test the Google speech api, but we came to the conclusion that the extra time and costs involved mean it is not the right choice to implement Google STT at this time.

The only SST that works well with wide ranges of microphones (including single microphones) and noisy environment. Are you taking this issue seriously? Based on the improvement since introduction... Hmm.

If we want speech recognition buy echo.
If we want to use IR buy a Logitech hub.
+ a Zwave extender. If homeys 4 m range is not enough.

€300 + €140 + €80 + €40

Nice.

ArenBreur · November 2016

Pils said:

Maybe we can add the feature to let him say "sorry, i did not hear what you where saying" and then he listens again for another change to say the sentence. Instead of again saying "ok homey".

I think that this is a really good idea.
because getting Homey to listen with "ok homey" does also not have a 100% hit ratio

honey · November 2016

While homey speaks can`t listen meaning such a voice feedback would delay the time before you could retry. Probably a short ping or some audio effect would be more suitable.

Jerryvdv · November 2016

Pils said:

Maybe we can add the feature to let him say "sorry, i did not hear what you where saying"

@Pils : Probably not the answer you are looking for but I posted a flow for the feedback in the Library / Flows that work topic: https://forum.athom.com/discussion/936/positive-flows-that-work/p3

Cheers!

Rocodamelshe · November 2016

Someone mentioned that even the sound of the "afzuigkap" is disturbing Homey. Just took a short video with an IPhone placed 15 cm away from Homey. Homey itself is standing approx 10 cm away from my Sonos Playbar which is playing Qmusic at the time of recording. Mesa sitting 5 meters away from Homey (measured!) and facing Homey when I speak.

honey · November 2016

Yeah we can see that dutch works. English please.

Fire69 · November 2016

honey said:

Yeah we can see that dutch works. English please.

I don't really think there's a difference between English and Dutch.
Either it works from a distance or it doesn't. If you talk to it from really close, it might work ok...

EdTst · November 2016

Rocodamelshe said:

Someone mentioned that even the sound of the "afzuigkap" is disturbing Homey. Just took a short video with an IPhone placed 15 cm away from Homey. Homey itself is standing approx 10 cm away from my Sonos Playbar which is playing Qmusic at the time of recording. Mesa sitting 5 meters away from Homey (measured!) and facing Homey when I speak.

This is interesting, because here it doesn't even work at 2 m in a silent room. Could it be that there is different hardware in circulation? I have one of the first devices.

MarcelTimmermans · November 2016

Mmh, You would almost think that not every homey has the same quality of mics or casing or so.

jjtbsomhorst · November 2016

Maybe everyone who has issues with it should bring their homey to the meetup today and see what happens over there

HansieNL · November 2016

When I got my first Homey my voice was not recognized very well, but got mutch better after a few weeks. I'm mostly approx 4 meters away.
I got a brand new Homey a few days ago and recognition is still very good.

Usable speech recognition

Comments