[Message Prev][Message Next][Thread Prev][Thread Next][Message Index][Thread Index]

Re: Speech recognition system for Home automation.

Subject: Re: Speech recognition system for Home automation.
From: "Robert Green" <ROBERT_GREEN1963@xxxxxxxxx>
Date: Wed, 19 Sep 2007 21:22:09 -0400
Newsgroups: comp.home.automation
References: <1189340124.126635.118080@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <tmEHi.11654$No2.9810@trndny07> <J4KdnWH-M9OQtHLbnZ2dnUVZ_uKpnZ2d@xxxxxxx> <1190110519.759332.33910@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <icGdnZr9SckmUXLbnZ2dnUVZ_uuqnZ2d@xxxxxxx> <1190234821.116993.262260@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>

"Soren" <soren.skou.nielsen@xxxxxxxxx> wrote in message
news:1190234821.116993.262260@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> On Sep 18, 3:03 pm, "Robert Green" <ROBERT_GREEN1...@xxxxxxxxx> wrote:
>
> > Lab results for SR or VR have to be taken with a grain of sand.  In the
real
> > world, people slur their speech for more than they ever realize.
Probably a
> > lot more than researchers do.  I doubt these are double blind tests,
either.
> > The researchers are often on the development team and have a bit of a
bias.
>
> I've actually worked on some SR when I did my thesis. Usually you have
> large databases with many people saying the same words, some
> researchers make their own, others download the available ones on the
> net. I've read a few where they'd only use a single database, but the
> serious ones use utterances from different databases to prove the
> robustness of the algorithm. Slurring of speech is really common and
> It can really be a problem. I've listened to many utterances of
> different sentences, and if you did not know the exact words they
> spoke, you could actually come in doubt yourself. Especially if the
> sentence was just random words. Thats a huge problem with SR, when we
> as humans hear a mumbled word, we might not notice it at all, since
> the brain perfectly understands the context of what was said, and can
> "guess" the correct word, even though it actually sounds like 5-10
> similar words. Its a bit like the old famous "you dnot hvae to wirte
> the ltteres in the crroect oderr, the bairn sees the wrod as a wolhe".
> Teaching a computer to understand context is an enormous task.. maybe
> possible in a short sentence, but in an entire conversation!? Not
> today, That's the next step :)

You've hit on the key:  context.  The later versions of Dragon Dictate did
have a fairly good understanding of context.  You could actually see it
making contextual corrections on screen with a fourth or fifth word in
sentence causing the program to "change" what it had originally "heard" and
displayed in the edit box for first few words.  The addition of context
sensitivity made for remarkable improvements in recognition and it really
brought home to me how the human brain works.  You can see it when you start
to talking to someone about a totally new subject from a previous one.
There's a moment when they "catch up" to the subject change and the "ah ha"
experience takes place.

> As for bias, yes it certainly happens. One should always make an
> effort to spot weaknesses in articles, but it can be very tricky
> unless you've actually done the work. I've been fooled a couple of
> times. :)

There are all sorts of traps to look out for in studies.  This week the LA
Times had a remarkable article about epidemiology and a study that "proved"
<g> that Canadians under the star sign Sagittarius are 38% more likely to
break their leg than other star signs:

<
http://www.latimes.com/features/health/la-he-epidemiology17sep17,1,1542211,f
ull.story?coll=la-headlines-health >

( aka http://tinyurl.com/34bva3)

"SAGITTARIANS are 38% more likely to break a leg than people of other star
signs -- and Leos are 15% more likely to suffer from internal bleeding. So
says a 2006 Canadian study that looked at the reasons residents of Ontario
province had unplanned stays in the hospital."

Of course, that's probably not really true - although it might be - because
of the quirks of the way the numbers are crunched.  A friend suggested a
possible chain of causation.  It's the time of year when the weather first
turns cold in Canada and people celebrating their birthdays out drinking
might find themselves walking or driving on ice.  Later in the year, people
are more likely to wear boots or shoes with non-slip soles.  What troubled
me most about the articles is that they were able to insert a magical
corrections to smooth out the numbers.  If you can "fix" outcomes you don't
like, how valid is the entire process?

> > I've used at least 5 different incarnations of  Dragon Dictate and other
> > similar programs over the years and the improvement has been phenomenal,
but
> > I'm also aware that I speak a very different way to the dictation mike
than
> > I do to another human being.  The - spacing - between - each - word -
is -
> > very - pronounced - and - my - wife - certainly - is - not - pleased -
> > when - I - use - the - same - halting - speech - on - her.  Why do I
talk to
> > DD that way?  Well, while I was training DD, it was also training me!
I
> > learned that the best way to avoid misrecognition was to - speak -
like -
> > this - and - annunciate - very - clearly.
>
> I've read somewhere (sorry, no references) that some of the newest SR
> software is able to detect words in fluent speech. I bought the HM2007
> chip which is old and discontinued, I will definitely have to speak
> very clearly and with "large" spacing if I wanted to say full
> sentences, 1-2 secs in between words I believe the datasheet said. But
> what I am after in the beginning is really just a robust on/off SR
> switch. The HM2007 has a 40 word memory. You could train "on" and
> "off" 20 times each. The more the better, as is also shown in the
> article, the more they utterances the average over, the better the
> error rate becomes. They did an average over 300 utterances and get
> 99% accuracy in some cases.

I wonder what happens if you use all the slots to train the words ON and OFF
under all the likely noise conditions you'll encounter?  I'd love to be able
to control at least some of the lights by voice, especially the switches I
am likely to encounter with my hands full of tools or laundry or dogs.  But
the reality is that a large paddle switch that I can operate with my elbow
will probably be more reliable, over all.

I was considering using my cordless phone system as an input to the speech
control but the dilemma was obvious.  If my hands were so full that I
couldn't operate a wall light switch, picking up a cordless phone, punching
some keys and THEN speaking the command wouldn't make sense.  It's like the
$3.76 wall clock I bought from Walmart today.  The box says "for warranty
service send the unit back, prepaid along with a $5 check."  Uh huh.  (-:
They should also request a certificate of stupidity!

> > For home automation, I don't think I'd accept every 5th command going
> > unheard, or worse, misheard and the wrong action taken.
>
> Exactly, for a robust system that people would actually want to use in
> their daily lives you'd almost need 99% accuracy. Not there yet, but
> getting there :) False triggering can be avoided somewhat, by using
> special sequences of words, that are not too similar.. that'll be my
> initial approach. Something in the lines of a trigger word.. and then
> the command.

That sounds like a good approach.  I thought about using whistles, clicks,
or a simple loud "Hey!" which seems to work well on the dog, even in noisy
conditions.

> > I use a number of SR-based services, and some of them are quite good at
> > natural speech processing.  But most of them, like my pharmacy refill
> > system, are restricted to a very, very narrow set of commands, usually 0
> > through 9, the pound and the star key, and sometimes "yes" or "no."
Whether
> > that's because these were simply fast ports from touch tone customer
> > response systems or because recognizing only those commands boosts
> > reliability tremendously, I can't say.
>
> Recognizing only a very limited set of commands does boots
> reliability, and keeps costs of stand alone systems to a minimum.

Agreed.  I've hung up on some of the more ambitious systems because they got
so far out of whack compared to what I was trying to say that I couldn't get
back to the beginning.  The error recovery process was not very good.   That
particular company switched back to a numerically-based system, but they now
use so many options per tier that it's still very hard to use.  Not many
people can remember the first few choices when a machine spits out ten
different options in a row.

> > I'd be more surprised than you to find such systems working well outside
the
> > lab or without the kind of computing horsepower that puts the cost or
size
> > or complexity outside the reach of your requirements.
>
> I believe that any really robust system, would probably be very
> complex and expensive.. I don't know how well the mastervoice (butler-
> in-a-box) works.. but at a price tag of 3000$, i hope it works very
> well. It's probably also a full PC with some attachments.

The modern multi-core CPUs have incredible processing horsepower compared to
the 300MHz PC's that I started doing SR with.  I use a 600MHz Pentium class
machine without too much time like for voice dictation which I use when my
bad hand tendons act up.  It works well until I get a sore throat from
speaking so much!

> A cheap
> robust system? Robustness is a highly valued quality in SR or VR, and
> people would be willing to pay big bucks for those extra 5-15%
> improvement in errorrate. My hope is that you could construct a simple
> system with very few features, that works reasonably well.. I'll share
> my errorrate when I get there :) A bit like the "clapper" on/off
> switch that was so popular in the 80's, only more robust :)

You've touched on something interesting.  The clapper's success was probably
due, in part, to the loud clapping of hands being a fairly distinct aural
event, even in a noisy environment.  The trick is to figure out how to
create a similarly unique sound with your voice.  A yodel, a wolf howl,
something very distinct from normal human speech yet not so weird that your
neighbors will call the local insane asylum might work.

> > What does this have to do with sound recognition?  It's still not mature
> > enough to reach the point where the "stink" of early failures doesn't
follow
> > it.  Yes, you can make it work but you have to "want to" - it's not
going to
> > be bulletproof out of the box without some adjustment on the part of the
> > user.
>
> I think thats a good comparison, and I couldn't agree more.

I've been ambushed by that phenomenon more than once and probably will be
again!

> > I hope you find what you're looking for and if it works, share the
results
> > with us.  The time for cheap, reliable and standalone SR is fast
> > approaching.
>
> Thanks :)
>
> > LCD technology has mostly cleared the hurdles that plagued early
products,
> > I'm hoping home SR will get there, too, without requiring the use of
tiny
> > tracking shotgun microphones in every room or a permanent Star Trek
> > communicator badge.  Ironically, though, I think the resolution of the
> > problem *will* be the badge because so many other technologies are
> > converging on the endpoint of wireless connection between the electronic
> > world and a person's eyes, ears and mouths.
>
> Interesting point.

I'm pretty sure that's where we are going to end up.  Phones and ear pieces
keep getting smaller and smaller.  Eventually they'll be implants although
some recent studies have implied that implanted micro-electronic devices may
increase the risk of cancer.  Sorry, I'm too tired to look that up, but I
believe it was in the NY Times Science section for anyone remotely
interested in following up.

> As a matter of fact, I'm still not really decided on LCD technology
> yet. Ok, I switched from my old CRT to a 22" Wide LCD for my computer,
> and it looks great! But I'm kind of wary of buying a large LCD TV. I
> still see old "LCD effects" in rapid camera movement, even on some of
> the new LCD's. And of course, to make matters worse, HD TV has not
> reached danish broadcasters yet. I think SED's looks promising.. but
> LCD's can probably improve tremendously in the time it takes for SED
> TV's to reach todays LCD prices.

I did virtually exactly the same thing.  I got a 22" wide LCD and it looked
so good compared to my 4 year old laptop that I decided that LCD's had
indeed improved tremendously in the last few years, as Lewis G. had
suggested.  $400 later and I own a no-name 32" LCD TV that looks remarkable
with no noticeable ghosting and very vivid colors and contrast.  And since
it has a VGA input (and 6 others, including HDMI, component, composite and
CATV),  I can pipe anything I own that produces video into it.  Programs
like Stargate Atlantis, which are filmed in HD, look outrageously good at
1080.  One thing's for sure about HDTV.  Prop masters and makeup artists are
going to have to work a lot harder now that everything is under a magnifying
glass.

--
Bobby G.

References:
- Speech recognition system for Home automation.
  - From: Soren
- Re: Speech recognition system for Home automation.
  - From: D&SW
- Re: Speech recognition system for Home automation.
  - From: Robert Green
- Re: Speech recognition system for Home automation.
  - From: Soren
- Re: Speech recognition system for Home automation.
  - From: Robert Green
- Re: Speech recognition system for Home automation.
  - From: Soren

Prev by Message: Re: firewater?
Next by Message: Re: MMIR X-10 Macro IR Module
Previous by thread: Re: Speech recognition system for Home automation.
Next by thread: Re: Speech recognition system for Home automation.
Index(es):
- Message
- Thread

comp.home.automation Main Index | comp.home.automation Thread Index | comp.home.automation Home | Archives Home