Fun with Speech

Things you can do with HTML5 Natural Language APIs





John Dimm

http://www.johndimm.com/

Feb, 2014

business

  1. find a need
  2. create a solution


toys

  1. find some interesting technology
  2. do something stupid with it

  • Speech Input for HTML Forms
  • Web Speech API
  • AlchemyAPI
  • Bing Search
  • flickr.photos.search
  • WebRTC
  • Bing Translator API
  • Google Translate Text-to-Speech

minimal api demos

toy browser "apps"

Talkshow: shows images of the things you are talking about

Translating Telephone: multilingual video conferencing

two kinds of apps


  1. apps that change the world
  2. apps that have to wait for the world to change around them 


For these two:

  • underlying technology needs a few more quantum leaps forward
  • until then, users must acquire special skills

speech


  • natural
  • hands free
  • no screen needed
  • no keyboard
  • magic -- action at a distance

talk is cheap


  • It’s easy, almost effortless
  • Our preferred way of communicating
  • High bandwidth
  • In fact, we love to do it
  • All the time
  • Some of us can’t stop
  • The easiest form of work
  • One of the first things you learn
  • and the last thing to go

ease of use


ergonomics -- reduce number of clicks and keystrokes

  • one-click is nice
  • voice allows zero-click interface
  • effortless 
  • No clicking, no typing, just say the magic words

voice commands

 
  • Very hard to do well
  • Frustrating for users
  • To reduce errors, we have to limit vocabulary 
  • The user has to know or guess the available commands 
  • Big penalty for getting a command wrong 
  • Errors in speech recognition cannot be avoided
  • User errors are also unavoidable

    error correction


    command line

    1. up arrow
    2. arrow left
    3. type over the error
    4. submit


    voice

    1. repeat entire command until it is recognized

     

    ERROR CORRECTION


    how to use speech




    • voice commands
    • transcription
    • something completely different: react to overheard conversation







    Can a computer make itself useful by listening to my conversations?

    speech input for HTML forms






    Deficiencies: 
     
    • Speech recognition stops at the first pause 
    • No feedback during recognition
    • You have to click on the microphone icon to speak 

      continuous speech recognition



      • In 2013, Google Chrome gets the Web Speech API
      • Continuous ASR sessions, lasting several minutes
      • Now we can process and respond to conversational speech
      • Speech meant for other humans, not computers


      interim results




      text analytics

      We can analyze conversational speech as sentences of text
      • Speech is different from written text, but let’s worry about that later

      We can (mis)apply standard text analytics to speech

      • Named Entity Recognition
      • Machine Translation
      • Sentiment Analysis 
      • Domain Classification 
      • Fact Finding 
      • Summarization 
      • Extract semantic frames
      • Normalization
      • Segmentation
      • Simplification

      named entity recognition


      Alchemy API provides online NER for 8 European languages

      Free for up to:
      • 1,000 daily transactions
      • 5 concurrent requests

      Install their php library on your server and use ajax


      start with some cool technology...


      We have this input: 

      • text from continuous speech recognition
      • list of names that were mentioned


      What can we do with it?



      good / evil


       Clearly this is information is interesting to the intelligence community.
        
      • Metadata:  create a graph of the people you talk to
      • NER on content: superimpose a graph of the people you talk about

      But can it be used for good, to give something of value back to the speaker or listener?


      everybody gets an inset


      Have you ever struggled to describe something or someone in words?
      …when a picture would explain everything?
      And yet it’s not worth interrupting your conversation to do an internet search?

      You need Computer Aided Conversation!

      • The computer eavesdrops on your conversation, listening for names
      • When it hears a name, it searches for images of that thing
      • And displays the images on a nearby screen


      finding related images


      Microsoft Bing Search API
      • Free for up to 5,000 transactions per month


      flickr.photos.search API
      • Free



      the disappearing user


      That could have been me talking to a friend.  
      I wasn’t really using the computer.  
      It was just there, listening, acting when it had something to offer to the conversation.  
      Like an attentive servant.

        pictures and meaning


        • Picture Theory of Meaning -- Wittgenstein

        • A statement is meaningful if it pictures a state of the world
        • Let's take that literally -- a statement is like a picture
        • A proper noun "pictures" the thing it names
        • Grand project: can we turn speech into pictures on the fly?
        • Deep Learning, recursive neural nets, semantic grounding

        what we talk about


        connect to a private database of pictures of friends, vacation spots

        add last year's sales figures so they will pop up on a screen in the hallway at work during a water cooler conversation

        using pictures to avoid miscommunication

        the trouble with screens



        The value may small, but the effort is smaller.   Close to null.

        But there's a setup problem...

        We need a screen that both of us can see.  If only there were a shared screen nearby…






        ambient computing

        processing that happens when  you are busy doing something else




        At first it will be creepy that computers are listening to us


        But we will get over it


        caveat lector


        Previous predictions

        • Supermarket floors will become wall-to-wall advertisements
        • Robo-calls will grow until everyone's phone is constantly ringing 

        translating telephone



        • WebRTC
        • Machine Translation
        • Text-to-Speech

        web real time communication

        • Real-time, peer-to-peer voice and video communications through a browser without plug-ins
        • Click-to-call from any web page
        • No phone number, navigate to the same page to connect
        • Video conferencing
        • Data Channels
        • Javascript control of the screen each participant sees


        machine translation


        • Voice-challenged
        • Trained on human-translated bilingual corpora
        • Therefore well-formed sentences
        • High bar for grammaticality
        • Not speech
        • Little training data for speech
        • Disfluencies -- um, huh, ah, er, duh
        • Stop/start
        • Backtracking
        • Parenthetic speech

        TRANSLATION API


        Bing Translator API
        • Free for up to 2,000,000 characters a month


        speech synthesis



        • Google Translate’s unofficial API
        • 100 characters at a time
        • use https
        • Web Speech is supposed to do synthesis too, it’s coming



        translating telephone



        • Multilingual Video Conferencing
        • Everyone broadcasts in their own language
        • Foreign language is translated locally to mine
         

        demo


        questions?





        John Dimm
        jdimm@yahoo.com
        http://www.johndimm.com/
        https://github.com/johndimm/FunWithSpeech

        Fun with speech

        By John Dimm