US20100312555A1 - Local and remote aggregation of feedback data for speech recognition - Google Patents
Local and remote aggregation of feedback data for speech recognition Download PDFInfo
- Publication number
- US20100312555A1 US20100312555A1 US12/481,439 US48143909A US2010312555A1 US 20100312555 A1 US20100312555 A1 US 20100312555A1 US 48143909 A US48143909 A US 48143909A US 2010312555 A1 US2010312555 A1 US 2010312555A1
- Authority
- US
- United States
- Prior art keywords
- data
- speech recognition
- user
- feedback
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004220 aggregation Methods 0.000 title description 2
- 230000002776 aggregation Effects 0.000 title description 2
- 238000003860 storage Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 19
- 230000008713 feedback mechanism Effects 0.000 claims abstract description 10
- 238000000034 method Methods 0.000 claims description 43
- 230000006978 adaptation Effects 0.000 claims description 19
- 238000004891 communication Methods 0.000 claims description 13
- 238000001914 filtration Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 238000003306 harvesting Methods 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 5
- 238000007619 statistical method Methods 0.000 claims description 2
- 238000012546 transfer Methods 0.000 claims description 2
- 230000004931 aggregating effect Effects 0.000 claims 3
- 230000008569 process Effects 0.000 description 20
- 238000010586 diagram Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 230000002708 enhancing effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004883 computer application Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
Definitions
- IVR Interactive Voice Response
- the audio from a phone call may be recorded, transcribed, and then used to directly train a new speech recognition system as part of a feedback loop for a datacenter.
- System developers may also purchase sampled recordings from data consolidators in order to generate/enhance their training models.
- a party e.g. a user
- Embodiments are directed to providing a local feedback mechanism for customizing training models based on user data and directed user feedback in speech recognition applications, where feedback data may be filtered to address privacy concerns and further provided to a system developer for enhancement of generic training models.
- locally employed data may also be filtered to address privacy concerns at a different level than the data to be submitted to the system developer.
- collection of data once considered potentially private but later identified as not private may be enabled.
- FIG. 1 is a conceptual high level diagram of a speech recognition system
- FIG. 2 is a diagram illustrating remote feedback loop mechanisms in an example speech recognition system
- FIG. 3 is a block diagram illustrating major components of a speech recognition system employing local and remote feedback loops according to embodiments;
- FIG. 4 is a networked environment, where a system according to embodiments may be implemented
- FIG. 5 is a block diagram of an example computing operating environment, where embodiments may be implemented.
- FIG. 6 illustrates a logic flow diagram for implementing a local and remote feedback looped speech recognition system.
- a single framework for local and remote feedback loops for enhancing training models in speech recognition systems may incentivize users to submit valuable data to system developers enabling them to improve accuracy of the systems.
- program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types.
- embodiments may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and comparable computing devices.
- Embodiments may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
- program modules may be located in both local and remote memory storage devices.
- Embodiments may be implemented as a computer-implemented process (method), a computing system, or as an article of manufacture, such as a computer program product or computer readable media.
- the computer program product may be a computer storage medium readable by a computer system and encoding a computer program that comprises instructions for causing a computer or computing system to perform example process(es).
- the computer-readable storage medium can for example be implemented via one or more of a volatile computer memory, a non-volatile memory, a hard drive, a flash drive, a floppy disk, or a compact disk, and comparable media.
- server generally refers to a computing device executing one or more software programs typically in a networked environment.
- a server may also be implemented as a virtual server (software programs) executed on one or more computing devices viewed as a server on the network. More detail on these technologies and example operations is provided below.
- engine as in speech recognition engine is used to refer to a self contained software application that has input(s) and an output(s).
- FIG. 1 is a block diagram illustrating top level components in a speech recognition system.
- Speech recognition system 112 begins the process of speech recognition by receiving speech input 102 .
- the audio signal is provided to acoustic analysis module 104 and linguistic analysis 106 , followed by generation of textual data 110 by text generation module 108 .
- Speech recognition system 112 recognizes words, phrases, and the like, based on customized language and acoustic models ( 114 ).
- the consumption of the recognized audio and the recognition processes may be an interactive one, according to some embodiments, where user feedback for selection or correction of a recognized portion of the received speech is received before the entire utterance is recognized.
- speech recognition process takes in audio input and provides textual or other output.
- the output may include commands or other control input for different applications without the intermediate step of constructing textual data.
- a speech recognition system may utilize a language model and an acoustic model.
- the language model may be generated and/or adapted through statistical modeling of words, phrases, fragments, etc. that form a user's profile.
- Statistical data from user language model statistics and a generic language model may be used in generating the adapted language model customized for the particular user's profile.
- the acoustic model may be based on live or stored audio recordings by the users, which are used for generating statistics data to adapt a generic acoustic model to the customized acoustic model.
- the acoustic and language models are then used by the speech recognition process to generate textual data for processing by other applications.
- Components of the speech recognizing system may be loaded into a server, executed over a distributed network, executed in a client device, and the like. Furthermore, the components described above are for illustration purposes only, and do not constitute a limitation on the embodiments.
- a speech recognizing system may be implemented using fewer or additional components in various orders such as additional models (e.g. confidence models). Individual components may be separate applications, or part of a single application.
- the speech recognition system or its components may include individually or collectively a user interface such as a web service, a Graphical User Interface (GUI), and the like.
- GUI Graphical User Interface
- FIG. 2 is a diagram illustrating remote feedback loop mechanisms in an example speech recognition system.
- a local feedback loop also called ‘adaptation’
- data from the user is collected to make the speech recognition models match the user better.
- data from many users is collected to make the speech recognition models better in general, or better match the scenario.
- Two main models used in speech recognition are typically modified through this process: acoustic models, which model what phonemes sound like, and language models, which model how words fit together to make sentences.
- system developer 224 may include application(s) executed on one or more servers 222 that are arranged to receive feedback from a plurality of users ( 238 , 248 , etc.), perform statistical analysis on the received feedback, and update existing acoustic and language models based on the analysis results. For example, words not included in original models, but popularized since then (e.g. Zune®) may be determined from frequent use by many users and be added to the models. Updated models and other data may be stored in system developer's data stores 226 .
- Users 238 , 248 may utilize speech recognition applications executed on computers 232 , 242 of client systems 234 , 244 .
- Local data such as locally customized models may be stored in the client systems' data stores 236 , 246 .
- Client systems 234 , 244 may communicate with system developer 224 over one or more networks 220 , and provide feedback data as discussed above.
- Client systems 234 , 244 may also receive updated generic models from system developer 224 over networks 220 .
- the feedback data and the updated models may be exchanged through various methods including, but not limited to, email, file transfer, access through a hosted service, and comparable ones.
- FIG. 3 is a block diagram illustrating major components of a speech recognition system employing local and remote feedback loops according to embodiments.
- the local adaptation framework is leveraged for the remote feedback loop. Rather than sending the raw data back for the remote feedback loop, the local adaptation framework is used to generate an intermediate file, which is forwarded to the system developer. This way, privacy concerns are reduced.
- directed user feedback is integrated into the feedback loop framework.
- Directed user feedback is the process where the speech recognition system learns from correction of an utterance by the user such as a response to a yes/no question. This is above and beyond generally learning what the user sounds like.
- the directed user feedback mechanism may be added to the remote feedback loop as well. Without the local feedback loop, a user typically does not have incentive to provide this data.
- having a closed loop enables the system to collect data, the privacy status of which may be uncertain at the beginning.
- the feedback loop may clarify the uncertainty and classify the collected data as not private. In which case, the data may be forwarded to the system developer.
- Speech recognition service 342 of diagram 300 may collect user data (in audio and text format) from a variety of sources. For example, documents, emails, and the like, in the user's computer may be harvested to determine frequently used words or word sequences by the user. For dictation systems, the standard language modeling technology is generally called “n-grams”. If n is 3, then they are called “trigrams”. A trigram is the probability of a particular three word sequence occurring. In the local feedback loop, data from the contents of emails and word processing documents, spreadsheets, and the like are collected by the adaptation aggregator 346 . For every detected trigram (or other sequence), a number of times the trigram is used is determined. This is an intermediate form.
- directed feedback by user 344 may also be provided to adaptation aggregator 346 and added to the collected data resulting in intermediate aggregated data 348 .
- Intermediate aggregated data 348 is not as verbose as the original text, and has therefore significantly less privacy concerns than the original text. It is also aggregated over some body of content. The privacy concerns may be further reduced by removing specific contents from the intermediate data 348 at privacy scrubber 350 . For example, any trigrams containing digits may be removed, since many sources of privacy concerns include digits (phone numbers, social security numbers, credit card numbers, and similar ones).
- the intermediate aggregated data 348 is provided to local adaptation module 354 , which generates updated models 356 for immediate use by the speech recognition service 342 .
- the intermediate aggregated data 348 is also provided to system developer 352 after the filtering by privacy scrubber 350 .
- System developer 352 provide updated or new model to speech recognition service 342 further enhancing a quality of the recognition process.
- Speech recognition systems are provided with particular vocabulary. This is typically generic vocabulary that is expected to broadly cover the English (or other) language. It may not contain most proper names, or project names. These can either be directly added by the user, or the system may automatically learn them through the document harvesting feature. For sending data back to system developer, trigrams may be submitted that contain words shipped with the system, and the list of added words (without context). Any trigrams containing words added on the user's machine may be assumed to have some element of privacy concern, and thus removed in the process of preparing to send data back to the system developer.
- the new words are sent to the system developer without context. For example, uncommon last names may be sent back as a new word. This is not concerning from a privacy point of view, because such last names may be transmitted for anyone communicating with the user, and would not identify a particular user.
- the system developer may identify words that should be added to updated models and release them to all interested users enhancing the generic models in the base vocabulary.
- a conservative approach may be taken and initially some non-private information removed. This removed data may be identified later through another part of the feedback loop and the process automatically updated to begin collecting this now positively identified as non-private.
- the user may also be provided with the option of giving permission before any data is sent to the system developer.
- a conventional remote feedback mechanism is to save the audio, and use that to train a new model.
- the intermediate and aggregated data 348 is provided to the system developer for acoustic models as well in a speech recognition system according to embodiments, but with different details.
- conventional speech recognition systems do not take advantage of directed user feedback for remote feedback loop. By providing direct feedback, this learning mechanism is faster and more efficient than other mechanisms for the system. System developers can benefit from this direct method by receiving the intermediate and aggregated data, which contains directed user feedback as well.
- speech recognition systems, components, configurations, and feedback mechanisms illustrated above are for example purposes and do not constitute a limitation on embodiments.
- a speech recognition system with local and remote feedback loops may be implemented with other components and configurations using the principles described herein.
- FIG. 4 is an example environment, where embodiments may be implemented.
- a local and remote feedback looped speech recognition system may be implemented via software executed over one or more servers 418 such as a hosted service.
- the platform may communicate with client applications on individual computing devices such as a cellular phone 413 , a laptop computer 412 , and desktop computer 411 (‘client devices’) through network(s) 410 .
- client devices desktop computer 411
- client devices 411 - 413 are used to facilitate communications employing a variety of modes between users of the speech recognition system.
- Locally or in a distributed manner executed speech recognition applications may generate local training data based on collected user data and/or directed user feedback.
- the locally collected data may be filtered at different levels for local storage and for submittal to a system developer (e.g. servers 418 ) through a remote feedback loop.
- a system developer e.g. servers 418
- Such data, as well as training models, and other speech recognition related data may be stored in one or more data stores (e.g. data store 416 ), which may be managed by any one of the servers 418 or by database server 414 .
- Network(s) 410 may comprise any topology of servers, clients, Internet service providers, and communication media.
- a system according to embodiments may have a static or dynamic topology.
- Network(s) 410 may include a secure network such as an enterprise network, an unsecure network such as a wireless open network, or the Internet.
- Network(s) 410 may also coordinate communication over other networks such as PSTN or cellular networks.
- Network(s) 410 provides communication between the nodes described herein.
- network(s) 410 may include wireless media such as acoustic, RF, infrared and other wireless media.
- FIG. 5 and the associated discussion are intended to provide a brief, general description of a suitable computing environment in which embodiments may be implemented.
- computing device 500 may be a client device executing a speech recognition application and include at least one processing unit 502 and system memory 504 .
- Computing device 500 may also include a plurality of processing units that cooperate in executing programs.
- the system memory 504 may be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two.
- System memory 504 typically includes an operating system 505 suitable for controlling the operation of the platform, such as the WINDOWS® operating systems from MICROSOFT CORPORATION of Redmond, Wash.
- the system memory 504 may also include one or more software applications such as program modules 506 , speech recognition application 522 , and adaptation module 524 .
- Speech recognition application 522 may be any application that performs speech recognition as part of a service as discussed previously.
- Adaptation module 524 may be an integral part of speech recognition application 522 or a separate application.
- Adaptation module 524 may collect user data and/or directed user feedback associated with recognized speech, and provide feedback to a speech recognition engine for customization of acoustic and language training models.
- Adaptation module 524 may further provide locally collected data, after filtering to address privacy concerns, to a system developer for enhancement of generic training models. This basic configuration is illustrated in FIG. 5 by those components within dashed line 508 .
- Computing device 500 may have additional features or functionality.
- the computing device 500 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape.
- additional storage is illustrated in FIG. 5 by removable storage 509 and non-removable storage 510 .
- Computer readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data.
- System memory 505 , removable storage 509 and non-removable storage 510 are all examples of computer readable storage media.
- Computer readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 500 . Any such computer readable storage media may be part of computing device 500 .
- Computing device 500 may also have input device(s) 512 such as keyboard, mouse, pen, voice input device, touch input device, and comparable input devices.
- Output device(s) 514 such as a display, speakers, printer, and other types of output devices may also be included. These devices are well known in the art and need not be discussed at length here.
- Computing device 500 may also contain communication connections 516 that allow the device to communicate with other devices 518 , such as over a wireless network in a distributed computing environment, a satellite link, a cellular link, and comparable mechanisms.
- Other devices 518 may include computer device(s) that execute communication applications, other directory or presence servers, and comparable devices.
- Communication connection(s) 516 is one example of communication media.
- Communication media can include therein computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
- modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
- Example embodiments also include methods. These methods can be implemented in any number of ways, including the structures described in this document. One such way is by machine operations, of devices of the type described in this document.
- Another optional way is for one or more of the individual operations of the methods to be performed in conjunction with one or more human operators performing some. These human operators need not be collocated with each other, but each can be only with a machine that performs a portion of the program.
- FIG. 6 illustrates a logic flow diagram for process 600 of implementing a local and remote feedback looped speech recognition system according to embodiments.
- Process 600 may be implemented in any speech recognition application.
- Process 600 begins with operation 610 , where user data associated with speech recognition is collected. Collection of user data may include harvesting of data from user documents, emails, and the like through methods like n-gram modeling. Other methods may include user provided samples (read passages, etc.).
- a directed feedback may be received from the user. Directed feedback includes user responses to system generated prompts such as yes/no answers that gauge accuracy of recognized utterances (e.g. “Did you say you want Technical Assistance?”).
- the collected data and optionally the directed user feedback are provided as local feedback (adaptation) to the speech recognition engine for customization and enhancement of local training models.
- the locally collected data or even directed user feedback may include private information.
- the data is filtered to address privacy concerns such as removing numbers, addresses, and other personal information. Some of the filtering may be performed on locally stored data such that the user is protected from having their private data exposed through their local machine. This is illustrated in process 600 by the loop from operation 640 to operation 610 .
- Some of the locally collected data may be subjected to a different level of filtering in order to ensure protection of the user's privacy prior to submittal of the data to a system developer.
- another layer of privacy protection may be performed by explicitly requesting user permission for submitting the data to the system developer.
- the filtered data is provided to the system developer for enhancement and update of generic training models as discussed previously. This may be followed by optional operation 670 , where updates to the training models or upgrade replacements of the same are received from the system developer.
- process 600 The operations included in process 600 are for illustration purposes. Improving speech recognition systems with local and remote feedback loops may be implemented by similar processes with fewer or additional steps, as well as in different order of operations using the principles described herein.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
Description
- One of the forefronts of computing technology is speech recognition, because people often find speech to be familiar and convenient way to communicate information. With computerized applications controlling many aspects of daily activities from word processing to controlling appliances, providing speech recognition based interfaces for such applications is a high priority of research and development for many companies. Web site operators and other content providers are deploying voice driven interfaces for allowing users to browse their content. One of the more visible implementations of speech recognition is Interactive Voice Response (IVR) systems, where caller can interact with a computer application through natural speech as opposed to pressing telephone keys or other mechanic methods.
- In a traditional speech recognition system, the audio from a phone call may be recorded, transcribed, and then used to directly train a new speech recognition system as part of a feedback loop for a datacenter. System developers may also purchase sampled recordings from data consolidators in order to generate/enhance their training models. In a local application environment, where the speech recognition system is installed, operated, and maintained by a party (e.g. a user) independent from the system developer, there is little incentive and significant privacy concerns for the user to provide the exact audio of what they said to the system developer. This may disadvantage the system developer's efforts to enhance and update the speech recognition product with accurate data.
- This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to exclusively identify key features or essential features of the claimed subject matter, nor is it intended as an aid in determining the scope of the claimed subject matter.
- Embodiments are directed to providing a local feedback mechanism for customizing training models based on user data and directed user feedback in speech recognition applications, where feedback data may be filtered to address privacy concerns and further provided to a system developer for enhancement of generic training models. According to some embodiments, locally employed data may also be filtered to address privacy concerns at a different level than the data to be submitted to the system developer. Moreover, collection of data once considered potentially private but later identified as not private may be enabled.
- These and other features and advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory and do not restrict aspects as claimed.
-
FIG. 1 is a conceptual high level diagram of a speech recognition system; -
FIG. 2 is a diagram illustrating remote feedback loop mechanisms in an example speech recognition system; -
FIG. 3 is a block diagram illustrating major components of a speech recognition system employing local and remote feedback loops according to embodiments; -
FIG. 4 is a networked environment, where a system according to embodiments may be implemented; -
FIG. 5 is a block diagram of an example computing operating environment, where embodiments may be implemented; and -
FIG. 6 illustrates a logic flow diagram for implementing a local and remote feedback looped speech recognition system. - As briefly described above, a single framework for local and remote feedback loops for enhancing training models in speech recognition systems may incentivize users to submit valuable data to system developers enabling them to improve accuracy of the systems. These aspects may be combined, other aspects may be utilized, and structural changes may be made without departing from the spirit or scope of the present disclosure. The following detailed description is therefore not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and their equivalents.
- While the embodiments will be described in the general context of program modules that execute in conjunction with an application program that runs on an operating system on a personal computer, those skilled in the art will recognize that aspects may also be implemented in combination with other program modules.
- Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that embodiments may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and comparable computing devices. Embodiments may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
- Embodiments may be implemented as a computer-implemented process (method), a computing system, or as an article of manufacture, such as a computer program product or computer readable media. The computer program product may be a computer storage medium readable by a computer system and encoding a computer program that comprises instructions for causing a computer or computing system to perform example process(es). The computer-readable storage medium can for example be implemented via one or more of a volatile computer memory, a non-volatile memory, a hard drive, a flash drive, a floppy disk, or a compact disk, and comparable media.
- Throughout this specification, the term “server” generally refers to a computing device executing one or more software programs typically in a networked environment. However, a server may also be implemented as a virtual server (software programs) executed on one or more computing devices viewed as a server on the network. More detail on these technologies and example operations is provided below. Also, the term “engine” (as in speech recognition engine) is used to refer to a self contained software application that has input(s) and an output(s).
-
FIG. 1 is a block diagram illustrating top level components in a speech recognition system.Speech recognition system 112 begins the process of speech recognition by receivingspeech input 102. The audio signal is provided toacoustic analysis module 104 andlinguistic analysis 106, followed by generation oftextual data 110 bytext generation module 108.Speech recognition system 112 recognizes words, phrases, and the like, based on customized language and acoustic models (114). The consumption of the recognized audio and the recognition processes may be an interactive one, according to some embodiments, where user feedback for selection or correction of a recognized portion of the received speech is received before the entire utterance is recognized. - As mentioned before, speech recognition process takes in audio input and provides textual or other output. For example, the output may include commands or other control input for different applications without the intermediate step of constructing textual data. In recognizing utterances, a speech recognition system may utilize a language model and an acoustic model. The language model may be generated and/or adapted through statistical modeling of words, phrases, fragments, etc. that form a user's profile. Statistical data from user language model statistics and a generic language model may be used in generating the adapted language model customized for the particular user's profile.
- The acoustic model may be based on live or stored audio recordings by the users, which are used for generating statistics data to adapt a generic acoustic model to the customized acoustic model. The acoustic and language models are then used by the speech recognition process to generate textual data for processing by other applications.
- Components of the speech recognizing system may be loaded into a server, executed over a distributed network, executed in a client device, and the like. Furthermore, the components described above are for illustration purposes only, and do not constitute a limitation on the embodiments. A speech recognizing system may be implemented using fewer or additional components in various orders such as additional models (e.g. confidence models). Individual components may be separate applications, or part of a single application. The speech recognition system or its components may include individually or collectively a user interface such as a web service, a Graphical User Interface (GUI), and the like.
-
FIG. 2 is a diagram illustrating remote feedback loop mechanisms in an example speech recognition system. By providing a process for both local adaptation and remote feedback within a single framework, a system according to embodiments incentivizes users to provide valuable training data to system developers. By taking data from particular steps from the local adaptation process, privacy concerns on the data sent back to the developer can be reduced. - In a local feedback loop (also called ‘adaptation’), data from the user is collected to make the speech recognition models match the user better. In a remote feedback loop, data from many users is collected to make the speech recognition models better in general, or better match the scenario. Two main models used in speech recognition are typically modified through this process: acoustic models, which model what phonemes sound like, and language models, which model how words fit together to make sentences.
- As shown in diagram 200,
system developer 224 may include application(s) executed on one ormore servers 222 that are arranged to receive feedback from a plurality of users (238, 248, etc.), perform statistical analysis on the received feedback, and update existing acoustic and language models based on the analysis results. For example, words not included in original models, but popularized since then (e.g. Zune®) may be determined from frequent use by many users and be added to the models. Updated models and other data may be stored in system developer's data stores 226. -
Users computers client systems data stores Client systems system developer 224 over one ormore networks 220, and provide feedback data as discussed above.Client systems system developer 224 overnetworks 220. - The feedback data and the updated models may be exchanged through various methods including, but not limited to, email, file transfer, access through a hosted service, and comparable ones.
-
FIG. 3 is a block diagram illustrating major components of a speech recognition system employing local and remote feedback loops according to embodiments. Generally, in creating an updated model for local adaptation, there is a local aggregation step, because the models do not contain specific examples, but instead contain statistics about what is likely. According to one embodiment, the local adaptation framework is leveraged for the remote feedback loop. Rather than sending the raw data back for the remote feedback loop, the local adaptation framework is used to generate an intermediate file, which is forwarded to the system developer. This way, privacy concerns are reduced. - According to another embodiment, directed user feedback is integrated into the feedback loop framework. Directed user feedback is the process where the speech recognition system learns from correction of an utterance by the user such as a response to a yes/no question. This is above and beyond generally learning what the user sounds like. By integrating local and remote feedback loops, the directed user feedback mechanism may be added to the remote feedback loop as well. Without the local feedback loop, a user typically does not have incentive to provide this data.
- Moreover, having a closed loop enables the system to collect data, the privacy status of which may be uncertain at the beginning. The feedback loop may clarify the uncertainty and classify the collected data as not private. In which case, the data may be forwarded to the system developer.
-
Speech recognition service 342 of diagram 300 may collect user data (in audio and text format) from a variety of sources. For example, documents, emails, and the like, in the user's computer may be harvested to determine frequently used words or word sequences by the user. For dictation systems, the standard language modeling technology is generally called “n-grams”. If n is 3, then they are called “trigrams”. A trigram is the probability of a particular three word sequence occurring. In the local feedback loop, data from the contents of emails and word processing documents, spreadsheets, and the like are collected by theadaptation aggregator 346. For every detected trigram (or other sequence), a number of times the trigram is used is determined. This is an intermediate form. - In a system according to embodiments, directed feedback by
user 344 may also be provided toadaptation aggregator 346 and added to the collected data resulting in intermediate aggregateddata 348. Intermediate aggregateddata 348 is not as verbose as the original text, and has therefore significantly less privacy concerns than the original text. It is also aggregated over some body of content. The privacy concerns may be further reduced by removing specific contents from theintermediate data 348 atprivacy scrubber 350. For example, any trigrams containing digits may be removed, since many sources of privacy concerns include digits (phone numbers, social security numbers, credit card numbers, and similar ones). - According to one embodiment, the intermediate aggregated
data 348 is provided tolocal adaptation module 354, which generates updatedmodels 356 for immediate use by thespeech recognition service 342. According to another embodiment, the intermediate aggregateddata 348 is also provided tosystem developer 352 after the filtering byprivacy scrubber 350.System developer 352 provide updated or new model tospeech recognition service 342 further enhancing a quality of the recognition process. - Speech recognition systems are provided with particular vocabulary. This is typically generic vocabulary that is expected to broadly cover the English (or other) language. It may not contain most proper names, or project names. These can either be directly added by the user, or the system may automatically learn them through the document harvesting feature. For sending data back to system developer, trigrams may be submitted that contain words shipped with the system, and the list of added words (without context). Any trigrams containing words added on the user's machine may be assumed to have some element of privacy concern, and thus removed in the process of preparing to send data back to the system developer.
- The new words are sent to the system developer without context. For example, uncommon last names may be sent back as a new word. This is not concerning from a privacy point of view, because such last names may be transmitted for anyone communicating with the user, and would not identify a particular user. Once these are aggregated, the system developer may identify words that should be added to updated models and release them to all interested users enhancing the generic models in the base vocabulary. By having a layered approach to privacy, a conservative approach may be taken and initially some non-private information removed. This removed data may be identified later through another part of the feedback loop and the process automatically updated to begin collecting this now positively identified as non-private. As an added measure, the user may also be provided with the option of giving permission before any data is sent to the system developer.
- For acoustic models, a conventional remote feedback mechanism is to save the audio, and use that to train a new model. Much like with language models, the intermediate and aggregated
data 348 is provided to the system developer for acoustic models as well in a speech recognition system according to embodiments, but with different details. Additionally, conventional speech recognition systems do not take advantage of directed user feedback for remote feedback loop. By providing direct feedback, this learning mechanism is faster and more efficient than other mechanisms for the system. System developers can benefit from this direct method by receiving the intermediate and aggregated data, which contains directed user feedback as well. - The speech recognition systems, components, configurations, and feedback mechanisms illustrated above are for example purposes and do not constitute a limitation on embodiments. A speech recognition system with local and remote feedback loops may be implemented with other components and configurations using the principles described herein.
-
FIG. 4 is an example environment, where embodiments may be implemented. A local and remote feedback looped speech recognition system may be implemented via software executed over one ormore servers 418 such as a hosted service. The platform may communicate with client applications on individual computing devices such as acellular phone 413, alaptop computer 412, and desktop computer 411 (‘client devices’) through network(s) 410. - As discussed previously, client devices 411-413 are used to facilitate communications employing a variety of modes between users of the speech recognition system. Locally or in a distributed manner executed speech recognition applications may generate local training data based on collected user data and/or directed user feedback. The locally collected data may be filtered at different levels for local storage and for submittal to a system developer (e.g. servers 418) through a remote feedback loop. Such data, as well as training models, and other speech recognition related data may be stored in one or more data stores (e.g. data store 416), which may be managed by any one of the
servers 418 or bydatabase server 414. - Network(s) 410 may comprise any topology of servers, clients, Internet service providers, and communication media. A system according to embodiments may have a static or dynamic topology. Network(s) 410 may include a secure network such as an enterprise network, an unsecure network such as a wireless open network, or the Internet. Network(s) 410 may also coordinate communication over other networks such as PSTN or cellular networks. Network(s) 410 provides communication between the nodes described herein. By way of example, and not limitation, network(s) 410 may include wireless media such as acoustic, RF, infrared and other wireless media.
- Many other configurations of computing devices, applications, data sources, and data distribution systems may be employed to implement a local and remote feedback looped speech recognition system. Furthermore, the networked environments discussed in
FIG. 4 are for illustration purposes only. Embodiments are not limited to the example applications, modules, or processes. -
FIG. 5 and the associated discussion are intended to provide a brief, general description of a suitable computing environment in which embodiments may be implemented. With reference toFIG. 5 , a block diagram of an example computing operating environment for an application according to embodiments is illustrated, such ascomputing device 500. In a basic configuration,computing device 500 may be a client device executing a speech recognition application and include at least oneprocessing unit 502 andsystem memory 504.Computing device 500 may also include a plurality of processing units that cooperate in executing programs. Depending on the exact configuration and type of computing device, thesystem memory 504 may be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two.System memory 504 typically includes anoperating system 505 suitable for controlling the operation of the platform, such as the WINDOWS® operating systems from MICROSOFT CORPORATION of Redmond, Wash. Thesystem memory 504 may also include one or more software applications such asprogram modules 506,speech recognition application 522, andadaptation module 524. -
Speech recognition application 522 may be any application that performs speech recognition as part of a service as discussed previously.Adaptation module 524 may be an integral part ofspeech recognition application 522 or a separate application.Adaptation module 524 may collect user data and/or directed user feedback associated with recognized speech, and provide feedback to a speech recognition engine for customization of acoustic and language training models.Adaptation module 524 may further provide locally collected data, after filtering to address privacy concerns, to a system developer for enhancement of generic training models. This basic configuration is illustrated inFIG. 5 by those components within dashedline 508. -
Computing device 500 may have additional features or functionality. For example, thecomputing device 500 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated inFIG. 5 byremovable storage 509 andnon-removable storage 510. Computer readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data.System memory 505,removable storage 509 andnon-removable storage 510 are all examples of computer readable storage media. Computer readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computingdevice 500. Any such computer readable storage media may be part ofcomputing device 500.Computing device 500 may also have input device(s) 512 such as keyboard, mouse, pen, voice input device, touch input device, and comparable input devices. Output device(s) 514 such as a display, speakers, printer, and other types of output devices may also be included. These devices are well known in the art and need not be discussed at length here. -
Computing device 500 may also containcommunication connections 516 that allow the device to communicate withother devices 518, such as over a wireless network in a distributed computing environment, a satellite link, a cellular link, and comparable mechanisms.Other devices 518 may include computer device(s) that execute communication applications, other directory or presence servers, and comparable devices. Communication connection(s) 516 is one example of communication media. Communication media can include therein computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. - Example embodiments also include methods. These methods can be implemented in any number of ways, including the structures described in this document. One such way is by machine operations, of devices of the type described in this document.
- Another optional way is for one or more of the individual operations of the methods to be performed in conjunction with one or more human operators performing some. These human operators need not be collocated with each other, but each can be only with a machine that performs a portion of the program.
-
FIG. 6 illustrates a logic flow diagram forprocess 600 of implementing a local and remote feedback looped speech recognition system according to embodiments.Process 600 may be implemented in any speech recognition application. -
Process 600 begins withoperation 610, where user data associated with speech recognition is collected. Collection of user data may include harvesting of data from user documents, emails, and the like through methods like n-gram modeling. Other methods may include user provided samples (read passages, etc.). Atoptional operation 620, a directed feedback may be received from the user. Directed feedback includes user responses to system generated prompts such as yes/no answers that gauge accuracy of recognized utterances (e.g. “Did you say you want Technical Assistance?”). - At
operation 630, the collected data and optionally the directed user feedback are provided as local feedback (adaptation) to the speech recognition engine for customization and enhancement of local training models. As discussed above, the locally collected data or even directed user feedback may include private information. Atoperation 640, the data is filtered to address privacy concerns such as removing numbers, addresses, and other personal information. Some of the filtering may be performed on locally stored data such that the user is protected from having their private data exposed through their local machine. This is illustrated inprocess 600 by the loop fromoperation 640 tooperation 610. - Some of the locally collected data may be subjected to a different level of filtering in order to ensure protection of the user's privacy prior to submittal of the data to a system developer. At
optional operation 650, another layer of privacy protection may be performed by explicitly requesting user permission for submitting the data to the system developer. - At
operation 660, the filtered data is provided to the system developer for enhancement and update of generic training models as discussed previously. This may be followed byoptional operation 670, where updates to the training models or upgrade replacements of the same are received from the system developer. - The operations included in
process 600 are for illustration purposes. Improving speech recognition systems with local and remote feedback loops may be implemented by similar processes with fewer or additional steps, as well as in different order of operations using the principles described herein. - The above specification, examples and data provide a complete description of the manufacture and use of the composition of the embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims and embodiments.
Claims (20)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/481,439 US9111540B2 (en) | 2009-06-09 | 2009-06-09 | Local and remote aggregation of feedback data for speech recognition |
US14/799,533 US10157609B2 (en) | 2009-06-09 | 2015-07-14 | Local and remote aggregation of feedback data for speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/481,439 US9111540B2 (en) | 2009-06-09 | 2009-06-09 | Local and remote aggregation of feedback data for speech recognition |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/799,533 Continuation US10157609B2 (en) | 2009-06-09 | 2015-07-14 | Local and remote aggregation of feedback data for speech recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
US20100312555A1 true US20100312555A1 (en) | 2010-12-09 |
US9111540B2 US9111540B2 (en) | 2015-08-18 |
Family
ID=43301371
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/481,439 Active 2033-09-24 US9111540B2 (en) | 2009-06-09 | 2009-06-09 | Local and remote aggregation of feedback data for speech recognition |
US14/799,533 Active 2029-10-14 US10157609B2 (en) | 2009-06-09 | 2015-07-14 | Local and remote aggregation of feedback data for speech recognition |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/799,533 Active 2029-10-14 US10157609B2 (en) | 2009-06-09 | 2015-07-14 | Local and remote aggregation of feedback data for speech recognition |
Country Status (1)
Country | Link |
---|---|
US (2) | US9111540B2 (en) |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110184740A1 (en) * | 2010-01-26 | 2011-07-28 | Google Inc. | Integration of Embedded and Network Speech Recognizers |
US8185392B1 (en) * | 2010-07-13 | 2012-05-22 | Google Inc. | Adapting enhanced acoustic models |
US20130054238A1 (en) * | 2011-08-29 | 2013-02-28 | Microsoft Corporation | Using Multiple Modality Input to Feedback Context for Natural Language Understanding |
WO2013062797A1 (en) | 2011-10-28 | 2013-05-02 | Microsoft Corporation | Distributed user input to text generated by a speech to text transcription service |
US20140039893A1 (en) * | 2012-07-31 | 2014-02-06 | Sri International | Personalized Voice-Driven User Interfaces for Remote Multi-User Services |
WO2014062851A1 (en) | 2012-10-17 | 2014-04-24 | Nuance Communications, Inc. | Multiple device intelligent language model synchronization |
US20140316784A1 (en) * | 2013-04-18 | 2014-10-23 | Nuance Communications, Inc. | Updating population language models based on changes made by user clusters |
EP2804113A2 (en) * | 2013-05-13 | 2014-11-19 | Facebook, Inc. | Hybrid, offline/online speech translation system |
US20150120288A1 (en) * | 2013-10-29 | 2015-04-30 | At&T Intellectual Property I, L.P. | System and method of performing automatic speech recognition using local private data |
US9060062B1 (en) | 2011-07-06 | 2015-06-16 | Google Inc. | Clustering and classification of recent customer support inquiries |
WO2015164116A1 (en) * | 2014-04-25 | 2015-10-29 | Nuance Communications, Inc | Learning language models from scratch based on crowd-sourced user text input |
US9218807B2 (en) * | 2010-01-08 | 2015-12-22 | Nuance Communications, Inc. | Calibration of a speech recognition engine using validated text |
US20160365088A1 (en) * | 2015-06-10 | 2016-12-15 | Synapse.Ai Inc. | Voice command response accuracy |
US9530416B2 (en) | 2013-10-28 | 2016-12-27 | At&T Intellectual Property I, L.P. | System and method for managing models for embedded speech and language processing |
US20160379639A1 (en) * | 2015-06-29 | 2016-12-29 | Google Inc. | Privacy-preserving training corpus selection |
WO2017054122A1 (en) * | 2015-09-29 | 2017-04-06 | 深圳市全圣时代科技有限公司 | Speech recognition system and method, client device and cloud server |
US9786281B1 (en) * | 2012-08-02 | 2017-10-10 | Amazon Technologies, Inc. | Household agent learning |
WO2018017423A1 (en) * | 2016-07-20 | 2018-01-25 | Apple Inc. | Using proxies to enable on-device machine learning |
KR20180070970A (en) * | 2016-12-19 | 2018-06-27 | 삼성전자주식회사 | Method and Apparatus for Voice Recognition |
WO2019094092A1 (en) * | 2017-11-07 | 2019-05-16 | Google Llc | Incognito mode for personalized machine-learned models |
US10304445B2 (en) * | 2016-10-13 | 2019-05-28 | Viesoft, Inc. | Wearable device for speech training |
US20190171671A1 (en) * | 2016-10-13 | 2019-06-06 | Viesoft, Inc. | Data processing for continuous monitoring of sound data and advanced life arc presentation analysis |
EP3460792A4 (en) * | 2016-06-23 | 2019-06-12 | Huawei Technologies Co., Ltd. | Optimization method and apparatus suitable for model of pattern recognition, and terminal device |
US20190279620A1 (en) * | 2018-03-06 | 2019-09-12 | GM Global Technology Operations LLC | Speech recognition arbitration logic |
US20190371311A1 (en) * | 2018-06-01 | 2019-12-05 | Soundhound, Inc. | Custom acoustic models |
WO2020018212A1 (en) * | 2018-07-16 | 2020-01-23 | Microsoft Technology Licensing, Llc | Eyes-off training for automatic speech recognition |
US20210193147A1 (en) * | 2019-12-23 | 2021-06-24 | Descript, Inc. | Automated generation of transcripts through independent transcription |
CN113113002A (en) * | 2019-12-25 | 2021-07-13 | 斑马智行网络(香港)有限公司 | Vehicle voice interaction method and system and voice updating system |
US11183173B2 (en) * | 2017-04-21 | 2021-11-23 | Lg Electronics Inc. | Artificial intelligence voice recognition apparatus and voice recognition system |
US20220084521A1 (en) * | 2021-11-23 | 2022-03-17 | Raju Arvind | Automatic personal identifiable information removal from audio |
US20220208197A1 (en) * | 2012-06-01 | 2022-06-30 | Google Llc | Providing Answers To Voice Queries Using User Feedback |
CN114942985A (en) * | 2022-06-08 | 2022-08-26 | 广州芸荟数字软件有限公司 | Artificial intelligence data processing method and system based on machine learning |
CN114974221A (en) * | 2022-04-29 | 2022-08-30 | 中移互联网有限公司 | Speech recognition model training method and device and computer readable storage medium |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102281178B1 (en) * | 2014-07-09 | 2021-07-23 | 삼성전자주식회사 | Method and apparatus for recognizing multi-level speech |
US10867609B2 (en) * | 2018-05-18 | 2020-12-15 | Sorenson Ip Holdings, Llc | Transcription generation technique selection |
US10825446B2 (en) | 2018-11-14 | 2020-11-03 | International Business Machines Corporation | Training artificial intelligence to respond to user utterances |
EP4026121A4 (en) * | 2019-09-04 | 2023-08-16 | Telepathy Labs, Inc. | Speech recognition systems and methods |
US20230395063A1 (en) * | 2022-06-03 | 2023-12-07 | Nuance Communications, Inc. | System and Method for Secure Transcription Generation |
US12265788B1 (en) * | 2024-06-04 | 2025-04-01 | Curio XR | Systems and methods for connected natural language models |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6418410B1 (en) * | 1999-09-27 | 2002-07-09 | International Business Machines Corporation | Smart correction of dictated speech |
US20020116194A1 (en) * | 2001-02-21 | 2002-08-22 | International Business Machines Corporation | Method for preserving contextual accuracy in an extendible speech recognition language model |
US20020138274A1 (en) * | 2001-03-26 | 2002-09-26 | Sharma Sangita R. | Server based adaption of acoustic models for client-based speech systems |
US6691089B1 (en) * | 1999-09-30 | 2004-02-10 | Mindspeed Technologies Inc. | User configurable levels of security for a speaker verification system |
US20060155539A1 (en) * | 2005-01-13 | 2006-07-13 | Yen-Fu Chen | System for compiling word usage frequencies |
US20070016419A1 (en) * | 2005-07-13 | 2007-01-18 | Hyperquality, Llc | Selective security masking within recorded speech utilizing speech recognition techniques |
US20070081428A1 (en) * | 2005-09-29 | 2007-04-12 | Spryance, Inc. | Transcribing dictation containing private information |
US7266495B1 (en) * | 2003-09-12 | 2007-09-04 | Nuance Communications, Inc. | Method and system for learning linguistically valid word pronunciations from acoustic data |
US7315818B2 (en) * | 2000-05-02 | 2008-01-01 | Nuance Communications, Inc. | Error correction in speech recognition |
US20080037719A1 (en) * | 2006-06-28 | 2008-02-14 | Hyperquality, Inc. | Selective security masking within recorded speech |
US7383183B1 (en) * | 2007-09-25 | 2008-06-03 | Medquist Inc. | Methods and systems for protecting private information during transcription |
US20080208579A1 (en) * | 2007-02-27 | 2008-08-28 | Verint Systems Ltd. | Session recording and playback with selective information masking |
US7437924B2 (en) * | 2006-09-06 | 2008-10-21 | Fu-Liang Chen | Wind vane device |
US20100082342A1 (en) * | 2008-09-28 | 2010-04-01 | Avaya Inc. | Method of Retaining a Media Stream without Its Private Audio Content |
US7873523B2 (en) * | 2005-06-30 | 2011-01-18 | Microsoft Corporation | Computer implemented method of analyzing recognition results between a user and an interactive application utilizing inferred values instead of transcribed speech |
US7937270B2 (en) * | 2007-01-16 | 2011-05-03 | Mitsubishi Electric Research Laboratories, Inc. | System and method for recognizing speech securely using a secure multi-party computation protocol |
US8473451B1 (en) * | 2004-07-30 | 2013-06-25 | At&T Intellectual Property I, L.P. | Preserving privacy in natural language databases |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6511324B1 (en) | 1998-10-07 | 2003-01-28 | Cognitive Concepts, Inc. | Phonological awareness, phonological processing, and reading skill training system and method |
US7895039B2 (en) | 2005-02-04 | 2011-02-22 | Vocollect, Inc. | Methods and systems for optimizing model adaptation for a speech recognition system |
US8244545B2 (en) | 2006-03-30 | 2012-08-14 | Microsoft Corporation | Dialog repair based on discrepancies between user model predictions and speech recognition results |
US8219406B2 (en) | 2007-03-15 | 2012-07-10 | Microsoft Corporation | Speech-centric multimodal user interface design in mobile technology |
US8275615B2 (en) | 2007-07-13 | 2012-09-25 | International Business Machines Corporation | Model weighting, selection and hypotheses combination for automatic speech recognition and machine translation |
-
2009
- 2009-06-09 US US12/481,439 patent/US9111540B2/en active Active
-
2015
- 2015-07-14 US US14/799,533 patent/US10157609B2/en active Active
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6418410B1 (en) * | 1999-09-27 | 2002-07-09 | International Business Machines Corporation | Smart correction of dictated speech |
US6691089B1 (en) * | 1999-09-30 | 2004-02-10 | Mindspeed Technologies Inc. | User configurable levels of security for a speaker verification system |
US7315818B2 (en) * | 2000-05-02 | 2008-01-01 | Nuance Communications, Inc. | Error correction in speech recognition |
US20020116194A1 (en) * | 2001-02-21 | 2002-08-22 | International Business Machines Corporation | Method for preserving contextual accuracy in an extendible speech recognition language model |
US20020138274A1 (en) * | 2001-03-26 | 2002-09-26 | Sharma Sangita R. | Server based adaption of acoustic models for client-based speech systems |
US7266495B1 (en) * | 2003-09-12 | 2007-09-04 | Nuance Communications, Inc. | Method and system for learning linguistically valid word pronunciations from acoustic data |
US8473451B1 (en) * | 2004-07-30 | 2013-06-25 | At&T Intellectual Property I, L.P. | Preserving privacy in natural language databases |
US20060155539A1 (en) * | 2005-01-13 | 2006-07-13 | Yen-Fu Chen | System for compiling word usage frequencies |
US7873523B2 (en) * | 2005-06-30 | 2011-01-18 | Microsoft Corporation | Computer implemented method of analyzing recognition results between a user and an interactive application utilizing inferred values instead of transcribed speech |
US20070016419A1 (en) * | 2005-07-13 | 2007-01-18 | Hyperquality, Llc | Selective security masking within recorded speech utilizing speech recognition techniques |
US20070081428A1 (en) * | 2005-09-29 | 2007-04-12 | Spryance, Inc. | Transcribing dictation containing private information |
US20080037719A1 (en) * | 2006-06-28 | 2008-02-14 | Hyperquality, Inc. | Selective security masking within recorded speech |
US7437924B2 (en) * | 2006-09-06 | 2008-10-21 | Fu-Liang Chen | Wind vane device |
US7937270B2 (en) * | 2007-01-16 | 2011-05-03 | Mitsubishi Electric Research Laboratories, Inc. | System and method for recognizing speech securely using a secure multi-party computation protocol |
US20080208579A1 (en) * | 2007-02-27 | 2008-08-28 | Verint Systems Ltd. | Session recording and playback with selective information masking |
US7383183B1 (en) * | 2007-09-25 | 2008-06-03 | Medquist Inc. | Methods and systems for protecting private information during transcription |
US20100082342A1 (en) * | 2008-09-28 | 2010-04-01 | Avaya Inc. | Method of Retaining a Media Stream without Its Private Audio Content |
Cited By (74)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9218807B2 (en) * | 2010-01-08 | 2015-12-22 | Nuance Communications, Inc. | Calibration of a speech recognition engine using validated text |
US8868428B2 (en) * | 2010-01-26 | 2014-10-21 | Google Inc. | Integration of embedded and network speech recognizers |
US8412532B2 (en) * | 2010-01-26 | 2013-04-02 | Google Inc. | Integration of embedded and network speech recognizers |
US20110184740A1 (en) * | 2010-01-26 | 2011-07-28 | Google Inc. | Integration of Embedded and Network Speech Recognizers |
US20120084079A1 (en) * | 2010-01-26 | 2012-04-05 | Google Inc. | Integration of Embedded and Network Speech Recognizers |
US9858917B1 (en) | 2010-07-13 | 2018-01-02 | Google Inc. | Adapting enhanced acoustic models |
US8185392B1 (en) * | 2010-07-13 | 2012-05-22 | Google Inc. | Adapting enhanced acoustic models |
US9263034B1 (en) | 2010-07-13 | 2016-02-16 | Google Inc. | Adapting enhanced acoustic models |
US9060062B1 (en) | 2011-07-06 | 2015-06-16 | Google Inc. | Clustering and classification of recent customer support inquiries |
US20130054238A1 (en) * | 2011-08-29 | 2013-02-28 | Microsoft Corporation | Using Multiple Modality Input to Feedback Context for Natural Language Understanding |
US10332514B2 (en) * | 2011-08-29 | 2019-06-25 | Microsoft Technology Licensing, Llc | Using multiple modality input to feedback context for natural language understanding |
US20170169824A1 (en) * | 2011-08-29 | 2017-06-15 | Microsoft Technology Licensing, Llc | Using multiple modality input to feedback context for natural language understanding |
US9576573B2 (en) * | 2011-08-29 | 2017-02-21 | Microsoft Technology Licensing, Llc | Using multiple modality input to feedback context for natural language understanding |
WO2013062797A1 (en) | 2011-10-28 | 2013-05-02 | Microsoft Corporation | Distributed user input to text generated by a speech to text transcription service |
US8930189B2 (en) | 2011-10-28 | 2015-01-06 | Microsoft Corporation | Distributed user input to text generated by a speech to text transcription service |
EP2771879A4 (en) * | 2011-10-28 | 2015-07-15 | Microsoft Technology Licensing Llc | DISTRIBUTED USER ENTRY OF LANGUAGE RECOGNIZED TEXT FOR A TEXT TRANSCRIPTION SERVICE |
US20220208197A1 (en) * | 2012-06-01 | 2022-06-30 | Google Llc | Providing Answers To Voice Queries Using User Feedback |
US12094471B2 (en) * | 2012-06-01 | 2024-09-17 | Google Llc | Providing answers to voice queries using user feedback |
US20140039893A1 (en) * | 2012-07-31 | 2014-02-06 | Sri International | Personalized Voice-Driven User Interfaces for Remote Multi-User Services |
US9786281B1 (en) * | 2012-08-02 | 2017-10-10 | Amazon Technologies, Inc. | Household agent learning |
KR101693653B1 (en) | 2012-10-17 | 2017-01-06 | 뉘앙스 커뮤니케이션즈, 인코포레이티드 | Multiple device intelligent language model synchronization |
CN104885071A (en) * | 2012-10-17 | 2015-09-02 | 纽昂斯通信有限公司 | Multiple device intelligent language model synchronization |
KR20150074075A (en) * | 2012-10-17 | 2015-07-01 | 뉘앙스 커뮤니케이션즈, 인코포레이티드 | Multiple device intelligent language model synchronization |
EP2909736A4 (en) * | 2012-10-17 | 2016-05-11 | Nuance Communications Inc | Multiple device intelligent language model synchronization |
WO2014062851A1 (en) | 2012-10-17 | 2014-04-24 | Nuance Communications, Inc. | Multiple device intelligent language model synchronization |
US10176803B2 (en) * | 2013-04-18 | 2019-01-08 | Nuance Communications, Inc. | Updating population language models based on changes made by user clusters |
US9672818B2 (en) * | 2013-04-18 | 2017-06-06 | Nuance Communications, Inc. | Updating population language models based on changes made by user clusters |
US20170365253A1 (en) * | 2013-04-18 | 2017-12-21 | Nuance Communications, Inc. | Updating population language models based on changes made by user clusters |
US20140316784A1 (en) * | 2013-04-18 | 2014-10-23 | Nuance Communications, Inc. | Updating population language models based on changes made by user clusters |
EP2804113A2 (en) * | 2013-05-13 | 2014-11-19 | Facebook, Inc. | Hybrid, offline/online speech translation system |
US9773498B2 (en) | 2013-10-28 | 2017-09-26 | At&T Intellectual Property I, L.P. | System and method for managing models for embedded speech and language processing |
US9530416B2 (en) | 2013-10-28 | 2016-12-27 | At&T Intellectual Property I, L.P. | System and method for managing models for embedded speech and language processing |
US20150120288A1 (en) * | 2013-10-29 | 2015-04-30 | At&T Intellectual Property I, L.P. | System and method of performing automatic speech recognition using local private data |
US9666188B2 (en) * | 2013-10-29 | 2017-05-30 | Nuance Communications, Inc. | System and method of performing automatic speech recognition using local private data |
US9905228B2 (en) | 2013-10-29 | 2018-02-27 | Nuance Communications, Inc. | System and method of performing automatic speech recognition using local private data |
WO2015164116A1 (en) * | 2014-04-25 | 2015-10-29 | Nuance Communications, Inc | Learning language models from scratch based on crowd-sourced user text input |
CN106233375A (en) * | 2014-04-25 | 2016-12-14 | 纽昂斯通信有限公司 | User version based on mass-rent input starts anew to learn language model |
US20160365088A1 (en) * | 2015-06-10 | 2016-12-15 | Synapse.Ai Inc. | Voice command response accuracy |
US9881613B2 (en) * | 2015-06-29 | 2018-01-30 | Google Llc | Privacy-preserving training corpus selection |
US20160379639A1 (en) * | 2015-06-29 | 2016-12-29 | Google Inc. | Privacy-preserving training corpus selection |
JP2018506081A (en) * | 2015-06-29 | 2018-03-01 | グーグル エルエルシー | Training corpus selection for privacy protection |
US9990925B2 (en) * | 2015-06-29 | 2018-06-05 | Google Llc | Privacy-preserving training corpus selection |
WO2017054122A1 (en) * | 2015-09-29 | 2017-04-06 | 深圳市全圣时代科技有限公司 | Speech recognition system and method, client device and cloud server |
EP3460792A4 (en) * | 2016-06-23 | 2019-06-12 | Huawei Technologies Co., Ltd. | Optimization method and apparatus suitable for model of pattern recognition, and terminal device |
US10825447B2 (en) | 2016-06-23 | 2020-11-03 | Huawei Technologies Co., Ltd. | Method and apparatus for optimizing model applicable to pattern recognition, and terminal device |
US11210583B2 (en) | 2016-07-20 | 2021-12-28 | Apple Inc. | Using proxies to enable on-device machine learning |
WO2018017423A1 (en) * | 2016-07-20 | 2018-01-25 | Apple Inc. | Using proxies to enable on-device machine learning |
US10650055B2 (en) * | 2016-10-13 | 2020-05-12 | Viesoft, Inc. | Data processing for continuous monitoring of sound data and advanced life arc presentation analysis |
US10304445B2 (en) * | 2016-10-13 | 2019-05-28 | Viesoft, Inc. | Wearable device for speech training |
US20190171671A1 (en) * | 2016-10-13 | 2019-06-06 | Viesoft, Inc. | Data processing for continuous monitoring of sound data and advanced life arc presentation analysis |
US10770065B2 (en) | 2016-12-19 | 2020-09-08 | Samsung Electronics Co., Ltd. | Speech recognition method and apparatus |
CN110088833A (en) * | 2016-12-19 | 2019-08-02 | 三星电子株式会社 | Audio recognition method and device |
KR20180070970A (en) * | 2016-12-19 | 2018-06-27 | 삼성전자주식회사 | Method and Apparatus for Voice Recognition |
KR102691541B1 (en) | 2016-12-19 | 2024-08-02 | 삼성전자주식회사 | Method and Apparatus for Voice Recognition |
WO2018117532A1 (en) * | 2016-12-19 | 2018-06-28 | Samsung Electronics Co., Ltd. | Speech recognition method and apparatus |
US11183173B2 (en) * | 2017-04-21 | 2021-11-23 | Lg Electronics Inc. | Artificial intelligence voice recognition apparatus and voice recognition system |
WO2019094092A1 (en) * | 2017-11-07 | 2019-05-16 | Google Llc | Incognito mode for personalized machine-learned models |
US11216745B2 (en) | 2017-11-07 | 2022-01-04 | Google Llc | Incognito mode for personalized machine-learned models |
US11983613B2 (en) | 2017-11-07 | 2024-05-14 | Google Llc | Incognito mode for personalized machine-learned models |
US10679620B2 (en) * | 2018-03-06 | 2020-06-09 | GM Global Technology Operations LLC | Speech recognition arbitration logic |
US20190279620A1 (en) * | 2018-03-06 | 2019-09-12 | GM Global Technology Operations LLC | Speech recognition arbitration logic |
US11367448B2 (en) | 2018-06-01 | 2022-06-21 | Soundhound, Inc. | Providing a platform for configuring device-specific speech recognition and using a platform for configuring device-specific speech recognition |
US11011162B2 (en) * | 2018-06-01 | 2021-05-18 | Soundhound, Inc. | Custom acoustic models |
US20190371311A1 (en) * | 2018-06-01 | 2019-12-05 | Soundhound, Inc. | Custom acoustic models |
US11830472B2 (en) | 2018-06-01 | 2023-11-28 | Soundhound Ai Ip, Llc | Training a device specific acoustic model |
WO2020018212A1 (en) * | 2018-07-16 | 2020-01-23 | Microsoft Technology Licensing, Llc | Eyes-off training for automatic speech recognition |
US10679610B2 (en) | 2018-07-16 | 2020-06-09 | Microsoft Technology Licensing, Llc | Eyes-off training for automatic speech recognition |
US12062373B2 (en) * | 2019-12-23 | 2024-08-13 | Descript, Inc. | Automated generation of transcripts through independent transcription |
US12136423B2 (en) | 2019-12-23 | 2024-11-05 | Descript, Inc. | Transcript correction through programmatic comparison of independently generated transcripts |
US20210193147A1 (en) * | 2019-12-23 | 2021-06-24 | Descript, Inc. | Automated generation of transcripts through independent transcription |
CN113113002A (en) * | 2019-12-25 | 2021-07-13 | 斑马智行网络(香港)有限公司 | Vehicle voice interaction method and system and voice updating system |
US20220084521A1 (en) * | 2021-11-23 | 2022-03-17 | Raju Arvind | Automatic personal identifiable information removal from audio |
CN114974221A (en) * | 2022-04-29 | 2022-08-30 | 中移互联网有限公司 | Speech recognition model training method and device and computer readable storage medium |
CN114942985A (en) * | 2022-06-08 | 2022-08-26 | 广州芸荟数字软件有限公司 | Artificial intelligence data processing method and system based on machine learning |
Also Published As
Publication number | Publication date |
---|---|
US20160012817A1 (en) | 2016-01-14 |
US9111540B2 (en) | 2015-08-18 |
US10157609B2 (en) | 2018-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10157609B2 (en) | Local and remote aggregation of feedback data for speech recognition | |
US10163440B2 (en) | Generic virtual personal assistant platform | |
US8170866B2 (en) | System and method for increasing accuracy of searches based on communication network | |
US11250876B1 (en) | Method and system for confidential sentiment analysis | |
JP5796496B2 (en) | Input support system, method, and program | |
CN116324792A (en) | Systems and methods related to robotic authoring by mining intent from natural language conversations | |
US11947872B1 (en) | Natural language processing platform for automated event analysis, translation, and transcription verification | |
US11711469B2 (en) | Contextualized speech to text conversion | |
US11615110B2 (en) | Systems and methods for unifying formats and adaptively automating processing of business records data | |
US20230026945A1 (en) | Virtual Conversational Agent | |
EP2588237A2 (en) | System and method of analyzing business process events | |
EP4193292A1 (en) | Entity resolution for chatbot conversations | |
US11586816B2 (en) | Content tailoring for diverse audiences | |
US20070156406A1 (en) | Voice user interface authoring tool | |
US20230188643A1 (en) | Ai-based real-time natural language processing system and method thereof | |
US8990088B2 (en) | Tool and framework for creating consistent normalization maps and grammars | |
US20230134796A1 (en) | Named entity recognition system for sentiment labeling | |
Jiang et al. | A GDPR-compliant ecosystem for speech recognition with transfer, federated, and evolutionary learning | |
US8862609B2 (en) | Expanding high level queries | |
Nakatumba‐Nabende et al. | Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons | |
US12265565B2 (en) | Methods, apparatuses and computer program products for intent-driven query processing | |
US11810558B2 (en) | Explaining anomalous phonetic translations | |
US20250045529A1 (en) | Transcription using a corpus of reference | |
CN119229878A (en) | Voiceprint recognition method, device, computer equipment and medium based on artificial intelligence | |
CN118410162A (en) | Digest extraction method, digest extraction device, digest extraction apparatus, digest extraction medium, and digest extraction program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PLUMPE, MICHAEL D.;ODELL, JULIAN;HAMAKER, JON;AND OTHERS;REEL/FRAME:023223/0381 Effective date: 20090605 |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034564/0001 Effective date: 20141014 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |