US20170004824A1 - Speech recognition apparatus, speech recognition method, and electronic device - Google Patents
Speech recognition apparatus, speech recognition method, and electronic device Download PDFInfo
- Publication number
- US20170004824A1 US20170004824A1 US15/139,926 US201615139926A US2017004824A1 US 20170004824 A1 US20170004824 A1 US 20170004824A1 US 201615139926 A US201615139926 A US 201615139926A US 2017004824 A1 US2017004824 A1 US 2017004824A1
- Authority
- US
- United States
- Prior art keywords
- probabilities
- phoneme
- candidate
- calculated
- candidate set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims description 47
- 230000005236 sound signal Effects 0.000 claims abstract description 42
- 238000004422 calculation algorithm Methods 0.000 claims description 58
- 239000000284 extract Substances 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 13
- 230000002123 temporal effect Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 230000015654 memory Effects 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G06F17/2818—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/228—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context
Definitions
- This application relates to speech recognition technology.
- HMM decoder In a general speech recognition system, after an acoustic model acquires phonetic probabilities from an audio signal, a Hidden Markov Model (HMM) decoder combines these probabilities and converts the probabilities into a sequence of words.
- HMM decoder requires numerous computing resources and operations, and a Viterbi decoding method used in the HMM decoder may result in a huge loss of information.
- a speech recognition apparatus includes a probability calculator configured to calculate phoneme probabilities of an audio signal using an acoustic model; a candidate set extractor configured to extract a candidate set from a recognition target list of target sequences; and a result returner configured to return a recognition result of the audio signal based on the calculated phoneme probabilities and the extracted candidate set.
- the acoustic model may be trained using a learning algorithm including Connectionist Temporal Classification (CTC).
- CTC Connectionist Temporal Classification
- the result returner may be further configured to calculate probabilities of generating each target sequence included in the candidate set based on the calculated phoneme probabilities, and return a candidate target sequence having a highest probability among the calculated probabilities of generating each target sequence as the recognition result.
- the apparatus may further include a sequence acquirer configured to acquire a phoneme sequence based on the calculated phoneme probabilities.
- the candidate set extractor may be further configured to calculate similarities between the acquired phoneme sequence and each target sequence included in the recognition target list, and extract the candidate set based on the calculated similarities.
- the candidate set extractor may be further configured to calculate the similarities using a similarity algorithm including an edit distance algorithm.
- the sequence acquirer may be further configured to acquire the phoneme sequence based on the calculated phoneme probabilities using a best path decoding algorithm or a prefix search decoding algorithm.
- a speech recognition method includes calculating phoneme probabilities of an audio signal using an acoustic model; extracting a candidate set from a recognition target list of target sequences; and returning a recognition result of the audio signal based on the calculated phoneme probabilities and the extracted candidate set.
- the acoustic model may be trained using a learning algorithm including Connectionist Temporal Classification (CTC).
- CTC Connectionist Temporal Classification
- the returning of the recognition result may include calculating probabilities of generating each target sequence included in the candidate set based on the calculated phoneme probabilities; and returning a candidate target sequence having a highest probability among the calculated probabilities of generating each target sequence as the recognition result.
- the method may further include acquiring a phoneme sequence based on the calculated phoneme probabilities.
- the extracting of the candidate set may include calculating similarities between the acquired phoneme sequence and each target sequence included in the recognition target list; and extracting the candidate set based on the calculated similarities.
- the calculating of the similarities may include calculating the similarities using a similarity algorithm including an edit distance algorithm.
- the acquiring of the phoneme sequence may include acquiring the phoneme sequence based on the calculated phoneme probabilities using a best path decoding algorithm or a prefix search decoding algorithm.
- an electronic device in another general aspect, includes a speech receiver configured to receive an audio signal of a user; a speech recognizer configured to calculate phoneme probabilities of the received audio signal using an acoustic model, and based on the calculated phoneme probabilities, return any one of target sequences included in a recognition target list as a recognition result; and a processor configured to perform a specific operation based on the returned recognition result.
- the speech recognizer may be further configured to extract a candidate set from the recognition target list, calculate probabilities of generating each candidate target sequence included in the candidate set based on the calculated phoneme probabilities, and return a candidate target sequence having a highest probability among the calculated probabilities of generating each target sequence as the recognition result.
- the speech recognizer may be further configured to acquire a phoneme sequence by decoding the phoneme probabilities, and extract the candidate set based on similarities between the acquired phoneme sequence and each target sequence included in the recognition target list.
- the processor may be further configured to output the recognition result in a voice from a speaker, or in a text format on a display.
- the processor may be further configured to translate the recognition result into another language, and output the translated result in the voice from the speaker, or in the text format on the display.
- the processor may be further configured to process commands including one or more of a power on/off command, a volume control command, a channel change command, and a destination search command in response to the recognition result.
- a speech recognition method includes calculating probabilities that portions of an audio signal correspond to speech units; obtaining a set of candidate sequences of speech units from a list of sequences of speech units; and recognizing one of the candidate sequences of speech units as corresponding to the audio signal based on the probabilities.
- the calculating of the probabilities may include calculating the probabilities using an acoustic model.
- the speech units may be phonemes.
- the candidate sequences of speech units may be phrases.
- phrases may be commands to control an electronic device.
- the recognizing of the one of the candidate sequences of speech units may include calculating probabilities of generating each of the candidate sequences of speech units based on the probabilities that portions of the audio signal correspond to the speech units; and recognizing one of the candidate sequences of speech units having a highest probability among the probabilities of generating each of the candidate sequences of speech units as corresponding to the audio signal.
- FIG. 1 is a block diagram illustrating an example of a speech recognition apparatus.
- FIG. 2 is a block diagram illustrating another example of a speech recognition apparatus.
- FIG. 3 is a flowchart illustrating an example of a speech recognition method.
- FIG. 4 is a flowchart illustrating another example of a speech recognition method.
- FIG. 5 is a block diagram illustrating an example of an electronic device.
- FIG. 6 is a flowchart illustrating an example of a speech recognition method in the electronic device.
- FIG. 1 is a block diagram illustrating an example of a speech recognition apparatus.
- the speech recognition apparatus 100 includes a probability calculator 110 , a candidate set extractor 120 , and a result returner 130 .
- the probability calculator 110 calculates probabilities of each phoneme of an audio signal using an acoustic model.
- a phoneme is the smallest unit of sound that is significant in a language.
- the audio signal is converted into an audio frame by a preprocessing process of extracting characteristics, and is input to an acoustic model.
- the acoustic model divides an audio frame into phonemes, and outputs probabilities of each phoneme.
- a general acoustic model based on a Gaussian Mixture Model (GMM), a Deep Neural Network (DNN), or a Recurrent Neural Network (RNN) is trained in a manner that maximizes the probability of phonemes of each frame that are output as an answer.
- GMM Gaussian Mixture Model
- DNN Deep Neural Network
- RNN Recurrent Neural Network
- the acoustic model in this example is built using a Recurrent Neural Network (RNN) and Connectionist Temporal Classification (CTC).
- RNN Recurrent Neural Network
- CTC Connectionist Temporal Classification
- the acoustic model is trained in a manner that maximizes probabilities of phonemes of each audio frame, with respect to all the combinations of phonemes that may make up an answer sequence, using various learning algorithms such as a CTC learning algorithm.
- a CTC learning algorithm i.e., an acoustic model based on a CTC network.
- Equation 1 is an example of an algorithm for training an acoustic model based on GMM, DNN, or RNN.
- Equation 1 x represents an input audio signal, y represents probabilities of each phoneme calculated for an audio frame k using an acoustic model, and z represents an answer for the audio frame k.
- a general acoustic model is trained in a manner that maximizes probabilities of phonemes of each audio frame output as an answer.
- Equations 2 and 3 are examples of algorithms for training an acoustic model according to an example of this application.
- Equations 2 and 3 l denotes a phoneme sequence, i.e., a series of phonemes, that is an answer, and it denotes any one phoneme sequence that may be an answer.
- ( ⁇ ) is a many-to-one function that converts an output sequence ⁇ of a neural network to a phoneme sequence. For example, if a user says “apple” in 1 second (sec), pronouncing a phoneme/ae/ from 0 to 0.5 sec, a phoneme/p/ from 0.5 to 0.8 sec, and a phoneme /l/ from 0.8 to 1 sec, this will produce an output sequence it in frame units (commonly 0.01 sec) of “ae ae ae ae ae . . .
- ( ⁇ ) is a function that removes the repeated phonemes from the output sequence ⁇ and maps the output sequence ⁇ to a phoneme sequence/ae p l/.
- Acoustic model training is performed in such a manner that a probability p( ⁇
- the acoustic model training is performed using a back propagation learning method.
- the candidate set extractor 120 extracts a candidate set from a recognition target list 140 .
- the recognition target list include a plurality of words or phrases composed of phoneme sequences.
- the recognition target list 140 is predefined according to various types of devices that include the voice recognition apparatus 100 .
- the recognition target list 140 includes various commands to operate the TV, such as a power on/off command, a volume control command, a channel change command, and names of specific programs to be executed.
- the candidate set extractor 120 extracts one or more target sequences from the recognition target list 140 according to devices to be operated by a user to generate a candidate set.
- the result returner 130 calculates probabilities of generating each candidate target sequence in a candidate set using phoneme probabilities calculated using the acoustic model, and returns a candidate target sequence having the highest probability as a recognition result of an input audio signal.
- the result returner 130 calculates probabilities of each candidate target sequence of a candidate set by applying Equations 2 and 3 above, which are algorithms for training the acoustic model.
- FIG. 2 is a block diagram illustrating another example of a speech recognition apparatus.
- a speech recognition apparatus 200 includes a probability calculator 210 , a sequence acquirer 220 , a candidate set extractor 230 , and a result returner 240 .
- the probability calculator 210 calculates probabilities of each phoneme of an audio signal using an acoustic model.
- the acoustic model is trained in a manner that maximizes probabilities of phonemes for each audio frame, with respect to all the combinations of phonemes that may make up an answer sequence, using RNN and CTC learning algorithms.
- the sequence acquirer 220 acquires a phoneme sequence that is a series of phonemes based on the phoneme probabilities calculated by the probability calculator 210 .
- the sequence acquirer 220 acquires one or more phoneme sequences by decoding the calculated probabilities of phonemes using a decoding algorithm, such as a best path decoding algorithm or a prefix search decoding algorithm.
- a decoding algorithm such as a best path decoding algorithm or a prefix search decoding algorithm.
- the decoding algorithm is not limited to these examples.
- the candidate set extractor 230 generates a candidate set by extracting one or more candidate target sequences from a recognition target list 250 based on the phoneme sequence.
- the recognition target list 250 includes target sequences, such as words/phrases/commands, that are predefined according to the types of electronic devices including the speech recognition apparatus 200 . Further, the recognition target list 250 may further include information associated with usage rankings (e.g., a usage frequency, a usage probability, etc.) of the target sequences.
- the candidate set extractor 230 extracts all or some of the target sequences as a candidate set depending on the number of target sequences included in the recognition target list 250 . In this case, a specific number of target sequences may be extracted as a candidate set based on the information associated with the usage rankings of the target sequences.
- the candidate set extractor 230 calculates similarities by comparing one or more phoneme sequences acquired by the sequence acquirer 220 with each target sequence included in the recognition target list 250 , and based on the similarities, extracts a specific number of phoneme sequences as candidate target sequences.
- the candidate set extractor 230 calculates similarities between phoneme sequences and target sequences using a similarity calculation algorithm including an edit distance algorithm, and based on the similarities, extracts a specific number of phoneme sequences (e.g., the top 20 sequences) as candidate target sequences in order of similarity.
- the result returner 240 calculates the probability of generating each candidate target sequence with reduced time, thereby enabling rapid return of a final recognition result.
- the result returner 240 returns, as a recognition result of an audio signal, at least one candidate target sequence in a candidate set using phoneme probabilities calculated using the acoustic model.
- the result returner 240 calculates similarities between one or more acquired phoneme sequences and each candidate target sequence in a candidate set using a similarity calculation algorithm including an edit distance algorithm, and returns a candidate target sequence having the highest similarity as a recognition result.
- the result returner 240 calculates probabilities of generating each candidate target sequence in a candidate set by applying phoneme probabilities calculated by the probability calculator 210 to probability calculation algorithms, such as Equations 2 and 3, and returns a candidate target sequence having the highest probability as a final recognition result.
- FIG. 3 is a flowchart illustrating an example of a speech recognition method.
- FIG. 3 is an example of a speech recognition method performed by the speech recognition apparatus illustrated in FIG. 1 .
- the speech recognition apparatus 100 calculates probabilities of phonemes of an audio signal using an acoustic model in 310 .
- the audio signal is converted into audio frames by a preprocessing process, and the audio frames are input to the acoustic model.
- the acoustic model divides each audio frame into phonemes, and outputs probabilities of each phoneme.
- an acoustic model is trained by combining a Recurrent Neural Network (RNN) and Connectionist Temporal Classification (CTC).
- RNN Recurrent Neural Network
- CTC Connectionist Temporal Classification
- the recognition target list includes target sequences, such as words or phrases, that are predefined according to various devices.
- the target sequences may include commands for controlling the TV, such as a power on/off command, a volume control command, and a channel change command.
- the target sequences may include commands for controlling the navigation device, such as a power on/off command, a volume control command, and a destination search command.
- the target sequences may include commands to control various electronic devices mounted in a vehicle.
- the target sequences are not limited to these examples, and may be applied to any electronic device controlled by a user and including speech recognition technology.
- a recognition result of an input audio signal is returned based on the calculated phoneme probabilities and the extracted candidate set in 330 .
- probabilities of generating each candidate target sequence are calculated based on the phoneme probabilities calculated using an acoustic model and algorithms of Equations 2 and 3 above. Further, a candidate target sequence having the highest probability is returned as a final recognition result.
- FIG. 4 is a flowchart illustrating an example of a speech recognition method.
- probabilities of phonemes of an audio signal are calculated using an acoustic model in 410 .
- the acoustic model is trained in a manner that maximizes probabilities of phonemes for each audio frame with respect to all the combinations of phonemes that may make up a phoneme sequence that is an answer using various learning algorithms, e.g., a CTC learning algorithm.
- a phoneme sequence which is a series of phonemes, is acquired based on the calculated phoneme probabilities in 420 .
- one or more phoneme sequences are acquired using a decoding algorithm, such as a best path decoding algorithm or a prefix search decoding algorithm.
- a candidate set is generated by extracting one or more candidate target sequences from the recognition target list based on the phoneme sequence in 430 .
- the recognition target list is predefined according to types of electronic devices having including speech recognition technology.
- the recognition target list further includes information associated with usage rankings (e.g., a usage frequency, a usage probability, etc.) of each target sequence.
- the speech recognition apparatus extracts all or some of the target sequences as a candidate set depending on the total number of target sequences included in the recognition target list. In the case where there is information associated with usage rankings of target sequences, a predefined number of target sequences may be extracted as a candidate set based on the information.
- the speech recognition apparatus calculates similarities by comparing one or more phoneme sequences acquired by the sequence acquirer 220 with each target sequence included in the recognition target list, and based on the similarities, extracts a specific number of phoneme sequences as candidate target sequences. For example, the speech recognition apparatus calculates similarities between phoneme sequences and target sequences using a similarity calculation algorithm including an edit distance algorithm, and based on the similarities, extracts a specific number of phoneme sequences (e.g., the top 20 sequences) as candidate target sequences in order of similarity.
- a similarity calculation algorithm including an edit distance algorithm
- a recognition result of an audio signal is returned based on the phoneme probabilities calculated using an acoustic model and the candidate set in 440 .
- the speech recognition apparatus calculates similarities between one or more acquired phoneme sequences and each candidate target sequence in a candidate set using a similarity calculation algorithm including an edit distance algorithm, and returns a candidate target sequence having the highest similarity as a recognition result.
- the speech recognition apparatus calculates probabilities of generating each candidate target sequence in a candidate set by applying the calculated phoneme probabilities to probability calculation algorithms, such as Equations 2 and 3 above, and returns a candidate target sequence having the highest probability as a final recognition result.
- FIG. 5 is a block diagram illustrating an example of an electronic device.
- the electronic device 500 includes speech recognition apparatus 100 or 200 described above.
- the electronic device 500 may be a TV set, a set-top box, a desktop computer, a laptop computer, an electronic translator, a smartphone, a tablet PC, an electronic control device of a vehicle, or any other device that is controlled by a user, and processes a user's various commands by embedded speech recognition technology.
- the electronic device 500 is not limited to these examples, and may be any electronic device that is controlled by a user and includes speech recognition technology.
- the electronic device 500 includes a speech receiver 510 , a speech recognizer 520 , and a processor 530 .
- the speech recognizer 520 is the speech recognition apparatus 100 in FIG. 1 or 200 in FIG. 2 that are manufactured as hardware to be implemented in the electronic device 500 .
- the speech receiver 510 receives a user's audio signal input through a microphone of the electronic device 500 .
- the user's audio signal may be phrases to be translated into another language, or may be commands for controlling a TV set, driving a vehicle, or controlling any other device that is controlled by a user.
- the speech receiver 510 performs a preprocessing process in which an analog audio signal input by a user is converted into a digital signal, the signal is divided into a plurality of audio frames, and the audio frames are transmitted to the speech recognizer 520 .
- the speech recognizer 520 inputs an audio signal, e.g., audio frames, to an acoustic model, and calculates probabilities of phonemes of each audio frame. Once the phoneme probabilities of the audio frame are calculated, the speech recognizer 520 extracts a candidate set from a recognition target list based on the calculated phoneme probabilities, and returns a final recognition result based on the calculated phoneme probabilities and the extracted candidate set.
- the acoustic model is a network based on a Recurrent Neural Network (RNN) or a Deep Neural Network (DNN), and is trained in a manner that maximizes probabilities of phonemes of each audio frame with respect to all the combinations of phonemes that may make up an answer sequence using a CTC learning algorithm.
- RNN Recurrent Neural Network
- DNN Deep Neural Network
- the recognition target list is predefined according to the types and purposes of the electronic device 500 that includes speech recognition technology. For example, in a case in which the voice recognition apparatus 100 is mounted in a TV set, various words or commands, such as a power on/off command, a volume control command, and a channel change command, that are frequently used for TVs are defined in the recognition target list. Further, in a case in which the electronic device 500 is a navigation device mounted in a vehicle, various commands, such as a power on/off command, a volume control command, and a destination search command, that are use to control the navigation device are defined in the recognition target list.
- the speech recognizer 520 acquires phoneme sequences based on phoneme probabilities using a general decoding algorithm (e.g., CTC) for speech recognition, and extracts a candidate set by comparing the acquired phoneme sequences with the recognition target list.
- a general decoding algorithm e.g., CTC
- the speech recognizer 520 calculates similarities between the acquired phoneme sequences and each target sequence included in the recognition target list using a similarity calculation algorithm including an edit distance algorithm, and based on the similarities, generates a candidate set by extracting a specific number of phoneme sequences as candidate target sequences in order of similarity.
- the speech recognizer 520 returns, as a final recognition result, one candidate target sequence in the candidate set extracted based on the calculated phoneme probabilities. In this case, the speech recognizer 520 returns, as a final recognition result, a candidate target sequence having the highest probability among the probabilities of generating each candidate target sequence in a candidate set. In one example, the speech recognizer 520 outputs the final recognition result in a text format.
- the processor 530 performs an operation in response to the final recognition result. For example, the processor 530 outputs the recognition result of speech input by a user in voice from a speaker, headphones, or any other audio output device, or provides the recognition result in a text format on a display. Further, the processor 530 performs operations to process commands (e.g., a power on/off command, a volume control command, etc.) to control TVs, set-top boxes, home appliances, electronic control devices of a vehicle, or any other devices that are controlled by a user.
- commands e.g., a power on/off command, a volume control command, etc.
- the processor 530 translates the final recognition result output in a text format into another language, and outputs the translated result in voice or in a text format.
- the processor 530 is not limited to these examples, and may be used in various applications.
- FIG. 6 is a flowchart illustrating an example of a speech recognition method in the electronic device.
- the electronic device 500 receives, through a microphone or any other audio input device, a user's audio signal containing phrases to be translated into another language, or commands for controlling TVs or driving a vehicle, in 610 . Further, once the user's audio signal is received, the electronic device 500 converts the analog audio signal into a digital signal, and performs a preprocessing process of dividing the digital signal into a plurality of audio frames.
- the electronic device 500 returns a final recognition result of the input audio signal based on the pre-stored acoustic model and a predefined recognition target list in 620 .
- the electronic device 500 inputs an audio frame to an acoustic model to calculate probabilities of phonemes of audio frames. Further, once the probabilities of phonemes of audio frames have been calculated, the electronic device 500 extracts a candidate set from the recognition target list based on the calculated probabilities of phonemes, and returns a final recognition result based on the calculated phoneme probabilities and the extracted candidate set.
- the acoustic model is a network based on a Recurrent Neural Network (RNN) or a Deep Neural Network (DNN), and is trained using a CTC learning algorithm.
- the recognition target list is predefined according to the types and purposes of the electronic device 500 that includes speech recognition technology.
- the electronic device 500 acquires phoneme sequences from the calculated phoneme probabilities, and extracts a candidate set by comparing the acquired phoneme sequences with the recognition target list.
- the electronic device 500 calculates similarities between the acquired phoneme sequences and each target sequence included in the recognition target list using a similarity calculation algorithm including an edit distance algorithm, and based on the similarities, generates a candidate set by extracting a specific number of phoneme sequences as candidate target sequences in order of similarity.
- the electronic device 500 calculates probabilities of generating each candidate target sequence using Equations 2 and 3 above, and returns a candidate target sequence having the highest probability as a final recognition result, which may be converted into a text format by the electronic device 500 .
- the electronic device 500 performs an operation in response to the returned final recognition result in 630 .
- the electronic device 500 may output the recognition result of speech input by a user in voice from a speaker, headphones, or any other audio output device, or provides the recognition result in a text format on a display. Further, the electronic device 500 may perform operations to process commands to control TVs, set-top boxes, home appliances, electronic control devices of a vehicle, and any other devices that are controlled by a user. In addition, the electronic device 500 may translate the final recognition result output in a text format into another language, and may output the translated result in voice or in a text format. However, the electronic device 500 is not limited to these examples, and may be used in various applications.
- Examples of hardware components include controllers, sensors, generators, drivers, memories, comparators, arithmetic logic units, adders, subtractors, multipliers, dividers, integrators, and any other electronic components known to one of ordinary skill in the art.
- the hardware components are implemented by computing hardware, for example, by one or more processors or computers.
- a processor or computer is implemented by one or more processing elements, such as an array of logic gates, a controller and an arithmetic logic unit, a digital signal processor, a microcomputer, a programmable logic controller, a field-programmable gate array, a programmable logic array, a microprocessor, or any other device or combination of devices known to one of ordinary skill in the art that is capable of responding to and executing instructions in a defined manner to achieve a desired result.
- a processor or computer includes, or is connected to, one or more memories storing instructions or software that are executed by the processor or computer.
- Hardware components implemented by a processor or computer execute instructions or software, such as an operating system (OS) and one or more software applications that run on the OS, to perform the operations described herein with respect to FIGS. 1-6 .
- the hardware components also access, manipulate, process, create, and store data in response to execution of the instructions or software.
- OS operating system
- processors or computers may be used in the description of the examples described herein, but in other examples multiple processors or computers are used, or a processor or computer includes multiple processing elements, or multiple types of processing elements, or both.
- a hardware component includes multiple processors, and in another example, a hardware component includes a processor and a controller.
- a hardware component has any one or more of different processing configurations, examples of which include a single processor, independent processors, parallel processors, single-instruction single-data (SISD) multiprocessing, single-instruction multiple-data (SIMD) multiprocessing, multiple-instruction single-data (MISD) multiprocessing, and multiple-instruction multiple-data (MIMD) multiprocessing.
- SISD single-instruction single-data
- SIMD single-instruction multiple-data
- MIMD multiple-instruction multiple-data
- FIGS. 3, 4, and 6 that perform the operations described herein with respect to FIGS. 1-6 are performed by computing hardware, for example, by one or more processors or computers, as described above executing instructions or software to perform the operations described herein.
- Instructions or software to control a processor or computer to implement the hardware components and perform the methods as described above are written as computer programs, code segments, instructions or any combination thereof, for individually or collectively instructing or configuring the processor or computer to operate as a machine or special-purpose computer to perform the operations performed by the hardware components and the methods as described above.
- the instructions or software include machine code that is directly executed by the processor or computer, such as machine code produced by a compiler.
- the instructions or software include higher-level code that is executed by the processor or computer using an interpreter. Programmers of ordinary skill in the art can readily write the instructions or software based on the block diagrams and the flow charts illustrated in the drawings and the corresponding descriptions in the specification, which disclose algorithms for performing the operations performed by the hardware components and the methods as described above.
- the instructions or software to control a processor or computer to implement the hardware components and perform the methods as described above, and any associated data, data files, and data structures, are recorded, stored, or fixed in or on one or more non-transitory computer-readable storage media.
- Examples of a non-transitory computer-readable storage medium include read-only memory (ROM), random-access memory (RAM), flash memory, CD-ROMs, CD-Rs, CD+Rs, CD-RWs, CD+RWs, DVD-ROMs, DVD-Rs, DVD+Rs, DVD-RWs, DVD+RWs, DVD-RAMS, BD-ROMs, BD-Rs, BD-R LTHs, BD-REs, magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any device known to one of ordinary skill in the art that is capable of storing the instructions or software and any associated data, data files, and data structures in a non-transitory
- the instructions or software and any associated data, data files, and data structures are distributed over network-coupled computer systems so that the instructions and software and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by the processor or computer.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
- This application claims the benefit under 35 USC 119(a) of Korean Patent Application No. 10-2015-0093653 filed on Jun. 30, 2015, in the Korean Intellectual Property Office, the entire disclosure of which is incorporated herein by reference for all purposes.
- 1. Field
- This application relates to speech recognition technology.
- 2. Description of Related Art
- When speech recognition systems are embedded in TV sets, set-top boxes, home appliances, and other devices, there is a drawback in that there may not be sufficient computing resources for the embedded speech recognition systems. However, such a drawback is negligible because speech recognition is performed for a limited number of commands in the embedded environment, whereas in a general speech recognition environment, a decoder uses many computing resources to recognize all of the words and combinations thereof that may be used by people. In contrast, in the embedded environment, only given commands of several words to thousands of words need to be recognized.
- In a general speech recognition system, after an acoustic model acquires phonetic probabilities from an audio signal, a Hidden Markov Model (HMM) decoder combines these probabilities and converts the probabilities into a sequence of words. However, the HMM decoder requires numerous computing resources and operations, and a Viterbi decoding method used in the HMM decoder may result in a huge loss of information.
- This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
- In one general aspect, a speech recognition apparatus includes a probability calculator configured to calculate phoneme probabilities of an audio signal using an acoustic model; a candidate set extractor configured to extract a candidate set from a recognition target list of target sequences; and a result returner configured to return a recognition result of the audio signal based on the calculated phoneme probabilities and the extracted candidate set.
- The acoustic model may be trained using a learning algorithm including Connectionist Temporal Classification (CTC).
- The result returner may be further configured to calculate probabilities of generating each target sequence included in the candidate set based on the calculated phoneme probabilities, and return a candidate target sequence having a highest probability among the calculated probabilities of generating each target sequence as the recognition result.
- The apparatus may further include a sequence acquirer configured to acquire a phoneme sequence based on the calculated phoneme probabilities.
- The candidate set extractor may be further configured to calculate similarities between the acquired phoneme sequence and each target sequence included in the recognition target list, and extract the candidate set based on the calculated similarities.
- The candidate set extractor may be further configured to calculate the similarities using a similarity algorithm including an edit distance algorithm.
- The sequence acquirer may be further configured to acquire the phoneme sequence based on the calculated phoneme probabilities using a best path decoding algorithm or a prefix search decoding algorithm.
- In another general aspect, a speech recognition method includes calculating phoneme probabilities of an audio signal using an acoustic model; extracting a candidate set from a recognition target list of target sequences; and returning a recognition result of the audio signal based on the calculated phoneme probabilities and the extracted candidate set.
- The acoustic model may be trained using a learning algorithm including Connectionist Temporal Classification (CTC).
- The returning of the recognition result may include calculating probabilities of generating each target sequence included in the candidate set based on the calculated phoneme probabilities; and returning a candidate target sequence having a highest probability among the calculated probabilities of generating each target sequence as the recognition result.
- The method may further include acquiring a phoneme sequence based on the calculated phoneme probabilities.
- The extracting of the candidate set may include calculating similarities between the acquired phoneme sequence and each target sequence included in the recognition target list; and extracting the candidate set based on the calculated similarities.
- The calculating of the similarities may include calculating the similarities using a similarity algorithm including an edit distance algorithm.
- The acquiring of the phoneme sequence may include acquiring the phoneme sequence based on the calculated phoneme probabilities using a best path decoding algorithm or a prefix search decoding algorithm.
- In another general aspect, an electronic device includes a speech receiver configured to receive an audio signal of a user; a speech recognizer configured to calculate phoneme probabilities of the received audio signal using an acoustic model, and based on the calculated phoneme probabilities, return any one of target sequences included in a recognition target list as a recognition result; and a processor configured to perform a specific operation based on the returned recognition result.
- The speech recognizer may be further configured to extract a candidate set from the recognition target list, calculate probabilities of generating each candidate target sequence included in the candidate set based on the calculated phoneme probabilities, and return a candidate target sequence having a highest probability among the calculated probabilities of generating each target sequence as the recognition result.
- The speech recognizer may be further configured to acquire a phoneme sequence by decoding the phoneme probabilities, and extract the candidate set based on similarities between the acquired phoneme sequence and each target sequence included in the recognition target list.
- The processor may be further configured to output the recognition result in a voice from a speaker, or in a text format on a display.
- The processor may be further configured to translate the recognition result into another language, and output the translated result in the voice from the speaker, or in the text format on the display.
- The processor may be further configured to process commands including one or more of a power on/off command, a volume control command, a channel change command, and a destination search command in response to the recognition result.
- In another general aspect, a speech recognition method includes calculating probabilities that portions of an audio signal correspond to speech units; obtaining a set of candidate sequences of speech units from a list of sequences of speech units; and recognizing one of the candidate sequences of speech units as corresponding to the audio signal based on the probabilities.
- The calculating of the probabilities may include calculating the probabilities using an acoustic model.
- The speech units may be phonemes.
- The candidate sequences of speech units may be phrases.
- The phrases may be commands to control an electronic device.
- The recognizing of the one of the candidate sequences of speech units may include calculating probabilities of generating each of the candidate sequences of speech units based on the probabilities that portions of the audio signal correspond to the speech units; and recognizing one of the candidate sequences of speech units having a highest probability among the probabilities of generating each of the candidate sequences of speech units as corresponding to the audio signal.
- Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
-
FIG. 1 is a block diagram illustrating an example of a speech recognition apparatus. -
FIG. 2 is a block diagram illustrating another example of a speech recognition apparatus. -
FIG. 3 is a flowchart illustrating an example of a speech recognition method. -
FIG. 4 is a flowchart illustrating another example of a speech recognition method. -
FIG. 5 is a block diagram illustrating an example of an electronic device. -
FIG. 6 is a flowchart illustrating an example of a speech recognition method in the electronic device. - Throughout the drawings and the detailed description, the same drawing reference numerals refer to the same elements. The relative size, proportions, and depiction of these elements may be exaggerated for clarity, illustration, and convenience.
- The following detailed description is provided to assist the reader in gaining a comprehensive understanding of the methods, apparatuses, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatuses, and/or systems described herein will be apparent to one of ordinary skill in the art. The sequences of operations described herein are merely examples, and are not limited to those set forth herein, but may be changed as will be apparent to one of ordinary skill in the art, with the exception of operations necessarily occurring in a certain order. Also, descriptions of functions and constructions that are well known to one of ordinary skill in the art may be omitted for increased clarity and conciseness.
- The features described herein may be embodied in different forms, and are not to be construed as being limited to the examples described herein. Rather, the examples described herein have been provided so that this disclosure will be thorough and complete, and will convey the full scope of the disclosure to one of ordinary skill in the art.
-
FIG. 1 is a block diagram illustrating an example of a speech recognition apparatus. - Referring to
FIG. 1 , thespeech recognition apparatus 100 includes aprobability calculator 110, a candidate setextractor 120, and a result returner 130. - The
probability calculator 110 calculates probabilities of each phoneme of an audio signal using an acoustic model. A phoneme is the smallest unit of sound that is significant in a language. - In one example, the audio signal is converted into an audio frame by a preprocessing process of extracting characteristics, and is input to an acoustic model. The acoustic model divides an audio frame into phonemes, and outputs probabilities of each phoneme.
- A general acoustic model based on a Gaussian Mixture Model (GMM), a Deep Neural Network (DNN), or a Recurrent Neural Network (RNN) is trained in a manner that maximizes the probability of phonemes of each frame that are output as an answer.
- However, since it is difficult to construct an HMM decoder that can operate in an embedded environment, the acoustic model in this example is built using a Recurrent Neural Network (RNN) and Connectionist Temporal Classification (CTC). In this case, the acoustic model is trained in a manner that maximizes probabilities of phonemes of each audio frame, with respect to all the combinations of phonemes that may make up an answer sequence, using various learning algorithms such as a CTC learning algorithm. Hereinafter, for convenience of explanation, examples will be described using an acoustic model trained using the CTC learning algorithm, i.e., an acoustic model based on a CTC network.
- The following Equation 1 is an example of an algorithm for training an acoustic model based on GMM, DNN, or RNN.
-
- In Equation 1, x represents an input audio signal, y represents probabilities of each phoneme calculated for an audio frame k using an acoustic model, and z represents an answer for the audio frame k.
- As described above, a general acoustic model is trained in a manner that maximizes probabilities of phonemes of each audio frame output as an answer.
- By contrast, the following Equations 2 and 3 are examples of algorithms for training an acoustic model according to an example of this application.
-
- In the above Equations 2 and 3, l denotes a phoneme sequence, i.e., a series of phonemes, that is an answer, and it denotes any one phoneme sequence that may be an answer. (π) is a many-to-one function that converts an output sequence π of a neural network to a phoneme sequence. For example, if a user says “apple” in 1 second (sec), pronouncing a phoneme/ae/ from 0 to 0.5 sec, a phoneme/p/ from 0.5 to 0.8 sec, and a phoneme /l/ from 0.8 to 1 sec, this will produce an output sequence it in frame units (commonly 0.01 sec) of “ae ae ae ae . . . p p p p . . . l l l l” in which the phonemes are repeated. (π) is a function that removes the repeated phonemes from the output sequence π and maps the output sequence π to a phoneme sequence/ae p l/.
- Acoustic model training is performed in such a manner that a probability p(π|x) of generating any one phoneme sequence π is calculated according to Equation 2 using a phoneme probability y for an audio frame t calculated using the acoustic model, and a probability of generating the answer l is calculated according to Equation 3 by combining probabilities p(π|x) calculated according to Equation 2. In this case, the acoustic model training is performed using a back propagation learning method.
- The candidate set
extractor 120 extracts a candidate set from arecognition target list 140. The recognition target list include a plurality of words or phrases composed of phoneme sequences. Therecognition target list 140 is predefined according to various types of devices that include thevoice recognition apparatus 100. For example, in the case where thevoice recognition apparatus 100 is mounted in a TV, therecognition target list 140 includes various commands to operate the TV, such as a power on/off command, a volume control command, a channel change command, and names of specific programs to be executed. - The candidate set
extractor 120 extracts one or more target sequences from therecognition target list 140 according to devices to be operated by a user to generate a candidate set. - The
result returner 130 calculates probabilities of generating each candidate target sequence in a candidate set using phoneme probabilities calculated using the acoustic model, and returns a candidate target sequence having the highest probability as a recognition result of an input audio signal. - The
result returner 130 calculates probabilities of each candidate target sequence of a candidate set by applying Equations 2 and 3 above, which are algorithms for training the acoustic model. - In this example, since a candidate target sequence that may be an answer is already known, it is possible to calculate probabilities of generating a candidate target sequence using each phoneme probability calculated using the acoustic model. That is, since there is no need to decode a phoneme probability using a general decoding algorithm, such as CTC, a loss of information occurring in the decoding process may be minimized. By contrast, since a candidate target sequence that may be an answer is not known in a general speech recognition environment, it is necessary to perform a decoding process using Equation 1, thereby resulting in a loss of information in the speech recognition process.
-
FIG. 2 is a block diagram illustrating another example of a speech recognition apparatus. - Referring to
FIG. 2 , aspeech recognition apparatus 200 includes aprobability calculator 210, asequence acquirer 220, a candidate setextractor 230, and aresult returner 240. - The
probability calculator 210 calculates probabilities of each phoneme of an audio signal using an acoustic model. As described above, the acoustic model is trained in a manner that maximizes probabilities of phonemes for each audio frame, with respect to all the combinations of phonemes that may make up an answer sequence, using RNN and CTC learning algorithms. - The
sequence acquirer 220 acquires a phoneme sequence that is a series of phonemes based on the phoneme probabilities calculated by theprobability calculator 210. In this case, thesequence acquirer 220 acquires one or more phoneme sequences by decoding the calculated probabilities of phonemes using a decoding algorithm, such as a best path decoding algorithm or a prefix search decoding algorithm. However, the decoding algorithm is not limited to these examples. - The candidate set
extractor 230 generates a candidate set by extracting one or more candidate target sequences from arecognition target list 250 based on the phoneme sequence. As described above, therecognition target list 250 includes target sequences, such as words/phrases/commands, that are predefined according to the types of electronic devices including thespeech recognition apparatus 200. Further, therecognition target list 250 may further include information associated with usage rankings (e.g., a usage frequency, a usage probability, etc.) of the target sequences. - In one example, the candidate set
extractor 230 extracts all or some of the target sequences as a candidate set depending on the number of target sequences included in therecognition target list 250. In this case, a specific number of target sequences may be extracted as a candidate set based on the information associated with the usage rankings of the target sequences. - In another example, the candidate set
extractor 230 calculates similarities by comparing one or more phoneme sequences acquired by thesequence acquirer 220 with each target sequence included in therecognition target list 250, and based on the similarities, extracts a specific number of phoneme sequences as candidate target sequences. In one example, the candidate setextractor 230 calculates similarities between phoneme sequences and target sequences using a similarity calculation algorithm including an edit distance algorithm, and based on the similarities, extracts a specific number of phoneme sequences (e.g., the top 20 sequences) as candidate target sequences in order of similarity. - In this manner, by controlling the number of candidate target sequences to be included in a candidate set with a similarity algorithm, the
result returner 240 calculates the probability of generating each candidate target sequence with reduced time, thereby enabling rapid return of a final recognition result. - The
result returner 240 returns, as a recognition result of an audio signal, at least one candidate target sequence in a candidate set using phoneme probabilities calculated using the acoustic model. - In one example, the
result returner 240 calculates similarities between one or more acquired phoneme sequences and each candidate target sequence in a candidate set using a similarity calculation algorithm including an edit distance algorithm, and returns a candidate target sequence having the highest similarity as a recognition result. - In another example, the
result returner 240 calculates probabilities of generating each candidate target sequence in a candidate set by applying phoneme probabilities calculated by theprobability calculator 210 to probability calculation algorithms, such as Equations 2 and 3, and returns a candidate target sequence having the highest probability as a final recognition result. -
FIG. 3 is a flowchart illustrating an example of a speech recognition method. -
FIG. 3 is an example of a speech recognition method performed by the speech recognition apparatus illustrated inFIG. 1 . - Referring to
FIG. 3 , thespeech recognition apparatus 100 calculates probabilities of phonemes of an audio signal using an acoustic model in 310. In this case, the audio signal is converted into audio frames by a preprocessing process, and the audio frames are input to the acoustic model. The acoustic model divides each audio frame into phonemes, and outputs probabilities of each phoneme. As described above, an acoustic model is trained by combining a Recurrent Neural Network (RNN) and Connectionist Temporal Classification (CTC). The acoustic model is trained using algorithms of Equations 2 and 3 above. - Subsequently, a candidate set that includes one or more candidate target sequences is extracted from a recognition target list in 320. The recognition target list includes target sequences, such as words or phrases, that are predefined according to various devices. For example, in TVs, the target sequences may include commands for controlling the TV, such as a power on/off command, a volume control command, and a channel change command. Further, in navigation devices, the target sequences may include commands for controlling the navigation device, such as a power on/off command, a volume control command, and a destination search command. In addition, the target sequences may include commands to control various electronic devices mounted in a vehicle. However, the target sequences are not limited to these examples, and may be applied to any electronic device controlled by a user and including speech recognition technology.
- Then, a recognition result of an input audio signal is returned based on the calculated phoneme probabilities and the extracted candidate set in 330. In one example, probabilities of generating each candidate target sequence are calculated based on the phoneme probabilities calculated using an acoustic model and algorithms of Equations 2 and 3 above. Further, a candidate target sequence having the highest probability is returned as a final recognition result.
-
FIG. 4 is a flowchart illustrating an example of a speech recognition method. - Referring to
FIG. 4 , probabilities of phonemes of an audio signal are calculated using an acoustic model in 410. The acoustic model is trained in a manner that maximizes probabilities of phonemes for each audio frame with respect to all the combinations of phonemes that may make up a phoneme sequence that is an answer using various learning algorithms, e.g., a CTC learning algorithm. - Subsequently, a phoneme sequence, which is a series of phonemes, is acquired based on the calculated phoneme probabilities in 420. For example, one or more phoneme sequences are acquired using a decoding algorithm, such as a best path decoding algorithm or a prefix search decoding algorithm.
- Then, a candidate set is generated by extracting one or more candidate target sequences from the recognition target list based on the phoneme sequence in 430. The recognition target list is predefined according to types of electronic devices having including speech recognition technology. In this case, the recognition target list further includes information associated with usage rankings (e.g., a usage frequency, a usage probability, etc.) of each target sequence.
- In one example, the speech recognition apparatus extracts all or some of the target sequences as a candidate set depending on the total number of target sequences included in the recognition target list. In the case where there is information associated with usage rankings of target sequences, a predefined number of target sequences may be extracted as a candidate set based on the information.
- In another example, the speech recognition apparatus calculates similarities by comparing one or more phoneme sequences acquired by the
sequence acquirer 220 with each target sequence included in the recognition target list, and based on the similarities, extracts a specific number of phoneme sequences as candidate target sequences. For example, the speech recognition apparatus calculates similarities between phoneme sequences and target sequences using a similarity calculation algorithm including an edit distance algorithm, and based on the similarities, extracts a specific number of phoneme sequences (e.g., the top 20 sequences) as candidate target sequences in order of similarity. - Then, a recognition result of an audio signal is returned based on the phoneme probabilities calculated using an acoustic model and the candidate set in 440.
- In one example, the speech recognition apparatus calculates similarities between one or more acquired phoneme sequences and each candidate target sequence in a candidate set using a similarity calculation algorithm including an edit distance algorithm, and returns a candidate target sequence having the highest similarity as a recognition result.
- In another example, the speech recognition apparatus calculates probabilities of generating each candidate target sequence in a candidate set by applying the calculated phoneme probabilities to probability calculation algorithms, such as Equations 2 and 3 above, and returns a candidate target sequence having the highest probability as a final recognition result.
-
FIG. 5 is a block diagram illustrating an example of an electronic device. - The
electronic device 500 includesspeech recognition apparatus electronic device 500 may be a TV set, a set-top box, a desktop computer, a laptop computer, an electronic translator, a smartphone, a tablet PC, an electronic control device of a vehicle, or any other device that is controlled by a user, and processes a user's various commands by embedded speech recognition technology. However, theelectronic device 500 is not limited to these examples, and may be any electronic device that is controlled by a user and includes speech recognition technology. - Referring to
FIG. 5 , theelectronic device 500 includes aspeech receiver 510, aspeech recognizer 520, and aprocessor 530. Thespeech recognizer 520 is thespeech recognition apparatus 100 inFIG. 1 or 200 inFIG. 2 that are manufactured as hardware to be implemented in theelectronic device 500. - The
speech receiver 510 receives a user's audio signal input through a microphone of theelectronic device 500. As illustrated inFIG. 5 , the user's audio signal may be phrases to be translated into another language, or may be commands for controlling a TV set, driving a vehicle, or controlling any other device that is controlled by a user. - In one example, the
speech receiver 510 performs a preprocessing process in which an analog audio signal input by a user is converted into a digital signal, the signal is divided into a plurality of audio frames, and the audio frames are transmitted to thespeech recognizer 520. - The
speech recognizer 520 inputs an audio signal, e.g., audio frames, to an acoustic model, and calculates probabilities of phonemes of each audio frame. Once the phoneme probabilities of the audio frame are calculated, thespeech recognizer 520 extracts a candidate set from a recognition target list based on the calculated phoneme probabilities, and returns a final recognition result based on the calculated phoneme probabilities and the extracted candidate set. The acoustic model is a network based on a Recurrent Neural Network (RNN) or a Deep Neural Network (DNN), and is trained in a manner that maximizes probabilities of phonemes of each audio frame with respect to all the combinations of phonemes that may make up an answer sequence using a CTC learning algorithm. - The recognition target list is predefined according to the types and purposes of the
electronic device 500 that includes speech recognition technology. For example, in a case in which thevoice recognition apparatus 100 is mounted in a TV set, various words or commands, such as a power on/off command, a volume control command, and a channel change command, that are frequently used for TVs are defined in the recognition target list. Further, in a case in which theelectronic device 500 is a navigation device mounted in a vehicle, various commands, such as a power on/off command, a volume control command, and a destination search command, that are use to control the navigation device are defined in the recognition target list. - The
speech recognizer 520 acquires phoneme sequences based on phoneme probabilities using a general decoding algorithm (e.g., CTC) for speech recognition, and extracts a candidate set by comparing the acquired phoneme sequences with the recognition target list. In this case, thespeech recognizer 520 calculates similarities between the acquired phoneme sequences and each target sequence included in the recognition target list using a similarity calculation algorithm including an edit distance algorithm, and based on the similarities, generates a candidate set by extracting a specific number of phoneme sequences as candidate target sequences in order of similarity. - The
speech recognizer 520 returns, as a final recognition result, one candidate target sequence in the candidate set extracted based on the calculated phoneme probabilities. In this case, thespeech recognizer 520 returns, as a final recognition result, a candidate target sequence having the highest probability among the probabilities of generating each candidate target sequence in a candidate set. In one example, thespeech recognizer 520 outputs the final recognition result in a text format. - The
processor 530 performs an operation in response to the final recognition result. For example, theprocessor 530 outputs the recognition result of speech input by a user in voice from a speaker, headphones, or any other audio output device, or provides the recognition result in a text format on a display. Further, theprocessor 530 performs operations to process commands (e.g., a power on/off command, a volume control command, etc.) to control TVs, set-top boxes, home appliances, electronic control devices of a vehicle, or any other devices that are controlled by a user. - Further, in the case of translating the final recognition result into another language, the
processor 530 translates the final recognition result output in a text format into another language, and outputs the translated result in voice or in a text format. However, theprocessor 530 is not limited to these examples, and may be used in various applications. -
FIG. 6 is a flowchart illustrating an example of a speech recognition method in the electronic device. - The
electronic device 500 receives, through a microphone or any other audio input device, a user's audio signal containing phrases to be translated into another language, or commands for controlling TVs or driving a vehicle, in 610. Further, once the user's audio signal is received, theelectronic device 500 converts the analog audio signal into a digital signal, and performs a preprocessing process of dividing the digital signal into a plurality of audio frames. - Then, the
electronic device 500 returns a final recognition result of the input audio signal based on the pre-stored acoustic model and a predefined recognition target list in 620. - For example, the
electronic device 500 inputs an audio frame to an acoustic model to calculate probabilities of phonemes of audio frames. Further, once the probabilities of phonemes of audio frames have been calculated, theelectronic device 500 extracts a candidate set from the recognition target list based on the calculated probabilities of phonemes, and returns a final recognition result based on the calculated phoneme probabilities and the extracted candidate set. The acoustic model is a network based on a Recurrent Neural Network (RNN) or a Deep Neural Network (DNN), and is trained using a CTC learning algorithm. The recognition target list is predefined according to the types and purposes of theelectronic device 500 that includes speech recognition technology. - In one example, the
electronic device 500 acquires phoneme sequences from the calculated phoneme probabilities, and extracts a candidate set by comparing the acquired phoneme sequences with the recognition target list. In this case, theelectronic device 500 calculates similarities between the acquired phoneme sequences and each target sequence included in the recognition target list using a similarity calculation algorithm including an edit distance algorithm, and based on the similarities, generates a candidate set by extracting a specific number of phoneme sequences as candidate target sequences in order of similarity. - The
electronic device 500 calculates probabilities of generating each candidate target sequence using Equations 2 and 3 above, and returns a candidate target sequence having the highest probability as a final recognition result, which may be converted into a text format by theelectronic device 500. - Subsequently, the
electronic device 500 performs an operation in response to the returned final recognition result in 630. - For example, the
electronic device 500 may output the recognition result of speech input by a user in voice from a speaker, headphones, or any other audio output device, or provides the recognition result in a text format on a display. Further, theelectronic device 500 may perform operations to process commands to control TVs, set-top boxes, home appliances, electronic control devices of a vehicle, and any other devices that are controlled by a user. In addition, theelectronic device 500 may translate the final recognition result output in a text format into another language, and may output the translated result in voice or in a text format. However, theelectronic device 500 is not limited to these examples, and may be used in various applications. - The
speech recognition apparatus 100, theprobability calculator 110, the candidate setextractor 120, and the result returner illustrated inFIG. 1 , thespeech recognition apparatus 100, theprobability calculator 110, the candidate setextractor 120, and theresult returner 130 illustrated inFIG. 1 , thespeech recognition apparatus 200, theprobability calculator 210, thesequence acquirer 220, the candidate setextractor 230, and theresult returner 240 illustrated inFIG. 2 , theelectronic device 500, thespeech receiver 510, thespeech recognizer 520, and theprocessor 530 illustrated inFIG. 5 that perform the operations described herein with respect toFIGS. 1-6 are implemented by hardware components. Examples of hardware components include controllers, sensors, generators, drivers, memories, comparators, arithmetic logic units, adders, subtractors, multipliers, dividers, integrators, and any other electronic components known to one of ordinary skill in the art. In one example, the hardware components are implemented by computing hardware, for example, by one or more processors or computers. A processor or computer is implemented by one or more processing elements, such as an array of logic gates, a controller and an arithmetic logic unit, a digital signal processor, a microcomputer, a programmable logic controller, a field-programmable gate array, a programmable logic array, a microprocessor, or any other device or combination of devices known to one of ordinary skill in the art that is capable of responding to and executing instructions in a defined manner to achieve a desired result. In one example, a processor or computer includes, or is connected to, one or more memories storing instructions or software that are executed by the processor or computer. Hardware components implemented by a processor or computer execute instructions or software, such as an operating system (OS) and one or more software applications that run on the OS, to perform the operations described herein with respect toFIGS. 1-6 . The hardware components also access, manipulate, process, create, and store data in response to execution of the instructions or software. For simplicity, the singular term “processor” or “computer” may be used in the description of the examples described herein, but in other examples multiple processors or computers are used, or a processor or computer includes multiple processing elements, or multiple types of processing elements, or both. In one example, a hardware component includes multiple processors, and in another example, a hardware component includes a processor and a controller. A hardware component has any one or more of different processing configurations, examples of which include a single processor, independent processors, parallel processors, single-instruction single-data (SISD) multiprocessing, single-instruction multiple-data (SIMD) multiprocessing, multiple-instruction single-data (MISD) multiprocessing, and multiple-instruction multiple-data (MIMD) multiprocessing. - The methods illustrated in
FIGS. 3, 4, and 6 that perform the operations described herein with respect toFIGS. 1-6 are performed by computing hardware, for example, by one or more processors or computers, as described above executing instructions or software to perform the operations described herein. - Instructions or software to control a processor or computer to implement the hardware components and perform the methods as described above are written as computer programs, code segments, instructions or any combination thereof, for individually or collectively instructing or configuring the processor or computer to operate as a machine or special-purpose computer to perform the operations performed by the hardware components and the methods as described above. In one example, the instructions or software include machine code that is directly executed by the processor or computer, such as machine code produced by a compiler. In another example, the instructions or software include higher-level code that is executed by the processor or computer using an interpreter. Programmers of ordinary skill in the art can readily write the instructions or software based on the block diagrams and the flow charts illustrated in the drawings and the corresponding descriptions in the specification, which disclose algorithms for performing the operations performed by the hardware components and the methods as described above.
- The instructions or software to control a processor or computer to implement the hardware components and perform the methods as described above, and any associated data, data files, and data structures, are recorded, stored, or fixed in or on one or more non-transitory computer-readable storage media. Examples of a non-transitory computer-readable storage medium include read-only memory (ROM), random-access memory (RAM), flash memory, CD-ROMs, CD-Rs, CD+Rs, CD-RWs, CD+RWs, DVD-ROMs, DVD-Rs, DVD+Rs, DVD-RWs, DVD+RWs, DVD-RAMS, BD-ROMs, BD-Rs, BD-R LTHs, BD-REs, magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any device known to one of ordinary skill in the art that is capable of storing the instructions or software and any associated data, data files, and data structures in a non-transitory manner and providing the instructions or software and any associated data, data files, and data structures to a processor or computer so that the processor or computer can execute the instructions. In one example, the instructions or software and any associated data, data files, and data structures are distributed over network-coupled computer systems so that the instructions and software and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by the processor or computer.
- While this disclosure includes specific examples, it will be apparent to one of ordinary skill in the art that various changes in form and details may be made in these examples without departing from the spirit and scope of the claims and their equivalents. The examples described herein are to be considered in a descriptive sense only, and not for purposes of limitation. Descriptions of features or aspects in each example are to be considered as being applicable to similar features or aspects in other examples. Suitable results may be achieved if the described techniques are performed in a different order, and/or if components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents. Therefore, the scope of the disclosure is defined not by the detailed description, but by the claims and their equivalents, and all variations within the scope of the claims and their equivalents are to be construed as being included in the disclosure.
Claims (26)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/322,965 US20210272551A1 (en) | 2015-06-30 | 2021-05-18 | Speech recognition apparatus, speech recognition method, and electronic device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2015-0093653 | 2015-06-30 | ||
KR1020150093653A KR102371188B1 (en) | 2015-06-30 | 2015-06-30 | Apparatus and method for speech recognition, and electronic device |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/322,965 Continuation US20210272551A1 (en) | 2015-06-30 | 2021-05-18 | Speech recognition apparatus, speech recognition method, and electronic device |
Publications (1)
Publication Number | Publication Date |
---|---|
US20170004824A1 true US20170004824A1 (en) | 2017-01-05 |
Family
ID=56134254
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/139,926 Abandoned US20170004824A1 (en) | 2015-06-30 | 2016-04-27 | Speech recognition apparatus, speech recognition method, and electronic device |
US17/322,965 Abandoned US20210272551A1 (en) | 2015-06-30 | 2021-05-18 | Speech recognition apparatus, speech recognition method, and electronic device |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/322,965 Abandoned US20210272551A1 (en) | 2015-06-30 | 2021-05-18 | Speech recognition apparatus, speech recognition method, and electronic device |
Country Status (5)
Country | Link |
---|---|
US (2) | US20170004824A1 (en) |
EP (1) | EP3113176B1 (en) |
JP (1) | JP6637848B2 (en) |
KR (1) | KR102371188B1 (en) |
CN (1) | CN106328127B (en) |
Cited By (153)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106782513A (en) * | 2017-01-25 | 2017-05-31 | 上海交通大学 | Speech recognition realization method and system based on confidence level |
US20180061439A1 (en) * | 2016-08-31 | 2018-03-01 | Gregory Frederick Diamos | Automatic audio captioning |
US9972313B2 (en) | 2016-03-01 | 2018-05-15 | Intel Corporation | Intermediate scoring and rejection loopback for improved key phrase detection |
US10043521B2 (en) * | 2016-07-01 | 2018-08-07 | Intel IP Corporation | User defined key phrase detection by user dependent sequence modeling |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10083690B2 (en) | 2014-05-30 | 2018-09-25 | Apple Inc. | Better resolution when referencing to concepts |
US10108612B2 (en) | 2008-07-31 | 2018-10-23 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US10229672B1 (en) * | 2015-12-31 | 2019-03-12 | Google Llc | Training acoustic models using connectionist temporal classification |
US10269345B2 (en) * | 2016-06-11 | 2019-04-23 | Apple Inc. | Intelligent task discovery |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10311871B2 (en) | 2015-03-08 | 2019-06-04 | Apple Inc. | Competing devices responding to voice triggers |
US10325594B2 (en) | 2015-11-24 | 2019-06-18 | Intel IP Corporation | Low resource key phrase detection for wake on voice |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10354652B2 (en) | 2015-12-02 | 2019-07-16 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10381016B2 (en) | 2008-01-03 | 2019-08-13 | Apple Inc. | Methods and apparatus for altering audio output signals |
US10390213B2 (en) | 2014-09-30 | 2019-08-20 | Apple Inc. | Social reminders |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US10417344B2 (en) | 2014-05-30 | 2019-09-17 | Apple Inc. | Exemplar-based natural language processing |
US10417405B2 (en) | 2011-03-21 | 2019-09-17 | Apple Inc. | Device access using voice authentication |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10423727B1 (en) | 2018-01-11 | 2019-09-24 | Wells Fargo Bank, N.A. | Systems and methods for processing nuances in natural language |
US10431204B2 (en) | 2014-09-11 | 2019-10-01 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10438595B2 (en) | 2014-09-30 | 2019-10-08 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10453443B2 (en) | 2014-09-30 | 2019-10-22 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10497365B2 (en) | 2014-05-30 | 2019-12-03 | Apple Inc. | Multi-command single utterance input method |
US10529332B2 (en) | 2015-03-08 | 2020-01-07 | Apple Inc. | Virtual assistant activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US10580409B2 (en) | 2016-06-11 | 2020-03-03 | Apple Inc. | Application integration with a digital assistant |
US10580432B2 (en) * | 2018-02-28 | 2020-03-03 | Microsoft Technology Licensing, Llc | Speech recognition using connectionist temporal classification |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10607602B2 (en) | 2015-05-22 | 2020-03-31 | National Institute Of Information And Communications Technology | Speech recognition device and computer program |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
CN111090630A (en) * | 2019-12-16 | 2020-05-01 | 中科宇图科技股份有限公司 | Data fusion processing method based on multi-source spatial point data |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US10650807B2 (en) | 2018-09-18 | 2020-05-12 | Intel Corporation | Method and system of neural network keyphrase detection |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10657961B2 (en) | 2013-06-08 | 2020-05-19 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US10692504B2 (en) | 2010-02-25 | 2020-06-23 | Apple Inc. | User profiling for voice input processing |
US10699717B2 (en) | 2014-05-30 | 2020-06-30 | Apple Inc. | Intelligent assistant for home automation |
US10714117B2 (en) | 2013-02-07 | 2020-07-14 | Apple Inc. | Voice trigger for a digital assistant |
US10714122B2 (en) | 2018-06-06 | 2020-07-14 | Intel Corporation | Speech classification of audio for wake on voice |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10741185B2 (en) | 2010-01-18 | 2020-08-11 | Apple Inc. | Intelligent automated assistant |
US10748546B2 (en) | 2017-05-16 | 2020-08-18 | Apple Inc. | Digital assistant services based on device capabilities |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US10769385B2 (en) | 2013-06-09 | 2020-09-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
CN111681661A (en) * | 2020-06-08 | 2020-09-18 | 北京有竹居网络技术有限公司 | Method, device, electronic equipment and computer readable medium for voice recognition |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
CN111862943A (en) * | 2019-04-30 | 2020-10-30 | 北京地平线机器人技术研发有限公司 | Speech recognition method and apparatus, electronic device, and storage medium |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10909976B2 (en) | 2016-06-09 | 2021-02-02 | National Institute Of Information And Communications Technology | Speech recognition device and computer program |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US10942703B2 (en) | 2015-12-23 | 2021-03-09 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10942702B2 (en) | 2016-06-11 | 2021-03-09 | Apple Inc. | Intelligent device arbitration and control |
US10956666B2 (en) | 2015-11-09 | 2021-03-23 | Apple Inc. | Unconventional virtual assistant interactions |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
CN112735394A (en) * | 2020-12-16 | 2021-04-30 | 青岛海尔科技有限公司 | Semantic parsing method and device for voice |
US11010127B2 (en) | 2015-06-29 | 2021-05-18 | Apple Inc. | Virtual assistant for media playback |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US11048473B2 (en) | 2013-06-09 | 2021-06-29 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US11069336B2 (en) | 2012-03-02 | 2021-07-20 | Apple Inc. | Systems and methods for name pronunciation |
US11070949B2 (en) | 2015-05-27 | 2021-07-20 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display |
US11069347B2 (en) | 2016-06-08 | 2021-07-20 | Apple Inc. | Intelligent automated assistant for media exploration |
US20210272551A1 (en) * | 2015-06-30 | 2021-09-02 | Samsung Electronics Co., Ltd. | Speech recognition apparatus, speech recognition method, and electronic device |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11127397B2 (en) | 2015-05-27 | 2021-09-21 | Apple Inc. | Device voice control |
US11127394B2 (en) | 2019-03-29 | 2021-09-21 | Intel Corporation | Method and system of high accuracy keyphrase detection for low resource devices |
US11126400B2 (en) | 2015-09-08 | 2021-09-21 | Apple Inc. | Zero latency digital assistant |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
CN113488029A (en) * | 2021-06-23 | 2021-10-08 | 中科极限元(杭州)智能科技股份有限公司 | Non-autoregressive speech recognition training decoding method and system based on parameter sharing |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
CN114333821A (en) * | 2021-12-30 | 2022-04-12 | 山东声智物联科技有限公司 | Elevator control method, device, electronic equipment, storage medium and product |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US11350253B2 (en) | 2011-06-03 | 2022-05-31 | Apple Inc. | Active transport based notifications |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US20220189462A1 (en) * | 2020-12-10 | 2022-06-16 | National Cheng Kung University | Method of training a speech recognition model of an extended language by speech in a source language |
US11373639B2 (en) * | 2019-12-12 | 2022-06-28 | Mitsubishi Electric Research Laboratories, Inc. | System and method for streaming end-to-end speech recognition with asynchronous decoders pruning prefixes using a joint label and frame information in transcribing technique |
CN114694641A (en) * | 2020-12-31 | 2022-07-01 | 华为技术有限公司 | Voice recognition method and electronic equipment |
CN114694637A (en) * | 2020-12-30 | 2022-07-01 | 北大方正集团有限公司 | Hybrid speech recognition method, device, electronic device and storage medium |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
CN114783419A (en) * | 2022-06-21 | 2022-07-22 | 深圳市友杰智新科技有限公司 | Text recognition method and device combined with priori knowledge and computer equipment |
US11423886B2 (en) | 2010-01-18 | 2022-08-23 | Apple Inc. | Task flow identification based on user intent |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11450312B2 (en) | 2018-03-22 | 2022-09-20 | Tencent Technology (Shenzhen) Company Limited | Speech recognition method, apparatus, and device, and storage medium |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11500672B2 (en) | 2015-09-08 | 2022-11-15 | Apple Inc. | Distributed personal assistant |
US11516537B2 (en) | 2014-06-30 | 2022-11-29 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11526368B2 (en) | 2015-11-06 | 2022-12-13 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US11530930B2 (en) | 2017-09-19 | 2022-12-20 | Volkswagen Aktiengesellschaft | Transportation vehicle control with phoneme generation |
US20230046924A1 (en) * | 2016-11-03 | 2023-02-16 | Google Llc | Focus session at a voice interface device |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11671920B2 (en) | 2007-04-03 | 2023-06-06 | Apple Inc. | Method and system for operating a multifunction portable electronic device using voice-activation |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11765209B2 (en) | 2020-05-11 | 2023-09-19 | Apple Inc. | Digital assistant hardware abstraction |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US11809483B2 (en) | 2015-09-08 | 2023-11-07 | Apple Inc. | Intelligent automated assistant for media search and playback |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US11853536B2 (en) | 2015-09-08 | 2023-12-26 | Apple Inc. | Intelligent automated assistant in a media environment |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US11928604B2 (en) | 2005-09-08 | 2024-03-12 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
US12197817B2 (en) | 2016-06-11 | 2025-01-14 | Apple Inc. | Intelligent device arbitration and control |
US12223282B2 (en) | 2016-06-09 | 2025-02-11 | Apple Inc. | Intelligent automated assistant in a home environment |
US12301635B2 (en) | 2020-05-11 | 2025-05-13 | Apple Inc. | Digital assistant hardware abstraction |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10229685B2 (en) * | 2017-01-18 | 2019-03-12 | International Business Machines Corporation | Symbol sequence estimation in speech |
CN109313892B (en) * | 2017-05-17 | 2023-02-21 | 北京嘀嘀无限科技发展有限公司 | Robust speech recognition method and system |
KR102339716B1 (en) * | 2017-06-30 | 2021-12-14 | 삼성에스디에스 주식회사 | Method for recognizing speech and Apparatus thereof |
KR102441066B1 (en) * | 2017-10-12 | 2022-09-06 | 현대자동차주식회사 | Vehicle voice generation system and method |
CN107729321A (en) * | 2017-10-23 | 2018-02-23 | 上海百芝龙网络科技有限公司 | A kind of method for correcting error of voice identification result |
CN109542514B (en) * | 2017-10-30 | 2021-01-05 | 安徽寒武纪信息科技有限公司 | Method for implementing operation instruction and related product |
CN107992812A (en) * | 2017-11-27 | 2018-05-04 | 北京搜狗科技发展有限公司 | A kind of lip reading recognition methods and device |
CN108766414B (en) * | 2018-06-29 | 2021-01-15 | 北京百度网讯科技有限公司 | Method, apparatus, device and computer-readable storage medium for speech translation |
CN109121057B (en) * | 2018-08-30 | 2020-11-06 | 北京聆通科技有限公司 | Intelligent hearing aid method and system |
US11996105B2 (en) | 2018-09-13 | 2024-05-28 | Shanghai Cambricon Information Technology Co., Ltd. | Information processing method and terminal device |
KR102651413B1 (en) * | 2018-10-17 | 2024-03-27 | 삼성전자주식회사 | Electronic device and controlling method of electronic device |
KR20200056001A (en) * | 2018-11-14 | 2020-05-22 | 삼성전자주식회사 | A decoding method in an artificial neural network and an apparatus thereof |
CN111862961A (en) * | 2019-04-29 | 2020-10-30 | 京东数字科技控股有限公司 | Method and device for recognizing voice |
CN110852324A (en) * | 2019-08-23 | 2020-02-28 | 上海撬动网络科技有限公司 | Deep neural network-based container number detection method |
CN110503956B (en) * | 2019-09-17 | 2023-05-12 | 平安科技(深圳)有限公司 | Voice recognition method, device, medium and electronic equipment |
KR102577589B1 (en) * | 2019-10-22 | 2023-09-12 | 삼성전자주식회사 | Voice recognizing method and voice recognizing appratus |
KR20210060897A (en) * | 2019-11-19 | 2021-05-27 | 삼성전자주식회사 | Method and apparatus for processing speech |
CN112837401B (en) * | 2021-01-27 | 2024-04-09 | 网易(杭州)网络有限公司 | Information processing method, device, computer equipment and storage medium |
US11682413B2 (en) * | 2021-10-28 | 2023-06-20 | Lenovo (Singapore) Pte. Ltd | Method and system to modify speech impaired messages utilizing neural network audio filters |
CN113889083B (en) * | 2021-11-03 | 2022-12-02 | 广州博冠信息科技有限公司 | Voice recognition method and device, storage medium and electronic equipment |
CN117524263A (en) * | 2022-07-26 | 2024-02-06 | 北京三星通信技术研究有限公司 | Data processing method, device wake-up method, electronic device and storage medium |
CN115329785B (en) * | 2022-10-15 | 2023-01-20 | 小语智能信息科技(云南)有限公司 | English-Tai-old multi-language neural machine translation method and device integrated with phoneme characteristics |
CN116580701B (en) * | 2023-05-19 | 2023-11-24 | 国网物资有限公司 | Alarm audio frequency identification method, device, electronic equipment and computer medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7219123B1 (en) * | 1999-10-08 | 2007-05-15 | At Road, Inc. | Portable browser device with adaptive personalization capability |
US20100004932A1 (en) * | 2007-03-20 | 2010-01-07 | Fujitsu Limited | Speech recognition system, speech recognition program, and speech recognition method |
US20100217596A1 (en) * | 2009-02-24 | 2010-08-26 | Nexidia Inc. | Word spotting false alarm phrases |
US20150340034A1 (en) * | 2014-05-22 | 2015-11-26 | Google Inc. | Recognizing speech using neural networks |
US20160027438A1 (en) * | 2014-07-27 | 2016-01-28 | Malaspina Labs (Barbados), Inc. | Concurrent Segmentation of Multiple Similar Vocalizations |
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
US20160125878A1 (en) * | 2014-11-05 | 2016-05-05 | Hyundai Motor Company | Vehicle and head unit having voice recognition function, and method for voice recognizing thereof |
US20160171974A1 (en) * | 2014-12-15 | 2016-06-16 | Baidu Usa Llc | Systems and methods for speech transcription |
US20160260430A1 (en) * | 2015-03-06 | 2016-09-08 | Dell Products L.P. | Voice-based input using natural language processing for interfacing with one or more devices |
US20160351188A1 (en) * | 2015-05-26 | 2016-12-01 | Google Inc. | Learning pronunciations from acoustic sequences |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS62118398A (en) * | 1985-11-19 | 1987-05-29 | 松下電器産業株式会社 | Word recognition equipment |
JP3741156B2 (en) * | 1995-04-07 | 2006-02-01 | ソニー株式会社 | Speech recognition apparatus, speech recognition method, and speech translation apparatus |
JP2000029486A (en) * | 1998-07-09 | 2000-01-28 | Hitachi Ltd | Speech recognition system and method |
JP3782943B2 (en) * | 2001-02-20 | 2006-06-07 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Speech recognition apparatus, computer system, speech recognition method, program, and recording medium |
KR100438838B1 (en) * | 2002-01-29 | 2004-07-05 | 삼성전자주식회사 | A voice command interpreter with dialogue focus tracking function and method thereof |
JP4511274B2 (en) * | 2004-07-29 | 2010-07-28 | 三菱電機株式会社 | Voice data retrieval device |
JP4734155B2 (en) * | 2006-03-24 | 2011-07-27 | 株式会社東芝 | Speech recognition apparatus, speech recognition method, and speech recognition program |
KR20090065102A (en) * | 2007-12-17 | 2009-06-22 | 한국전자통신연구원 | Vocabulary decoding method and apparatus |
WO2011037587A1 (en) * | 2009-09-28 | 2011-03-31 | Nuance Communications, Inc. | Downsampling schemes in a hierarchical neural network structure for phoneme recognition |
JP5753769B2 (en) * | 2011-11-18 | 2015-07-22 | 株式会社日立製作所 | Voice data retrieval system and program therefor |
US8682678B2 (en) * | 2012-03-14 | 2014-03-25 | International Business Machines Corporation | Automatic realtime speech impairment correction |
KR20140028174A (en) * | 2012-07-13 | 2014-03-10 | 삼성전자주식회사 | Method for recognizing speech and electronic device thereof |
CN103854643B (en) * | 2012-11-29 | 2017-03-01 | 株式会社东芝 | Method and apparatus for synthesizing voice |
US20150228277A1 (en) * | 2014-02-11 | 2015-08-13 | Malaspina Labs (Barbados), Inc. | Voiced Sound Pattern Detection |
JP6011565B2 (en) * | 2014-03-05 | 2016-10-19 | カシオ計算機株式会社 | Voice search device, voice search method and program |
KR102371188B1 (en) * | 2015-06-30 | 2022-03-04 | 삼성전자주식회사 | Apparatus and method for speech recognition, and electronic device |
CN114503194A (en) * | 2019-12-17 | 2022-05-13 | 谷歌有限责任公司 | Machine learning for interpretation |
-
2015
- 2015-06-30 KR KR1020150093653A patent/KR102371188B1/en active Active
-
2016
- 2016-04-27 US US15/139,926 patent/US20170004824A1/en not_active Abandoned
- 2016-06-17 EP EP16175048.4A patent/EP3113176B1/en not_active Not-in-force
- 2016-06-29 JP JP2016128918A patent/JP6637848B2/en active Active
- 2016-06-30 CN CN201610510741.3A patent/CN106328127B/en active Active
-
2021
- 2021-05-18 US US17/322,965 patent/US20210272551A1/en not_active Abandoned
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7219123B1 (en) * | 1999-10-08 | 2007-05-15 | At Road, Inc. | Portable browser device with adaptive personalization capability |
US20100004932A1 (en) * | 2007-03-20 | 2010-01-07 | Fujitsu Limited | Speech recognition system, speech recognition program, and speech recognition method |
US20100217596A1 (en) * | 2009-02-24 | 2010-08-26 | Nexidia Inc. | Word spotting false alarm phrases |
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
US20150340034A1 (en) * | 2014-05-22 | 2015-11-26 | Google Inc. | Recognizing speech using neural networks |
US20160027438A1 (en) * | 2014-07-27 | 2016-01-28 | Malaspina Labs (Barbados), Inc. | Concurrent Segmentation of Multiple Similar Vocalizations |
US20160125878A1 (en) * | 2014-11-05 | 2016-05-05 | Hyundai Motor Company | Vehicle and head unit having voice recognition function, and method for voice recognizing thereof |
US20160171974A1 (en) * | 2014-12-15 | 2016-06-16 | Baidu Usa Llc | Systems and methods for speech transcription |
US20160260430A1 (en) * | 2015-03-06 | 2016-09-08 | Dell Products L.P. | Voice-based input using natural language processing for interfacing with one or more devices |
US20160351188A1 (en) * | 2015-05-26 | 2016-12-01 | Google Inc. | Learning pronunciations from acoustic sequences |
Cited By (255)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11928604B2 (en) | 2005-09-08 | 2024-03-12 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US11671920B2 (en) | 2007-04-03 | 2023-06-06 | Apple Inc. | Method and system for operating a multifunction portable electronic device using voice-activation |
US11979836B2 (en) | 2007-04-03 | 2024-05-07 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US10381016B2 (en) | 2008-01-03 | 2019-08-13 | Apple Inc. | Methods and apparatus for altering audio output signals |
US10108612B2 (en) | 2008-07-31 | 2018-10-23 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11348582B2 (en) | 2008-10-02 | 2022-05-31 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11900936B2 (en) | 2008-10-02 | 2024-02-13 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US12165635B2 (en) | 2010-01-18 | 2024-12-10 | Apple Inc. | Intelligent automated assistant |
US10741185B2 (en) | 2010-01-18 | 2020-08-11 | Apple Inc. | Intelligent automated assistant |
US12087308B2 (en) | 2010-01-18 | 2024-09-10 | Apple Inc. | Intelligent automated assistant |
US11423886B2 (en) | 2010-01-18 | 2022-08-23 | Apple Inc. | Task flow identification based on user intent |
US10692504B2 (en) | 2010-02-25 | 2020-06-23 | Apple Inc. | User profiling for voice input processing |
US10417405B2 (en) | 2011-03-21 | 2019-09-17 | Apple Inc. | Device access using voice authentication |
US11350253B2 (en) | 2011-06-03 | 2022-05-31 | Apple Inc. | Active transport based notifications |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11069336B2 (en) | 2012-03-02 | 2021-07-20 | Apple Inc. | Systems and methods for name pronunciation |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11321116B2 (en) | 2012-05-15 | 2022-05-03 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US10714117B2 (en) | 2013-02-07 | 2020-07-14 | Apple Inc. | Voice trigger for a digital assistant |
US11862186B2 (en) | 2013-02-07 | 2024-01-02 | Apple Inc. | Voice trigger for a digital assistant |
US10978090B2 (en) | 2013-02-07 | 2021-04-13 | Apple Inc. | Voice trigger for a digital assistant |
US11636869B2 (en) | 2013-02-07 | 2023-04-25 | Apple Inc. | Voice trigger for a digital assistant |
US12277954B2 (en) | 2013-02-07 | 2025-04-15 | Apple Inc. | Voice trigger for a digital assistant |
US11557310B2 (en) | 2013-02-07 | 2023-01-17 | Apple Inc. | Voice trigger for a digital assistant |
US12009007B2 (en) | 2013-02-07 | 2024-06-11 | Apple Inc. | Voice trigger for a digital assistant |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US10657961B2 (en) | 2013-06-08 | 2020-05-19 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US11727219B2 (en) | 2013-06-09 | 2023-08-15 | Apple Inc. | System and method for inferring user intent from speech inputs |
US11048473B2 (en) | 2013-06-09 | 2021-06-29 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US12073147B2 (en) | 2013-06-09 | 2024-08-27 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US10769385B2 (en) | 2013-06-09 | 2020-09-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US10497365B2 (en) | 2014-05-30 | 2019-12-03 | Apple Inc. | Multi-command single utterance input method |
US11670289B2 (en) | 2014-05-30 | 2023-06-06 | Apple Inc. | Multi-command single utterance input method |
US11257504B2 (en) | 2014-05-30 | 2022-02-22 | Apple Inc. | Intelligent assistant for home automation |
US11810562B2 (en) | 2014-05-30 | 2023-11-07 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US12067990B2 (en) | 2014-05-30 | 2024-08-20 | Apple Inc. | Intelligent assistant for home automation |
US10878809B2 (en) | 2014-05-30 | 2020-12-29 | Apple Inc. | Multi-command single utterance input method |
US10083690B2 (en) | 2014-05-30 | 2018-09-25 | Apple Inc. | Better resolution when referencing to concepts |
US11699448B2 (en) | 2014-05-30 | 2023-07-11 | Apple Inc. | Intelligent assistant for home automation |
US12118999B2 (en) | 2014-05-30 | 2024-10-15 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10417344B2 (en) | 2014-05-30 | 2019-09-17 | Apple Inc. | Exemplar-based natural language processing |
US10657966B2 (en) | 2014-05-30 | 2020-05-19 | Apple Inc. | Better resolution when referencing to concepts |
US10714095B2 (en) | 2014-05-30 | 2020-07-14 | Apple Inc. | Intelligent assistant for home automation |
US10699717B2 (en) | 2014-05-30 | 2020-06-30 | Apple Inc. | Intelligent assistant for home automation |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US11516537B2 (en) | 2014-06-30 | 2022-11-29 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11838579B2 (en) | 2014-06-30 | 2023-12-05 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US12200297B2 (en) | 2014-06-30 | 2025-01-14 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10431204B2 (en) | 2014-09-11 | 2019-10-01 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10453443B2 (en) | 2014-09-30 | 2019-10-22 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10438595B2 (en) | 2014-09-30 | 2019-10-08 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10390213B2 (en) | 2014-09-30 | 2019-08-20 | Apple Inc. | Social reminders |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US10311871B2 (en) | 2015-03-08 | 2019-06-04 | Apple Inc. | Competing devices responding to voice triggers |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US11842734B2 (en) | 2015-03-08 | 2023-12-12 | Apple Inc. | Virtual assistant activation |
US10930282B2 (en) | 2015-03-08 | 2021-02-23 | Apple Inc. | Competing devices responding to voice triggers |
US10529332B2 (en) | 2015-03-08 | 2020-01-07 | Apple Inc. | Virtual assistant activation |
US12236952B2 (en) | 2015-03-08 | 2025-02-25 | Apple Inc. | Virtual assistant activation |
US11087759B2 (en) | 2015-03-08 | 2021-08-10 | Apple Inc. | Virtual assistant activation |
US12001933B2 (en) | 2015-05-15 | 2024-06-04 | Apple Inc. | Virtual assistant in a communication session |
US12154016B2 (en) | 2015-05-15 | 2024-11-26 | Apple Inc. | Virtual assistant in a communication session |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US10607602B2 (en) | 2015-05-22 | 2020-03-31 | National Institute Of Information And Communications Technology | Speech recognition device and computer program |
US11070949B2 (en) | 2015-05-27 | 2021-07-20 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display |
US11127397B2 (en) | 2015-05-27 | 2021-09-21 | Apple Inc. | Device voice control |
US10681212B2 (en) | 2015-06-05 | 2020-06-09 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US11947873B2 (en) | 2015-06-29 | 2024-04-02 | Apple Inc. | Virtual assistant for media playback |
US11010127B2 (en) | 2015-06-29 | 2021-05-18 | Apple Inc. | Virtual assistant for media playback |
US20210272551A1 (en) * | 2015-06-30 | 2021-09-02 | Samsung Electronics Co., Ltd. | Speech recognition apparatus, speech recognition method, and electronic device |
US11500672B2 (en) | 2015-09-08 | 2022-11-15 | Apple Inc. | Distributed personal assistant |
US11853536B2 (en) | 2015-09-08 | 2023-12-26 | Apple Inc. | Intelligent automated assistant in a media environment |
US11126400B2 (en) | 2015-09-08 | 2021-09-21 | Apple Inc. | Zero latency digital assistant |
US11550542B2 (en) | 2015-09-08 | 2023-01-10 | Apple Inc. | Zero latency digital assistant |
US12204932B2 (en) | 2015-09-08 | 2025-01-21 | Apple Inc. | Distributed personal assistant |
US11809483B2 (en) | 2015-09-08 | 2023-11-07 | Apple Inc. | Intelligent automated assistant for media search and playback |
US11954405B2 (en) | 2015-09-08 | 2024-04-09 | Apple Inc. | Zero latency digital assistant |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
US11809886B2 (en) | 2015-11-06 | 2023-11-07 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11526368B2 (en) | 2015-11-06 | 2022-12-13 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11886805B2 (en) | 2015-11-09 | 2024-01-30 | Apple Inc. | Unconventional virtual assistant interactions |
US10956666B2 (en) | 2015-11-09 | 2021-03-23 | Apple Inc. | Unconventional virtual assistant interactions |
US10325594B2 (en) | 2015-11-24 | 2019-06-18 | Intel IP Corporation | Low resource key phrase detection for wake on voice |
US10937426B2 (en) | 2015-11-24 | 2021-03-02 | Intel IP Corporation | Low resource key phrase detection for wake on voice |
US10354652B2 (en) | 2015-12-02 | 2019-07-16 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10942703B2 (en) | 2015-12-23 | 2021-03-09 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US11853647B2 (en) | 2015-12-23 | 2023-12-26 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US11341958B2 (en) | 2015-12-31 | 2022-05-24 | Google Llc | Training acoustic models using connectionist temporal classification |
US11769493B2 (en) | 2015-12-31 | 2023-09-26 | Google Llc | Training acoustic models using connectionist temporal classification |
US10229672B1 (en) * | 2015-12-31 | 2019-03-12 | Google Llc | Training acoustic models using connectionist temporal classification |
US10803855B1 (en) | 2015-12-31 | 2020-10-13 | Google Llc | Training acoustic models using connectionist temporal classification |
US9972313B2 (en) | 2016-03-01 | 2018-05-15 | Intel Corporation | Intermediate scoring and rejection loopback for improved key phrase detection |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US11069347B2 (en) | 2016-06-08 | 2021-07-20 | Apple Inc. | Intelligent automated assistant for media exploration |
US12223282B2 (en) | 2016-06-09 | 2025-02-11 | Apple Inc. | Intelligent automated assistant in a home environment |
US10909976B2 (en) | 2016-06-09 | 2021-02-02 | National Institute Of Information And Communications Technology | Speech recognition device and computer program |
US11037565B2 (en) | 2016-06-10 | 2021-06-15 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US12175977B2 (en) | 2016-06-10 | 2024-12-24 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US11657820B2 (en) | 2016-06-10 | 2023-05-23 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US12197817B2 (en) | 2016-06-11 | 2025-01-14 | Apple Inc. | Intelligent device arbitration and control |
US10942702B2 (en) | 2016-06-11 | 2021-03-09 | Apple Inc. | Intelligent device arbitration and control |
US11749275B2 (en) | 2016-06-11 | 2023-09-05 | Apple Inc. | Application integration with a digital assistant |
US10580409B2 (en) | 2016-06-11 | 2020-03-03 | Apple Inc. | Application integration with a digital assistant |
US10269345B2 (en) * | 2016-06-11 | 2019-04-23 | Apple Inc. | Intelligent task discovery |
US12293763B2 (en) | 2016-06-11 | 2025-05-06 | Apple Inc. | Application integration with a digital assistant |
US11809783B2 (en) | 2016-06-11 | 2023-11-07 | Apple Inc. | Intelligent device arbitration and control |
US11152002B2 (en) | 2016-06-11 | 2021-10-19 | Apple Inc. | Application integration with a digital assistant |
US10043521B2 (en) * | 2016-07-01 | 2018-08-07 | Intel IP Corporation | User defined key phrase detection by user dependent sequence modeling |
US10679643B2 (en) * | 2016-08-31 | 2020-06-09 | Gregory Frederick Diamos | Automatic audio captioning |
US20180061439A1 (en) * | 2016-08-31 | 2018-03-01 | Gregory Frederick Diamos | Automatic audio captioning |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10553215B2 (en) | 2016-09-23 | 2020-02-04 | Apple Inc. | Intelligent automated assistant |
US20230046924A1 (en) * | 2016-11-03 | 2023-02-16 | Google Llc | Focus session at a voice interface device |
US11990128B2 (en) * | 2016-11-03 | 2024-05-21 | Google Llc | Focus session at a voice interface device |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US12260234B2 (en) | 2017-01-09 | 2025-03-25 | Apple Inc. | Application integration with a digital assistant |
US11656884B2 (en) | 2017-01-09 | 2023-05-23 | Apple Inc. | Application integration with a digital assistant |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
CN106782513A (en) * | 2017-01-25 | 2017-05-31 | 上海交通大学 | Speech recognition realization method and system based on confidence level |
US10741181B2 (en) | 2017-05-09 | 2020-08-11 | Apple Inc. | User interface for correcting recognition errors |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US11599331B2 (en) | 2017-05-11 | 2023-03-07 | Apple Inc. | Maintaining privacy of personal information |
US10847142B2 (en) | 2017-05-11 | 2020-11-24 | Apple Inc. | Maintaining privacy of personal information |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US11862151B2 (en) | 2017-05-12 | 2024-01-02 | Apple Inc. | Low-latency intelligent automated assistant |
US11837237B2 (en) | 2017-05-12 | 2023-12-05 | Apple Inc. | User-specific acoustic models |
US11580990B2 (en) | 2017-05-12 | 2023-02-14 | Apple Inc. | User-specific acoustic models |
US11538469B2 (en) | 2017-05-12 | 2022-12-27 | Apple Inc. | Low-latency intelligent automated assistant |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US11380310B2 (en) | 2017-05-12 | 2022-07-05 | Apple Inc. | Low-latency intelligent automated assistant |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US11405466B2 (en) | 2017-05-12 | 2022-08-02 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US11217255B2 (en) | 2017-05-16 | 2022-01-04 | Apple Inc. | Far-field extension for digital assistant services |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US12026197B2 (en) | 2017-05-16 | 2024-07-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US12254887B2 (en) | 2017-05-16 | 2025-03-18 | Apple Inc. | Far-field extension of digital assistant services for providing a notification of an event to a user |
US11675829B2 (en) | 2017-05-16 | 2023-06-13 | Apple Inc. | Intelligent automated assistant for media exploration |
US10909171B2 (en) | 2017-05-16 | 2021-02-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US10748546B2 (en) | 2017-05-16 | 2020-08-18 | Apple Inc. | Digital assistant services based on device capabilities |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US11530930B2 (en) | 2017-09-19 | 2022-12-20 | Volkswagen Aktiengesellschaft | Transportation vehicle control with phoneme generation |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10423727B1 (en) | 2018-01-11 | 2019-09-24 | Wells Fargo Bank, N.A. | Systems and methods for processing nuances in natural language |
US12001806B1 (en) | 2018-01-11 | 2024-06-04 | Wells Fargo Bank, N.A. | Systems and methods for processing nuances in natural language |
US11244120B1 (en) | 2018-01-11 | 2022-02-08 | Wells Fargo Bank, N.A. | Systems and methods for processing nuances in natural language |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10580432B2 (en) * | 2018-02-28 | 2020-03-03 | Microsoft Technology Licensing, Llc | Speech recognition using connectionist temporal classification |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US11450312B2 (en) | 2018-03-22 | 2022-09-20 | Tencent Technology (Shenzhen) Company Limited | Speech recognition method, apparatus, and device, and storage medium |
US11710482B2 (en) | 2018-03-26 | 2023-07-25 | Apple Inc. | Natural assistant interaction |
US12211502B2 (en) | 2018-03-26 | 2025-01-28 | Apple Inc. | Natural assistant interaction |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US11907436B2 (en) | 2018-05-07 | 2024-02-20 | Apple Inc. | Raise to speak |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11854539B2 (en) | 2018-05-07 | 2023-12-26 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11169616B2 (en) | 2018-05-07 | 2021-11-09 | Apple Inc. | Raise to speak |
US11487364B2 (en) | 2018-05-07 | 2022-11-01 | Apple Inc. | Raise to speak |
US11900923B2 (en) | 2018-05-07 | 2024-02-13 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US11431642B2 (en) | 2018-06-01 | 2022-08-30 | Apple Inc. | Variable latency device coordination |
US12061752B2 (en) | 2018-06-01 | 2024-08-13 | Apple Inc. | Attention aware virtual assistant dismissal |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US12067985B2 (en) | 2018-06-01 | 2024-08-20 | Apple Inc. | Virtual assistant operations in multi-device environments |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US11630525B2 (en) | 2018-06-01 | 2023-04-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US12080287B2 (en) | 2018-06-01 | 2024-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11360577B2 (en) | 2018-06-01 | 2022-06-14 | Apple Inc. | Attention aware virtual assistant dismissal |
US10720160B2 (en) | 2018-06-01 | 2020-07-21 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10984798B2 (en) | 2018-06-01 | 2021-04-20 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11009970B2 (en) | 2018-06-01 | 2021-05-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US10504518B1 (en) | 2018-06-03 | 2019-12-10 | Apple Inc. | Accelerated task performance |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10944859B2 (en) | 2018-06-03 | 2021-03-09 | Apple Inc. | Accelerated task performance |
US10714122B2 (en) | 2018-06-06 | 2020-07-14 | Intel Corporation | Speech classification of audio for wake on voice |
US10650807B2 (en) | 2018-09-18 | 2020-05-12 | Intel Corporation | Method and system of neural network keyphrase detection |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US11893992B2 (en) | 2018-09-28 | 2024-02-06 | Apple Inc. | Multi-modal inputs for voice commands |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US11783815B2 (en) | 2019-03-18 | 2023-10-10 | Apple Inc. | Multimodality in digital assistant systems |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
US12136419B2 (en) | 2019-03-18 | 2024-11-05 | Apple Inc. | Multimodality in digital assistant systems |
US11127394B2 (en) | 2019-03-29 | 2021-09-21 | Intel Corporation | Method and system of high accuracy keyphrase detection for low resource devices |
CN111862943A (en) * | 2019-04-30 | 2020-10-30 | 北京地平线机器人技术研发有限公司 | Speech recognition method and apparatus, electronic device, and storage medium |
US11675491B2 (en) | 2019-05-06 | 2023-06-13 | Apple Inc. | User configurable task triggers |
US12216894B2 (en) | 2019-05-06 | 2025-02-04 | Apple Inc. | User configurable task triggers |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11705130B2 (en) | 2019-05-06 | 2023-07-18 | Apple Inc. | Spoken notifications |
US12154571B2 (en) | 2019-05-06 | 2024-11-26 | Apple Inc. | Spoken notifications |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11888791B2 (en) | 2019-05-21 | 2024-01-30 | Apple Inc. | Providing message response suggestions |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11360739B2 (en) | 2019-05-31 | 2022-06-14 | Apple Inc. | User activity shortcut suggestions |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
US11373639B2 (en) * | 2019-12-12 | 2022-06-28 | Mitsubishi Electric Research Laboratories, Inc. | System and method for streaming end-to-end speech recognition with asynchronous decoders pruning prefixes using a joint label and frame information in transcribing technique |
CN111090630A (en) * | 2019-12-16 | 2020-05-01 | 中科宇图科技股份有限公司 | Data fusion processing method based on multi-source spatial point data |
US12301635B2 (en) | 2020-05-11 | 2025-05-13 | Apple Inc. | Digital assistant hardware abstraction |
US11765209B2 (en) | 2020-05-11 | 2023-09-19 | Apple Inc. | Digital assistant hardware abstraction |
US12197712B2 (en) | 2020-05-11 | 2025-01-14 | Apple Inc. | Providing relevant data items based on context |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US11924254B2 (en) | 2020-05-11 | 2024-03-05 | Apple Inc. | Digital assistant hardware abstraction |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
CN111681661A (en) * | 2020-06-08 | 2020-09-18 | 北京有竹居网络技术有限公司 | Method, device, electronic equipment and computer readable medium for voice recognition |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US12219314B2 (en) | 2020-07-21 | 2025-02-04 | Apple Inc. | User identification using headphones |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
US11750962B2 (en) | 2020-07-21 | 2023-09-05 | Apple Inc. | User identification using headphones |
US20220189462A1 (en) * | 2020-12-10 | 2022-06-16 | National Cheng Kung University | Method of training a speech recognition model of an extended language by speech in a source language |
CN112735394A (en) * | 2020-12-16 | 2021-04-30 | 青岛海尔科技有限公司 | Semantic parsing method and device for voice |
CN114694637A (en) * | 2020-12-30 | 2022-07-01 | 北大方正集团有限公司 | Hybrid speech recognition method, device, electronic device and storage medium |
CN114694641A (en) * | 2020-12-31 | 2022-07-01 | 华为技术有限公司 | Voice recognition method and electronic equipment |
CN113488029A (en) * | 2021-06-23 | 2021-10-08 | 中科极限元(杭州)智能科技股份有限公司 | Non-autoregressive speech recognition training decoding method and system based on parameter sharing |
CN114333821A (en) * | 2021-12-30 | 2022-04-12 | 山东声智物联科技有限公司 | Elevator control method, device, electronic equipment, storage medium and product |
CN114783419A (en) * | 2022-06-21 | 2022-07-22 | 深圳市友杰智新科技有限公司 | Text recognition method and device combined with priori knowledge and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
JP6637848B2 (en) | 2020-01-29 |
KR102371188B1 (en) | 2022-03-04 |
CN106328127A (en) | 2017-01-11 |
JP2017016131A (en) | 2017-01-19 |
CN106328127B (en) | 2021-12-28 |
EP3113176A1 (en) | 2017-01-04 |
EP3113176B1 (en) | 2019-04-03 |
US20210272551A1 (en) | 2021-09-02 |
KR20170003246A (en) | 2017-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210272551A1 (en) | Speech recognition apparatus, speech recognition method, and electronic device | |
US10388284B2 (en) | Speech recognition apparatus and method | |
US10468030B2 (en) | Speech recognition method and apparatus | |
US10902216B2 (en) | Parallel processing-based translation method and apparatus | |
CN107590135B (en) | Automatic translation method, device and system | |
US11037552B2 (en) | Method and apparatus with a personalized speech recognition model | |
US10957309B2 (en) | Neural network method and apparatus | |
US10714077B2 (en) | Apparatus and method of acoustic score calculation and speech recognition using deep neural networks | |
US10529319B2 (en) | User adaptive speech recognition method and apparatus | |
EP3355303A1 (en) | Speech recognition method and apparatus | |
CN106560891A (en) | Speech Recognition Apparatus And Method With Acoustic Modelling | |
US20170084268A1 (en) | Apparatus and method for speech recognition, and apparatus and method for training transformation parameter | |
EP3826007B1 (en) | Method and apparatus with speech processing | |
US9972305B2 (en) | Apparatus and method for normalizing input data of acoustic model and speech recognition apparatus | |
US10599784B2 (en) | Automated interpretation method and apparatus, and machine translation method | |
US12073825B2 (en) | Method and apparatus for speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SAMSUNG ELECTRONICS CO., LTD., KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:YOO, SANG HYUN;CHOI, HEE YOUL;REEL/FRAME:038395/0883 Effective date: 20160425 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |