George A Saon from Wilton, CT, age 54

Methods And Apparatus For Performing Heteroscedastic Discriminant Analysis In Pattern Recognition Systems

View page

US Patent:

6609093, Aug 19, 2003

Filed:

Jun 1, 2000

Appl. No.:

09/584871

Inventors:

Ramesh Ambat Gopinath - Millwood NY
Mukund Padmanabhan - White Plains NY
George Andrei Saon - Putnam Valley NY

Assignee:

International Business Machines Corporation - Armonk NY

International Classification:

G10L 1508

US Classification:

704236, 704243, 382190

Abstract:

The present invention provides a new approach to heteroscedastic linear discriminant analysis (HDA) by defining an objective function which maximizes the class discrimination in the projected subspace while ignoring the rejected dimensions. Moreover, we present a link between discrimination and the likelihood of the projected samples and show that HDA can be viewed as a constrained maximum likelihood (ML) projection for a full covariance gaussian model, the constraint being given by the maximization of the projected between-class scatter volume. The present invention also provides that, under diagonal covariance gaussian modeling constraints, applying a diagonalizing linear transformation (e. g. , MLLTâmaximum likelihood linear transformation) to the HDA space results in an increased classification accuracy. In another embodiment, the heteroscedastic discriminant objective function assumes that models associated with the function have diagonal covariances thereby resulting in a diagonal heteroscedastic discriminant objective function.

Lattice-Based Unsupervised Maximum Likelihood Linear Regression For Speaker Adaptation

View page

US Patent:

7216077, May 8, 2007

Filed:

Sep 26, 2000

Appl. No.:

09/670251

Inventors:

Mukund Padmanabhan - White Plains NY, US
George A. Saon - Putnam Valley NY, US
Geoffrey G. Zweig - Greenwich CT, US

Assignee:

International Business Machines Corporation - Armonk NY

International Classification:

G10L 15/06
G10L 15/14

US Classification:

704240, 704244, 704246

Abstract:

Methods and arrangements using lattice-based information for unsupervised speaker adaptation. By performing adaptation against a word lattice, correct models are more likely to be used in estimating a transform. Further, a particular type of lattice proposed herein enables the use of a natural confidence measure given by the posterior occupancy probability of a state, that is, the statistics of a particular state will be updated with the current frame only if the a posteriori probability of the state at that particular time is greater than a predetermined threshold.

Speech Recognition Utilizing Multitude Of Speech Features

View page

US Patent:

7464031, Dec 9, 2008

Filed:

Nov 28, 2003

Appl. No.:

10/724536

Inventors:

Scott E. Axelrod - Mount Kisco NY, US
Sreeram Viswanath Balakrishnan - Los Altos CA, US
Stanley F. Chen - Yorktown Heights NY, US
Yuging Gao - Mount Kisco NY, US
Ramesh A. Gopinath - Millwood NY, US
Benoit Maison - White Plains NY, US
David Nahamoo - White Plains NY, US
Michael Alan Picheny - White Plains NY, US
George A. Saon - Old Greenwich CT, US
Geoffrey G. Zweig - Ridgefield CT, US

Assignee:

International Business Machines Corporation - Armonk NY

International Classification:

G10L 15/00
G10L 15/20

US Classification:

704236, 704240, 704251

Abstract:

In a speech recognition system, the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances. The speech recognition system models the posterior probability of linguistic units relevant to speech recognition using a log-linear model. The posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model. The posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. Log-linear models are used with features derived from sparse or incomplete data. The speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features. Not all features used in training need to appear in testing/recognition.

Speech Recognition Utilizing Multitude Of Speech Features

View page

US Patent:

20080312921, Dec 18, 2008

Filed:

Aug 20, 2008

Appl. No.:

12/195123

Inventors:

Scott E. Axelrod - Mount Kisco NY, US
Sreeram Viswanath Balakrishnan - Los Altos CA, US
Stanley F. Chen - Yorktown Heights NY, US
Yuging Gao - Mount Kisco NY, US
Benoit Maison - White Plains NY, US
David Nahamoo - White Plains NY, US
Michael Alan Picheny - White Plains NY, US
George A. Saon - Old Greenwich CT, US
Geoffrey G. Zweig - Ridgefield CT, US

International Classification:

G10L 15/00
G10L 15/04

US Classification:

704240, 704251, 704E15001, 704E15004

Abstract:

In a speech recognition system, the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances. The speech recognition system models the posterior probability of linguistic units relevant to speech recognition using a log-linear model. The posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model. The posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. Log-linear models are used with features derived from sparse or incomplete data. The speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features. Not all features used in training need to appear in testing/recognition.

Methods And Apparatus For Forming Compound Words For Use In A Continuous Speech Recognition System

View page

US Patent:

6385579, May 7, 2002

Filed:

Apr 29, 1999

Appl. No.:

09/302032

Inventors:

Mukund Padmanabhan - White Plains NY
George Andrei Saon - Putnam Valley NY

Assignee:

International Business Machines Corporation - Armonk NY

International Classification:

G10L 1506

US Classification:

704243, 704256

Abstract:

A method of forming an augmented textual training corpus with compound words for use with an associated with a speech recognition system includes computing a measure for a consecutive word pair in the training corpus. The measure is then compared to a threshold value. The consecutive word pair is replaced in the training corpus with a corresponding compound word depending on the result of the comparison between the measure and the threshold value. One or more measures may be employed. A first measure is an average of a direct bigram probability value and a reverse bigram probability value. A second measure is based on mutual information between the words in the pair. A third measure is based on a comparison of the number of times a co-articulated baseform for the pair is preferred over a concatenation of non-co-articulated individual baseforms of the words forming the pair. A fourth measure is based on a difference between an average phone recognition score for a particular compound word and a sum of respective average phone recognition scores of the words of the pair.

End To End Spoken Language Understanding Model

View page

US Patent:

20220319494, Oct 6, 2022

Filed:

Mar 31, 2021

Appl. No.:

17/218618

Inventors:

- Armonk NY, US
George Andrei Saon - Stamford CT, US
Zoltan Tueske - White Plains NY, US
Brian E. D. Kingsbury - Cortlandt Manor NY, US

International Classification:

G10L 15/06
G06K 9/62
G10L 13/02

Abstract:

An approach to training an end-to-end spoken language understanding model may be provided. A pre-trained general automatic speech recognition model may be adapted to a domain specific spoken language understanding model. The pre-trained general automatic speech recognition model may be a recurrent neural network transducer model. The adaptation may provide transcription data annotated with spoken language understanding labels. Adaptation may include audio data may also be provided for in addition to verbatim transcripts annotated with spoken language understanding labels. The spoken language understanding labels may be entity and/or intent based with values associated with each label.

Chunking And Overlap Decoding Strategy For Streaming Rnn Transducers For Speech Recognition

View page

US Patent:

20220277734, Sep 1, 2022

Filed:

Feb 26, 2021

Appl. No.:

17/186167

Inventors:

- Armonk NY, US
George Andrei Saon - Stamford CT, US

International Classification:

G10L 15/16
G06N 3/04
G06N 3/08

Abstract:

A computer-implemented method is provided for improving accuracy recognition of digital speech. The method includes receiving the digital speech. The method further includes splitting the digital speech into overlapping chunks. The method also includes computing a bidirectional encoder embedding of each of the overlapping chunks to obtain bidirectional encoder embeddings. The method additionally includes combining the bidirectional encoder embeddings. The method further includes interpreting, by a speech recognition system, the digital speech using the combined bidirectional encoder embeddings.

Customization Of Recurrent Neural Network Transducers For Speech Recognition

View page

US Patent:

20220208179, Jun 30, 2022

Filed:

Dec 29, 2020

Appl. No.:

17/136439

Inventors:

- Armonk NY, US
George Andrei Saon - Stamford CT, US
Brian E. D. Kingsbury - Cortlandt Manor NY, US

International Classification:

G10L 15/16
G10L 25/30
G10L 13/02
G06N 3/08

Abstract:

A computer-implemented method for customizing a recurrent neural network transducer (RNN-T) is provided. The computer implemented method includes synthesizing first domain audio data from first domain text data, and feeding the synthesized first domain audio data into a trained encoder of the recurrent neural network transducer (RNN-T) having an initial condition, wherein the encoder is updated using the synthesized first domain audio data and the first domain text data. The computer implemented method further includes synthesizing second domain audio data from second domain text data, and feeding the synthesized second domain audio data into the updated encoder of the recurrent neural network transducer (RNN-T), wherein the prediction network is updated using the synthesized second domain audio data and the second domain text data. The computer implemented method further includes restoring the updated encoder to the initial condition.