JSR-043: JTAPI-1.4

javax.telephony.media
Interface ASR

All Superinterfaces:
ASRConstants, MediaConstants, Resource, ResourceConstants

public interface ASR
extends Resource, ASRConstants

Automatic Speech Recognition API.

The Automatic Speech Recognition (ASR) resource performs recognition and/or training. An ASR resource that performs recognition has associated with it a recognition algorithm which uses a context to recognize words from utterances presented in an input media stream, returning recognition results to the application.

An ASR resource that performs training creates or updates words in a context for use by the recognition algorithm.

An ASR resource configured in a group contains a default context, containing a set of words that the ASR's recognition algorithm can potentially match in an input media stream. The context may contain a grammar which constrains the search space searched by the recognition algorithm. Contexts have a large number of associated parameters, including the language and language variant of its words, the list of words in the context (if the context contains words), and other information required for ASR. The recognition algorithm returns recognition results. The recognition results are a sequence of strings associated with the recognized words, along with some details describing the results.

A client may train a word within a context, by collecting utterances and presenting them to the resource's training algorithm. The ASR resource interacts with the application to collect a sufficient number of utterances to train the word correctly.

A client may load a context from a container into an ASR resource, and may store a context from an ASR resource into a container.

When supported, the ASR resource may be used for speaker identification and speaker verification, e.g., by training a context with utterances from a particular speaker.

When supported, the ASR resource may be used for language identification.

ASR resources have a large set of attributes, which describe the various types of ASR capabilities which may be supported by the resource; and by a large set of parameters which control the operation of the ASR resource.

Automatic Speech Recognition is a rapidly evolving technology, with new advances coming at a rapid pace. The JTAPI framework is general enough to accommodate a wide range of technologies expected to enter into common usage over the next several years. Most vendors' ASR resources will provide a subset of the features available in the framework.

Operation

This section describes the operation of the ASR resource for recognition and training.

Recognition State Machine

The following figure is the ASR resource state diagram for recognition.

The defined states are v_Idle, v_RecognitionPaused, v_Recognizing, and v_ResultsAvailable.

Idle State: v_Idle

In the v_Idle state, the ASR resource is quiescent, performing no recognition operations. It is in this state after a group configuration, a handoff, or a idleASR() command.

This state transits to the v_Recognizing or v_RecognitionPaused states when a client application requests a recognition operation. Upon entering this state, all recognition results stored in the resource are lost.

Recognition Paused State: v_RecognitionPaused

In the v_RecognitionPaused state, the ASR Resource has been prepared with all parameter settings, but has not yet started to execute the recognition algorithm on its input media stream. This state is entered from the v_Idle state by issuing the command recognize() with the optional parameter p_StartPaused set to TRUE.

Recognition begins when the RTC action ASR.rtca_Start is received or the command startASR() is executed (the Recognizer transits to the v_Recognizing state).

Recognizing State: v_Recognizing

In the v_Recognizing state, the resource executes its recognition algorithm on the input media stream, using the current set of active contexts and active words.

While in this state, the Recognizer may create intermediate results, which can be reported to the application via the unsolicited event ev_IntermediateResultsReady. The application can retrieve intermediate results via the command getResults(). Intermediate recognition results are cumulative: since prior results of recognition can change as more of the utterance is processed, the complete result string since beginning of recognition is returned each time. (Final results, as described in the state v_ResultsAvailable, are not affected by the retrieval of intermediate results.)

This Recognizer transits to v_ResultsAvailable when the ASR resource's current recognition task is completed, or if the command stopASR() is issued by the application, or if the RTC ASR.rtca_Stop is received.

Results Available State: v_ResultsAvailable

The v_ResultsAvailable state is reached when a recognition operation is completed, because the application has issued the command stopASR(), or received the RTC ASR.rtca_Stop. In the v_ResultsAvailable state, final ASR results are available and can be retrieved with the getFinalResults() command. Retrieving the recognition results clears the results from the resource and causes the Recognizer to transit to the v_Idle state.

Using the Recognition State Machine

Recognition is a process which transforms the incoming media stream, assumed to contain utterances of a speaker, with text that can be processed by a client.

This section discusses how to use a Recognizer which obeys the recognition state machine to perform common ASR tasks.

Typical Recognition Operation
A typical recognition scenario is when a prompt is played, followed by recognition. To accomplish this, the application uses delayed start of recognition (see section Immediate Start and Delayed Start of Recognition), and plays a prompt using a Player resource. The end of the prompt triggers a run-time control to the Recognizer (see section Runtime Control), which then starts recognition. After recognition is over, the application retrieves the recognition results, interprets them, and starts a new round of recognition.
Immediate Start and Delayed Start of Recognition
Recognition begins immediately if the Recognizer transits from v_Idle to v_Recognizing. Delayed start occurs after a transition from the v_Idle to the v_RecognitionPaused state: recognition starts after transition to the v_Recognizing state. Delayed start is used for synchronization --for example, to synchronize the beginning of recognition with the end of a prompt.
Stopping Recognition: Resource Control and Application Control
Once recognition has started, there are two main programming styles used to stop recognition. Resource-controlled stop is obtained by pre-defining, to the Recognizer, the stop conditions to be used by the Recognizer. These conditions can be internal to the Recognizer and/or based on RTC Conditions from other resources. For example, a typical application will define a maximum recognition time window and some RTC conditions from the Signal Detector resource. If the utterance exceeds the maximum time window, or if the Signal Detector detects a DTMF tone, recognition stops.

Application-controlled stop, the other main programming style used to stop recognition, means that the application program stops recognition directly; to make this style workable, the application program must have access to intermediate results. For example, the application program starts recognition without imposing time limits on the recognition and then monitors intermediate results. The application stops the recognizer after the application receives a specific keyword. Another possible use of this application programming style is when the Recognizer can send only raw results to a smart application.

Note that these two programming styles may be mixed. For example, even if the stop is application-controlled, the application may set up a resource-controlled stop to occur if the telephone call is disconnected. (although disconnect->stop is enabled by default for all S.410 resources)

Intermediate Results and Final Results
If reporting of Intermediate results is supported by a particular resource, then intermediate results can be read -- after the unsolicited event ev_IntermediateResultsReady -- by using the getResults() function. The function may be used both in the v_Recognizing and in the v_ResultsAvailable states without altering the state of the Recognizer. Final results are read by using the getFinalResults() function; this function may only be used in the v_ResultsAvailable state and causes a transition to the v_Idle state.

If the Recognizer supports recognition of sequences of words, the Recognizer is also likely to support the retrieval of intermediate results while recognition is active. (As noted above, the Recognizer must support the retrieval of intermediate results if the application uses the programming style of application-controlled stop.) Please note that not all Recognizers support intermediate results, and therefore some of the scenarios described below will not be available with all Recognizers.

When a Recognizer stops, it changes state from the v_Recognizing state to the v_ResultsAvailable state. The Recognizer provides a summary result along with the event associated with the end of recognition; more detailed information about the results may be obtained by subsequently using getFinalResults(). getFinalResults() causes the Recognizer to transit from the v_ResultsAvailable state to the v_Idle state. The difference between summary results obtained automatically at the end of recognition and detailed results obtained via the getFinalResults() function are detailed below in section Recognition Results.

It is possible to transit from the v_ResultsAvailable state to the v_Recognizing state by using the recognize() function (e.g., to speed up the recognition of a sequence of isolated words). In this case, at the end of the recognition of each utterance, the summary result is in the corresponding event, while detailed results can be read with getResults(). After the final utterance of the sequence is recognized, getFinalResults() may be used, detailed results read, and the Recognizer will transit from the v_ResultsAvailable state to the v_Idle state.

A transition to v_Idle state resets the Recognizer and deletes all recognition results. The function idleASR() (and the corresponding RTC Action rtca_Idle) may be used to force the end of a recognition, but will also result in a loss of any recognition results. The function stopASR() (and the corresponding RTC Action rtca_Stop) forces the termination of recognition of an utterance, but the Recognizer enters the state v_ResultsAvailable, and the recognition results are still accessible via getFinalResults().

Parameters may only be set while the Recognizer is in the v_Idle or v_RecognitionPaused state, Functions that read recognition parameters are legal during any Recognizer state.

Training State Machine

The following figure is the ASR resource state diagram when performing training operations. The defined states are v_Idle, v_TrainingPaused, v_Training, and v_WordTrained.

Idle State, v_Idle

In the v_Idle state, the ASR resource is quiescent, performing no training operations. It is in this state after a group configuration, a handoff, a idleASR() command, or a wordCommit() command.

The ASR resource transits to the v_Training or v_StartPaused states when the application requests a training operation via wordTrain().

When the ASR resources enters v_Idle, all training information that has not been committed to the Context is lost.

Training Paused State, v_TrainingPaused

In the v_TrainingPaused state, the ASR resource has been prepared with all parameter settings, but has not yet started to execute the training algorithm on its input media stream.

This state is entered from the v_Idle state by issuing the command wordTrain() with the optional argument v_StartPaused set to true. Training begins when the RTC action rtca_Start or the command startASR() is received. (The resource transits to the v_Training state.)

Training, v_Training

In the v_Training state, the resource executes its training algorithm for a particular word and a particular context on the input media stream. The resource transits to the v_WordTrained state when the training operation is completed. If the command stopASR() or RTC rtca_Stop is received, the resource transits to the v_WordTrained state.

Word Trained State, v_WordTrained

In the v_WordTrained state, a training result has been reached and a temporary word model of the trained word has been updated. The results of the training operation can be discarded via the command wordDeleteLastUtterance(), further updated via the command wordTrain(), or committed to the context's model for the word via the command wordCommit(). The command wordDeleteLastUtterance() causes the state to transit to v_WordTrained again. wordTrain() causes the resource to transit to the v_Training state for further training. wordCommit() commits the training to the word, deletes the temporary word model, and the ASR resource transits to v_Idle.

Using the Training State Machine

Training is the process by which the ASR resource collects audio information (or other information), and associates this information with Words in an active Context on the ASR resource, so that the Word may be used by a Recognizer to perform recognition. When training is over, the ASR resource issues a single completion event. The parameter p_NumberRepetitions, available in some Recognizers, may be set to dictate how many training utterances the Recognizer should use in AutomaticTraining mode.

Applications will often use wordTrain() and wordCommit() in a pair of loops. The outer loop will prompt the user to determine if any training is required. The inner loop will prompt the user for suitable utterances and use wordTrain() to perform training. The outer loop will then use wordCommit() to modify the Context permanently.

Immediate Start and Delayed Start of Training

Training begins immediately when the Recognizer transits from v_Idle to v_Training. Delayed start occurs with a transition from the v_Idle to the v_TrainingPaused state: actual training starts after transition to the v_Training state. Delayed start is used for synchronization --for example, to synchronize the beginning of training with the end of a prompt.

Internal States of the Word Models

Training states of a word model are not defined by this specification. Instead, the ASR resource returns as part of its training results a Symbol whose value indicates the readiness, with the possible values of v_Ready, v_NotReady, or v_Complete.

The criteria by which the ASR resource determines the readiness is controlled by a set of parameters defined in Table 5-12. The determination can be made autonomously by the ASR resource -- automatically, on the basis of decision-making on the part of the ASR resource's training algorithm, or can be set to be a fixed number of training iterations on a single word.

Automatic Training

Some Recognizers may perform training automatically, issuing prompts and collecting utterances until a sufficient number of good utterances have been collected. Such Recognizers can run in automatic mode when the parameter p_AutomaticTraining is set to true. These ASR resources remain in the state v_Training until training is complete.

Contexts

A Context is an opaque object that an ASR resource's recognition algorithm uses to recognize utterances. A Context may contain a set of algorithm-dependent models representing words, or a grammar used to constrain the algorithm's search space, or other information needed by the ASR resource. Contexts have a large set of associated parameters used by the application to control the operation of the ASR resource. When stored as a data object in a container, some of these parameters are accessible as ContainerInfo information elements; when loaded into an ASR resource, they are accessible as resource parameters.

The Java Speech Grammar Format (JSGF) is supported by this specification, along with commands and parameters to control the grammars. In addition, this specification supports commands to facilitate speaker-dependent systems: to create word models (i.e., "train" the words), remove word models from contexts, and load/store contexts.

A Context is stored as a container data object with media type "file". When stored as a data object, some Context parameters may be read as ContainerInfo information elements of the data object.

A recommended convention is to use a context for domain-specific (e.g., banking, directions) or speaker-specific information, and to store multiple contexts representing information in the same language or for the same speaker in a common container. This permits the use of the term "context name" to refer to the language/domain pair and to refer to the container and data object in which the context is stored. For example, a context for the banking domain in Spanish would have the data object name "Spanish/Bank" (Container/Object); the context's context name would be Spanish/Bank.

Context Configuration

An ASR resource performs its recognition and training operations only with respect to contexts that are loaded into the resource; a Context that is loaded onto the resource is referred to as a loaded context. When configured into a group, an ASR resource has a default context established by the administration function. A client may load an additional context via the command contextCopy(), and remove a context via the command contextRemove().

An ASR resource performs its recognition operations with respect to active contexts and active words. The ASR parameter p_ActiveContexts is a list of active contexts. The ASR parameter p_ActiveWords is a list of active words. Only word models from contexts in the ActiveContexts list whose names are in the ActiveWords list are used by the recognition algorithm.

These parameters may be set non-persistently as an argument to the command recognize(), or persistently via the command setParameters(), in those Recognizers that have settable context lists and/or word lists.

Grammars

The ECTF supports the Java Speech Grammar Format (JSGF), a grammar specification language described by the Java consortium (http://java.sun.com/products/java-media/speech). The JSGF should be consulted for a complete description of the grammar specification language.

Grammars are contained within Contexts. A Context can contain at most one grammar; Contexts which contain grammars are referred to as Grammar Contexts.

Grammars contain public rules and private rules. Public rules may be imported by other grammars and can be made active or inactive.

When a Grammar Context is loaded using the contextCopy() function, the grammar is inactive, and all its rules are inactive. To activate a Grammar Context, set the parameter p_ActiveContexts to include the Grammar Context. When the Grammar Context becomes active, all the rules described in the Grammar Context attribute a_DefaultRules become active. The active rules can be modified by setting the parameter p_ActiveRules.

Grammar Context Attributes and Parameters

This section lists attributes and parameters specific to Grammar Contexts. The following attributes of Grammar Contexts may be queried by the application using standard commands.

AttributeDescription
a_DefaultRules public rule names that become active when the Grammar Context becomes active
a_Rulesrules (public and private) in the grammar
a_PrivateRulesprivate rules in the grammar
a_PublicRulespublic rules in the grammar

The parameters of a Grammar Context may be examined and set (if appropriate) using the contextGetParameters() and contextSetParameters() functions. The parameter p_ActiveRules contains the current complete list of active public rules and may be used to control which public rules are active. When the Grammar Context first becomes active, the contents p_ActiveRules will be identical to that of a_DefaultRules.

ParameterDescription
p_Rules grammar rules within a loaded grammar context
p_PrivateRules private grammar rules within a loaded grammar context
p_PublicRules public grammar rules within a loaded grammar context
p_ActiveRules active grammar rules within a loaded grammar context

Grammar Context Functions

A public or private rule has a name and a rule expansion. The rule expansion may contain a series of words to be recognized as part of the rule, the names of other rules, and other information as specified in the JSGF.

Applications may modify rule expansions. For example, if a rule expansion contains a list of street names, the application may change the list of street names during the course of the application to reflect the choice of city name made by the user.

The application may retrieve a rule expansion by using the function getRuleExpansion(), and set a rule expansion by using the function setRuleExpansion().

Grammars and Barge-In

Grammars may be used to support barge-in. See the discussion of barge-in in section Barge-In.

Grammars and Results

The JSGF grammar specification language supports grammar tags to attach application-specific information to parts of rule definitions. The grammar tags do not affect recognition; instead, they are a mechanism to provide specific information to the application about which part of the grammar was traversed.

The grammar tags are reported as part of the results of recognition. See section Recognition Results for a description of how grammar tags are reported in results.

Language and Variant Labels

For the purposes of ASR and TTS, languages are identified by a language identifier (e.g., English, German) and a variant identifier (e.g., US English). These identifiers are used as ContainerInfo information elements on TVMs, grammars, and contexts, to identify the represented, encoded, or recognized language(s) and language variant(s). They are used as attributes for ASR and TTS-Player resources to identify the languages that can be recognized or decoded. In language identification applications within ASR, they are also used as parameters to indicate the language that a recognizer attributes to a recognized utterance.

Language and variant identifiers are denoted by symbols. The item field of the symbol is a token adapted from the descriptions of ISO 639-2, with punctuation and whitespace removed and individual words in the description capitalized, conformant with the symbol nomenclature followed by s.410. For example, "Italian" in ISO 639-2 becomes the Symbol v_Italian.

ECTF S.100-R2 Volume 4 contains the currently-defined language and variant identifiers. Not all languages defined in ISO639-2 are represented in this table, since Recognizers do not exist in many of these languages, and indeed may never exist in such languages. An ASR vendor who wishes to support a language or variant not in this table may use a vendor-defined symbol for the language or variant identifier, and may contribute the identifiers so defined to the ECTF for incorporation within the ECTF symbol name space (the identifiers should follow the conventions of ISO639-2, as modified).

Runtime Control

Actions

The following table lists the actions supported by the Recognizer, and the events that the Recognizer generates after performing the actions.

ActionDefinitionCommands Affected Next StateEvent Generated
rtca_Stop stop recognizingrecognize(), wordTrain() v_ResultsAvailable, v_WordTrained ev_Recognize or ev_WordTrained
rtca_Idle stop recognizing and return to v_idle stateall v_Idleev_Idle
rtca_UpdateParameters update parameters given in p_UpdateParametersList recognize(), wordTrain() v_Recognizing, v_Training  
rtca_Start start recognizingrecognize(), wordTrain() v_Recognizing, v_Training ev_Start

Conditions

This table lists the internal conditions identified by an ASR and usable for triggering run time controls of another resource, e.g., for starting or stopping a Player resource.

ConditionDefinition
rtcc_RecognizeStarted Recognizer has started operating
rtcc_TrainStartedRecognizer has started training.
rtcc_SpeechDetectedSpeech has been detected
rtcc_SpecificUtterance A specifically defined utterance was detected. The parameter p_SpecificUtterance (see Table 5-12) is used to set the utterances that trigger this condition.
rtcc_RecognizeRecognition has terminated normally (this does not signify that a valid utterance was found).
rtcc_WordTrainWord training has terminated normally (this does not signify that training is complete)
rtcc_WordTrainedWord training is complete
rtcc_ValidUtteranceFoundA valid utterance has been detected, even though recognition may not be over
rtcc_ValidUtteranceFinalA valid utterance has been detected and recognition has completed
rtcc_SilenceSilence has been detected
rtcc_InvalidUtteranceAn invalid utterance has been detected
rtcc_GrammarBargeInThe grammar has detected barge-in (it has found the label v_BargeInHere). See section Grammar-Based Barge-In

Parameters

Introduction

Parameters control various aspects of the Recognizer. This section describes the officially-supported parameters.

The descriptions are phrased to describe the true case. For example, if the result for p_SpeechRecognition is false, then speech recognition is not supported.

ASR Resource Parameter Categories

The Context control parameters of Table 5-5 indicate the loaded words and contexts; and indicate and control the active words and contexts that may be used in a recognition or training operation (see section Contexts).

NameData TypeState when settable Default ValuePossible ValuesDefinition
p_ActiveWordsString[]v_Idle All Loaded Words  List of Active Words
p_LoadedWordsString[]Read-Only   List of Loaded Words
p_ActiveContextsString[]v_Idle All Loaded Contexts  List of active Contexts
p_LoadedContextsString[]Read-Only   List of loaded Contexts
The Context Information Parameters of the Table 5-6 specify information about a Context. If the Context is loaded on a Resource, this information may be accessed via the contextGetParameters() and contextSetParameters() functions. Several Contexts may be loaded on the resource, each with its own set of parameters.

Note that any attempt to set a parameter of the Context will be checked against the Resource's capabilities. If an attempt to made to set a parameter to indicate support for a capability that the Resource does not actually support, the attempt will fail.

If the Context is not loaded on a resource, but is in a container, then these context parameters are accessible via the Context Parameter's ContainerInfo information element.

Context Parameter

Data Type State when Settable Possible Values Description
p_Trainable CTbool WordTrained true, false Whether this context can be trained
p_SpeakerType Symbol WordTrained v_Independent, v_Dependent, v_Verification, v_Identification The type of recognition possible
p_Label String WordTrained   An abstract string identifying a set of words and/or phrases. This may be preloaded
p_Language Symbol[] Read only See Language list of Volume 4 The language(s)recognized
p_Variant Symbol[] Read only See Variant list of Volume 4 The variant(s) recognized
p_DetectionType Symbol WordTrained Continuous, Discrete  
p_Spotting Symbol WordTrained Available, NotAvailable  
p_UtteranceType Symbol WordTrained Phonetic, Speech, Text The manner in which the utterance data is described to the recognizer for training
p_Size   Read only   A vendor-defined heuristic describing the space requirements of the context when resident on the resource. Size is implicitly adjusted when training is performed on the resource
p_ContextType Symbol WordTrained v_JSGF The actual type of the context to be loaded. The only ECTF-defined symbol is one for Context Grammars that support the Java Speech Grammar Format. Otherwise, this is a vendor-defined symbol that is used by the vendor to distinguish between specific data formats used internally by the vendor's Resource.

The Speech input control parameters of the Table 5-7 specify parameters for determining speech boundaries (e.g., timeout thresholds for end of speech). They also enable or disable echo cancellation and audio prompt generation and cancellation.

Name Data Type State when settable Default Value Possible Values Definition
p_Duration Int v_Idle v_Forever 0 to ? and v_Forever maximum time window (in ms). At the end of that time the recognition terminates
p_InitialTimeout Int v_Idle v_Forever 0 to 1000000 or v_Forever Initial silence timeout in ms. If no utterance is detected in this period, recognition is terminated and the Recognizer will notify the application that silence has been detected.
p_FinalTimeout Int v_Idle v_Forever 0 to 1000000 or v_Forever Silence time in ms after utterance to indicate completion of the recognition
p_EchoCancellation Boolean v_Idle false true, false Echo cancellation active
p_GrammarBargeInThreshold Int v_Idle 500 0-1000 Minimum score at a barge-in node required to raise the rtcc_GrammarBargeIn RTC condition. The score is the same "Score" described in section ASR_ECTF_TokenScores.

The state of the Recognizer can be determined by reading the following parameter.

Name Data Type State when settable Default Value Possible Values Definition
p_State Symbol Read Only v_Idle v_Idle, v_RecognitionPaused, v_Recognizing, v_ResultsAvailalable, v_TrainingPaused, v_Training, v_WordTrained State of the Recognizer

The Output control parameters of the Table 5-9 specify the format in which recognition results are provided. The number of alternate recognition hypotheses, the presentation in text or in phonetic spelling, and the inclusion of warning tokens are selected by these parameters.

Name Data Type State when settable Default Value Possible Values Definition
p_NumberWordsRequested Int v_Idle 1   Number of words to return (e.g., in a connected digit ASR, the number of digits to return)
p_OutputType Symbol v_Idle v_Text v_Text, v_Phonetic The type of output result provided
p_TopChoices Int v_Idle 1 Any positive number Number of alternative results returned
p_ResultType Symbol v_Idle v_Final v_Final, v_Intermediate Whether results returned by getResults() are final summaries or intermediate results
p_ResultPresentation Symbol v_Idle v_TypeII v_TypeI, v_TypeII Method of presentation of ASR results

The Training control parameters of Table 5-10 are used to control the means by which the end of training for a word is reported. A client may select automatic training (in which the resource decides when enough training has been done), may set the number of repetitions, and may enable or disable unsolicited events related to training.

Name Data Type State when settable Default Value Possible Values Definition
p_AutomaticTraining Boolean v_Idle false true, false Automatic training active
p_NumberRepetitions Int v_Idle 1 Vendor specific Number of repetitions to perform in a training loop when automatic training is enabled
p_EnabledEvents Dictionary v_Idle Empty Any unsolicited event The elements in the KVSet form the list of which unsolicited events are enabled; only events on the list are reported by the Resource. (Only the Keys in this KVset are used; the Values are ignored)
p_TrainingType Symbol v_Idle vendor specific v_Speech, v_Phonetic, v_Text The type of input used to perform training

The Speech buffer control parameters of Table 5-11 enable a mode in which a previously-detected utterance is used as input.

Name Data Type State when settable Default Value Possible Values Definition
p_StoreInput Boolean v_Idle false true, false Acquired utterance is saved in a resource internal buffer for reuse
p_PlayInput Boolean v_Idle false true, false ASR internal buffer is reused as a new input

The RTC control parameters in Table 5-12 refine the behavior of the RTC conditions defined forthe ASR resource.

Name Data Type State when settable Default Value Possible Values Definition
p_StartPaused Boolean v_Idle false true, false The Recognizer starts in Paused mode, transiting to the RecognitionPaused or TrainingPaused state. The Recognizer remains in the Paused state until it receives a Start RTC (or function). This parameter is valid for the recognize() and wordTrain() functions only.
p_SpecificUtterance String[] v_Idle     Each value in the array represents an utterance. When any of these utterances are recognized the SpecificUtterance condition will be triggered.

Additional Features

Echo Cancellation

When a prompt is played, an echo is often returned on the input media stream, an echo that can be detected by a Recognizer; the echo can cause errors in recognition. Echo cancellation removes the echo from the input media stream. Echo cancellation provided by a Recognizer is controlled by the parameter p_EchoCancellation which when set to true causes echo cancellation to run continuously on the input media stream. Echo cancellation is stopped by setting p_EchoCancellation to false. The attribute a_EchoCancellation indicates whether the Recognizer can provide echo cancellation.

Barge-In

Barge-in (or "cut-through") allows a user to speak while a prompt is playing, have the prompt stop playing, and be recognized. Barge-in generally requires echo cancellation (so that the echo of the prompt is not confused with the utterance); and cooperation with the Player resource, so that the prompt terminates when an utterance is either detected or recognized.

By way of illustration, here are two possible RTCs, each of which would accomplish barge-in:

The first of these two RTCs causes the player to stop when a speaker begins to speak, i.e., when the Recognizer detects that some speech has occurred. The second RTC causes the player to stop only when a meaningful utterance has been received, to guard against stopping a prompt because of extraneous noise (e.g., traffic, throat-clearing).

Grammar-Based Barge-In

The JSGF grammar specifiation language supports grammar tags to attach application-specific information to parts of rule definitions. The grammar tags do not affect recognition; instead, they are a mechanism to provide specific information to the application about which part of the grammar was traversed.

The ECTF has extended the the JSGF grammar specification language to include a special tag to support barge-in, v_BargeInHere. Unlike other grammar tags, which are merely reported to the application as part of the result, this grammar tag affects the operation of the ASR resource. When the ASR resource encounters v_BargeInHere in the grammar, the ASR resource raises the RTC condition rtcc_GrammarBargeIn. In turn, this RTC condition may be used (as illustrated above) to control the operation of a Player resource to effect barge-in.

The parameter p_GrammarBargeInThreshold is used to determine the confidence level at which the ASR resource raises the RTC condition rtcc_GrammarBargeIn.

Parameter Updates During Barge-In

Often a different set of speech recognition parameters (e.g., timeout windows) must be used when a prompt is playing. The Dictionary p_UpdateParametersList may be used to update parameters while speech recognition is active. p_UpdateParametersList consists of keys that are the names of ASR parameters, and values appropriate for those keys. When the function updateParameters() is called, or the RTC action rtca_UpdateParameters is received, the Recognizer will update each parameter in p_UpdateParametersList with its corresponding value. Not all parameters may be included on this list: parameters that control timeouts can be altered, but parameters that control the nature of recognition results, the way recognition starts and stops, etc., cannot.

For example, when recognition with barge-in first begins, the parameter p_InitialTimeout will be set to the value v_Forever to signify that there is no initial timeout while the prompt is playing. However, after play has completed, the UpdateParameters RTC action can be used to set p_InitialTimeout to some reasonable value.

Recognition Results: ASREvent.TokenSequence

Overview

The token is the fundamental unit of recognition: it may be a phrase, a word or a sub-word unit (such as a phoneme), depending on the Recognizer. Tokens appear in a sequence; a sequence represents a hypothesis for a transcription of the utterance. The result of a recognition operation consists of one or more sequences of recognition tokens. Additional information elements may be associated with individual tokens of a sequence, with a sequence itself, or with the recognition results as a whole.

Recognition results are delivered in an ASREvent; the return value of recognize(), getResults() or getFinalResults(). Results are accessed using getTokenSequence(int row);

Note: Still need to re-write the results presentation document to reflect the S.410 convention for returning a TokenSequnece. Result accessors are defined in ASREvent.

In short:

The following is a literal copy from ECTF S.100-R2, included here for reference:

Recognition results are returned as a collection of one-dimensional and two-dimensional matrices. Among the two-dimensional matrices, there are: a token matrix T, a qualifier matrix Q, a grammar tag matrix G, and a token score matrix S. The one-dimensional matrices will be discussed below. In this chapter, these matrices are referred to as the recognition results matrices.

Recognition results are described as a collection of one-dimensional and two-dimensional matrices. Among the two-dimensional matrices, there are:

Each matrix element t[r][o] of T is a token. The row index r of token t[r][o] indicates its ranking, that is, its likelihood of being a correct recognition result. The column index o of t[r][o] indicates its order within a sequence with respect to other tokens; [o=0] being first element of the sequence.

A row of matrix T is referred to as a sequence; in a Type II recognizer (see below), a sequence represents a hypothesis for a transcription of the utterance. For any matrix M with elements m[r][o], we represent row i by the notation M[i].

The most likely transcription of an utterance is the top row of matrix T, T[0], {t[0][0], t[0][1], ... , t[0][N-1]}, where N is the maximum number of tokens in the sequence.

The other matrices Q, G, and S have the same dimensions as T; their matrix elements contain information about the corresponding matrix elements of T. For example, each element s[i][j] of S indicates the likelihood score (see section Scores) of t[i][j].

One-dimensonal matrices m[r] provide information about the the individual sequences T[r] of matrix T. For example, each member s'[r] of the matrix SequenceScore (see Table 5-15) represents the overall confidence of the Recognizer in the sequence T[r].

The recognition results matrices are returned as an ordered set of TokenSequence objects. Each TokenSequence[r] contains all of the information pertaining to row T[r], plus the parallel rows from the other matrices: Q[r], S[r], G[r], S'[r], etc. That is, for a two dimensional matrix M[i], TokenSequence[i] contains { M[i][0],Mt[i][1], ..., Mt[i][N-1] }.

TokenSequence[i] contains:

 int SequenceLength;
 String[]           Token[i][j] j=0..SequenceLength
 Symbol[]  TokenQualifier[i][j] j=0..SequenceLength
 int[]         TokenScore[i][j] j=0..SequenceLength
 String[]      GrammarTag[i][j] j=0..SequenceLength
 String       ContextName[i]
 Symbol          Language[i]
 Symbol   LanguageVariant[i]
 String SequenceQualifier[i]
 int        SequenceScore[i]
 

ASR resource output control parameters affect the sequence(s) of tokens available to the application. For example, OutputType determines whether the tokens represent phonetic symbols or text.

Recognition results are returned in the completion event of various functions as well as in some unsolicited events, and may represent either intermediate or complete results, depending on the event in question.

Scores

Scores indicating a quantitative level of confidence in any token or sequence may be optionally provided by the Recognizer. They may be associated either with a recognized sequence, or with the individual tokens of the sequence. Scores cover the numerical range 0 to 1000, according to the descriptive quality bands given in Table 5-13.

Score Description
801-1000 Excellent - there is no reason to reject this transcription
601-800 Good - this transcription should be accepted unless the action which results is deemed to have dangerous or irreversible consequences
401-600 Acceptable - this transcription is probably correct, but other alternatives may be worth considering if they have a similar score
201-400 Questionable - this transcription should be verified by further prompting, or discarded
0-200 Unacceptable

Type I and Type II Recognition Results

Many recognizers are capable of returning a result which includes multiple ("N-best") alternative sequences. These alternatives can be presented in one of two forms, depending on whether the tokens returned for each individual hypothesis are independent of each other or dependent on the other tokens of the sequence.

Some recognizers (isolated word recognizers, for example) return a set of sequences in which each token in a sequence is independent of the other tokens in the sequence. Such recognizers are referred to as Type I recognizers. Their token matrix t[r][o], therefore, represents a large number of possible individual sequences; for each value o, the index r can be selected independently from the respective alternative choices. For a matrix T with N rows and M columns, each combination of the form t[i0][0], t[i1][1], ..., t[iN-1][M-1] is a valid sequence (where {i0, i1, ..., iN-1} are integers in the range of {0, ..., N-1}).

Most connected word and continuous speech systems return recognition results t[r][o] in which an individual token t[r][o] depends on the values of the other tokens within the sequence t[r]. Such recognizers are referred to as Type II recognizers. For a Type II recognizer, only token sequences t[j][0], t[j][1], ... , t[j][N-1] are valid recognition sequences.

Recognition Results and ASREvent methods

Recognition results returned from retrieve(), getResults(), or getFinalResults are obtained from an ASREvent Iwith eventID equal to, respectively, ev_Recognize, ev_GetResults, or ev_GetFinalResults) using the following accessors:

ASREvent.TokenSequence and its accessors

ASREvent.getTokenSequence(int r) returns a TokenSequence object for row r. An ASREvent.TokenSequence object contains information associated with a single row of the recognition result matrices.

The accessor methods for TokenSequence are defined in ASREvent.TokenSequence

Note: the ASREvent returned from recognize() contains only one TokenSequence, representing the most likely result.

Certain of the TokenSequence accessors are described briefly in the following subsections.

TokenSequence.getToken(int o)

getTokenSequence(r).getToken(o) returns the String that identifies the token that corresponds to the matrix element t[r][o].

TokenSequence.getTokenQualifier(int o)

getTokenSequence(r).getTokenQualifier(o) returns the int that corresponds to the matrix element q[r][o].

A token qualifier reports whether the corresponding token is "normal" (i.e., represents a token string), "grammar tag" (i.e., the token value is NULL, and a grammar tag is provided instead), "garbage," or "silence."

This value is optional, if not supplied by the ASR Resource, the returned value is null for all values of {j, o}

getGrammarTag(int o)

getTokenSequence(r).getGrammarTag(o) returns the String that corresponds to the matrix element q[r][o]. If getTokenSequence(r).getTokenQualifier(o) is v_GrammarTag, then g[r][o] provides the grammar tag for the token t[r][o]. The grammar tag will include the braces "{}" as defined in the JSGF grammar specification.

This is an optional KVpair, defined for Type II ASR resources. If not supplied, the return value is null.

getTokenScore(int o)

getTokenSequence(r).getTokenScore(o) returns the int that corresponds to the matrix element s[r][o]. Matrix element s[r][o] provides the score for the token t[r][o].

This value is optional, if not supplied by the ASR Resource, the returned value is -1 for all values of {j, o}

Training Results

Overview

The Word is the fundamental unit of training. Words are trained through several possible methods, including utterances, transcriptions of phonemes, or through ordinary text. The Word is contained in a Context. In most cases, the goal of training is to create a Context which may be used in later acts of recognition.

The result of training is a Word contained in a Context. Depending on the ASR resource, training can take place on Words that have been freshly created; added to Words that already exist (to update their training); or replace previous training.

Training a word using speech is a complex interaction. If utterances (either of a particular speaker or a group of speakers) are used to train, several utterances may be required. The application, through training results, is informed of whether the training session was successful, whether more utterances will be required, or whether training is complete.

Training Results

Several different data items the training results. They are included in the completion event of the function wordTrain(). Readiness, available from the method TrainEvent.getReadiness(), is a description of the Word being trained. It is used to determine whether more training is required before the Word can be committed; this value describes the overall state of the Word being trained. For example, a training utterance may fail, but the Word might already have sufficient training to be committed. The possible values of Readiness are:

v_Ready The Word has sufficient training available.
v_NotReady The Word does not have sufficient training available
v_Complete The Word has sufficient training available, and additional training will be ignored.

Another value of interest to training results is the Qualifier of the event, available using getQualifier(). The Qualifier reveals the success or failure of the latest utterance used to train the Word. Note the difference between the "latest utterance" and the Word itself: the Word might have sufficient utterances to be committed, even though the latest utterance was problematical; a wise developer will check the latest utterance, and if it is problematical, delete it with the command wordDeleteLastUtterance(). (If wordDeleteLastUtterance() is not supported by the resource, a problematical utterance will fail instead of producing a warning.) The values for the Qualifier are:

q_Success The latest utterance was successfully added to the training set.
q_RTC The training was stopped by RTC
q_Stop The training was stopped by a stopASR or other Stop command.
q_Duration The training failed because the maximum duration (p_DurationMaximumTimeWindow) of the training window expired.
q_InitialTimeout The training failed because the initial timeout timer expired.
q_Failure The latest utterance was not successful, and was not added to the training set.
q_Warning Although the latest utterance was added to the training set, the Recognizer has detected potential problems. Check the TrainingResult for details.

For example, if the training was unsuccessful because the user was silent, the value of the Qualifier will be either q_Duration (if p_Duration expires) or p_InitialTimeout (if that timer expires first).

Warnings are included as Values of TrainingResult and are accessed with getTrainingResult(). A warning means either:

Values of TrainingResult are:
v_Success The training succeeded.
v_Collision The training is too close to the training of another Word in an active Context on the Recognizer.
v_Inconsistent The training is inconsistent with previous training for this Word.
v_Noisy The background for this training was too noisy.

For example, if the training is suspect because the resource cannot see any similarities between two training passes, and the resource supports wordDeleteLastUtterance(), then it will return the qualifier q_Warning and the TrainingResult v_Inconsistent. The training will become part of the Word until wordDeleteLastUtterance() is called. If the resource does not support wordDeleteLastUtterance(), however, then the qualifier q_Failure with the TrainingResult set to v_Inconsistent is returned and the utterance does not become part of the word.

To summarize, training can imply two loops:

Since:
JTAPI-1.4

Inner Class Summary
static class ASR.NoContextException
          Thrown when an operation is attempted using a Context that has not been created on the ASR Resource.
 
Fields inherited from interface javax.telephony.media.ResourceConstants
e_Disconnected, FOREVER, q_Disconnected, q_RTC, rtcc_Disconnected, rtcc_TriggerRTC, v_Forever
 
Fields inherited from interface javax.telephony.media.MediaConstants
e_OK, q_Duration, q_Standard, q_Stop
 
Fields inherited from interface javax.telephony.media.ASRConstants
a_AutomaticTraining, a_ContextType, a_Date, a_DefaultRules, a_DetectionType, a_EchoCancellation, a_IntermediateResults, a_Label, a_Language, a_LoadableContext, a_LoadedContext, a_NumberRepetitions, a_NumberWordsRequested, a_OutputType, a_PrivateRules, a_PublicRules, a_RepeatRecognition, a_ResourceName, a_ResourceVendor, a_ResourceVersion, a_ResultPresentation, a_Rules, a_Size, a_SpeakerType, a_SpecificUtterance, a_Spotting, a_Trainable, a_Training, a_TrainingModifiable, a_TrainingType, a_UtteranceType, a_UtteranceValidation, a_Variant, e_BadContainerName, e_BadContext, e_Exists, e_IncorrectContext, e_NoExists, e_NotSupported, ev_ContextCopy, ev_ContextCreate, ev_ContextGetParameters, ev_ContextRemove, ev_ContextSetParameters, ev_GetFinalResults, ev_GetResults, ev_GetRuleExpansion, ev_Idle, ev_IntermediateResultsReady, ev_InvalidUtterance, ev_RecognitionStarted, ev_Recognize, ev_RetrieveRecognition, ev_SetRuleExpansion, ev_Start, ev_Stop, ev_UpdateParameters, ev_ValidUtterance, ev_WordCommit, ev_WordCreate, ev_WordDeleteLastUtterance, ev_WordDeleteTraining, ev_WordDestroy, ev_WordTrain, p_ActiveContexts, p_ActiveRules, p_ActiveWords, p_AutomaticTraining, p_ContextType, p_DetectionType, p_Duration, p_EchoCancellation, p_EnabledEvents, p_FinalTimeout, p_GrammarBargeInThreshold, p_InitialTimeout, p_Label, p_Language, p_LoadedContexts, p_LoadedWords, p_NumberRepetitions, p_NumberWordsRequested, p_OutputType, p_PlayInput, p_PrivateRules, p_PublicRules, p_ResultPresentation, p_ResultType, p_Rules, p_Size, p_SpeakerType, p_SpecificUtterance, p_Spotting, p_StartPaused, p_State, p_StoreInput, p_TopChoices, p_Trainable, p_TrainingType, p_UpdateParametersList, p_UtteranceType, p_Variant, q_Complete, q_Failure, q_InitialTimeout, q_Rejected, q_Silence, q_Stop, q_Success, q_Unsuccessful, q_Warning, rtca_Idle, rtca_Start, rtca_Stop, rtca_UpdateParameters, rtcc_GrammarBargeIn, rtcc_InvalidUtterance, rtcc_Recognize, rtcc_RecognizeStarted, rtcc_Silence, rtcc_SpecificUtterance, rtcc_SpeechDetected, rtcc_TrainStarted, rtcc_ValidUtteranceFinal, rtcc_ValidUtteranceFound, rtcc_WordTrain, rtcc_WordTrained, v_AddTraining, v_Class, v_Collision, v_Complete, v_Continuous, v_Dependent, v_Discrete, v_Final, v_FromResource, v_Garbage, v_GrammarTag, v_Identification, v_Idle, v_Immutable, v_Inconsistent, v_Independent, v_Intermediate, v_Mutable, v_Noisy, v_Normal, v_NotModifiable, v_NotReady, v_Phonetic, v_Ready, v_RecognitionPaused, v_Recognizing, v_ResultsAvailable, v_Retrain, v_Silence, v_Speech, v_Success, v_Text, v_ToResource, v_ToResourceTraining, v_Training, v_TrainingPaused, v_TypeI, v_TypeII, v_Verification, v_WordTrained
 
Method Summary
 void contextCopy(java.lang.String resourceContext, java.lang.String containerContext, Symbol direction)
          Copy a Context from a Resource to a Container, or from a Container to a Resource.
 void contextCreate(java.lang.String resourceContext, Symbol trainingType, java.util.Dictionary contextParams)
          Creates a new context on the ASR Resource.
 java.util.Dictionary contextGetParameters(java.lang.String resourceContext, Symbol[] keys)
          Retrieve context parameter values for a Context loaded in this resource.
 void contextRemove(java.lang.String resourceContext)
          Remove an existing Context from an ASR Resource.
 void contextSetParameters(java.lang.String resourceContext, java.util.Dictionary contextParams)
          This function sets the parameters associated with a particular Context that is loaded on a Resource.
 ASREvent getFinalResults()
          Retrieve recognition results and reset the recognizer.
 ASREvent getResults()
          Retrieve intermediate recognition results and do not reset the recognizer.
 java.lang.String getRuleExpansion(java.lang.String grammarContext, java.lang.String ruleName)
          Retrieve the expansion of a rule from a Grammar Context.
 ASREvent idleASR()
          Forces a Recognizer into the Idle state.
 ASREvent recognize(RTC[] rtcs, java.util.Dictionary optargs)
          Initiate speech recognition on the ASR resource.
 ASREvent setRuleExpansion(java.lang.String grammarContext, java.lang.String ruleName, java.lang.String ruleExpansion)
          Set the expansion of a rule in a Grammar Context.
 ASREvent startASR()
          Starts a recognizer that is in a paused state.
 ASREvent stopASR()
          Stops the Recognizer when it is in the "Recognizing" or "Training" states.
 ASREvent updateParameters()
          Set resource parameters according to p_UpdateParametersList.
 ASREvent wordCommit(java.lang.String wordContext, java.lang.String wordTemp, java.lang.String wordString)
          Commits a Word, as trained, into a Context.
 ASREvent wordCreate(java.lang.String wordContext, java.lang.String wordTemp)
          Create a new Word within a loaded Context.
 ASREvent wordDeleteLastUtterance(java.lang.String wordContext, java.lang.String wordTemp)
          Prevent the previous utterance from contributing to training.
 ASREvent wordDeleteTraining(java.lang.String wordContext, java.lang.String wordString)
          Delete all training associated with a Word in the specified Context.
 ASREvent wordDestroy(java.lang.String wordContext, java.lang.String wordString)
          Remove a word from a loaded Context.
 ASREvent wordTrain(java.lang.String wordContext, java.lang.String wordString)
          Train a Word in a Context.
 

Method Detail

recognize

public ASREvent recognize(RTC[] rtcs,
                          java.util.Dictionary optargs)
                   throws MediaResourceException
Initiate speech recognition on the ASR resource. The returned event contains the highest-probablility result. That is, it contains only one TokenSequence. To get the full set of results, use getFinalResults().
Parameters:
rtcs - a RTC[] containing RTCs to be in effect during this call.
optargs - a Dictionary containing optional arguments.
Returns:
an ASREvent, when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.
DisconnectedException - if Terminal is disconnected.
See Also:
ASREvent

getResults

public ASREvent getResults()
                    throws MediaResourceException
Retrieve intermediate recognition results and do not reset the recognizer. This method is valid whenever partial results are ready. The recognizer can be in either of state Recognizing or ResultsReady.
Returns:
an ASREvent containing one or more TokenSequences.
Throws:
MediaResourceException - if requested operation fails.

getFinalResults

public ASREvent getFinalResults()
                         throws MediaResourceException
Retrieve recognition results and reset the recognizer.

Pre-conditions

Recognizer must be in the ResultsAvailable state.

Post-conditions

Recognizer is in the Idle state.

This is roughly equivalent to getResults() followed by idleASR().

Returns:
an ASREvent containing one or more TokenSequences.
Throws:
MediaResourceException - if requested operation fails.

startASR

public ASREvent startASR()
                  throws MediaResourceException
Starts a recognizer that is in a paused state. The Paused states are v_RecognitionPaused and v_TrainingPaused; the Recognizer moves to the states v_Recognizing and v_Training, respectively.

Corresponds to the ASR.rtca_Start action.

Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.

stopASR

public ASREvent stopASR()
                 throws MediaResourceException
Stops the Recognizer when it is in the "Recognizing" or "Training" states. The recognizer transitions to the next state as if though they had completed execution.

For example, the ASR resource would move from v_Recognizing to v_ResultsAvailable.

Corresponds to the ASR.rtca_Stop action.

Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.

idleASR

public ASREvent idleASR()
                 throws MediaResourceException
Forces a Recognizer into the Idle state.

The function is valid at any time, and is commonly used to move a Recognizer from the v_ResultsAvailable state into the v_Idle state without retrieving results from the Recognizer.

Note: on entry to the Idle state, any recognition results or uncommitted training data is lost.

Corresponds to the ASR.rtca_Idle action.

Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.

updateParameters

public ASREvent updateParameters()
                          throws MediaResourceException
Set resource parameters according to p_UpdateParametersList.

This function (also available as ASR.rtca_updateParameters) is useful for barge-in scenarios, where the timeout values change after the prompt has completed playing.

If p_UpdateParametersList is not set or empty, then nothing is done and this method returns normally.

Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.

getRuleExpansion

public java.lang.String getRuleExpansion(java.lang.String grammarContext,
                                         java.lang.String ruleName)
                                  throws MediaResourceException
Retrieve the expansion of a rule from a Grammar Context.

For a Grammar Context loaded on a resource, this method retrieves the current expansion of the rule RuleName in the Grammar Context grammarContext. The rule expansion is returned as part of the completion event for this function and is available using ASREvent.getRuleExpansion().

Parameters:
grammarContext - the name of a loaded grammar Context.
ruleName - a String that names the rule to be expanded.
Returns:
a String containing the rule expansion.
Throws:
MediaResourceException - if requested operation fails.
See Also:
setRuleExpansion(java.lang.String, java.lang.String, java.lang.String)

setRuleExpansion

public ASREvent setRuleExpansion(java.lang.String grammarContext,
                                 java.lang.String ruleName,
                                 java.lang.String ruleExpansion)
                          throws MediaResourceException
Set the expansion of a rule in a Grammar Context.

For a loaded Grammar Context, this function sets the rule expansion of the rule ruleName to the rule expansion given in the ruleExpansion argument. The syntax of RuleExpansion is defined by the JSGF grammar specification language.

Parameters:
grammarContext - a loaded Context.
ruleName - a String that names the rule being defined.
ruleExpansion - a String containing the rule expansion.
Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.
See Also:
getRuleExpansion(java.lang.String, java.lang.String)

contextCopy

public void contextCopy(java.lang.String resourceContext,
                        java.lang.String containerContext,
                        Symbol direction)
                 throws MediaResourceException
Copy a Context from a Resource to a Container, or from a Container to a Resource. This method can either store a loaded Context into a Container or load a context into the Resource.

The method contextCreate must be used to create the Context on the resource before using this method. If not, should we auto-define that Context? or should we define the Exception that is thrown?

The direction argument Symbol specifies whether the copy is
from a resource to a container (v_FromResource) or
from a container to a resource (v_ToResource).

To copy a Context from a Container to the Resource for the purpose of training, then the symbol v_ToResourceTraining should be used.

When copying a Context to the Resource, the argument resourceContext assigns a name to the Context. This name is used to identify the Context as loaded in the Resource. It does not need to be the same name used to identify the Context in a Container.

This is a non-destructive copy; the source copy of the Context is unaffected. If a Context of the same name already exists at the destination, it will be overwritten and lost. If the destination is a container, a new Context Object will be created to accommodate the Context if necessary. If the destination is an ASR Resource, and the Resource currently has a Context, and the Resource does not support multiple Contexts or the Resource has no room for this Context the copy will fail. In such cases the application must take corrective action; for example, free up room on the ASR Resource by using contextRemove().

A Context copied into the Resource is inactive, it joins the set of inactive contexts.

Parameters:
resourceContext - a Context within the ASR Resource.
containerContext - a String containing the name of the container context.
direction - a Symbol indicating the direction and type of copy. Must be one of ASRConstants.v_ToResource, ASRConstants.v_FromResource, or ASRConstants.v_ToResourceTraining.
Throws:
MediaResourceException - if requested operation fails.

contextCreate

public void contextCreate(java.lang.String resourceContext,
                          Symbol trainingType,
                          java.util.Dictionary contextParams)
                   throws MediaResourceException
Creates a new context on the ASR Resource. A new context is created on the ASR Resource and is associated with the name of the supplied Context. The Context object should be used for all further operations on that context.

The ASR Resource implementation must support Context modification.

The trainingType argment indicates the type of input used to perform training. The possible values are: v_Speech, v_Phonetic, v_Text

Any context parameter -- including those that are normally read-only and cannot be written by the application - may be set during the time of creation by sending the appropriate Dictionary entry in contextParms.

Parameters:
resourceContext - a Context to be created in the Resource.
trainingType - a Symbol that identifies the type of input to be used for training this Context.
contextParams - a Dictionary containing the context parameters.
Throws:
MediaResourceException - if requested operation fails.

contextGetParameters

public java.util.Dictionary contextGetParameters(java.lang.String resourceContext,
                                                 Symbol[] keys)
                                          throws MediaResourceException
Retrieve context parameter values for a Context loaded in this resource.

If a key refers to a Parameter that is not present, or the Context has no meaning associated with a particular Symbol, or that parameter cannot be returned, then the key is ignored, no error is generated, and that key does not appear returned Dictionary.

If there is not ASR Resource configured in this MediaService, or if the ASR Resource cannot return parameters for the given Context, or if the Context is not loaded in the ASR Resource, then null is returned.

Parameters:
resourceContext - the Context from which parameters are retrieved
keys - an array of Symbols identifying the requested parameters.
Returns:
a Dictionary of parameter Symbols and their values.
Throws:
MediaResourceException - if requested operation fails.

contextRemove

public void contextRemove(java.lang.String resourceContext)
                   throws MediaResourceException
Remove an existing Context from an ASR Resource. The Context which is removed is lost.

To preserve a Context, copy it to a Container using the contextCopy() method.

Parameters:
resourceContext - the Context to be removed.
Throws:
MediaResourceException - if requested operation fails.

contextSetParameters

public void contextSetParameters(java.lang.String resourceContext,
                                 java.util.Dictionary contextParams)
                          throws MediaResourceException
This function sets the parameters associated with a particular Context that is loaded on a Resource.
Parameters:
resourceContext - the Context on which parameters are to be set.
contextParams - a Dictionary of parameters to set.
Throws:
MediaResourceException - if requested operation fails.

wordCommit

public ASREvent wordCommit(java.lang.String wordContext,
                           java.lang.String wordTemp,
                           java.lang.String wordString)
                    throws MediaResourceException
Commits a Word, as trained, into a Context. Invoked when sufficient information has been collected to indicate that the Context may be permanently modified based on the updated training information.

The argument tempWord identifies the Word to be committed. When an utterance corresponding to that Word is detected, the recognizer will return wordString as the result. wordString is not the return value of the function -- wordString is the value the Recognizer will return as an answer in the Token field in the result object. wordString is permanent -- once set, the association between the Word and wordString cannot be changed. wordString is not necessarily an exact transcription of the utterance; in most cases, an arbitrary String is used.

For example, in a scenario where arbitrary voice labels are associated with a telephone number, the actual transcription is unknown. wordString can be an arbitrary index, and that index can used by the application to find the correct telephone number.

If this function is used to commit additional training to an already-existing Word, then wordTemp should be the permanent name for the Word; in that case, the value of wordString is ignored.

Parameters:
wordContext - the Context to which this word is committed.
wordTemp - a String that identifes the word to commit.
wordString - the permanent String to identify this word.
Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.

wordCreate

public ASREvent wordCreate(java.lang.String wordContext,
                           java.lang.String wordTemp)
                    throws MediaResourceException
Create a new Word within a loaded Context. The Word is accessed by using wordTemp, a temporary identifier. This method can only be used with Recognizers that support Context modification. Creating a Word adds it to the Context's p_WordList parameter.
Parameters:
wordContext - the Context in which Word is created.
wordTemp - a String containing the temporary name of the word.
Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.

wordDeleteLastUtterance

public ASREvent wordDeleteLastUtterance(java.lang.String wordContext,
                                        java.lang.String wordTemp)
                                 throws MediaResourceException
Prevent the previous utterance from contributing to training. The most recent utterance associated with train() is not used as the training of a Word.

This method must be issued before any new training is made or before the training is committed to the Context. That is, this method must be issued before any other train(), and commit() method; either of those methods makes the utterance part of the permanent training of the Word.

Parameters:
wordContext - the Context in which the work is being trained.
wordTemp - the String that identifies the word being trained.
Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.

wordDeleteTraining

public ASREvent wordDeleteTraining(java.lang.String wordContext,
                                   java.lang.String wordString)
                            throws MediaResourceException
Delete all training associated with a Word in the specified Context.

If the Word has been committed, then wordString must be the permanent name of the Word. If the Word has not been committed, then wordString must be the temporary name.

Parameters:
wordContext - the Context from which the training is deleted.
wordString - the String that idenifies word to be deleted.
Returns:
an ASREvent when then operation is complete.
Throws:
MediaResourceException - if requested operation fails.

wordDestroy

public ASREvent wordDestroy(java.lang.String wordContext,
                            java.lang.String wordString)
                     throws MediaResourceException
Remove a word from a loaded Context. This ASR Resource must support Context modification. Destroying a word will remove it from the Context's p_WordList parameter.

If the Word has been committed, then wordString must be the permanent name of the Word. If the Word has not been committed, then wordString must be the temporary name.

Parameters:
wordContext - the Context from which the Word is removed.
wordString - the String that identifies the word.
Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.

wordTrain

public ASREvent wordTrain(java.lang.String wordContext,
                          java.lang.String wordString)
                   throws MediaResourceException
Train a Word in a Context. The Recognizer collects audio information (or other information; see p_TrainingType), and associates this information with the Word.

The Word might be one that has been trained before, or one that has just been created. If the Word has been committed, then wordString must be the permanent name of the Word. If the Word has not been committed, then wordString must be the temporary name.

Not all Recognizers can add additional training to a Word that has already been committed to a Context. The attribute a_TrainingModifiable may be queried to determine the Recognizer's abilities:
v_NotModifiable means that training cannot be modified.
v_Retrain means that retraining the Word is possible, but only by performing a asrTrain.deleteTraining() and retraining the Word entirely.
v_AddTraining means that additional training may be added to a Word by using the asrTrain.train() command.

Some Recognizers may perform training automatically, issuing prompts and collecting utterances until a sufficient number of good utterances have been collected. Such Recognizers, which have the attribute a_AutomaticTraining set to true, will run in automatic mode when the parameter p_AutomaticTraining is set to true. When training is over, they will issue a single completion event. The parameter p_NumberRepetitions, available in some Recognizers, may be set to dictate how many training utterances the Recognizer should use in AutomaticTraining mode.

Recognizers have different requirements for the minimum number of utterances necessary to provide training for a new Word; if the application is collecting utterances, the application will need to provide at least that many utterances to the Recognizer. The range of the p_NumberRepetitions parameter may be queried to determine this number, which is indicated by the minimum number of repetitions. The Recognizer may also have an upper limit on the number of utterances it can use for training; the p_NumberRepetitions parameter may also be queried to determine this number, which is the maximum number of repetitions.

Different types of training may be available on Recognizers; the a_TrainingType attribute indicates which types are available. The parameter p_TrainingType is used to set the type when more than one type is supported: p_TrainingType takes on the values v_Speech, v_Text, and v_Phonetic to describe the training input. Recognizers that accept training of type v_Phonetic use IPA as their text input for training. If p_TrainingType is set to v_Text or v_Phonetic, the dictionary entry p_TrainingInfo must also be present. This array of String will contain either the text or phonetic representation of the training material, as appropriate.

Parameters:
wordContext - the Context in which the word is trained.
wordString - a String that identifies the word to be trained.
Returns:
an ASREvent when the operation is complete.
Throws:
MediaResourceException - if requested operation fails.

JSR-043: JTAPI-1.4

Send comments to: JSR-43@ectf.org