Resolution Rules¶

WIP

The resolution component is work in progress and has not been released yet. In addition, be aware that the functionality has not been tested extensively for different scenarios yet, so it cannot be guaranteed that it works smoothly in your case.

In many dialogs, users refer to objects in the environment by e.g. saying something like "Pick up the rightmost box". To resolve such references and identify the object that is referenced, the resolution functionality of stepDP can be used. Currently, only resolution of spatial references, i.e. references that describe an objects by its spatial features, are supported. So, references that refer to object attributes like "the red box" cannot yet be resolved but references like "the box on the left", "the third box from the right", "the box to the left of the cube" or "the box in the second shelf" are covered.

Limitations:

the reference needs to be wrapped, for example, by an intent, so it cannot be at the top level of the token
the implementation was only tested with at most one reference per token
the world coordinate system and the position of the speaker need to satisfy some requirements that are described in the "Knowledge Base" section
currently, hypotheses are not supported yet, i.e. the resolution rule simply selects the most likely referent and does not create different tokens with different hypotheses, when there are several likely referents
the tokens added from the language model do not respect the semantic tree => the functionality does not work with enabled type check

How to use the reference resolution functionality¶

To use the reference resolution functionality, three things need to be considered

the knowledge base needs to provide information about the potential referent objects in a certain format, incl. object type, 3D position and rotation, 3D bounding box
the required rule(s) need to be added to the blackboard
the tokens containing the reference need to match a certain format, such that the resolution rule can process them. These tokens can be created by the stepDP NLU extension in combination with Cerence Studio. A Cerence Studion NLU model that yields an interpretation of the required format already exists and you can adapt it to your scenario.

Knowledge Base¶

The types of knowledge base objects that should be considered as possible targets for a reference need to inherit from a certain semantic tree type defined by the constant RRTypes.SPAT_REF_TARGET. So make sure to add the following inheritance when defining the corresponding semantic tree types:

Type xy = new Type("xy", this.getKB());
xy.addInheritance(this.getKB().getType(RRTypes.SPAT_REF_TARGET));

In addition, information about the 3D position, rotation and bounding box of each object need to be provided. The requirements for the world coordinate system and the speaker position are illustrated in the figure below. S is the speaker and O1 and O2 are objects that are in the speakers view. The implementation is based on the coordinate system used by Unity, i.e. a left-handed, Y-Up coordinate system. The regions and spatial relations, e.g. "left" and "right", are based on the speaker's perspective, not the listener's perspective. In addition, it is assumed that the speaker looks in positive z direction of the world coordinate system, which is the default camera setting in Unity.

CoordinateSystem

The following json shows an example of the information that is required for the reference resolution in the required format. The visibility information is optional. Only objects above a visibility threshold are considered for the reference resolution.

{"type":"Ketchup",
 "name":"Ketchup_14",
 "transform":{
    "position":{"x":-0.5531328,"y":1.8194195,"z":-0.097749114},
    "rotation":{"x":270.0,"y":180.0,"z":0.0}
    }
    "visibility":0.08477666,
    "boundingBox":{
        "center":{"x":-0.5531328,"y":1.84930778,"z":-0.09774916},
        "extents":{"x":0.0619111545,"y":0.160330042,"z":0.0391443856}
    }
}

Cerence Studio¶

The NLU model setup for the reference resolution component differs a little from the usual setup. The idea is that there are two separate models, one for the scenario that does the intent classification and entity extraction, but only extracts the reference as text, and a second NLU model that is dedicated to the references. You can use the following NLU model as the second model if you import and deploy it to your Cerence Studio account:

RRNLU.trsx

To adapt it to your scenario, exchange the example literals for the entity "refType" with the types that are relevant for you use case. If the performance is not sufficient, you might want to add some training samples that contain your "refType" literals or somehow generate them by replacing the literals in the provided training samples.

You can design the first model however you want and your use case requires. The only requirement with regard to the reference resolution is that the entities that can contain references inherit from a concept of Type Freeform that is called "LMSpatialReferenceText". So, for example, if you have a "PickUpIntent" with an associated entity "objects" that should contain a reference, you need to create "objects" as an entity of Type "Relationship" and create a "isA" relationship to the Freeform concept "LMSpatialReferenceText". Then you can setup the StepDP NLU extenstion to use your scenario model as main model and the reference resolution model as submodel. The process is similar to the usual process for setting up StepDP with Cerence Studio NLU, except for the initialization (replace dummy values with you credentials):

// Credentials for refernece resolution model (the one that you imported from above)
ICerenceCredentials subCreds = new BasicCerenceCredentialController(
        "ServerURI",
        "AppID",
        "AppKey",
        "ContextTag", "Language");
cerence = new NLUController(this);
// Credentials for your scenario model, subCreds from above and name of the reference freeform concept
cerence.InitCerenceMix(
        "ServerURI",
        "AppID",
        "AppKey",
       "ContextTag", "Language", subCreds, "LMSpatialReferenceText");
cerence.StartAudioServer("0.0.0.0", 50002);

Rules¶

Finally, you need to add the relevant rules to StepDP:

// you can set some config parameters for the reference resolution, e.g. the default direction for "the second object"
RRConfigParameters config = new RRConfigParameters();
config.DEFAULT_DIR = SpatialRegion.left;

// Preprocessing rule: needed to convert Token from Cerence Studio to required format for reference resolution rule
// speaker is the IKBObject that represents the person uttering the reference (speaker position and rotation is needed for perspective-dependent computations)
this.getBlackboard().addRule(new SpatialReferencePreprocessingRule(this.getKB(), speaker, config));

// Fusion rule: optional, add it if you want to include pointing gestures in the reference resolution
Type t1 = this.getKB().getType(RRTypes.SPAT_REF);
Type t2 = this.getKB().getType(RRTypes.POINTING_C);
long interval = Duration.ofSeconds(10).toMillis();
long maxAge = Duration.ofSeconds(10).toMillis();
Rule r2 = new DeclarativeTypeBasedFusionRule(t1, t2, "constraints", interval, maxAge, this.getKB());
r2.setPriority(2999);
this.getBlackboard().addRule(r2);

// The actual Reference Resolution rule: this one is required in any case
// The minTokenAge of 1 Second is used because the user might provide the pointing gesture shortly after the speech
// without the delay, the resolution would start immediatly when the speech is received
this.getBlackboard().addRule(new SpatialReferenceResolutionRule(this.getKB(), Duration.ofSeconds(1).toMillis(), config));

The following code snippet gives an example of how to use the resolved token in an example rule and how to create a fallback rule for the case that the resolution was not successful.

// example for a dialog rule that works with the resolved token
Rule pickUpRule = new SimpleRule(tokens -> {
    IToken t = tokens[0];
    IKBObject[] objs = t.getResolvedReferenceArray("objects");
    List<String> names = objs.stream().map(o -> o.getName()).collect(Collectors.toList());
    String namesStr = names.stream().reduce("",(s1,s2) -> s1 + " " + s2);
    System.out.println("Picking up " + namesStr);
    // set ignore rule tags to avoid that origin / result tokens trigger rules a second time
    List<String> ignoreTags = List.of("PickUp", "fusion");
    t.setIgnoreRuleTags(new LinkedList<String>(ignoreTags));
    t.addIgnoreTagsToAllOrigins(ignoreTags);
    t.addIgnoreTagsToAllResults(ignoreTags);
}, "PickUpRule");
// pattern for "objects" property ensures that the rule triggers only for tokens with resolved references
// you could replace RRTypes.SPAT_REF_TARGET also with a more specific type that inherits from it
Pattern p = new PatternBuilder("PickUp", this.getKB()).addPatternForProperty("objects").hasType(RRTypes.SPAT_REF_TARGET)
        .endPropertyPattern().build();
pickUpRule.setCondition(new PatternCondition(p));
pickUpRule.setPriority(10001);
pickUpRule.getTags().add("PickUp");
this.getBlackboard().addRule(pickUpRule);

// if the token cannot be resolved, the pickUpRule will not trigger
// => after a delay, the fallback rule triggers
Rule fallbackRule = new SimpleRule(tokens -> {
    IToken t = tokens[0];
    System.out.println("I don't know which object you mean.");
    // set ignore rule tags to avoid that origin / result tokens trigger rules a second time
    List<String> ignoreTags = List.of("PickUp", "fusion");
    t.setIgnoreRuleTags(new LinkedList<String>(ignoreTags));
    t.addIgnoreTagsToAllOrigins(ignoreTags);
    t.addIgnoreTagsToAllResults(ignoreTags);
}, "FallbackRule");
// this one matches any PickUp intent, also with unresovled reference
Pattern p1 = new PatternBuilder("PickUp", this.getKB()).build();
fallbackRule.setCondition(new PatternCondition(p1));
fallbackRule.setPriority(10002);
// MinTokenAge = delay before fallback rule triggers
fallbackRule.getCondition().setMinTokenAge(Duration.ofSeconds(5).toMillis());
fallbackRule.getTags().add("PickUp");
this.getBlackboard().addRule(fallbackRule);

For Developers: Overview of Reference Resolution Implementation¶

This section describes how the spatial reference resolution rule works internally. You don't need to read if you just want to use the rule. For this purpose, the sections above are sufficient. The figure below gives an overview of the classes involved in the reference resolution and how they work together.

RRClasses