The QAit Task

Overview

The QAit (Question Answering with interactive text) task proposes a novel text-based question answering problem whereby an agent must interact with a partially observable text-based environment to gather the declarative knowledge required to answer questions. QAit poses said questions about the location, existence and attributes of objects distributed throughout the environment. QAit produced and evaluated a set of baseline models on a created test-set of unseen environments and questions. This test-set is intended to be used as a benchmark for future research to evaluate an agents ability to comprehend language and generalise its action policy.

Environment & Difficulty

QAit aims to test an agent's language comprehension abilities using tasks that require an understanding of locality, existence, and attributes. All environments are generated, using Textworld, by sampling from the world setting distribution (tabulated below), where environment configurations are distinguished into fixed map and random map categories. The fixed map category sees to the creation of environments consistently containing 6 unique rooms. In contrast, random map games draw from a uniform distribution to decide on the number of rooms to create.

	Fixed Map	Random Map
# Locations, N_r	6	Uniform(2,12)
# Entites, N_e	Uniform(3N_r, 6N_r)

Questions based on each environment are created on the fly as an agent plays a game, but the number of different games an agent trains on is set as an experimental parameter. All agents are trained on datasets consisting of 1, 2, 10, 100, 500 created environments as well as an unlimited setting where different environments are created for each question, thereby theoretically not allowing an agent to see the same environment question pair twice. In this setting, more than 10⁴⁰ different games can be created, indicating that an agent is unlikely to see the same environment again.

Question Types

There are three types of questions that the agent attempts to answer based on these generated worlds.

Location: location type questions ask the whereabouts of objects situated within the world. An example of such a question is "Where is the can of soda?", where a suitable answer would be "fridge".
Existence: existence type questions ask about the presence of objects situated within the world. An example of such a question is "is there a raw egg in the world?" where the answer would simply be yes or no.
Attribute: attribute type questions, the most difficult of all three question types, ask about whether or not an object has a certain associated attribute. An example of such a question is "is apple edible", where the answer is also yes or no. Objects within the attribute question setting are given arbitrary and randomly made-up words to discourage agent memorisation of values, such as an apple always being edible.

Interaction

Since language generation can become intractable within a reinforcement learning setting, all text commands are triplets of the form {action, modifier, object} (e.g., open metallic gate). When there is no ambiguity present such as two different keys in a room, the environment understands commands without modifiers, e.g. pick key will result in picking up the "copper key" provided it is the only key in the room. At each game step, there are three lexicons that divide the vocabulary into actions, modifiers and objects. This reduces the size of the action space for each word in the command triplet compared to a sequential, freeform setting. The wait command indicates the agent wants to stop interaction and answer the question.

Evaluation

The QAit test set provides 500 held out games for both map types and all three question types. This testing set is used to benchmark the generalisation abilities of agents on all experiment configurations. This allows for models to be assessed in a reproducible manner and is analogous to supervised learning test sets.

Metrics

Accuracy

Accuracy refers to the proportion of correctly answered questions and is deemed the most important metric since, ultimately, the goal of IQA is to answer a question.

Sufficient Information

Sufficient information is a metric used to evaluate the amount of information gathered by the agent and whether or not the information was sufficient to answer the question. It is also used as part of the reward function. This is a metric to evaluate the performance of the navigation and interaction required to answer a given question.

Baselines

QAit provides five baselines - these are human, random, and three popular value-based reinforcement learning methods. The human baseline consists of results achieved by 21 human participants. The random baseline performs no interaction with the environment and simply samples answers from the potential answer pool. This is yes or no for existence type questions and all possible object names for location type questions. The reinforcement learning baselines are DQN, DDQN and Rainbow.

Resources

QAit Paper

View

Interactive Question Answering

Question Answering with Interactive Text (QAit)