Gregory Furman
Prior work has exclusively focused on IQA as a reinforcement learning problem, such methods suffer from low sample efficiency and poor accuracy on zero-shot evaluation.
By framing IQA as an offline sequence modelling problem, we investigate the applicability of the novel Decision Transformer (DT) architecture where a sequence consists of a set of states, actions, and rewards each corresponding to an episodic timestep.
A causally masked GPT-2 Transformer for action generation along with an answer prediction head for QA was used. Additionally, we trained a BERT model for question-answering to be used in tandem with the DT’s command generation heads (DT-BERT).
Please read about the QAit task before continuing in order to understand the experiment settings, results and their metrics and overall context to the study.
Learn more
Proposed by Chen et al. (2021), the Decision Transformer architecture aims to autoregressively model a trajectory of actions, states, and rewards using a Transformer architecture, specifically GPT-2.
The Decision Transformer architecture can be fed the last K timesteps as input allowing for K returns-to-go, actions, and state tokens, respectively (totalling 3K tokens per timestep t). The parameter K is a context window consisting of the number of previous episodes the transformer can draw upon to inform its decision-making. We set the maximum length of a QAit episode to 50 episodic timesteps as well as K=50. This allows for a maximum of 50 timesteps (or maximum length of a trajectory) to be fed as input to the Decision Transformer at a time.
Token embeddings for states and actions are obtained using a single embedding layer wherein the raw token inputs are projected to an embedding dimension. Given that states and actions can be of variable length, these embeddings are fed to a GRU encoder with the final hidden state representing the entire encoded state or action. Embeddings for reward are also learnt and projected to the embedding dimension. Finally, an embedding for each episodic timestep is concatenated to each embedded token. The embedded and positionally encoded action, state, and return-to-go inputs are fed into the GPT model.
For any given timestep, the Decision Transformer output is fed into four linear decoders. Three correspond to a command's action, modifier, and object components with the fourth additional linear decoder serving to predict the answer to the question at each timestep. While not acting as the primary QA mechanism in the architecture, the answer decoder allows the Decision Transformer to learn some primitive level of question-answering in conjunction with command prediction to integrate question-answering with command generation.
We also limited the size of input states to a length of 180 tokens.
The BERT QA module consists of a classification layer sitting atop a BERT encoder. A benefit of using a BERT model is that multiple state strings can be used as context in order to answer a question. First, states are joined together into a single long sequence of tokens with the question appended at the end. This concatenated state-and-question string is tokenised by a pretrained bert-base-uncased tokeniser. This pads or truncates the input returning a 512 token long representation then fed to the BERT encoder. This tokeniser also separates the prompt and question using the [SEP] tag as well as adds [CLS] and [SEP] to the beginning and end of the string, respectively.
Finally, we pass BERT's pooled output to a linear layer used to predict a word from the vocab that answers the question in a manner akin to a classification task. The classifier has two output nodes for attribute and existence questions, one for "yes" or "no", respectively. For location type questions, the classifier has a node for each token in vocabulary, making it identical to the answer prediction mechanism of the Decision Transformer.
Selected results are presented. For a full set of results and discussion, please refer to the paper.
Model | Location | Existence | Attribute | |||
Fixed | Random | Fixed | Random | Fixed | Random | |
DQN | 0.224 (0.244) | 0.204 (0.216) | 0.674 (0.279) | 0.678 (0.214) | 0.534 (0.014) | 0.530 (0.017) |
DDQN | 0.218 (0.228) | 0.222 (0.246) | 0.626 (0.213) | 0.656 (0.188) | 0.508 (0.026) | 0.486 (0.023) |
Rainbow | 0.190 (0.196) | 0.172 (0.178) | 0.656 (0.207) | 0.678 (0.191) | 0.496 (0.029) | 0.494 (0.017) |
DT | 0.168 (0.232) | 0.104 (0.264) | 0.668 (0.254) | 0.722 (0.277) | 0.504 (0.057) | 0.526 (0.058) |
DT-BERT | 0.232 (0.232) | 0.270 (0.264) | 0.626 (0.258) | 0.654 (0.277) | 0.524 (0.058) | 0.538 (0.060) |
On the test set, we found the DT's answer prediction head to have a QA accuracy of 0.104 on random map games and 0.168 on a fixed map. Thus, while outperforming the random baseline, the DT's answer prediction capabilities did not surpass previous RL methods for the 500 games setting, both for fixed and random map types. However, we found that the information gathering capacity of the DT surpassed all prior approaches to location type questions, having scored 0.264 on random map and 0.232 on fixed map settings.
The DT's question-answering capacity was seemingly decoupled from its knowledge gathering abilities. Findings by Yuan et al. showed the ability of an agent to gather information was closely associated with its ability to answer location questions, where sufficient information scores were closely related to question-answering accuracy. This disconnect between the high sufficient information scores of DT and its relatively poor question-answering abilities was likely a result of the answer prediction head underfitting the training data.
The BERT QA model achieved a QA score of 0.27 with a sufficient information score of 0.264 in the random map setting. In a fixed map, the model scored a QA accuracy of 0.232 with a sufficient information score of 0.232. Thus, the BERT model outperformed the QA accuracy of all models trained in the 500 games setting in fixed and random maps. The QA accuracy mirroring that of the sufficient information score indicates perfect performance on questions of the locality of objects. For fixed map questions, the BERT model's accuracy is seemingly limited by whether or not the agent manages to navigate into the correct state, explaining the equal scores. However, the BERT model learnt to use additional context to achieve a higher QA accuracy than its sufficient information score for the random map settings. This decoupling of the BERT QA model and the DT means that a question can still be correctly answered even if the DT stops in an incorrect state. This high accuracy mirrors the results of the held-out validation set used when training the BERT QA model.
The DT scored a QA accuracy of 0.722 and a sufficient information score of 0.277, outperforming all previous sufficient information and QA accuracy baselines in the 500 games random setting. However, while outperforming the DDQN and Rainbow, the DT failed to outperform the DQN in the fixed map setting, achieving an accuracy of 0.668 and a sufficient information score of 0.254. For the BERT model, zero-shot evaluation on the test set resulted in a QA accuracy of 0.654 for random map and 0.626 for fixed map. Thus, despite having shown promise on the hold-out set during training, the BERT QA model could not outperform the DT's answer prediction head and the previous QA baselines in both the fixed map and random map settings. While achieving well above the random baseline, the reasons for this underperformance are likely a result of overfitting the training set.
The Decision Transformer scored higher sufficient information on the test set than all previous RL baselines, achieving 0.058 in random map and 0.057 in fixed map. Additionally, the QA accuracy of the answer prediction head surpassed the DDQN and Rainbow in the random map 500 games setting, with a QA accuracy of 0.526 - but failed to outperform the DQN. For fixed map, similar results were observed wherein the DT outperformed the DDQN and Rainbow, scoring 0.504 but failed to beat the DQN. Despite the superior knowledge-gathering ability of the DT, its question-answering capacity was unable to outperform the prior QA module's maximum accuracy of 0.530 and 0.534 for 500-games random and fixed map settings, respectively. During zero-shot evaluation on the test set, the BERT QA model scored similarly to the baselines set by Yuan et al., beating the random map baselines but failed to outperform the fixed map results in the 500 games setting. While BERT outperformed Rainbow and the DDQN in the fixed map setting, scoring 0.524, the QA abilities of the DQN were unable to be surpassed. On the other hand, the BERT model outperformed all baselines in the 500 games random map setting, achieving the highest sufficient information score of the attribute test set and a QA accuracy of 0.538. In both settings, we attribute the success of DT-BERT to the BERT model's leveraging of pre-trained language embeddings.
This paper shows that current reinforcement learning baselines set out in the QAit task can be matched and improved upon by framing IQA as a supervised sequence modelling problem and using a Transformer architecture for action generation and answer prediction. Additionally, we showed an improved sample efficiency when training, using a smaller training set than what reinforcement learning necessitates, consisting of suboptimal data generated via random rollouts. Moreover, when fine-tuning a BERT model on this same dataset for question-answering and allowing it to work in tandem with the Decision Transformer for action generation, several QAit QA baselines were outperformed.
For future works, we propose testing the limits of the Decision Transformer's sample efficiency and generalisability. This would require the DT be trained on extremely data deprived samples, consisting of fewer than 10K trajectories. Furthermore, we also suggest such an approach makes little use of exploration rewards, instead relying purely on the Decision Transformer navigating to the correct state. This can help to better understand the DT's ability to assign long term credit in sparse-reward settings - testing the extent to which its state-of-the-art reward agnosticism can be utilised. Moreover, we propose a Longformer architecture that allows for a far greater context window of up to 4096 tokens - testing the extent to which extra information can aid interactive question-answering. By increasing the context length, we believe QA performance would drastically improve. Lastly, we suggest fine-tuning a BERT model, or equivalent, to embed state and actions that are passed to the Decision Transformer's prediction heads in the hopes of utilising pre-existing language understanding to better generate actions and predict answers.