To train agents to interact well with humans, we need to be able to measure progress. But human interaction is complex and measuring progress is difficult. In this work we developed a method, called the Standardised Test Suite (STS), for evaluating agents in temporally extended, multi-modal interactions. We examined interactions that consist of human participants…
