Code for "Dialogue Evaluation with Offline Reinforcement Learning" paper
Code for "Dialogue Evaluation with Offline Reinforcement Learning" paper.
## The code will be released on September 2022
<palign="center">
<imgwidth="700"src="all2.pdf">
</p>
In this paper, we propose the use of offline reinforcement learning for dialogue evaluation based on static data.Such an evaluator is typically called a critic and utilized for policy optimization. We go one step further and show that offline RL critics can be trained for any dialogue system as external evaluators, allowing dialogue performance comparisons across various types of systems. This approach has the benefit of being corpus- and model-independent, while attaining strong correlation with human judgements, which we confirm via an interactive user trial.
## Data
data.zip includes the following data:
- Pre-processed MultiWOZ 2.0 and 2.1
- Generated response from AuGPT, HDSA with gold action labels, and HDSA with predicted action labels
## Structure
The implementation of the models, as well as training and evaluation scripts are under **latent_dialog**.
The scripts for running the experiments are under **experiment_woz**. The trained models and evaluation results are under **experiment_woz/sys_config_log_model**.
## Policy Optimization
To use critic for policy optimization, training is done in two steps:
### Step 1: SL pre-training with shared response generation and VAE objectives
python mt_gauss.py
### Step 2: Offline RL in the latent action space
python plas_gauss.py
## Policy Evaluation
To train a critic after-the-fact as evaluator for a fixed policy, first extract responses from the policy for MultiWOZ training, validation, and test sets. The responses should be in the same json format as data/augpt/test-predictions.json.
In this codebase, responses from AuGPT, HDSA with gold action labels, and HDSA with predicted action labels are provided in data.zip.