Skip to content
Snippets Groups Projects
Commit 31fcd637 authored by Christian's avatar Christian
Browse files

updated readme

parent 7aff5034
Branches main
No related tags found
No related merge requests found
<<<<<<< HEAD
# RECORD - public # RECORD - public
This is the code repository to our work **Learning with an Open Horizon in Ever-Changing Dialogue Circumstances**. This is the code repository to our work **Learning with an Open Horizon in Ever-Changing Dialogue Circumstances**.
======= This work proposes the usage of lifetime return and meta-learning of hyperparameters for enhanced continual reinforcement learning training. We optimized the state-of-the-art architecture for continual RL of dialogue policies called DDPT (see https://aclanthology.org/2022.coling-1.21/).
\ No newline at end of file
As base algorithms, we use PPO and CLEAR. While PPO is an on-policy algorithm, CLEAR is an off-policy algorithm specifically built for continual reinforcement learning. Moreover, the dialogue policies can be trained with different user simulator setups: single user simulator (rule-based or transformer-based), and multiple simulators together.
## Installation
The code builds upon ConvLab-3. To install ConvLab-3, please follow the instructions in the repository:
https://github.com/ConvLab/ConvLab-3
In addition, for utilizing meta-learning and evaluation, you need to install the higher library and rliable using
```
pip install higher
pip install -U rliable
```
## Training
The code for training models can be found in the folders ppo_DPT and vtrace_DPT. We explain the usage using vtrace_DPT. It works analogously for ppo_DPT.
### Train with rule-based simulator
The rule-based simulator has different configurations with outputting only little actions or more actions in a turn. We train with them using the following two configurations
```
python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl.json
python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_shy.json
```
### Train with transformer-based simulator TUS
```
python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_tus.json
```
### Train with all three simulators
We can leverage all simulators during learning with the following execution
```
python convlab/policy/vtrace_DPT/train_ocl_meta_users.py --seed=0
```
We can run the training with various seeds to obtain different results. The results are stored in the folder `experiments` and moved to `finished_experiments` once they are done.
### Leveraging Lifetime Return and Meta Learning
We can specify whether we want to use meta-learning, use episodic return, lifetime return, or both in the config-file `convlab/policy/vtrace_DPT/config.json`.
- lifetime_weight: number between 0 and 1; 0 means no lifetime return, 1 means using lifetime return
- only_lifetime: true or false; true means only lifetime return is used, false means both lifetime return and episodic return are used
- meta: true or false; true means meta-learning is used, false means no meta-learning is used
### Specifying the Timeline
We provide timelines used for the paper in `convlab/policy/ocl_utils/timelines`. You specify the following:
- timeline: a dictionary, where the keys are given by domains. The values determine after how many dialogues the domain should be introduced
- num_domain_probs: for every integer n, the probability of using n domains in a user goal
- domain_probs: for every domain, the probability of using the domain in a user goal
- new_domain_probs: probability that the newly introduce domain should be part of the user goal
- num_dialogues_stationary: number of dialogues before the user demand changes
- std_deviation: specifies the variation of user demand changes
During training you specify the timeline_path to use in the config, e.g. in `semantic_level_config_ocl.json`
## Evaluation
Let us assume we have run two experiments, one with meta-learning and one baseline. Each experiment has been run with 5 different seeds.
We create folders meta and baseline for the two experiments, each folder containing the different seed folders. We assume the folders meta and baseline lie in the folder meta-experiments.
meta-experiments
- meta
- seed_0
- seed_1
- seed_2
- seed_3
- seed_4
- baseline
- seed_0
- seed_1
- seed_2
- seed_3
- seed_4
We can evaluate the experiments using the following command
```
python convlab/policy/ocl_utils/plot_ocl.py meta baseline --dir_path meta-experiments
```
More generally, you pass a list of experiment names and the folder they are saved in. The script will then create different plots as in the paper and save them in the folder meta-experiments.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment