From 31fcd637b719b7590136cd6895041124a53d658b Mon Sep 17 00:00:00 2001 From: Christian <christian.geishauser@hhu.de> Date: Mon, 5 Feb 2024 15:58:58 +0100 Subject: [PATCH] updated readme --- README.md | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 95 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 94b4bb4..ea8116a 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,99 @@ -<<<<<<< HEAD # RECORD - public This is the code repository to our work **Learning with an Open Horizon in Ever-Changing Dialogue Circumstances**. -======= \ No newline at end of file +This work proposes the usage of lifetime return and meta-learning of hyperparameters for enhanced continual reinforcement learning training. We optimized the state-of-the-art architecture for continual RL of dialogue policies called DDPT (see https://aclanthology.org/2022.coling-1.21/). + +As base algorithms, we use PPO and CLEAR. While PPO is an on-policy algorithm, CLEAR is an off-policy algorithm specifically built for continual reinforcement learning. Moreover, the dialogue policies can be trained with different user simulator setups: single user simulator (rule-based or transformer-based), and multiple simulators together. + +## Installation + +The code builds upon ConvLab-3. To install ConvLab-3, please follow the instructions in the repository: + +https://github.com/ConvLab/ConvLab-3 + +In addition, for utilizing meta-learning and evaluation, you need to install the higher library and rliable using + +``` +pip install higher +pip install -U rliable +``` + + +## Training + +The code for training models can be found in the folders ppo_DPT and vtrace_DPT. We explain the usage using vtrace_DPT. It works analogously for ppo_DPT. + +### Train with rule-based simulator + +The rule-based simulator has different configurations with outputting only little actions or more actions in a turn. We train with them using the following two configurations + +``` +python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl.json +python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_shy.json +``` + +### Train with transformer-based simulator TUS + +``` +python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_tus.json +``` + +### Train with all three simulators + +We can leverage all simulators during learning with the following execution + +``` +python convlab/policy/vtrace_DPT/train_ocl_meta_users.py --seed=0 +``` + +We can run the training with various seeds to obtain different results. The results are stored in the folder `experiments` and moved to `finished_experiments` once they are done. + +### Leveraging Lifetime Return and Meta Learning + +We can specify whether we want to use meta-learning, use episodic return, lifetime return, or both in the config-file `convlab/policy/vtrace_DPT/config.json`. + +- lifetime_weight: number between 0 and 1; 0 means no lifetime return, 1 means using lifetime return +- only_lifetime: true or false; true means only lifetime return is used, false means both lifetime return and episodic return are used +- meta: true or false; true means meta-learning is used, false means no meta-learning is used + + +### Specifying the Timeline + +We provide timelines used for the paper in `convlab/policy/ocl_utils/timelines`. You specify the following: + +- timeline: a dictionary, where the keys are given by domains. The values determine after how many dialogues the domain should be introduced +- num_domain_probs: for every integer n, the probability of using n domains in a user goal +- domain_probs: for every domain, the probability of using the domain in a user goal +- new_domain_probs: probability that the newly introduce domain should be part of the user goal +- num_dialogues_stationary: number of dialogues before the user demand changes +- std_deviation: specifies the variation of user demand changes + +During training you specify the timeline_path to use in the config, e.g. in `semantic_level_config_ocl.json` + +## Evaluation + +Let us assume we have run two experiments, one with meta-learning and one baseline. Each experiment has been run with 5 different seeds. +We create folders meta and baseline for the two experiments, each folder containing the different seed folders. We assume the folders meta and baseline lie in the folder meta-experiments. + +meta-experiments +- meta + - seed_0 + - seed_1 + - seed_2 + - seed_3 + - seed_4 +- baseline + - seed_0 + - seed_1 + - seed_2 + - seed_3 + - seed_4 + +We can evaluate the experiments using the following command + +``` +python convlab/policy/ocl_utils/plot_ocl.py meta baseline --dir_path meta-experiments +``` + +More generally, you pass a list of experiment names and the folder they are saved in. The script will then create different plots as in the paper and save them in the folder meta-experiments. \ No newline at end of file -- GitLab