updated readme

31fcd637 · Christian · 7aff5034 · 31fcd637
Commit 31fcd637 authored 1 year ago by Christian
--- a/README.md
+++ b/README.md
-<<<<<<< HEAD
 # RECORD - public
 This is the code repository to our work **Learning with an Open Horizon in Ever-Changing Dialogue Circumstances**.
-=======
+This work proposes the usage of lifetime return and meta-learning of hyperparameters for enhanced continual reinforcement learning training. We optimized the state-of-the-art architecture for continual RL of dialogue policies called DDPT (see https://aclanthology.org/2022.coling-1.21/).
\ No newline at end of file
+As base algorithms, we use PPO and CLEAR. While PPO is an on-policy algorithm, CLEAR is an off-policy algorithm specifically built for continual reinforcement learning. Moreover, the dialogue policies can be trained with different user simulator setups: single user simulator (rule-based or transformer-based), and multiple simulators together.
+## Installation
+The code builds upon ConvLab-3. To install ConvLab-3, please follow the instructions in the repository:
+https://github.com/ConvLab/ConvLab-3
+In addition, for utilizing meta-learning and evaluation, you need to install the higher library and rliable using
+```
+pip install higher
+pip install -U rliable
+```
+## Training
+The code for training models can be found in the folders ppo_DPT and vtrace_DPT. We explain the usage using vtrace_DPT. It works analogously for ppo_DPT.
+### Train with rule-based simulator
+The rule-based simulator has different configurations with outputting only little actions or more actions in a turn. We train with them using the following two configurations
+```
+python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl.json
+python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_shy.json
+```
+### Train with transformer-based simulator TUS
+```
+python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_tus.json
+```
+### Train with all three simulators
+We can leverage all simulators during learning with the following execution
+```
+python convlab/policy/vtrace_DPT/train_ocl_meta_users.py --seed=0 
+```
+We can run the training with various seeds to obtain different results. The results are stored in the folder `experiments` and moved to `finished_experiments` once they are done.
+### Leveraging Lifetime Return and Meta Learning
+We can specify whether we want to use meta-learning, use episodic return, lifetime return, or both in the config-file `convlab/policy/vtrace_DPT/config.json`.
+- lifetime_weight: number between 0 and 1; 0 means no lifetime return, 1 means using lifetime return
+- only_lifetime: true or false; true means only lifetime return is used, false means both lifetime return and episodic return are used
+- meta: true or false; true means meta-learning is used, false means no meta-learning is used
+### Specifying the Timeline
+We provide timelines used for the paper in `convlab/policy/ocl_utils/timelines`. You specify the following:
+- timeline: a dictionary, where the keys are given by domains. The values determine after how many dialogues the domain should be introduced
+- num_domain_probs: for every integer n, the probability of using n domains in a user goal
+- domain_probs: for every domain, the probability of using the domain in a user goal
+- new_domain_probs: probability that the newly introduce domain should be part of the user goal
+- num_dialogues_stationary: number of dialogues before the user demand changes
+- std_deviation: specifies the variation of user demand changes
+During training you specify the timeline_path to use in the config, e.g. in `semantic_level_config_ocl.json`
+## Evaluation
+Let us assume we have run two experiments, one with meta-learning and one baseline. Each experiment has been run with 5 different seeds.
+We create folders meta and baseline for the two experiments, each folder containing the different seed folders. We assume the folders meta and baseline lie in the folder meta-experiments.
+meta-experiments
+- meta
+    - seed_0
+    - seed_1
+    - seed_2
+    - seed_3
+    - seed_4
+- baseline
+  - seed_0
+  - seed_1
+  - seed_2
+  - seed_3
+  - seed_4
+We can evaluate the experiments using the following command
+```
+python convlab/policy/ocl_utils/plot_ocl.py meta baseline --dir_path meta-experiments
+```
+More generally, you pass a list of experiment names and the folder they are saved in. The script will then create different plots as in the paper and save them in the folder meta-experiments.
\ No newline at end of file