From 31fcd637b719b7590136cd6895041124a53d658b Mon Sep 17 00:00:00 2001
From: Christian <christian.geishauser@hhu.de>
Date: Mon, 5 Feb 2024 15:58:58 +0100
Subject: [PATCH] updated readme

---
 README.md | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 95 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 94b4bb4..ea8116a 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,99 @@
-<<<<<<< HEAD
 # RECORD - public
 
 This is the code repository to our work **Learning with an Open Horizon in Ever-Changing Dialogue Circumstances**.
 
-=======
\ No newline at end of file
+This work proposes the usage of lifetime return and meta-learning of hyperparameters for enhanced continual reinforcement learning training. We optimized the state-of-the-art architecture for continual RL of dialogue policies called DDPT (see https://aclanthology.org/2022.coling-1.21/).
+
+As base algorithms, we use PPO and CLEAR. While PPO is an on-policy algorithm, CLEAR is an off-policy algorithm specifically built for continual reinforcement learning. Moreover, the dialogue policies can be trained with different user simulator setups: single user simulator (rule-based or transformer-based), and multiple simulators together.
+
+## Installation
+
+The code builds upon ConvLab-3. To install ConvLab-3, please follow the instructions in the repository:
+
+https://github.com/ConvLab/ConvLab-3
+
+In addition, for utilizing meta-learning and evaluation, you need to install the higher library and rliable using
+
+```
+pip install higher
+pip install -U rliable
+```
+
+
+## Training
+
+The code for training models can be found in the folders ppo_DPT and vtrace_DPT. We explain the usage using vtrace_DPT. It works analogously for ppo_DPT.
+
+### Train with rule-based simulator
+
+The rule-based simulator has different configurations with outputting only little actions or more actions in a turn. We train with them using the following two configurations
+
+```
+python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl.json
+python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_shy.json
+```
+
+### Train with transformer-based simulator TUS
+
+```
+python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_tus.json
+```
+
+### Train with all three simulators
+
+We can leverage all simulators during learning with the following execution
+
+```
+python convlab/policy/vtrace_DPT/train_ocl_meta_users.py --seed=0 
+```
+
+We can run the training with various seeds to obtain different results. The results are stored in the folder `experiments` and moved to `finished_experiments` once they are done.
+
+### Leveraging Lifetime Return and Meta Learning
+
+We can specify whether we want to use meta-learning, use episodic return, lifetime return, or both in the config-file `convlab/policy/vtrace_DPT/config.json`.
+
+- lifetime_weight: number between 0 and 1; 0 means no lifetime return, 1 means using lifetime return
+- only_lifetime: true or false; true means only lifetime return is used, false means both lifetime return and episodic return are used
+- meta: true or false; true means meta-learning is used, false means no meta-learning is used
+
+
+### Specifying the Timeline
+
+We provide timelines used for the paper in `convlab/policy/ocl_utils/timelines`. You specify the following:
+
+- timeline: a dictionary, where the keys are given by domains. The values determine after how many dialogues the domain should be introduced
+- num_domain_probs: for every integer n, the probability of using n domains in a user goal
+- domain_probs: for every domain, the probability of using the domain in a user goal
+- new_domain_probs: probability that the newly introduce domain should be part of the user goal
+- num_dialogues_stationary: number of dialogues before the user demand changes
+- std_deviation: specifies the variation of user demand changes
+
+During training you specify the timeline_path to use in the config, e.g. in `semantic_level_config_ocl.json`
+
+## Evaluation
+
+Let us assume we have run two experiments, one with meta-learning and one baseline. Each experiment has been run with 5 different seeds.
+We create folders meta and baseline for the two experiments, each folder containing the different seed folders. We assume the folders meta and baseline lie in the folder meta-experiments.
+
+meta-experiments
+- meta
+    - seed_0
+    - seed_1
+    - seed_2
+    - seed_3
+    - seed_4
+- baseline
+  - seed_0
+  - seed_1
+  - seed_2
+  - seed_3
+  - seed_4
+
+We can evaluate the experiments using the following command
+
+```
+python convlab/policy/ocl_utils/plot_ocl.py meta baseline --dir_path meta-experiments
+```
+
+More generally, you pass a list of experiment names and the folder they are saved in. The script will then create different plots as in the paper and save them in the folder meta-experiments.
\ No newline at end of file
-- 
GitLab