diff --git a/data/unified_datasets/README.md b/data/unified_datasets/README.md index 6a3bb3db37f3dd523ae123cfe006c899dbc93151..7406169db869929dda862b04794a7e8412e3a227 100644 --- a/data/unified_datasets/README.md +++ b/data/unified_datasets/README.md @@ -1,11 +1,7 @@ # Unified data format ## Overview -We transform different datasets into a unified format under `data/unified_datasets` directory. We also upload processed datasets to Hugging Face's `Datasets`, which can be loaded by: -```python -from datasets import load_dataset -dataset = load_dataset('ConvLab/$dataset') -``` +We transform different datasets into a unified format under `data/unified_datasets` directory. Each dataset contains at least these files: @@ -24,8 +20,6 @@ if __name__ == '__main__': - `dialogues.json`: a list of all dialogues in the dataset. - other necessary files such as databases. - `dummy_data.json`: a list of 10 dialogues from `dialogues.json` for illustration. -- `$dataset.py`: dataset loading script for Hugging Face's `Datasets`. -- `dataset_infos.json`: dataset metadata for Hugging Face's `Datasets`. Datasets that require database interaction should also include the following file: - `database.py`: load the database and define the query function: @@ -44,7 +38,6 @@ We first introduce the unified format of `ontology` and `dialogues`. To transfor 2. Write `preprocess.py` to transform the original dataset into the unified format, producing `data.zip` and `dummy_data.json`. 3. Run `python check.py $dataset` in the `data/unified_datasets` directory to check the validation of processed dataset and get data statistics. 4. Write `README.md` to describe the data following [How to create dataset README](#how-to-create-dataset-readme). -5. Add `$dataset.py` and `dataset_info.json` following this [instruction](https://huggingface.co/docs/datasets/dataset_script.html) (Here no need to generate dummy data). Upload the dataset directory to Hugging Face's `Datasets` following this [instruction](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) (set `--organization` to `ConvLab`). ### Ontology @@ -104,9 +97,10 @@ Other attributes are optional. > **Necessary**: Run `python check.py $dataset` in the `data/unified_datasets` directory to check the validation of processed dataset and get data statistics in `data/unified_datasets/$dataset/stat.txt`. ### How to create dataset README -Each dataset has a README.md to describe the original and transformed data. Follow the Hugging Face's [dataset card creation](https://huggingface.co/docs/datasets/dataset_card.html) to export `README.md`. Make sure that you: -- include your name and email in the **Urls->Point of Contact** section. -- include the following additional information in the **Dataset Description->Dataset Summary** section: - - How to get the transformed data from original data and what are the main changes. +Each dataset has a README.md to describe the original and transformed data. Please follow the `README_TEMPLATE.md` and make sure that you: +- include your name and email in **Who transforms the dataset**. +- include the following additional information in the **Dataset Summary** section: + - How to get the transformed data from original data. + - Main changes of the transformation. - Annotations: whether has user goal, dialogue acts, state, db results, etc. -- include the data statistics given by `check.py` (in `data/unified_datasets/$dataset/stat.txt`) in the **Dataset Structure->Data Splits** section. +- include the data statistics given by `check.py` (in `data/unified_datasets/$dataset/stat.txt`) in the **Data Splits** section. diff --git a/data/unified_datasets/README_TEMPLATE.md b/data/unified_datasets/README_TEMPLATE.md new file mode 100644 index 0000000000000000000000000000000000000000..ab49414e83cbf7b00725450fde896fc0cde443d6 --- /dev/null +++ b/data/unified_datasets/README_TEMPLATE.md @@ -0,0 +1,33 @@ +## Dataset Card for [dataset name] + +- **Repository:** data link +- **Paper:** paper link +- **Leaderboard:** leaderboard link if any +- **Who transforms the dataset:** Name(email) + +### Dataset Summary + +Describe the dataset. + +- **How to get the transformed data from original data:** + - TODO +- **Main changes of the transformation:** + - TODO +- **Annotations:** + - TODO + +### Supported Tasks and Leaderboards + +TODO + +### Languages + +TODO + +### Data Splits + +Please copy the statistics in `stat.txt` generated by `check.py` and paste here. + +### Licensing Information + +TODO \ No newline at end of file diff --git a/data/unified_datasets/check.py b/data/unified_datasets/check.py index 40af072d7d01fa830b8a6a5b534f8cf1516353e1..47e75e602b5009e0a6b9bbf53b7a1291491522a6 100644 --- a/data/unified_datasets/check.py +++ b/data/unified_datasets/check.py @@ -342,7 +342,7 @@ if __name__ == '__main__': stat = check_dialogues(name, dialogues, ontology) print('pass') - print(f'Please copy-and-paste the statistics in {name}/stat.txt to dataset README.md->Data Splits section\n') + print(f'Please copy and paste the statistics in {name}/stat.txt to dataset README.md->Data Splits section\n') with open(f'{name}/stat.txt', 'w') as f: print(stat, file=f) print('', file=f) diff --git a/data/unified_datasets/multiwoz21/README.md b/data/unified_datasets/multiwoz21/README.md index 803ebf3ad3ee6c4b6aaf710a739f9518bf5d5321..ae2af6b1b32d4edcceda3613ee2d7f0b83c08288 100644 --- a/data/unified_datasets/multiwoz21/README.md +++ b/data/unified_datasets/multiwoz21/README.md @@ -1,31 +1,46 @@ -# README +## Dataset Card for MultiWOZ 2.1 -## Features +- **Repository:** https://github.com/budzianowski/multiwoz +- **Paper:** https://aclanthology.org/2020.lrec-1.53 +- **Leaderboard:** https://github.com/budzianowski/multiwoz +- **Who transforms the dataset:** [Qi Zhu](zhuq96@gmail.com) -- Annotations: dialogue act, character-level span for non-categorical slots. state and state updates. +### Dataset Summary -Statistics: +MultiWOZ 2.1 fixed the noise in state annotations and dialogue utterances. It also includes user dialogue acts from ConvLab (Lee et al., 2019) as well as multiple slot descriptions per dialogue state slot. -| | \# dialogues | \# utterances | avg. turns | avg. tokens | \# domains | -| ----- | ------------ | ------------- | ---------- | ----------- | ---------- | -| train | 8434 | 105066 | 12.46 | 17.27 | 7 | -| dev | 999 | 13731 | 13.74 | 17.72 | 7 | -| train | 1000 | 13744 | 13.74 | 17.67 | 7 | +- **How to get the transformed data from original data:** + - Download [MultiWOZ_2.1.zip](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.1.zip). + - Run `python preprocess.py` in the current directory. +- **Main changes of the transformation:** + - Create a new ontology in the unified format, taking slot descriptions from MultiWOZ 2.2. + - Correct some grammar errors in the text, mainly following `tokenization.md` in MultiWOZ_2.1. + - Normalize slot name and value. See `normalize_domain_slot_value` function in `preprocess.py`. + - Correct some non-categorical slots' values and provide character level span annotation. +- **Annotations:** + - user goal, dialogue acts, state. +### Supported Tasks and Leaderboards -## Main changes +NLU, DST, Policy, NLG, E2E, User simulator -- only keep 5 domains in state annotations and dialog acts. -- `pricerange`, `area`, `day`, `internet`, `parking`, `stars` are considered categorical slots. -- punctuation marks are split from their previous tokens. e.g `I want to find a hotel. -> - I want to find a hotel .` +### Languages -Run `evaluate.py`: +English -da values match rate: 97.944 -state values match rate: 66.017 +### Data Splits -### original data +| split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) | +|------------|-------------|--------------|-----------|--------------|---------------|-------------------------|------------------------|--------------------------------|-----------------------------------| +| train | 8438 | 113556 | 13.46 | 13.23 | 3.39 | 98.84 | 99.48 | 86.39 | 98.22 | +| validation | 1000 | 14748 | 14.75 | 13.5 | 3.64 | 98.84 | 99.46 | 86.59 | 98.17 | +| test | 1000 | 14744 | 14.74 | 13.5 | 3.59 | 99.21 | 99.32 | 85.83 | 98.58 | +| all | 10438 | 143048 | 13.7 | 13.28 | 3.44 | 98.88 | 99.47 | 86.36 | 98.25 | -- from [multiwoz](https://github.com/budzianowski/multiwoz) repo. +9 domains: ['attraction', 'hotel', 'taxi', 'restaurant', 'train', 'police', 'hospital', 'booking', 'general'] +- **cat slot match**: how many values of categorical slots are in the possible values of ontology. +- **non-cat slot span**: how many values of non-categorical slots have span annotation. +### Licensing Information + +Apache License, Version 2.0 \ No newline at end of file