update unified datasets README, not use HF datasets, add README for unified multiwoz2.1

b5df0874 · zqwerty · d6383264 · b5df0874 · b5df0874 · b5df0874
Commit b5df0874 authored 3 years ago by zqwerty
--- a/data/unified_datasets/README.md
+++ b/data/unified_datasets/README.md
 # Unified data format
 ## Overview
-We transform different datasets into a unified format under `data/unified_datasets` directory. We also upload processed datasets to Hugging Face's `Datasets`, which can be loaded by:
+We transform different datasets into a unified format under `data/unified_datasets` directory.
-```python
-from datasets import load_dataset
-dataset = load_dataset('ConvLab/$dataset')
-```
 Each dataset contains at least these files:
@@ -24,8 +20,6 @@ if __name__ == '__main__':
  - `dialogues.json`: a list of all dialogues in the dataset.
  - other necessary files such as databases.
 - `dummy_data.json`: a list of 10 dialogues from `dialogues.json` for illustration.
- `$dataset.py`: dataset loading script for Hugging Face's `Datasets`.
- `dataset_infos.json`: dataset metadata for Hugging Face's `Datasets`.
 Datasets that require database interaction should also include the following file:
 - `database.py`: load the database and define the query function:
@@ -44,7 +38,6 @@ We first introduce the unified format of `ontology` and `dialogues`. To transfor
 2. Write `preprocess.py` to transform the original dataset into the unified format, producing `data.zip` and `dummy_data.json`.
 3. Run `python check.py $dataset` in the `data/unified_datasets` directory to check the validation of processed dataset and get data statistics.
 4. Write `README.md` to describe the data following [How to create dataset README](#how-to-create-dataset-readme).
-5. Add `$dataset.py` and `dataset_info.json` following this [instruction](https://huggingface.co/docs/datasets/dataset_script.html) (Here no need to generate dummy data). Upload the dataset directory to Hugging Face's `Datasets` following this [instruction](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) (set `--organization` to `ConvLab`).
 ### Ontology
@@ -104,9 +97,10 @@ Other attributes are optional.
 > **Necessary**: Run `python check.py $dataset` in the `data/unified_datasets` directory to check the validation of processed dataset and get data statistics in `data/unified_datasets/$dataset/stat.txt`.
 ### How to create dataset README
-Each dataset has a README.md to describe the original and transformed data. Follow the Hugging Face's [dataset card creation](https://huggingface.co/docs/datasets/dataset_card.html) to export `README.md`. Make sure that you:
+Each dataset has a README.md to describe the original and transformed data. Please follow the `README_TEMPLATE.md` and make sure that you:
- include your name and email in the **Urls->Point of Contact** section.
+- include your name and email in **Who transforms the dataset**.
- include the following additional information in the **Dataset Description->Dataset Summary** section:
+- include the following additional information in the **Dataset Summary** section:
-  - How to get the transformed data from original data and what are the main changes.
+  - How to get the transformed data from original data.
+  - Main changes of the transformation.
  - Annotations: whether has user goal, dialogue acts, state, db results, etc.
- include the data statistics given by `check.py` (in `data/unified_datasets/$dataset/stat.txt`) in the **Dataset Structure->Data Splits** section.
+- include the data statistics given by `check.py` (in `data/unified_datasets/$dataset/stat.txt`) in the **Data Splits** section.
--- a/data/unified_datasets/README_TEMPLATE.md
+++ b/data/unified_datasets/README_TEMPLATE.md
+## Dataset Card for [dataset name]
+- **Repository:** data link
+- **Paper:** paper link
+- **Leaderboard:** leaderboard link if any
+- **Who transforms the dataset:** Name(email)
+### Dataset Summary
+Describe the dataset.
+- **How to get the transformed data from original data:** 
+  - TODO
+- **Main changes of the transformation:**
+  - TODO
+- **Annotations:**
+  - TODO
+### Supported Tasks and Leaderboards
+TODO
+### Languages
+TODO
+### Data Splits
+Please copy the statistics in `stat.txt` generated by `check.py` and paste here.
+### Licensing Information
+TODO
\ No newline at end of file
--- a/data/unified_datasets/check.py
+++ b/data/unified_datasets/check.py
@@ -342,7 +342,7 @@ if __name__ == '__main__':
                    stat = check_dialogues(name, dialogues, ontology)
                    print('pass')
-                print(f'Please copy-and-paste the statistics in {name}/stat.txt to dataset README.md->Data Splits section\n')
+                print(f'Please copy and paste the statistics in {name}/stat.txt to dataset README.md->Data Splits section\n')
                with open(f'{name}/stat.txt', 'w') as f:
                    print(stat, file=f)
                    print('', file=f)

--- a/data/unified_datasets/multiwoz21/README.md
+++ b/data/unified_datasets/multiwoz21/README.md
-# README
+## Dataset Card for MultiWOZ 2.1
-## Features
+- **Repository:** https://github.com/budzianowski/multiwoz
+- **Paper:** https://aclanthology.org/2020.lrec-1.53
+- **Leaderboard:** https://github.com/budzianowski/multiwoz
+- **Who transforms the dataset:** [Qi Zhu](zhuq96@gmail.com)
- Annotations: dialogue act, character-level span for non-categorical slots. state and state updates.   
+### Dataset Summary
-Statistics: 
+MultiWOZ 2.1 fixed the noise in state annotations and dialogue utterances. It also includes user dialogue acts from ConvLab (Lee et al., 2019) as well as multiple slot descriptions per dialogue state slot.
-|       | \# dialogues | \# utterances | avg. turns | avg. tokens | \# domains |
+- **How to get the transformed data from original data:** 
-| ----- | ------------ | ------------- | ---------- | ----------- | ---------- |
+  - Download [MultiWOZ_2.1.zip](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.1.zip).
-| train | 8434         | 105066         | 12.46     | 17.27      | 7          |
+  - Run `python preprocess.py` in the current directory.
-| dev | 999         | 13731         | 13.74      | 17.72       | 7          |
+- **Main changes of the transformation:**
-| train | 1000         | 13744         | 13.74       | 17.67       | 7          |
+  - Create a new ontology in the unified format, taking slot descriptions from MultiWOZ 2.2.
+  - Correct some grammar errors in the text, mainly following `tokenization.md` in MultiWOZ_2.1.
+  - Normalize slot name and value. See `normalize_domain_slot_value` function in `preprocess.py`.
+  - Correct some non-categorical slots' values and provide character level span annotation.
+- **Annotations:**
+  - user goal, dialogue acts, state.
+### Supported Tasks and Leaderboards
-## Main changes
+NLU, DST, Policy, NLG, E2E, User simulator
- only keep 5 domains in state annotations and dialog acts. 
+### Languages
- `pricerange`, `area`, `day`, `internet`, `parking`, `stars` are considered categorical slots.
- punctuation marks are split from their previous tokens. e.g `I want to find a hotel. -> 
-  I want to find a hotel .`
-Run `evaluate.py`:
+English
-da values match rate:    97.944
+### Data Splits
-state values match rate: 66.017
-### original data
+| split      |   dialogues |   utterances |   avg_utt |   avg_tokens |   avg_domains |   cat slot match(state) |   cat slot match(goal) |   cat slot match(dialogue act) |   non-cat slot span(dialogue act) |
+|------------|-------------|--------------|-----------|--------------|---------------|-------------------------|------------------------|--------------------------------|-----------------------------------|
+| train      |        8438 |       113556 |     13.46 |        13.23 |          3.39 |                   98.84 |                  99.48 |                          86.39 |                             98.22 |
+| validation |        1000 |        14748 |     14.75 |        13.5  |          3.64 |                   98.84 |                  99.46 |                          86.59 |                             98.17 |
+| test       |        1000 |        14744 |     14.74 |        13.5  |          3.59 |                   99.21 |                  99.32 |                          85.83 |                             98.58 |
+| all        |       10438 |       143048 |     13.7  |        13.28 |          3.44 |                   98.88 |                  99.47 |                          86.36 |                             98.25 |
- from [multiwoz](https://github.com/budzianowski/multiwoz) repo.
+9 domains: ['attraction', 'hotel', 'taxi', 'restaurant', 'train', 'police', 'hospital', 'booking', 'general']
+- **cat slot match**: how many values of categorical slots are in the possible values of ontology.
+- **non-cat slot span**: how many values of non-categorical slots have span annotation.
+### Licensing Information
+Apache License, Version 2.0
\ No newline at end of file