Skip to content
Snippets Groups Projects
Commit b5df0874 authored by zqwerty's avatar zqwerty
Browse files

update unified datasets README, not use HF datasets, add README for unified multiwoz2.1

parent d6383264
Branches
No related tags found
No related merge requests found
# Unified data format # Unified data format
## Overview ## Overview
We transform different datasets into a unified format under `data/unified_datasets` directory. We also upload processed datasets to Hugging Face's `Datasets`, which can be loaded by: We transform different datasets into a unified format under `data/unified_datasets` directory.
```python
from datasets import load_dataset
dataset = load_dataset('ConvLab/$dataset')
```
Each dataset contains at least these files: Each dataset contains at least these files:
...@@ -24,8 +20,6 @@ if __name__ == '__main__': ...@@ -24,8 +20,6 @@ if __name__ == '__main__':
- `dialogues.json`: a list of all dialogues in the dataset. - `dialogues.json`: a list of all dialogues in the dataset.
- other necessary files such as databases. - other necessary files such as databases.
- `dummy_data.json`: a list of 10 dialogues from `dialogues.json` for illustration. - `dummy_data.json`: a list of 10 dialogues from `dialogues.json` for illustration.
- `$dataset.py`: dataset loading script for Hugging Face's `Datasets`.
- `dataset_infos.json`: dataset metadata for Hugging Face's `Datasets`.
Datasets that require database interaction should also include the following file: Datasets that require database interaction should also include the following file:
- `database.py`: load the database and define the query function: - `database.py`: load the database and define the query function:
...@@ -44,7 +38,6 @@ We first introduce the unified format of `ontology` and `dialogues`. To transfor ...@@ -44,7 +38,6 @@ We first introduce the unified format of `ontology` and `dialogues`. To transfor
2. Write `preprocess.py` to transform the original dataset into the unified format, producing `data.zip` and `dummy_data.json`. 2. Write `preprocess.py` to transform the original dataset into the unified format, producing `data.zip` and `dummy_data.json`.
3. Run `python check.py $dataset` in the `data/unified_datasets` directory to check the validation of processed dataset and get data statistics. 3. Run `python check.py $dataset` in the `data/unified_datasets` directory to check the validation of processed dataset and get data statistics.
4. Write `README.md` to describe the data following [How to create dataset README](#how-to-create-dataset-readme). 4. Write `README.md` to describe the data following [How to create dataset README](#how-to-create-dataset-readme).
5. Add `$dataset.py` and `dataset_info.json` following this [instruction](https://huggingface.co/docs/datasets/dataset_script.html) (Here no need to generate dummy data). Upload the dataset directory to Hugging Face's `Datasets` following this [instruction](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) (set `--organization` to `ConvLab`).
### Ontology ### Ontology
...@@ -104,9 +97,10 @@ Other attributes are optional. ...@@ -104,9 +97,10 @@ Other attributes are optional.
> **Necessary**: Run `python check.py $dataset` in the `data/unified_datasets` directory to check the validation of processed dataset and get data statistics in `data/unified_datasets/$dataset/stat.txt`. > **Necessary**: Run `python check.py $dataset` in the `data/unified_datasets` directory to check the validation of processed dataset and get data statistics in `data/unified_datasets/$dataset/stat.txt`.
### How to create dataset README ### How to create dataset README
Each dataset has a README.md to describe the original and transformed data. Follow the Hugging Face's [dataset card creation](https://huggingface.co/docs/datasets/dataset_card.html) to export `README.md`. Make sure that you: Each dataset has a README.md to describe the original and transformed data. Please follow the `README_TEMPLATE.md` and make sure that you:
- include your name and email in the **Urls->Point of Contact** section. - include your name and email in **Who transforms the dataset**.
- include the following additional information in the **Dataset Description->Dataset Summary** section: - include the following additional information in the **Dataset Summary** section:
- How to get the transformed data from original data and what are the main changes. - How to get the transformed data from original data.
- Main changes of the transformation.
- Annotations: whether has user goal, dialogue acts, state, db results, etc. - Annotations: whether has user goal, dialogue acts, state, db results, etc.
- include the data statistics given by `check.py` (in `data/unified_datasets/$dataset/stat.txt`) in the **Dataset Structure->Data Splits** section. - include the data statistics given by `check.py` (in `data/unified_datasets/$dataset/stat.txt`) in the **Data Splits** section.
## Dataset Card for [dataset name]
- **Repository:** data link
- **Paper:** paper link
- **Leaderboard:** leaderboard link if any
- **Who transforms the dataset:** Name(email)
### Dataset Summary
Describe the dataset.
- **How to get the transformed data from original data:**
- TODO
- **Main changes of the transformation:**
- TODO
- **Annotations:**
- TODO
### Supported Tasks and Leaderboards
TODO
### Languages
TODO
### Data Splits
Please copy the statistics in `stat.txt` generated by `check.py` and paste here.
### Licensing Information
TODO
\ No newline at end of file
...@@ -342,7 +342,7 @@ if __name__ == '__main__': ...@@ -342,7 +342,7 @@ if __name__ == '__main__':
stat = check_dialogues(name, dialogues, ontology) stat = check_dialogues(name, dialogues, ontology)
print('pass') print('pass')
print(f'Please copy-and-paste the statistics in {name}/stat.txt to dataset README.md->Data Splits section\n') print(f'Please copy and paste the statistics in {name}/stat.txt to dataset README.md->Data Splits section\n')
with open(f'{name}/stat.txt', 'w') as f: with open(f'{name}/stat.txt', 'w') as f:
print(stat, file=f) print(stat, file=f)
print('', file=f) print('', file=f)
......
# README ## Dataset Card for MultiWOZ 2.1
## Features - **Repository:** https://github.com/budzianowski/multiwoz
- **Paper:** https://aclanthology.org/2020.lrec-1.53
- **Leaderboard:** https://github.com/budzianowski/multiwoz
- **Who transforms the dataset:** [Qi Zhu](zhuq96@gmail.com)
- Annotations: dialogue act, character-level span for non-categorical slots. state and state updates. ### Dataset Summary
Statistics: MultiWOZ 2.1 fixed the noise in state annotations and dialogue utterances. It also includes user dialogue acts from ConvLab (Lee et al., 2019) as well as multiple slot descriptions per dialogue state slot.
| | \# dialogues | \# utterances | avg. turns | avg. tokens | \# domains | - **How to get the transformed data from original data:**
| ----- | ------------ | ------------- | ---------- | ----------- | ---------- | - Download [MultiWOZ_2.1.zip](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.1.zip).
| train | 8434 | 105066 | 12.46 | 17.27 | 7 | - Run `python preprocess.py` in the current directory.
| dev | 999 | 13731 | 13.74 | 17.72 | 7 | - **Main changes of the transformation:**
| train | 1000 | 13744 | 13.74 | 17.67 | 7 | - Create a new ontology in the unified format, taking slot descriptions from MultiWOZ 2.2.
- Correct some grammar errors in the text, mainly following `tokenization.md` in MultiWOZ_2.1.
- Normalize slot name and value. See `normalize_domain_slot_value` function in `preprocess.py`.
- Correct some non-categorical slots' values and provide character level span annotation.
- **Annotations:**
- user goal, dialogue acts, state.
### Supported Tasks and Leaderboards
## Main changes NLU, DST, Policy, NLG, E2E, User simulator
- only keep 5 domains in state annotations and dialog acts. ### Languages
- `pricerange`, `area`, `day`, `internet`, `parking`, `stars` are considered categorical slots.
- punctuation marks are split from their previous tokens. e.g `I want to find a hotel. ->
I want to find a hotel .`
Run `evaluate.py`: English
da values match rate: 97.944 ### Data Splits
state values match rate: 66.017
### original data | split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) |
|------------|-------------|--------------|-----------|--------------|---------------|-------------------------|------------------------|--------------------------------|-----------------------------------|
| train | 8438 | 113556 | 13.46 | 13.23 | 3.39 | 98.84 | 99.48 | 86.39 | 98.22 |
| validation | 1000 | 14748 | 14.75 | 13.5 | 3.64 | 98.84 | 99.46 | 86.59 | 98.17 |
| test | 1000 | 14744 | 14.74 | 13.5 | 3.59 | 99.21 | 99.32 | 85.83 | 98.58 |
| all | 10438 | 143048 | 13.7 | 13.28 | 3.44 | 98.88 | 99.47 | 86.36 | 98.25 |
- from [multiwoz](https://github.com/budzianowski/multiwoz) repo. 9 domains: ['attraction', 'hotel', 'taxi', 'restaurant', 'train', 'police', 'hospital', 'booking', 'general']
- **cat slot match**: how many values of categorical slots are in the possible values of ontology.
- **non-cat slot span**: how many values of non-categorical slots have span annotation.
### Licensing Information
Apache License, Version 2.0
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment