Skip to content
Snippets Groups Projects
Unverified Commit 37af9ecf authored by zhuqi's avatar zhuqi Committed by GitHub
Browse files

Merge pull request #46 from ConvLab/unified_dataset

add metalwoz dataset in unified format
parents dd3ae1e2 280101a0
Branches
No related tags found
No related merge requests found
# README
# Dataset Card for MetaLWOZ
## Features
- **Repository:** https://www.microsoft.com/en-us/research/project/metalwoz/
- **Paper:** https://www.microsoft.com/en-us/research/publication/results-of-the-multi-domain-task-completion-dialog-challenge/
- **Leaderboard:** None
- **Who transforms the dataset:** Qi Zhu(zhuq96 at gmail dot com)
No sentence-level annotation. Only annotate domain.
### Dataset Summary
Statistics:
This large dataset was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47 domains. Domains include bus schedules, apartment search, alarm setting, banking, and event reservation. Each dialog was grounded in a scenario with roles, pairing a person acting as the bot and a person acting as the user. (This is the Wizard of Oz reference—using people behind the curtain who act as the machine). Each pair were given a domain and a task, and instructed to converse for 10 turns to satisfy the user’s queries. For example, if a user asked if a bus stop was operational, the bot would respond that the bus stop had been moved two blocks north, which starts a conversation that addresses the user’s actual need.
| | \# dialogues | \# utterances | avg. turns | avg. tokens | \# domains |
| ----- | ------------ | ------------- | ---------- | ----------- | ---------- |
| train | 37884 | 362450 | 9.57 | 7.66 | - |
| test | 2319 | 21949 | 9.46 | 8.23 | - |
- **How to get the transformed data from original data:**
- Download [metalwoz-v1.zip](https://www.microsoft.com/en-us/download/58389) and [metalwoz-test-v1.zip](https://www.microsoft.com/en-us/download/100639).
- Run `python preprocess.py` in the current directory.
- **Main changes of the transformation:**
- `CITI_INFO`, `HOME_BOT`, `NAME_SUGGESTER`, and `TIME_ZONE` are randomly selected as the valiation domains.
- Remove the first utterance by the system since it is "Hello how may I help you?" in most case.
- Add goal description according to the original task description: user_role+user_prompt+system_role+system_prompt.
- **Annotations:**
- domain, goal
### Supported Tasks and Leaderboards
## Original data
RG, User simulator
- https://www.microsoft.com/en-us/research/project/metalwoz/
### Languages
English
### Data Splits
| split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) |
|------------|-------------|--------------|-----------|--------------|---------------|-------------------------|------------------------|--------------------------------|-----------------------------------|
| train | 34261 | 357092 | 10.42 | 7.48 | 1 | - | - | - | - |
| validation | 3623 | 37060 | 10.23 | 6.59 | 1 | - | - | - | - |
| test | 2319 | 23882 | 10.3 | 7.96 | 1 | - | - | - | - |
| all | 40203 | 418034 | 10.4 | 7.43 | 1 | - | - | - | - |
51 domains: ['AGREEMENT_BOT', 'ALARM_SET', 'APARTMENT_FINDER', 'APPOINTMENT_REMINDER', 'AUTO_SORT', 'BANK_BOT', 'BUS_SCHEDULE_BOT', 'CATALOGUE_BOT', 'CHECK_STATUS', 'CITY_INFO', 'CONTACT_MANAGER', 'DECIDER_BOT', 'EDIT_PLAYLIST', 'EVENT_RESERVE', 'GAME_RULES', 'GEOGRAPHY', 'GUINESS_CHECK', 'HOME_BOT', 'HOW_TO_BASIC', 'INSURANCE', 'LIBRARY_REQUEST', 'LOOK_UP_INFO', 'MAKE_RESTAURANT_RESERVATIONS', 'MOVIE_LISTINGS', 'MUSIC_SUGGESTER', 'NAME_SUGGESTER', 'ORDER_PIZZA', 'PET_ADVICE', 'PHONE_PLAN_BOT', 'PHONE_SETTINGS', 'PLAY_TIMES', 'POLICY_BOT', 'PRESENT_IDEAS', 'PROMPT_GENERATOR', 'QUOTE_OF_THE_DAY_BOT', 'RESTAURANT_PICKER', 'SCAM_LOOKUP', 'SHOPPING', 'SKI_BOT', 'SPORTS_INFO', 'STORE_DETAILS', 'TIME_ZONE', 'UPDATE_CALENDAR', 'UPDATE_CONTACT', 'WEATHER_CHECK', 'WEDDING_PLANNER', 'WHAT_IS_IT', 'BOOKING_FLIGHT', 'HOTEL_RESERVE', 'TOURISM', 'VACATION_IDEAS']
- **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage.
- **non-cat slot span**: how many values of non-categorical slots have span annotation in percentage.
### Citation
```
@inproceedings{li2020results,
author = {Li, Jinchao and Peng, Baolin and Lee, Sungjin and Gao, Jianfeng and Takanobu, Ryuichi and Zhu, Qi and Minlie Huang and Schulz, Hannes and Atkinson, Adam and Adada, Mahmoud},
title = {Results of the Multi-Domain Task-Completion Dialog Challenge},
booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence, Eighth Dialog System Technology Challenge Workshop},
year = {2020},
month = {February},
url = {https://www.microsoft.com/en-us/research/publication/results-of-the-multi-domain-task-completion-dialog-challenge/},
}
```
### Licensing Information
[Microsoft Research Data License Agreement](https://msropendata-web-api.azurewebsites.net/licenses/2f933be3-284d-500b-7ea3-2aa2fd0f1bb2/view)
No preview for this file type
This diff is collapsed.
File deleted
File deleted
{
"domains": {
"AGREEMENT_BOT": {
"description": "",
"slots": {}
},
"ALARM_SET": {
"description": "",
"slots": {}
},
"APARTMENT_FINDER": {
"description": "",
"slots": {}
},
"APPOINTMENT_REMINDER": {
"description": "",
"slots": {}
},
"AUTO_SORT": {
"description": "",
"slots": {}
},
"BANK_BOT": {
"description": "",
"slots": {}
},
"BUS_SCHEDULE_BOT": {
"description": "",
"slots": {}
},
"CATALOGUE_BOT": {
"description": "",
"slots": {}
},
"CHECK_STATUS": {
"description": "",
"slots": {}
},
"CITY_INFO": {
"description": "",
"slots": {}
},
"CONTACT_MANAGER": {
"description": "",
"slots": {}
},
"DECIDER_BOT": {
"description": "",
"slots": {}
},
"EDIT_PLAYLIST": {
"description": "",
"slots": {}
},
"EVENT_RESERVE": {
"description": "",
"slots": {}
},
"GAME_RULES": {
"description": "",
"slots": {}
},
"GEOGRAPHY": {
"description": "",
"slots": {}
},
"GUINESS_CHECK": {
"description": "",
"slots": {}
},
"HOME_BOT": {
"description": "",
"slots": {}
},
"HOW_TO_BASIC": {
"description": "",
"slots": {}
},
"INSURANCE": {
"description": "",
"slots": {}
},
"LIBRARY_REQUEST": {
"description": "",
"slots": {}
},
"LOOK_UP_INFO": {
"description": "",
"slots": {}
},
"MAKE_RESTAURANT_RESERVATIONS": {
"description": "",
"slots": {}
},
"MOVIE_LISTINGS": {
"description": "",
"slots": {}
},
"MUSIC_SUGGESTER": {
"description": "",
"slots": {}
},
"NAME_SUGGESTER": {
"description": "",
"slots": {}
},
"ORDER_PIZZA": {
"description": "",
"slots": {}
},
"PET_ADVICE": {
"description": "",
"slots": {}
},
"PHONE_PLAN_BOT": {
"description": "",
"slots": {}
},
"PHONE_SETTINGS": {
"description": "",
"slots": {}
},
"PLAY_TIMES": {
"description": "",
"slots": {}
},
"POLICY_BOT": {
"description": "",
"slots": {}
},
"PRESENT_IDEAS": {
"description": "",
"slots": {}
},
"PROMPT_GENERATOR": {
"description": "",
"slots": {}
},
"QUOTE_OF_THE_DAY_BOT": {
"description": "",
"slots": {}
},
"RESTAURANT_PICKER": {
"description": "",
"slots": {}
},
"SCAM_LOOKUP": {
"description": "",
"slots": {}
},
"SHOPPING": {
"description": "",
"slots": {}
},
"SKI_BOT": {
"description": "",
"slots": {}
},
"SPORTS_INFO": {
"description": "",
"slots": {}
},
"STORE_DETAILS": {
"description": "",
"slots": {}
},
"TIME_ZONE": {
"description": "",
"slots": {}
},
"UPDATE_CALENDAR": {
"description": "",
"slots": {}
},
"UPDATE_CONTACT": {
"description": "",
"slots": {}
},
"WEATHER_CHECK": {
"description": "",
"slots": {}
},
"WEDDING_PLANNER": {
"description": "",
"slots": {}
},
"WHAT_IS_IT": {
"description": "",
"slots": {}
},
"BOOKING_FLIGHT": {
"description": "",
"slots": {}
},
"HOTEL_RESERVE": {
"description": "",
"slots": {}
},
"TOURISM": {
"description": "",
"slots": {}
},
"VACATION_IDEAS": {
"description": "",
"slots": {}
}
},
"intents": {},
"binary_dialogue_act": [],
"state": {}
}
\ No newline at end of file
import json
import os
from zipfile import ZipFile, ZIP_DEFLATED
import random
import json_lines
dataset = 'metalwoz'
self_dir = os.path.dirname(os.path.abspath(__file__))
DATA_PATH = os.path.join(os.path.dirname(os.path.dirname(self_dir)), 'data')
# origin_data_dir = os.path.join(DATA_PATH, dataset)
origin_data_dir = self_dir
from collections import Counter
from shutil import rmtree
def preprocess():
random.seed(42)
ontology = {
'domains': {},
'intents': {},
'binary_dialogue_act': [],
'state': {}
'state': {},
"dialogue_acts": {
"categorical": {},
"non-categorical": {},
"binary": {}
}
}
def process_dialog(ori_dialog, split, dialog_id):
dataset = 'metalwoz'
splits = ['train', 'validation', 'test']
dialogues_by_split = {split: [] for split in splits}
ZipFile('metalwoz-test-v1.zip').extract('dstc8_metalwoz_heldout.zip')
cnt = Counter()
for filename in ['metalwoz-v1.zip', 'dstc8_metalwoz_heldout.zip']:
with ZipFile(filename) as zipfile:
task_id2description = {x['task_id']: x for x in json_lines.reader(zipfile.open('tasks.txt'))}
for path in zipfile.namelist():
if path.startswith('dialogues'):
if filename == 'metalwoz-v1.zip':
split = random.choice(['train']*9+['validation'])
else:
split = 'test'
if split == 'validation':
print(path, split)
for ori_dialog in json_lines.reader(zipfile.open(path)):
dialogue_id = f'{dataset}-{split}-{len(dialogues_by_split[split])}'
domain = ori_dialog['domain']
task_des = task_id2description[ori_dialog['task_id']]
goal = {
'description': "user role: {}. user prompt: {}. system role: {}. system prompt: {}.".format(
task_des['user_role'], task_des['user_prompt'], task_des['bot_role'], task_des['bot_prompt']),
'inform': {},
'request': {}
}
dialogue = {
'dataset': dataset,
'data_split': split,
'dialogue_id': dialogue_id,
'original_id': ori_dialog['id'],
'domains': [domain], # will be updated by dialog_acts and state
'goal': goal,
'turns': []
}
ontology['domains'][domain] = {
'description': "",
'description': task_des['bot_role'],
'slots': {}
}
dialog = {
"dataset": dataset,
"data_split": split,
"dialogue_id": f'{dataset}_{dialog_id}',
"original_id": ori_dialog['id'],
"domains": [domain],
}
turns = []
# starts with system
cnt[ori_dialog['turns'][0]] += 1
# assert ori_dialog['turns'][0] == "how may I help you?", print(ori_dialog['turns'])
for utt_idx, utt in enumerate(ori_dialog['turns'][1:]):
speaker = 'user' if utt_idx % 2 == 0 else 'system'
turn = {
'utt_idx': utt_idx,
'speaker': speaker,
'utterance': utt,
'dialogue_act': {
'utt_idx': utt_idx,
'dialogue_acts': {
'categorical': [],
'non-categorical': [],
'binary': [],
},
}
if utt_idx % 2 == 0:
turn['speaker'] = 'user'
turn['state'] = {}
turn['state_update'] = {
'categorical': [],
'non-categorical': [],
}
if speaker == 'system':
turn['db_results'] = {}
else:
turn['speaker'] = 'system'
turns.append(turn)
if turns[-1]['speaker'] == 'system':
turns.pop()
dialog['turns'] = turns
return dialog
turn['state'] = {}
dialogue['turns'].append(turn)
dialog_id = 0
data = []
with ZipFile(os.path.join(origin_data_dir, 'metalwoz-v1.zip')) as zipfile:
for path in zipfile.namelist():
if path.startswith('dialogues'):
for dialog in json_lines.reader(zipfile.open(path)):
data.append(process_dialog(dialog, 'train', dialog_id))
dialog_id += 1
dialogues_by_split[split].append(dialogue)
ZipFile(os.path.join(origin_data_dir, 'metalwoz-test-v1.zip')).extract('dstc8_metalwoz_heldout.zip')
with ZipFile(os.path.join('dstc8_metalwoz_heldout.zip')) as zipfile:
for path in zipfile.namelist():
if path.startswith('dialogues'):
for dialog in json_lines.reader(zipfile.open(path)):
data.append(process_dialog(dialog, 'test', dialog_id))
dialog_id += 1
os.remove('dstc8_metalwoz_heldout.zip')
json.dump(ontology, open(os.path.join(self_dir, 'ontology.json'), 'w'))
json.dump(data, open('data.json', 'w'), indent=4)
ZipFile(os.path.join(self_dir, 'data.zip'), 'w', ZIP_DEFLATED).write('data.json')
os.remove('data.json')
new_data_dir = 'data'
os.makedirs(new_data_dir, exist_ok=True)
dialogues = []
for split in splits:
dialogues += dialogues_by_split[split]
json.dump(dialogues[:10], open(f'dummy_data.json', 'w', encoding='utf-8'), indent=2, ensure_ascii=False)
json.dump(ontology, open(f'{new_data_dir}/ontology.json', 'w', encoding='utf-8'), indent=2, ensure_ascii=False)
json.dump(dialogues, open(f'{new_data_dir}/dialogues.json', 'w', encoding='utf-8'), indent=2, ensure_ascii=False)
with ZipFile('data.zip', 'w', ZIP_DEFLATED) as zf:
for filename in os.listdir(new_data_dir):
zf.write(f'{new_data_dir}/{filename}')
rmtree(new_data_dir)
return dialogues, ontology
if __name__ == '__main__':
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment