Merge pull request #46 from ConvLab/unified_dataset

add metalwoz dataset in unified format

Merge pull request #46 from ConvLab/unified_dataset
37af9ecf · zhuqi · GitHub · dd3ae1e2 · 280101a0 · 37af9ecf
Unverified Commit 37af9ecf authored 3 years ago by zhuqi Committed by GitHub 3 years ago
--- a/data/unified_datasets/metalwoz/README.md
+++ b/data/unified_datasets/metalwoz/README.md
-# README
+# Dataset Card for MetaLWOZ

-## Features
+- **Repository:** https://www.microsoft.com/en-us/research/project/metalwoz/
+- **Paper:** https://www.microsoft.com/en-us/research/publication/results-of-the-multi-domain-task-completion-dialog-challenge/
+- **Leaderboard:** None
+- **Who transforms the dataset:** Qi Zhu(zhuq96 at gmail dot com)

-No sentence-level annotation. Only annotate domain.
+### Dataset Summary

-Statistics: 
+This large dataset was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47 domains. Domains include bus schedules, apartment search, alarm setting, banking, and event reservation. Each dialog was grounded in a scenario with roles, pairing a person acting as the bot and a person acting as the user. (This is the Wizard of Oz reference—using people behind the curtain who act as the machine). Each pair were given a domain and a task, and instructed to converse for 10 turns to satisfy the user’s queries. For example, if a user asked if a bus stop was operational, the bot would respond that the bus stop had been moved two blocks north, which starts a conversation that addresses the user’s actual need.

-|       | \# dialogues | \# utterances | avg. turns | avg. tokens | \# domains |
-| ----- | ------------ | ------------- | ---------- | ----------- | ---------- |
-| train | 37884         | 362450       | 9.57    | 7.66       | -          |
-| test | 2319        | 21949         | 9.46       | 8.23       | -          |
+- **How to get the transformed data from original data:** 
+  - Download [metalwoz-v1.zip](https://www.microsoft.com/en-us/download/58389) and [metalwoz-test-v1.zip](https://www.microsoft.com/en-us/download/100639).
+  - Run `python preprocess.py` in the current directory.
+- **Main changes of the transformation:**
+  - `CITI_INFO`, `HOME_BOT`, `NAME_SUGGESTER`, and `TIME_ZONE` are randomly selected as the valiation domains.
+  - Remove the first utterance by the system since it is "Hello how may I help you?" in most case.
+  - Add goal description according to the original task description: user_role+user_prompt+system_role+system_prompt.
+- **Annotations:**
+  - domain, goal

+### Supported Tasks and Leaderboards

-## Original data
+RG, User simulator

- https://www.microsoft.com/en-us/research/project/metalwoz/
+### Languages
+
+English
+
+### Data Splits
+
+| split      |   dialogues |   utterances |   avg_utt |   avg_tokens |   avg_domains | cat slot match(state)   | cat slot match(goal)   | cat slot match(dialogue act)   | non-cat slot span(dialogue act)   |
+|------------|-------------|--------------|-----------|--------------|---------------|-------------------------|------------------------|--------------------------------|-----------------------------------|
+| train      |       34261 |       357092 |     10.42 |         7.48 |             1 | -                       | -                      | -                              | -                                 |
+| validation |        3623 |        37060 |     10.23 |         6.59 |             1 | -                       | -                      | -                              | -                                 |
+| test       |        2319 |        23882 |     10.3  |         7.96 |             1 | -                       | -                      | -                              | -                                 |
+| all        |       40203 |       418034 |     10.4  |         7.43 |             1 | -                       | -                      | -                              | -                                 |
+
+51 domains: ['AGREEMENT_BOT', 'ALARM_SET', 'APARTMENT_FINDER', 'APPOINTMENT_REMINDER', 'AUTO_SORT', 'BANK_BOT', 'BUS_SCHEDULE_BOT', 'CATALOGUE_BOT', 'CHECK_STATUS', 'CITY_INFO', 'CONTACT_MANAGER', 'DECIDER_BOT', 'EDIT_PLAYLIST', 'EVENT_RESERVE', 'GAME_RULES', 'GEOGRAPHY', 'GUINESS_CHECK', 'HOME_BOT', 'HOW_TO_BASIC', 'INSURANCE', 'LIBRARY_REQUEST', 'LOOK_UP_INFO', 'MAKE_RESTAURANT_RESERVATIONS', 'MOVIE_LISTINGS', 'MUSIC_SUGGESTER', 'NAME_SUGGESTER', 'ORDER_PIZZA', 'PET_ADVICE', 'PHONE_PLAN_BOT', 'PHONE_SETTINGS', 'PLAY_TIMES', 'POLICY_BOT', 'PRESENT_IDEAS', 'PROMPT_GENERATOR', 'QUOTE_OF_THE_DAY_BOT', 'RESTAURANT_PICKER', 'SCAM_LOOKUP', 'SHOPPING', 'SKI_BOT', 'SPORTS_INFO', 'STORE_DETAILS', 'TIME_ZONE', 'UPDATE_CALENDAR', 'UPDATE_CONTACT', 'WEATHER_CHECK', 'WEDDING_PLANNER', 'WHAT_IS_IT', 'BOOKING_FLIGHT', 'HOTEL_RESERVE', 'TOURISM', 'VACATION_IDEAS']
+- **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage.
+- **non-cat slot span**: how many values of non-categorical slots have span annotation in percentage.
+
+### Citation
+
+```
+@inproceedings{li2020results,
+    author = {Li, Jinchao and Peng, Baolin and Lee, Sungjin and Gao, Jianfeng and Takanobu, Ryuichi and Zhu, Qi and Minlie Huang and Schulz, Hannes and Atkinson, Adam and Adada, Mahmoud},
+    title = {Results of the Multi-Domain Task-Completion Dialog Challenge},
+    booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence, Eighth Dialog System Technology Challenge Workshop},
+    year = {2020},
+    month = {February},
+    url = {https://www.microsoft.com/en-us/research/publication/results-of-the-multi-domain-task-completion-dialog-challenge/},
+}
+```
+
+### Licensing Information
+
+[Microsoft Research Data License Agreement](https://msropendata-web-api.azurewebsites.net/licenses/2f933be3-284d-500b-7ea3-2aa2fd0f1bb2/view)
--- a/data/unified_datasets/metalwoz/data.zip
+++ b/data/unified_datasets/metalwoz/data.zip
--- a/data/unified_datasets/metalwoz/dummy_data.json
+++ b/data/unified_datasets/metalwoz/dummy_data.json
--- a/data/unified_datasets/metalwoz/metalwoz-test-v1.zip
+++ b/data/unified_datasets/metalwoz/metalwoz-test-v1.zip
--- a/data/unified_datasets/metalwoz/metalwoz-v1.zip
+++ b/data/unified_datasets/metalwoz/metalwoz-v1.zip
--- a/data/unified_datasets/metalwoz/ontology.json
+++ b/data/unified_datasets/metalwoz/ontology.json
-{
-    "domains": {
-        "AGREEMENT_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "ALARM_SET": {
-            "description": "",
-            "slots": {}
-        },
-        "APARTMENT_FINDER": {
-            "description": "",
-            "slots": {}
-        },
-        "APPOINTMENT_REMINDER": {
-            "description": "",
-            "slots": {}
-        },
-        "AUTO_SORT": {
-            "description": "",
-            "slots": {}
-        },
-        "BANK_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "BUS_SCHEDULE_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "CATALOGUE_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "CHECK_STATUS": {
-            "description": "",
-            "slots": {}
-        },
-        "CITY_INFO": {
-            "description": "",
-            "slots": {}
-        },
-        "CONTACT_MANAGER": {
-            "description": "",
-            "slots": {}
-        },
-        "DECIDER_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "EDIT_PLAYLIST": {
-            "description": "",
-            "slots": {}
-        },
-        "EVENT_RESERVE": {
-            "description": "",
-            "slots": {}
-        },
-        "GAME_RULES": {
-            "description": "",
-            "slots": {}
-        },
-        "GEOGRAPHY": {
-            "description": "",
-            "slots": {}
-        },
-        "GUINESS_CHECK": {
-            "description": "",
-            "slots": {}
-        },
-        "HOME_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "HOW_TO_BASIC": {
-            "description": "",
-            "slots": {}
-        },
-        "INSURANCE": {
-            "description": "",
-            "slots": {}
-        },
-        "LIBRARY_REQUEST": {
-            "description": "",
-            "slots": {}
-        },
-        "LOOK_UP_INFO": {
-            "description": "",
-            "slots": {}
-        },
-        "MAKE_RESTAURANT_RESERVATIONS": {
-            "description": "",
-            "slots": {}
-        },
-        "MOVIE_LISTINGS": {
-            "description": "",
-            "slots": {}
-        },
-        "MUSIC_SUGGESTER": {
-            "description": "",
-            "slots": {}
-        },
-        "NAME_SUGGESTER": {
-            "description": "",
-            "slots": {}
-        },
-        "ORDER_PIZZA": {
-            "description": "",
-            "slots": {}
-        },
-        "PET_ADVICE": {
-            "description": "",
-            "slots": {}
-        },
-        "PHONE_PLAN_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "PHONE_SETTINGS": {
-            "description": "",
-            "slots": {}
-        },
-        "PLAY_TIMES": {
-            "description": "",
-            "slots": {}
-        },
-        "POLICY_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "PRESENT_IDEAS": {
-            "description": "",
-            "slots": {}
-        },
-        "PROMPT_GENERATOR": {
-            "description": "",
-            "slots": {}
-        },
-        "QUOTE_OF_THE_DAY_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "RESTAURANT_PICKER": {
-            "description": "",
-            "slots": {}
-        },
-        "SCAM_LOOKUP": {
-            "description": "",
-            "slots": {}
-        },
-        "SHOPPING": {
-            "description": "",
-            "slots": {}
-        },
-        "SKI_BOT": {
-            "description": "",
-            "slots": {}
-        },
-        "SPORTS_INFO": {
-            "description": "",
-            "slots": {}
-        },
-        "STORE_DETAILS": {
-            "description": "",
-            "slots": {}
-        },
-        "TIME_ZONE": {
-            "description": "",
-            "slots": {}
-        },
-        "UPDATE_CALENDAR": {
-            "description": "",
-            "slots": {}
-        },
-        "UPDATE_CONTACT": {
-            "description": "",
-            "slots": {}
-        },
-        "WEATHER_CHECK": {
-            "description": "",
-            "slots": {}
-        },
-        "WEDDING_PLANNER": {
-            "description": "",
-            "slots": {}
-        },
-        "WHAT_IS_IT": {
-            "description": "",
-            "slots": {}
-        },
-        "BOOKING_FLIGHT": {
-            "description": "",
-            "slots": {}
-        },
-        "HOTEL_RESERVE": {
-            "description": "",
-            "slots": {}
-        },
-        "TOURISM": {
-            "description": "",
-            "slots": {}
-        },
-        "VACATION_IDEAS": {
-            "description": "",
-            "slots": {}
-        }
-    },
-    "intents": {},
-    "binary_dialogue_act": [],
-    "state": {}
-}
\ No newline at end of file
--- a/data/unified_datasets/metalwoz/preprocess.py
+++ b/data/unified_datasets/metalwoz/preprocess.py
 import json
 import os
 from zipfile import ZipFile, ZIP_DEFLATED
-
+import random
 import json_lines
-
-
-dataset = 'metalwoz'
-self_dir = os.path.dirname(os.path.abspath(__file__))
-DATA_PATH = os.path.join(os.path.dirname(os.path.dirname(self_dir)), 'data')
-# origin_data_dir = os.path.join(DATA_PATH, dataset)
-origin_data_dir = self_dir
+from collections import Counter
+from shutil import rmtree


 def preprocess():
+    random.seed(42)
+
    ontology = {
        'domains': {},
        'intents': {},
-        'binary_dialogue_act': [],
-        'state': {}
+        'state': {},
+        "dialogue_acts": {
+            "categorical": {},
+            "non-categorical": {},
+            "binary": {}
+        }
    }

-    def process_dialog(ori_dialog, split, dialog_id):
+    dataset = 'metalwoz'
+    splits = ['train', 'validation', 'test']
+    dialogues_by_split = {split: [] for split in splits}
+    ZipFile('metalwoz-test-v1.zip').extract('dstc8_metalwoz_heldout.zip')
+    cnt = Counter()
+    for filename in ['metalwoz-v1.zip', 'dstc8_metalwoz_heldout.zip']:
+        with ZipFile(filename) as zipfile:
+            task_id2description = {x['task_id']: x for x in json_lines.reader(zipfile.open('tasks.txt'))}
+            for path in zipfile.namelist():
+                if path.startswith('dialogues'):
+                    if filename == 'metalwoz-v1.zip':
+                        split = random.choice(['train']*9+['validation'])
+                    else:
+                        split = 'test'
+                    if split == 'validation':
+                        print(path, split)
+                    for ori_dialog in json_lines.reader(zipfile.open(path)):
+                        dialogue_id = f'{dataset}-{split}-{len(dialogues_by_split[split])}'
                        domain = ori_dialog['domain']
+
+                        task_des = task_id2description[ori_dialog['task_id']]
+
+                        goal = {
+                            'description': "user role: {}. user prompt: {}. system role: {}. system prompt: {}.".format(
+                                task_des['user_role'], task_des['user_prompt'], task_des['bot_role'], task_des['bot_prompt']),
+                            'inform': {},
+                            'request': {}
+                        }
+
+                        dialogue = {
+                            'dataset': dataset,
+                            'data_split': split,
+                            'dialogue_id': dialogue_id,
+                            'original_id': ori_dialog['id'],
+                            'domains': [domain],  # will be updated by dialog_acts and state
+                            'goal': goal,
+                            'turns': []
+                        }
+
                        ontology['domains'][domain] = {
-            'description': "",
+                            'description': task_des['bot_role'],
                            'slots': {}
                        }
-        dialog = {
-            "dataset": dataset,
-            "data_split": split,
-            "dialogue_id": f'{dataset}_{dialog_id}',
-            "original_id": ori_dialog['id'],
-            "domains": [domain],
-        }
-        turns = []
-        # starts with system
+                        cnt[ori_dialog['turns'][0]] += 1
+                        # assert ori_dialog['turns'][0] == "how may I help you?", print(ori_dialog['turns'])
                        for utt_idx, utt in enumerate(ori_dialog['turns'][1:]):
+                            speaker = 'user' if utt_idx % 2 == 0 else 'system'
                            turn = {
-                'utt_idx': utt_idx,
+                                'speaker': speaker,
                                'utterance': utt,
-                'dialogue_act': {
+                                'utt_idx': utt_idx,
+                                'dialogue_acts': {
                                    'categorical': [],
                                    'non-categorical': [],
                                    'binary': [],
-                },
                                }
-            if utt_idx % 2 == 0:
-                turn['speaker'] = 'user'
-                turn['state'] = {}
-                turn['state_update'] = {
-                    'categorical': [],
-                    'non-categorical': [],
                            }
+                            if speaker == 'system':
+                                turn['db_results'] = {}
                            else:
-                turn['speaker'] = 'system'
-            turns.append(turn)
-        if turns[-1]['speaker'] == 'system':
-            turns.pop()
-
-        dialog['turns'] = turns
-        return dialog
+                                turn['state'] = {}
+                            dialogue['turns'].append(turn)

-    dialog_id = 0
-    data = []
-    with ZipFile(os.path.join(origin_data_dir, 'metalwoz-v1.zip')) as zipfile:
-        for path in zipfile.namelist():
-            if path.startswith('dialogues'):
-                for dialog in json_lines.reader(zipfile.open(path)):
-                    data.append(process_dialog(dialog, 'train', dialog_id))
-                    dialog_id += 1
+                        dialogues_by_split[split].append(dialogue)

-    ZipFile(os.path.join(origin_data_dir, 'metalwoz-test-v1.zip')).extract('dstc8_metalwoz_heldout.zip')
-    with ZipFile(os.path.join('dstc8_metalwoz_heldout.zip')) as zipfile:
-        for path in zipfile.namelist():
-            if path.startswith('dialogues'):
-                for dialog in json_lines.reader(zipfile.open(path)):
-                    data.append(process_dialog(dialog, 'test', dialog_id))
-                    dialog_id += 1
    os.remove('dstc8_metalwoz_heldout.zip')
-
-    json.dump(ontology, open(os.path.join(self_dir, 'ontology.json'), 'w'))
-    json.dump(data, open('data.json', 'w'), indent=4)
-    ZipFile(os.path.join(self_dir, 'data.zip'), 'w', ZIP_DEFLATED).write('data.json')
-    os.remove('data.json')
+    new_data_dir = 'data'
+    os.makedirs(new_data_dir, exist_ok=True)
+    dialogues = []
+    for split in splits:
+        dialogues += dialogues_by_split[split]
+    json.dump(dialogues[:10], open(f'dummy_data.json', 'w', encoding='utf-8'), indent=2, ensure_ascii=False)
+    json.dump(ontology, open(f'{new_data_dir}/ontology.json', 'w', encoding='utf-8'), indent=2, ensure_ascii=False)
+    json.dump(dialogues, open(f'{new_data_dir}/dialogues.json', 'w', encoding='utf-8'), indent=2, ensure_ascii=False)
+    with ZipFile('data.zip', 'w', ZIP_DEFLATED) as zf:
+        for filename in os.listdir(new_data_dir):
+            zf.write(f'{new_data_dir}/{filename}')
+    rmtree(new_data_dir)
+    return dialogues, ontology


 if __name__ == '__main__':