update dailydialog: add space after full-stop

4a5a0d4b · zqwerty · 5ead12a1 · 4a5a0d4b · 4a5a0d4b · 4a5a0d4b
Commit 4a5a0d4b authored 3 years ago by zqwerty
--- a/data/unified_datasets/dailydialog/README.md
+++ b/data/unified_datasets/dailydialog/README.md
@@ -18,6 +18,7 @@ DailyDialog is a high-quality multi-turn dialog dataset. It is intriguing in sev
  - Retain emotion annotation in the `emotion` field of each turn.
  - Use nltk to remove space before punctuation: `utt = ' '.join([detokenizer.detokenize(word_tokenize(s)) for s in sent_tokenize(utt)])`.
  - Replace `" ’ "` with `"'"`: `utt = utt.replace(' ’ ', "'")`.
+  - Add space after full-stop
 - **Annotations:**
  - intent, emotion
@@ -33,10 +34,10 @@ English
 | split      |   dialogues |   utterances |   avg_utt |   avg_tokens |   avg_domains | cat slot match(state)   | cat slot match(goal)   | cat slot match(dialogue act)   | non-cat slot span(dialogue act)   |
 |------------|-------------|--------------|-----------|--------------|---------------|-------------------------|------------------------|--------------------------------|-----------------------------------|
-| train      |       11118 |        87170 |      7.84 |        11.18 |             1 | -                       | -                      | -                              | -                                 |
+| train      |       11118 |        87170 |      7.84 |        11.22 |             1 | -                       | -                      | -                              | -                                 |
-| validation |        1000 |         8069 |      8.07 |        11.14 |             1 | -                       | -                      | -                              | -                                 |
+| validation |        1000 |         8069 |      8.07 |        11.16 |             1 | -                       | -                      | -                              | -                                 |
-| test       |        1000 |         7740 |      7.74 |        11.33 |             1 | -                       | -                      | -                              | -                                 |
+| test       |        1000 |         7740 |      7.74 |        11.36 |             1 | -                       | -                      | -                              | -                                 |
-| all        |       13118 |       102979 |      7.85 |        11.19 |             1 | -                       | -                      | -                              | -                                 |
+| all        |       13118 |       102979 |      7.85 |        11.22 |             1 | -                       | -                      | -                              | -                                 |
 10 domains: ['Ordinary Life', 'School Life', 'Culture & Education', 'Attitude & Emotion', 'Relationship', 'Tourism', 'Health', 'Work', 'Politics', 'Finance']
 - **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage.

--- a/data/unified_datasets/dailydialog/data.zip
+++ b/data/unified_datasets/dailydialog/data.zip
--- a/data/unified_datasets/dailydialog/dummy_data.json
+++ b/data/unified_datasets/dailydialog/dummy_data.json
--- a/data/unified_datasets/dailydialog/preprocess.py
+++ b/data/unified_datasets/dailydialog/preprocess.py
@@ -7,6 +7,7 @@ from collections import Counter
 from pprint import pprint
 from nltk.tokenize import sent_tokenize, word_tokenize
 from nltk.tokenize.treebank import TreebankWordDetokenizer
+import re
 topic_map = {
    1: "Ordinary Life", 
@@ -110,8 +111,12 @@ def preprocess():
                    speaker = 'user' if len(dialogue['turns']) % 2 == 0 else 'system'
                    intent = act_map[int(act)]
                    emotion = emotion_map[int(emotion)]
+                    # re-tokenize
                    utt = ' '.join([detokenizer.detokenize(word_tokenize(s)) for s in sent_tokenize(utt)])
+                    # replace with common apostrophe
                    utt = utt.replace(' ’ ', "'")
+                    # add space after full-stop
+                    utt = re.sub('\.(?!com)(\w)', lambda x: '. '+x.group(1), utt)
                    dialogue['turns'].append({
                        'speaker': speaker,