Skip to content
Snippets Groups Projects
Commit 4a5a0d4b authored by zqwerty's avatar zqwerty
Browse files

update dailydialog: add space after full-stop

parent 5ead12a1
No related branches found
No related tags found
No related merge requests found
...@@ -18,6 +18,7 @@ DailyDialog is a high-quality multi-turn dialog dataset. It is intriguing in sev ...@@ -18,6 +18,7 @@ DailyDialog is a high-quality multi-turn dialog dataset. It is intriguing in sev
- Retain emotion annotation in the `emotion` field of each turn. - Retain emotion annotation in the `emotion` field of each turn.
- Use nltk to remove space before punctuation: `utt = ' '.join([detokenizer.detokenize(word_tokenize(s)) for s in sent_tokenize(utt)])`. - Use nltk to remove space before punctuation: `utt = ' '.join([detokenizer.detokenize(word_tokenize(s)) for s in sent_tokenize(utt)])`.
- Replace `" ’ "` with `"'"`: `utt = utt.replace(' ’ ', "'")`. - Replace `" ’ "` with `"'"`: `utt = utt.replace(' ’ ', "'")`.
- Add space after full-stop
- **Annotations:** - **Annotations:**
- intent, emotion - intent, emotion
...@@ -33,10 +34,10 @@ English ...@@ -33,10 +34,10 @@ English
| split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) | | split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) |
|------------|-------------|--------------|-----------|--------------|---------------|-------------------------|------------------------|--------------------------------|-----------------------------------| |------------|-------------|--------------|-----------|--------------|---------------|-------------------------|------------------------|--------------------------------|-----------------------------------|
| train | 11118 | 87170 | 7.84 | 11.18 | 1 | - | - | - | - | | train | 11118 | 87170 | 7.84 | 11.22 | 1 | - | - | - | - |
| validation | 1000 | 8069 | 8.07 | 11.14 | 1 | - | - | - | - | | validation | 1000 | 8069 | 8.07 | 11.16 | 1 | - | - | - | - |
| test | 1000 | 7740 | 7.74 | 11.33 | 1 | - | - | - | - | | test | 1000 | 7740 | 7.74 | 11.36 | 1 | - | - | - | - |
| all | 13118 | 102979 | 7.85 | 11.19 | 1 | - | - | - | - | | all | 13118 | 102979 | 7.85 | 11.22 | 1 | - | - | - | - |
10 domains: ['Ordinary Life', 'School Life', 'Culture & Education', 'Attitude & Emotion', 'Relationship', 'Tourism', 'Health', 'Work', 'Politics', 'Finance'] 10 domains: ['Ordinary Life', 'School Life', 'Culture & Education', 'Attitude & Emotion', 'Relationship', 'Tourism', 'Health', 'Work', 'Politics', 'Finance']
- **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage. - **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage.
......
No preview for this file type
...@@ -7,6 +7,7 @@ from collections import Counter ...@@ -7,6 +7,7 @@ from collections import Counter
from pprint import pprint from pprint import pprint
from nltk.tokenize import sent_tokenize, word_tokenize from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer from nltk.tokenize.treebank import TreebankWordDetokenizer
import re
topic_map = { topic_map = {
1: "Ordinary Life", 1: "Ordinary Life",
...@@ -110,8 +111,12 @@ def preprocess(): ...@@ -110,8 +111,12 @@ def preprocess():
speaker = 'user' if len(dialogue['turns']) % 2 == 0 else 'system' speaker = 'user' if len(dialogue['turns']) % 2 == 0 else 'system'
intent = act_map[int(act)] intent = act_map[int(act)]
emotion = emotion_map[int(emotion)] emotion = emotion_map[int(emotion)]
# re-tokenize
utt = ' '.join([detokenizer.detokenize(word_tokenize(s)) for s in sent_tokenize(utt)]) utt = ' '.join([detokenizer.detokenize(word_tokenize(s)) for s in sent_tokenize(utt)])
# replace with common apostrophe
utt = utt.replace('', "'") utt = utt.replace('', "'")
# add space after full-stop
utt = re.sub('\.(?!com)(\w)', lambda x: '. '+x.group(1), utt)
dialogue['turns'].append({ dialogue['turns'].append({
'speaker': speaker, 'speaker': speaker,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment