Test Report

Model Name: BERTNLU-RuleDST-MLEPolicy-TemplateNLG

Dataset: multiwoz

Time: 2020-04-27 02:37:46

Overall Results

Success Rate: 48.4 %

(Precision, Recall, F1) : (0.663, 0.727, 0.660)

Average Dialog Turn (Succ): 12.455

Average Dialog Turn (All): 26.270

Metric

 Total NumSucc RatePrecisionRecallF1Dialog Loop Failed RateDialog Turn (Succ)Dialog Turn (All)
hotel3750.4320.4220.5030.4340.4698.07419.781
restaurant3940.6090.4920.7220.5580.38610.18320.365
attraction2890.8750.7700.9160.8120.1076.4279.723
taxi990.6770.7470.7120.7240.3237.31313.192
train3270.7950.8670.8680.8620.2027.64613.125
police231.0000.9351.0000.9570.0002.0002.000
hospital300.9330.9330.9330.9330.0674.0006.400

Domain hotel

Overall Results

Success Rate: 43.2 %

(Precision, Recall, F1) : (0.422, 0.503, 0.434)

Average Dialog Turn (Succ): 8.074

Average Dialog Turn (All): 19.781

System NLU Failed Dialog Act: User NLU Failed Dialog Act: Dialog Loop Bad Inform Dialog Act Request But Not Inform Dialog Act Inform But Not Request Dialog Act

Domain restaurant

Overall Results

Success Rate: 60.9 %

(Precision, Recall, F1) : (0.492, 0.722, 0.558)

Average Dialog Turn (Succ): 10.183

Average Dialog Turn (All): 20.365

System NLU Failed Dialog Act: User NLU Failed Dialog Act: Dialog Loop Bad Inform Dialog Act Request But Not Inform Dialog Act Inform But Not Request Dialog Act

Domain attraction

Overall Results

Success Rate: 87.5 %

(Precision, Recall, F1) : (0.770, 0.916, 0.812)

Average Dialog Turn (Succ): 6.427

Average Dialog Turn (All): 9.723

System NLU Failed Dialog Act: User NLU Failed Dialog Act: Dialog Loop Bad Inform Dialog Act Request But Not Inform Dialog Act Inform But Not Request Dialog Act

Domain taxi

Overall Results

Success Rate: 67.7 %

(Precision, Recall, F1) : (0.747, 0.712, 0.724)

Average Dialog Turn (Succ): 7.313

Average Dialog Turn (All): 13.192

System NLU Failed Dialog Act: User NLU Failed Dialog Act: Dialog Loop Bad Inform Dialog Act

Nothing

Request But Not Inform Dialog Act Inform But Not Request Dialog Act

Nothing

Domain train

Overall Results

Success Rate: 79.5 %

(Precision, Recall, F1) : (0.867, 0.868, 0.862)

Average Dialog Turn (Succ): 7.646

Average Dialog Turn (All): 13.125

System NLU Failed Dialog Act: User NLU Failed Dialog Act: Dialog Loop Bad Inform Dialog Act Request But Not Inform Dialog Act Inform But Not Request Dialog Act

Nothing

Domain police

Overall Results

Success Rate: 100.0 %

(Precision, Recall, F1) : (0.935, 1.000, 0.957)

Average Dialog Turn (Succ): 2.000

Average Dialog Turn (All): 2.000

System NLU Failed Dialog Act: User NLU Failed Dialog Act: Dialog Loop

Nothing

Bad Inform Dialog Act

Nothing

Request But Not Inform Dialog Act

Nothing

Inform But Not Request Dialog Act

Domain hospital

Overall Results

Success Rate: 93.3 %

(Precision, Recall, F1) : (0.933, 0.933, 0.933)

Average Dialog Turn (Succ): 4.000

Average Dialog Turn (All): 6.400

System NLU Failed Dialog Act: User NLU Failed Dialog Act: Dialog Loop Bad Inform Dialog Act

Nothing

Request But Not Inform Dialog Act Inform But Not Request Dialog Act

Nothing