Better report for Rasa chatbot model analysis
Rasa is a very powerful AI Framework for building contextual assistants, but you have to know how to set up your NLU, and improve your intent recognition.
You can generate a report that will show what your model looks like using the rasa test command. In the results folder, the intent-errors.json will show where intents are getting mixed up, but doesn't show you intents that are nearly being confused, and an alteration in your NLU examples can cause them to start failing.
All of your examples are influencing the neural net that is classifying what the user typed, in one way or another. Imagine a huge spider web connecting all examples with intents... if you "move" one example, all the intents classifications will move, albeit some almost imperceptibly.
I'll be showing you how to analyze intent classification and figuring out where to improve your model, and taking you step-by-step through writing a python script to help figure out what's happening.
So how do I improve my model?
First of all, let's remove randomness from your training. If you train a model, and test it, then train it again and test it again, you will get slightly different results. Why? When you train, the model is initialized with random values that are then adjusted by the training process. If you start with a certain setting and train it, you will get slightly different results than if you started with another setting. In my experience this is only a couple of percent in accuracy, but let's eliminate this variation.
So, to make sure that if we train a model, and train the same model again, getting the same results, open your config file, and take a look at your pipeline. The Rasa documentation is very good, and you'll find a complete list of all pipeline component parameters here.
For example in my pipeline I use DIETClassifier. If you look through all the parameters for this component (you need to click on "show"), you'll see this (click to enlarge):
This look promising... So to use this, add
random_seed: 42
right after the name: DIETClassifier in your config - make sure the indentation is correct! It should look something like this:
...
- name: DIETClassifier
random_seed: 42
epochs: 150
entity_recognition: False
...
Why 42? Because it's the answer to everything... Seriously, though, you can use any whole number. We just want to get consistent results.
Check all the other components and make sure that any random seeds are set to a specific value.
Getting ready
I'm assuming a couple of things here:
- You have a reasonable understanding of programming and python
- You are using anaconda or some other virtual environment for python
- You have Rasa installed in that environment
- Your model contains no syntax errors (i.e. can be trained)
Choose you python environment (with conda activate or whatever you use). We'll be using a couple of libraries to make our script easier to write:
- requests
- xlsxwriter
So you'll need to run
pip install requests Xlswriter
Writing the script
We'll divide the script in two parts:
- Load all the intents and examples into a list (called messages), calling the parse API function and saving the results along with the intent and example text.
- Iterate through all the items in messages and write out a formatted .xlsx file.
To make up the script, just add all the text in green to a file called reports.py.
First off, our imports:
import os, json, requests, xlsxwriter
from rasa.nlu.training_data import load_data
We'll initialize the messages list and setup some variables. Adjust these according to your environment. I'll assume that the script is in your rasa root directory:
parse_url = 'http://localhost:5005/model/parse'
nlu_directory = 'data/nlu/'
threshold = 0.6
messages = []
Now let's read in our NLU, calling /model/parse and add the results to our messages list. We can use the load_data function from rasa.nlu.training_data to load the intents:
for filename in os.listdir(nlu_directory):
if filename.endswith('.md'):
intents = load_data(nlu_directory + filename).sorted_intent_examples()
for intent in intents:
data = intent.as_dict()
print(data['intent'], data['text'])
# let's hit the parser and get the classifications
headers = { 'content-type': 'application/json' }
payload = '{ "text": "' + data['text'].replace("\"", "\\\"").strip() + '"}'
r = requests.post(parse_url, data=payload.encode('utf-8'), headers = headers)
if r.status_code != 200:
print(payload)
break
results = json.loads(r.content.decode('utf8'))
message = { 'expected_intent': data['intent'],
'text': data['text'],
'intent1': results['intent_ranking'][0]['name'],
'accuracy1': float(results['intent_ranking'][0]['confidence']),
'intent2': results['intent_ranking'][1]['name'],
'accuracy2': float(results['intent_ranking'][1]['confidence']),
'intent3': results['intent_ranking'][2]['name'],
'accuracy3': float(results['intent_ranking'][2]['confidence']) }
messages.append(message)
Now let's create a spreadsheet and set up the formats we'll be using:
workbook = xlsxwriter.Workbook('Results.xlsx')
format_red = workbook.add_format({'font_color': 'white', 'bg_color': 'red'})
format_green = workbook.add_format({'bg_color': 'green'})
format_bold = workbook.add_format({'bold': True})
format_bold_percent = workbook.add_format({'bold': True, 'num_format': 10})
format_percent = workbook.add_format({'num_format': 10})
format_percent_red = workbook.add_format({'font_color': 'white', 'bg_color': 'red', 'num_format': 10})
format_percent_yellow = workbook.add_format({'bg_color': 'yellow', 'num_format': 10})
format_percent_green = workbook.add_format({'bg_color': 'green','num_format': 10})
format_center_green = workbook.add_format({'bg_color': 'green', 'align': 'center'})
format_center_red = workbook.add_format({'font_color': 'white', 'bg_color': 'red', 'align': 'center'})
Add a spreadsheet and set up our column widths and headers:
worksheet = workbook.add_worksheet('Test results')
row = 1
worksheet.set_column('A:A', 30)
worksheet.set_column('B:B', 40)
worksheet.set_column('C:C', 5)
worksheet.set_column('D:D', 25)
worksheet.set_column('E:E', 10)
worksheet.set_column('F:F', 10)
worksheet.set_column('G:G', 5)
worksheet.set_column('H:H', 25)
worksheet.set_column('I:I', 10)
worksheet.set_column('J:J', 25)
worksheet.set_column('K:K', 10)
worksheet.set_column('L:L', 25)
worksheet.set_column('M:M', 10)
worksheet.write('A1', 'Expected intent', format_bold)
worksheet.write('B1', 'Text', format_bold)
worksheet.write('D1', 'Classified Intent', format_bold)
worksheet.write('E1', 'Accuracy', format_bold)
worksheet.write('H1', 'Ranking', format_bold)
Now let's add all the results:
total_items = 0
correct_items = 0
for item in messages:
worksheet.write(row, 0, item['expected_intent'])
worksheet.write(row, 1, item['text'])
if item['expected_intent'] == item['intent1']:
worksheet.write(row, 3, item['intent1'], format_green)
action_ok = True
else:
worksheet.write(row, 3, item['intent1'], format_red)
action_ok = False
if item['accuracy1'] > threshold + .1:
worksheet.write(row, 4, item['accuracy1'], format_percent_green)
accuracy_ok = True
elif item['accuracy1'] >= threshold:
worksheet.write(row, 4, item['accuracy1'], format_percent_yellow)
accuracy_ok = True
else:
worksheet.write(row, 4, item['accuracy1'], format_percent_red)
accuracy_ok = False
if action_ok and accuracy_ok:
worksheet.write(row, 5, 'OK', format_center_green)
correct_items += 1
else:
worksheet.write(row, 5, 'BAD', format_center_red)
total_items += 1
worksheet.write(row, 7, item['intent1'])
worksheet.write(row, 8, item['accuracy1'], format_percent)
worksheet.write(row, 9, item['intent2'])
worksheet.write(row, 10, item['accuracy2'], format_percent)
worksheet.write(row, 11, item['intent3'])
worksheet.write(row, 12, item['accuracy3'], format_percent)
row += 1
And finally add some overall stats and close the file:
if total_items > 0: worksheet.write(row + 1, 5, correct_items, format_bold) worksheet.write(row + 1, 6, 'correct examples', format_bold) worksheet.write(row + 2, 5, total_items, format_bold) worksheet.write(row + 2, 6, 'total examples', format_bold) worksheet.write(row + 3, 5, correct_items / total_items, format_bold_percent) worksheet.write(row + 3, 6, '%', format_bold) workbook.close()
How to run it
All you need to do now is to put the script in your chatbot directory and run it with
python report.py
You do need to have a Rasa instance running, with the trained model loaded. I normally open another shell prompt and run rasa run --enable-api.
Here's a sample generated from the default Rasa dialogues (click to enlarge):
I hope that this spreadsheet can help you see where you're getting collisions, and can help you to improve your models. If you have any problems running the script, leave a comment!
Thanks for the wonderful post
ReplyDeleteYou are the best, thanks a lot.
ReplyDeleteI have a situation here. The excel sheet is not appearing in the folder. It is necessary to have a docker port open?
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteTraceback (most recent call last):
ReplyDeleteFile "reports.py", line 13, in
intents = load_data(nlu_directory + filename).sorted_intent_examples()
File "D:\Programas\Anaconda\envs\EcoBot\lib\site-packages\rasa\shared\nlu\training_data\loading.py", line 60, in load_data
data_sets = [_load(f, language) for f in files]
File "D:\Programas\Anaconda\envs\EcoBot\lib\site-packages\rasa\shared\nlu\training_data\loading.py", line 60, in
data_sets = [_load(f, language) for f in files]
File "D:\Programas\Anaconda\envs\EcoBot\lib\site-packages\rasa\shared\nlu\training_data\loading.py", line 107, in _load
raise ValueError(f"Unknown data format for file '{filename}'.")
ValueError: Unknown data format for file 'data/rules.yml'.
Please show yourself magic man
Just fixed it gg, for anyone wondering, the issue is with the rules.yml and stories.yml, because this script is made for old rasa versions. The solution? Move those archives to another carpet when executing the script. Also, dont forget to replace ".yml" when says ".md".
ReplyDelete