Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

- Community
- DigitalOcean
- Community
- DigitalOcean

How to calculate BLEU Score in Python?

Published on August 3, 2022

By Jayant Verma

Bleu score in Python is a metric that measures the goodness of Machine Translation models. Though originally it was designed for only translation models, now it is used for other natural language processing applications as well.

The BLEU score compares a sentence against one or more reference sentences and tells how well does the candidate sentence matched the list of reference sentences. It gives an output score between 0 and 1.

A BLEU score of 1 means that the candidate sentence perfectly matches one of the reference sentences.

This score is a common metric of measurement for Image captioning models.

In this tutorial, we will be using sentence_bleu() function from the nltk library. Let’s get started.

Calculating the Bleu score in Python

To calculate the Bleu score, we need to provide the reference and candidate sentences in the form of tokens.

We will learn how to do that and compute the score in this section. Let’s start with importing the necessary modules.

from nltk.translate.bleu_score import sentence_bleu

Now we can input the reference sentences in the form of a list. We also need to create tokens out of sentences before passing them to the sentence_bleu() function.

1. Input and Split the sentences

The sentences in our reference list are:

    'this is a dog'
    'it is dog
    'dog it is'
    'a dog, it is'

We can split them into tokens using the split function.

reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split() 
]
print(reference)

Output :

[['this', 'is', 'a', 'dog'], ['it', 'is', 'dog'], ['dog', 'it', 'is'], ['a', 'dog,', 'it', 'is']]

This is what the sentences look like in the form of tokens. Now we can call the sentence_bleu() function to calculate the score.

2. Calculate the BLEU score in Python

To calculate the score use the following lines of code:

candidate = 'it is dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

Output :

BLEU score -> 1.0

We get a perfect score of 1 as the candidate sentence belongs to the reference set. Let’s try another one.

candidate = 'it is a dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

Output :

BLEU score -> 0.8408964152537145

We have the sentence in our reference set, but it isn’t an exact match. This is why we get a 0.84 score.

3. Complete Code for Implementing BLEU Score in Python

Here’s the complete code from this section.

from nltk.translate.bleu_score import sentence_bleu
reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split() 
]
candidate = 'it is dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate )))

candidate = 'it is a dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

4. Calculating the n-gram score

While matching sentences you can choose the number of words you want the model to match at once. For example, you can choose for words to be matched one at a time (1-gram). Alternatively, you can also choose to match words in pairs (2-gram) or triplets (3-grams).

In this section we will learn how to calculate these n-gram scores.

In the sentence_bleu() function you can pass an argument with weights corresponding to the individual grams.

For example, to calculate gram scores individually you can use the following weights.

Individual 1-gram: (1, 0, 0, 0)
Individual 2-gram: (0, 1, 0, 0). 
Individual 3-gram: (1, 0, 1, 0). 
Individual 4-gram: (0, 0, 0, 1).

Python code for the same is given below:

from nltk.translate.bleu_score import sentence_bleu
reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split() 
]
candidate = 'it is a dog'.split()

print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

Output :

Individual 1-gram: 1.000000
Individual 2-gram: 1.000000
Individual 3-gram: 0.500000
Individual 4-gram: 1.000000

Be default the sentence_bleu() function calculates the cumulative 4-gram BLEU score, also called BLEU-4. The weights for BLEU-4 are as follows :

(0.25, 0.25, 0.25, 0.25)

Let’s see the BLEU-4 code:

score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
print(score)

Output :

0.8408964152537145

That’s the exact score we got without the n-gram weights added.

Conclusion

This tutorial was about calculating the BLEU score in Python. We learned what it is and how to calculate individual and cumulative n-gram Bleu scores. Hope you had fun learning with us!

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Jayant Verma

Author

Category:

Tutorial

Tags:

Python

Python Advanced

While we believe that this content benefits our community, we have not yet thoroughly reviewed it. If you have any suggestions for improvements, please let us know by clicking the “report an issue“ button at the bottom of the tutorial.

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Report this

How to calculate BLEU Score in Python?

Calculating the Bleu score in Python

1. Input and Split the sentences

2. Calculate the BLEU score in Python

3. Complete Code for Implementing BLEU Score in Python

4. Calculating the n-gram score

Conclusion

About the author

Still looking for an answer?

Join the Tech Talk

Deploy on DigitalOcean

Become a contributor for community

DigitalOcean Documentation

Resources for startups and AI-native businesses

The developer cloud

Start building today