How to Use Hugging Face's New Evaluate Library
By now I’m sure we’ve all heard of Hugging Face — the company that leads the way for open-source AI models with their Transformers library having over 64k stars on GitHub. Just a few days ago Hugging Face released yet another Python library called Evaluate. This package makes it easy to evaluate and compare AI models. Upon its release, Hugging Face included 44 metrics such as accuracy, precision, and recall, which will be the three metrics we will cover within this tutorial. Anyone can contribute new metrics, so I suspect soon there will be far more.
There are many other metrics that I suggest you explore. For example, they included a metric called perplexity, which is used to measure the likelihood of a sequence using a model. They also included a metric called SQuAD, which is used to evaluate question answering models. The three metric we'll cover (accuracy, recall and precision) are quite fundamental and commonly used for many AI tasks, such as text classification. By reading this article, you'll gain a basic understanding of how to use Evaluate, which you can apply to quickly learn how to use other metrics.
Check out the code for this tutorial within Google Colab. Also check out this in-depth tutorial that covers how to apply Hugging Face's Evaluate Library to evaluate a text classification model.
Install
Let’s first install the Python package from PyPI.
pip install evaluate
Import
import evaluate
Metrics
We need to use a function called "load" to load each of the metrics. This function will create an EvaluationModule object.
Accuracy
Documentation for the accuracy metric
accuracy_metric = evaluate.load("accuracy")
Precision
Documentation for the precision metric
precision_metric = evaluate.load("precision")
Recall
Documentation for the recall metric
recall_metric = evaluate.load("recall")
Display
One interesting feature of the EvaluationModule objects is that documentation for them is outputted when they are printed. Below is an example of printing the accuracy metric.
print(accuracy_metric)
Output:
EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
predictions (list
of int
): Predicted labels.
references (list
of int
): Ground truth labels.
normalize (boolean
): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
sample_weight (list
of float
): Sample weights Defaults to None.
Returns:
accuracy (float
or int
): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if normalize
is set to True
.. A higher score means higher accuracy...
Data
Let's create some data to use for the metrics. The predictions variable is used to store sample outputs from a model, and the references variable contains the answer. We’ll compare the predictions variable against the references variable in the next step.
predictions = [0, 1, 1, 1, 1, 1]
references = [0, 1, 1, 0, 1, 1]
Results
For each of the metrics, we can use the metric's "compute" method to produce a result.
Accuracy
accuracy_result = accuracy_metric.compute(references=references, predictions=predictions)
print(accuracy_result)
Output: {'accuracy': 0.8333333333333334}
The output is a dictionary with a single key called accuracy. We can isolate this value as shown below.
print(accuracy_result['accuracy'])
Output: 0.8333333333333334
Precision
precision_result = precision_metric.compute(references=references, predictions=predictions)
print(precision_result)
print(precision_result["precision"])
Output:
{'precision': 0.8}
0.8
Recall
recall_result = recall_metric.compute(references=references, predictions=predictions)
print(recall_result)
print(recall_result['recall'])
Output:
{'recall': 1.0}
1.0
Conclusion
We just covered how to use Hugging Face's evaluation library to compute accuracy, precision and recall. I suggest you now follow along this tutorial to apply what you learned to evaluate a text classification model. Be sure to subscribe to Vennify's YouTube channel and sign up for our email list.
Once again, here’s the code for this tutorial within Google Colab.
Book a Call
We may be able to help you or your company with your next NLP project. Feel free to book a free 15 minute call with us.