# A quick tour πŸ€— Evaluate provides access to a wide range of evaluation tools. It covers a range of modalities such as text, computer vision, audio, etc. as well as tools to evaluate models or datasets. These tools are split into three categories. ## Types of evaluations There are different aspects of a typical machine learning pipeline that can be evaluated and for each aspect πŸ€— Evaluate provides a tool: - **Metric**: A metric is used to evaluate a model's performance and usually involves the model's predictions as well as some ground truth labels. You can find all integrated metrics at [evaluate-metric](https://huggingface.co/evaluate-metric). - **Comparison**: A comparison is used to compare two models. This can for example be done by comparing their predictions to ground truth labels and computing their agreement. You can find all integrated comparisons at [evaluate-comparison](https://huggingface.co/evaluate-comparison). - **Measurement**: The dataset is as important as the model trained on it. With measurements one can investigate a dataset's properties. You can find all integrated measurements at [evaluate-measurement](https://huggingface.co/evaluate-measurement). Each of these evaluation modules live on Hugging Face Hub as a Space. They come with an interactive widget and a documentation card documenting its use and limitations. For example [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy):
Each metric, comparison, and measurement is a separate Python module, but for using any of them, there is a single entry point: [`evaluate.load`]! ## Load Any metric, comparison, or measurement is loaded with the `evaluate.load` function: ```py >>> import evaluate >>> accuracy = evaluate.load("accuracy") ``` If you want to make sure you are loading the right type of evaluation (especially if there are name clashes) you can explicitly pass the type: ```py >>> word_length = evaluate.load("word_length", module_type="measurement") ``` ### Community modules Besides the modules implemented in πŸ€— Evaluate you can also load any community module by specifying the repository ID of the metric implementation: ```py >>> element_count = evaluate.load("lvwerra/element_count", module_type="measurement") ``` See the [Creating and Sharing Guide](/docs/evaluate/main/en/creating_and_sharing) for information about uploading custom metrics. ### List available modules With [`list_evaluation_modules`] you can check what modules are available on the hub. You can also filter for a specific modules and skip community metrics if you want. You can also see additional information such as likes: ```python >>> evaluate.list_evaluation_modules( ... module_type="comparison", ... include_community=False, ... with_details=True) [{'name': 'mcnemar', 'type': 'comparison', 'community': False, 'likes': 1}, {'name': 'exact_match', 'type': 'comparison', 'community': False, 'likes': 0}] ``` ## Module attributes All evalution modules come with a range of useful attributes that help to use a module stored in a [`EvaluationModuleInfo`] object. |Attribute|Description| |---|---| |`description`|A short description of the evaluation module.| |`citation`|A BibTex string for citation when available.| |`features`|A `Features` object defining the input format.| |`inputs_description`|This is equivalent to the modules docstring.| |`homepage`|The homepage of the module.| |`license`|The license of the module.| |`codebase_urls`|Link to the code behind the module.| |`reference_urls`|Additional reference URLs.| Let's have a look at a few examples. First, let's look at the `description` attribute of the accuracy metric: ```py >>> accuracy = evaluate.load("accuracy") >>> accuracy.description Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative ``` You can see that it describes how the metric works in theory. If you use this metric for your work, especially if it is an academic publication you want to reference it properly. For that you can look at the `citation` attribute: ```py >>> accuracy.citation @article{scikit-learn, title={Scikit-learn: Machine Learning in {P}ython}, author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, journal={Journal of Machine Learning Research}, volume={12}, pages={2825--2830}, year={2011} } ``` Before we can apply a metric or other evaluation module to a use-case, we need to know what the input format of the metric is: ```py >>> accuracy.features { 'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None) } ``` Note that features always describe the type of a single input element. In general we will add lists of elements so you can always think of a list around the types in `features`. Evaluate accepts various input formats (Python lists, NumPy arrays, PyTorch tensors, etc.) and converts them to an appropriate format for storage and computation. ## Compute Now that we know how the evaluation module works and what should go in there we want to actually use it! When it comes to computing the actual score there are two main ways to do it: 1. All-in-one 2. Incremental In the incremental approach the necessary inputs are added to the module with [`EvaluationModule.add`] or [`EvaluationModule.add_batch`] and the score is calculated at the end with [`EvaluationModule.compute`]. Alternatively, one can pass all the inputs at once to `compute()`. Let's have a look at the two approaches. ### How to compute The simplest way to calculate the score of an evaluation module is by calling `compute()` directly with the necessary inputs. Simply pass the inputs as seen in `features` to the `compute()` method. ```py >>> accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1]) {'accuracy': 0.5} ``` Evaluation modules return the results in a dictionary. However, in some instances you build up the predictions iteratively or in a distributed fashion in which case `add()` or `add_batch()` are useful. ### Calculate a single metric or a batch of metrics In many evaluation pipelines you build the predictions iteratively such as in a for-loop. In that case you could store the predictions in a list and at the end pass them to `compute()`. With `add()` and `add_batch()` you can circumvent the step of storing the predictions separately. If you are only creating single predictions at a time you can use `add()`: ```py >>> for ref, pred in zip([0,1,0,1], [1,0,0,1]): >>> accuracy.add(references=ref, predictions=pred) >>> accuracy.compute() {'accuracy': 0.5} ``` Once you have gathered all predictions you can call `compute()` to compute the score based on all stored values. When getting predictions and references in batches you can use `add_batch()` which adds a list elements for later processing. The rest works as with `add()`: ```py >>> for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]): >>> accuracy.add_batch(references=refs, predictions=preds) >>> accuracy.compute() {'accuracy': 0.5} ``` This is especially useful when you need to get the predictions from your model in batches: ```py >>> for model_inputs, gold_standards in evaluation_dataset: >>> predictions = model(model_inputs) >>> metric.add_batch(references=gold_standards, predictions=predictions) >>> metric.compute() ``` ### Distributed evaluation Computing metrics in a distributed environment can be tricky. Metric evaluation is executed in separate Python processes, or nodes, on different subsets of a dataset. Typically, when a metric score is additive (`f(AuB) = f(A) + f(B)`), you can use distributed reduce operations to gather the scores for each subset of the dataset. But when a metric is non-additive (`f(AuB) β‰  f(A) + f(B)`), it's not that simple. For example, you can't take the sum of the [F1](https://huggingface.co/spaces/evaluate-metric/f1) scores of each data subset as your **final metric**. A common way to overcome this issue is to fallback on single process evaluation. The metrics are evaluated on a single GPU, which becomes inefficient. πŸ€— Evaluate solves this issue by only computing the final metric on the first node. The predictions and references are computed and provided to the metric separately for each node. These are temporarily stored in an Apache Arrow table, avoiding cluttering the GPU or CPU memory. When you are ready to `compute()` the final metric, the first node is able to access the predictions and references stored on all the other nodes. Once it has gathered all the predictions and references, `compute()` will perform the final metric evaluation. This solution allows πŸ€— Evaluate to perform distributed predictions, which is important for evaluation speed in distributed settings. At the same time, you can also use complex non-additive metrics without wasting valuable GPU or CPU memory. ## Combining several evaluations Often one wants to not only evaluate a single metric but a range of different metrics capturing different aspects of a model. E.g. for classification it is usually a good idea to compute F1-score, recall, and precision in addition to accuracy to get a better picture of model performance. Naturally, you can load a bunch of metrics and call them sequentially. However, a more convenient way is to use the [`~evaluate.combine`] function to bundle them together: ```python >>> clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"]) ``` The `combine` function accepts both the list of names of the metrics as well as an instantiated modules. The `compute` call then computes each metric: ```python >>> clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1]) { 'accuracy': 0.667, 'f1': 0.667, 'precision': 1.0, 'recall': 0.5 } ``` ## Save and push to the Hub Saving and sharing evaluation results is an important step. We provide the [`evaluate.save`] function to easily save metrics results. You can either pass a specific filename or a directory. In the latter case, the results are saved in a file with an automatically created file name. Besides the directory or file name, the function takes any key-value pairs as inputs and stores them in a JSON file. ```py >>> result = accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1]) >>> hyperparams = {"model": "bert-base-uncased"} >>> evaluate.save("./results/", experiment="run 42", **result, **hyperparams) PosixPath('results/result-2022_05_30-22_09_11.json') ``` The content of the JSON file look like the following: ```json { "experiment": "run 42", "accuracy": 0.5, "model": "bert-base-uncased", "_timestamp": "2022-05-30T22:09:11.959469", "_git_commit_hash": "123456789abcdefghijkl", "_evaluate_version": "0.1.0", "_python_version": "3.9.12 (main, Mar 26 2022, 15:51:15) \n[Clang 13.1.6 (clang-1316.0.21.2)]", "_interpreter_path": "/Users/leandro/git/evaluate/env/bin/python" } ``` In addition to the specified fields, it also contains useful system information for reproducing the results. Besides storing the results locally, you should report them on the model's repository on the Hub. With the [`evaluate.push_to_hub`] function, you can easily report evaluation results to the model's repository: ```py evaluate.push_to_hub( model_id="huggingface/gpt2-wikitext2", # model repository on hub metric_value=0.5, # metric value metric_type="bleu", # metric name, e.g. accuracy.name metric_name="BLEU", # pretty name which is displayed dataset_type="wikitext", # dataset name on the hub dataset_name="WikiText", # pretty name dataset_split="test", # dataset split used task_type="text-generation", # task id, see https://github.com/huggingface/evaluate/blob/main/src/evaluate/config.py#L154-L192 task_name="Text Generation" # pretty name for task ) ``` ## Evaluator The [`evaluate.evaluator`] provides automated evaluation and only requires a model, dataset, metric in contrast to the metrics in `EvaluationModule`s that require the model's predictions. As such it is easier to evaluate a model on a dataset with a given metric as the inference is handled internally. To make that possible it uses the [`~transformers.pipeline`] abstraction from `transformers`. However, you can use your own framework as long as it follows the `pipeline` interface. To make an evaluation with the `evaluator` let's load a `transformers` pipeline (but you can pass your own custom inference class for any framework as long as it follows the pipeline call API) with an model trained on IMDb, the IMDb test split and the accuracy metric. ```python from transformers import pipeline from datasets import load_dataset from evaluate import evaluator import evaluate pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb", device=0) data = load_dataset("imdb", split="test").shuffle().select(range(1000)) metric = evaluate.load("accuracy") ``` Then you can create an evaluator for text classification and pass the three objects to the `compute()` method. With the label mapping `evaluate` provides a method to align the pipeline outputs with the label column in the dataset: ```python >>> task_evaluator = evaluator("text-classification") >>> results = task_evaluator.compute(model_or_pipeline=pipe, data=data, metric=metric, ... label_mapping={"NEGATIVE": 0, "POSITIVE": 1},) >>> print(results) {'accuracy': 0.934} ``` Calculating the value of the metric alone is often not enough to know if a model performs significantly better than another one. With _bootstrapping_ `evaluate` computes confidence intervals and the standard error which helps estimate how stable a score is: ```python >>> results = eval.compute(model_or_pipeline=pipe, data=data, metric=metric, ... label_mapping={"NEGATIVE": 0, "POSITIVE": 1}, ... strategy="bootstrap", n_resamples=200) >>> print(results) {'accuracy': { 'confidence_interval': (0.906, 0.9406749892841922), 'standard_error': 0.00865213251082787, 'score': 0.923 } } ``` The evaluator expects a `"text"` and `"label"` column for the data input. If your dataset differs you can provide the columns with the keywords `input_column="text"` and `label_column="label"`. Currently only `"text-classification"` is supported with more tasks being added in the future. ## Visualization When comparing several models, sometimes it's hard to spot the differences in their performance simply by looking at their scores. Also often there is not a single best model but there are trade-offs between e.g. latency and accuracy as larger models might have better performance but are also slower. We are gradually adding different visualization approaches, such as plots, to make choosing the best model for a use-case easier. For instance, if you have a list of results from multiple models (as dictionaries), you can feed them into the `radar_plot()` function: ```python import evaluate from evaluate.visualization import radar_plot >>> data = [ {"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6}, {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2}, {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6}, {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6} ] >>> model_names = ["Model 1", "Model 2", "Model 3", "Model 4"] >>> plot = radar_plot(data=data, model_names=model_names) >>> plot.show() ``` Which lets you visually compare the 4 models and choose the optimal one for you, based on one or several metrics:
## Running evaluation on a suite of tasks It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The [EvaluationSuite](evaluation_suite) enables evaluation of models on a collection of tasks. Tasks can be constructed as ([evaluator](base_evaluator), dataset, metric) tuples and passed to an [EvaluationSuite](evaluation_suite) stored on the Hugging Face Hub as a Space, or locally as a Python script. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks. `EvaluationSuite` scripts can be defined as follows, and supports Python code for data preprocessing. ```python import evaluate from evaluate.evaluation_suite import SubTask class Suite(evaluate.EvaluationSuite): def __init__(self, name): super().__init__(name) self.suite = [ SubTask( task_type="text-classification", data="imdb", split="test[:1]", args_for_task={ "metric": "accuracy", "input_column": "text", "label_column": "label", "label_mapping": { "LABEL_0": 0.0, "LABEL_1": 1.0 } } ), SubTask( task_type="text-classification", data="sst2", split="test[:1]", args_for_task={ "metric": "accuracy", "input_column": "sentence", "label_column": "label", "label_mapping": { "LABEL_0": 0.0, "LABEL_1": 1.0 } } ) ] ``` Evaluation can be run by loading the `EvaluationSuite` and calling the `run()` method with a model or pipeline. ``` >>> from evaluate import EvaluationSuite >>> suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite') >>> results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli") ``` | accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | task_name | |------------:|---------------------:|--------------------------:|:----------------|:-----------| | 0.3 | 4.62804 | 2.16074 | 0.462804 | imdb | | 0 | 0.686388 | 14.569 | 0.0686388 | sst2 | # Using the `evaluator` The `Evaluator` classes allow to evaluate a triplet of model, dataset, and metric. The models wrapped in a pipeline, responsible for handling all preprocessing and post-processing and out-of-the-box, `Evaluator`s support transformers pipelines for the supported tasks, but custom pipelines can be passed, as showcased in the section [Using the `evaluator` with custom pipelines](custom_evaluator). Currently supported tasks are: - `"text-classification"`: will use the [`TextClassificationEvaluator`]. - `"token-classification"`: will use the [`TokenClassificationEvaluator`]. - `"question-answering"`: will use the [`QuestionAnsweringEvaluator`]. - `"image-classification"`: will use the [`ImageClassificationEvaluator`]. - `"text-generation"`: will use the [`TextGenerationEvaluator`]. - `"text2text-generation"`: will use the [`Text2TextGenerationEvaluator`]. - `"summarization"`: will use the [`SummarizationEvaluator`]. - `"translation"`: will use the [`TranslationEvaluator`]. - `"automatic-speech-recognition"`: will use the [`AutomaticSpeechRecognitionEvaluator`]. - `"audio-classification"`: will use the [`AudioClassificationEvaluator`]. To run an `Evaluator` with several tasks in a single call, use the [EvaluationSuite](evaluation_suite), which runs evaluations on a collection of `SubTask`s. Each task has its own set of requirements for the dataset format and pipeline output, make sure to check them out for your custom use case. Let's have a look at some of them and see how you can use the evaluator to evalute a single or multiple of models, datasets, and metrics at the same time. ## Text classification The text classification evaluator can be used to evaluate text models on classification datasets such as IMDb. Beside the model, data, and metric inputs it takes the following optional inputs: - `input_column="text"`: with this argument the column with the data for the pipeline can be specified. - `label_column="label"`: with this argument the column with the labels for the evaluation can be specified. - `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels. By default the `"accuracy"` metric is computed. ### Evaluate models on the Hub There are several ways to pass a model to the evaluator: you can pass the name of a model on the Hub, you can load a `transformers` model and pass it to the evaluator or you can pass an initialized `transformers.Pipeline`. Alternatively you can pass any callable function that behaves like a `pipeline` call for the task in any framework. So any of the following works: ```py from datasets import load_dataset from evaluate import evaluator from transformers import AutoModelForSequenceClassification, pipeline data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000)) task_evaluator = evaluator("text-classification") # 1. Pass a model name or path eval_results = task_evaluator.compute( model_or_pipeline="lvwerra/distilbert-imdb", data=data, label_mapping={"NEGATIVE": 0, "POSITIVE": 1} ) # 2. Pass an instantiated model model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb") eval_results = task_evaluator.compute( model_or_pipeline=model, data=data, label_mapping={"NEGATIVE": 0, "POSITIVE": 1} ) # 3. Pass an instantiated pipeline pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb") eval_results = task_evaluator.compute( model_or_pipeline=pipe, data=data, label_mapping={"NEGATIVE": 0, "POSITIVE": 1} ) print(eval_results) ``` Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device. The results will look as follows: ```python { 'accuracy': 0.918, 'latency_in_seconds': 0.013, 'samples_per_second': 78.887, 'total_time_in_seconds': 12.676 } ``` Note that evaluation results include both the requested metric, and information about the time it took to obtain predictions through the pipeline. The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size. ### Evaluate multiple metrics With the [`combine`] function one can bundle several metrics into an object that behaves like a single metric. We can use this to evaluate several metrics at once with the evaluator: ```python import evaluate eval_results = task_evaluator.compute( model_or_pipeline="lvwerra/distilbert-imdb", data=data, metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]), label_mapping={"NEGATIVE": 0, "POSITIVE": 1} ) print(eval_results) ``` The results will look as follows: ```python { 'accuracy': 0.918, 'f1': 0.916, 'precision': 0.9147, 'recall': 0.9187, 'latency_in_seconds': 0.013, 'samples_per_second': 78.887, 'total_time_in_seconds': 12.676 } ``` Next let's have a look at token classification. ## Token Classification With the token classification evaluator one can evaluate models for tasks such as NER or POS tagging. It has the following specific arguments: - `input_column="text"`: with this argument the column with the data for the pipeline can be specified. - `label_column="label"`: with this argument the column with the labels for the evaluation can be specified. - `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels. - `join_by=" "`: While most datasets are already tokenized the pipeline expects a string. Thus the tokens need to be joined before passing to the pipeline. By default they are joined with a whitespace. Let's have a look how we can use the evaluator to benchmark several models. ### Benchmarking several models Here is an example where several models can be compared thanks to the `evaluator` in only a few lines of code, abstracting away the preprocessing, inference, postprocessing, metric computation: ```python import pandas as pd from datasets import load_dataset from evaluate import evaluator from transformers import pipeline models = [ "xlm-roberta-large-finetuned-conll03-english", "dbmdz/bert-large-cased-finetuned-conll03-english", "elastic/distilbert-base-uncased-finetuned-conll03-english", "dbmdz/electra-large-discriminator-finetuned-conll03-english", "gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner", "philschmid/distilroberta-base-ner-conll2003", "Jorgeutd/albert-base-v2-finetuned-ner", ] data = load_dataset("conll2003", split="validation").shuffle().select(range(1000)) task_evaluator = evaluator("token-classification") results = [] for model in models: results.append( task_evaluator.compute( model_or_pipeline=model, data=data, metric="seqeval" ) ) df = pd.DataFrame(results, index=models) df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]] ``` The result is a table that looks like this: | model | overall_f1 | overall_accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | |:-------------------------------------------------------------------|-------------:|-------------------:|------------------------:|---------------------:|---------------------:| | Jorgeutd/albert-base-v2-finetuned-ner | 0.941 | 0.989 | 4.515 | 221.468 | 0.005 | | dbmdz/bert-large-cased-finetuned-conll03-english | 0.962 | 0.881 | 11.648 | 85.850 | 0.012 | | dbmdz/electra-large-discriminator-finetuned-conll03-english | 0.965 | 0.881 | 11.456 | 87.292 | 0.011 | | elastic/distilbert-base-uncased-finetuned-conll03-english | 0.940 | 0.989 | 2.318 | 431.378 | 0.002 | | gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner | 0.947 | 0.991 | 2.376 | 420.873 | 0.002 | | philschmid/distilroberta-base-ner-conll2003 | 0.961 | 0.994 | 2.436 | 410.579 | 0.002 | | xlm-roberta-large-finetuned-conll03-english | 0.969 | 0.882 | 11.996 | 83.359 | 0.012 | ### Visualizing results You can feed in the `results` list above into the `plot_radar()` function to visualize different aspects of their performance and choose the model that is the best fit, depending on the metric(s) that are relevant to your use case: ```python import evaluate from evaluate.visualization import radar_plot >>> plot = radar_plot(data=results, model_names=models, invert_range=["latency_in_seconds"]) >>> plot.show() ```
Don't forget to specify `invert_range` for metrics for which smaller is better (such as the case for latency in seconds). If you want to save the plot locally, you can use the `plot.savefig()` function with the option `bbox_inches='tight'`, to make sure no part of the image gets cut off. ## Question Answering With the question-answering evaluator one can evaluate models for QA without needing to worry about the complicated pre- and post-processing that's required for these models. It has the following specific arguments: - `question_column="question"`: the name of the column containing the question in the dataset - `context_column="context"`: the name of the column containing the context - `id_column="id"`: the name of the column cointaing the identification field of the question and answer pair - `label_column="answers"`: the name of the column containing the answers - `squad_v2_format=None`: whether the dataset follows the format of squad_v2 dataset where a question may have no answer in the context. If this parameter is not provided, the format will be automatically inferred. Let's have a look how we can evaluate QA models and compute confidence intervals at the same time. ### Confidence intervals Every evaluator comes with the options to compute confidence intervals using [bootstrapping](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html). Simply pass `strategy="bootstrap"` and set the number of resanmples with `n_resamples`. ```python from datasets import load_dataset from evaluate import evaluator task_evaluator = evaluator("question-answering") data = load_dataset("squad", split="validation[:1000]") eval_results = task_evaluator.compute( model_or_pipeline="distilbert-base-uncased-distilled-squad", data=data, metric="squad", strategy="bootstrap", n_resamples=30 ) ``` Results include confidence intervals as well as error estimates as follows: ```python { 'exact_match': { 'confidence_interval': (79.67, 84.54), 'score': 82.30, 'standard_error': 1.28 }, 'f1': { 'confidence_interval': (85.30, 88.88), 'score': 87.23, 'standard_error': 0.97 }, 'latency_in_seconds': 0.0085, 'samples_per_second': 117.31, 'total_time_in_seconds': 8.52 } ``` ## Image classification With the image classification evaluator we can evaluate any image classifier. It uses the same keyword arguments at the text classifier: - `input_column="image"`: the name of the column containing the images as PIL ImageFile - `label_column="label"`: the name of the column containing the labels - `label_mapping=None`: We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column` Let's have a look at how can evaluate image classification models on large datasets. ### Handling large datasets The evaluator can be used on large datasets! Below, an example shows how to use it on ImageNet-1k for image classification. Beware that this example will require to download ~150 GB. ```python data = load_dataset("imagenet-1k", split="validation", use_auth_token=True) pipe = pipeline( task="image-classification", model="facebook/deit-small-distilled-patch16-224" ) task_evaluator = evaluator("image-classification") eval_results = task_evaluator.compute( model_or_pipeline=pipe, data=data, metric="accuracy", label_mapping=pipe.model.config.label2id ) ``` Since we are using `datasets` to store data we make use of a technique called memory mappings. This means that the dataset is never fully loaded into memory which saves a lot of RAM. Running the above code only uses roughly 1.5 GB of RAM while the validation split is more than 30 GB big. # Scikit-Learn To run the scikit-learn examples make sure you have installed the following library: ```bash pip install -U scikit-learn ``` The metrics in `evaluate` can be easily integrated with an Scikit-Learn estimator or [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline). However, these metrics require that we generate the predictions from the model. The predictions and labels from the estimators can be passed to `evaluate` mertics to compute the required values. ```python import numpy as np np.random.seed(0) import evaluate from sklearn.compose import ColumnTransformer from sklearn.datasets import fetch_openml from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split ``` Load data from https://www.openml.org/d/40945: ```python X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) ``` Alternatively X and y can be obtained directly from the frame attribute: ```python X = titanic.frame.drop('survived', axis=1) y = titanic.frame['survived'] ``` We create the preprocessing pipelines for both numeric and categorical data. Note that pclass could either be treated as a categorical or numeric feature. ```python numeric_features = ["age", "fare"] numeric_transformer = Pipeline( steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())] ) categorical_features = ["embarked", "***", "pclass"] categorical_transformer = OneHotEncoder(handle_unknown="ignore") preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features), ] ) ``` Append classifier to preprocessing pipeline. Now we have a full prediction pipeline. ```python clf = Pipeline( steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())] ) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) ``` As `Evaluate` metrics use lists as inputs for references and predictions, we need to convert them to Python lists. ```python # Evaluate metrics accept lists as inputs for values of references and predictions y_test = y_test.tolist() y_pred = y_pred.tolist() # Accuracy accuracy_metric = evaluate.load("accuracy") accuracy = accuracy_metric.compute(references=y_test, predictions=y_pred) print("Accuracy:", accuracy) # Accuracy: 0.79 ``` You can use any suitable `evaluate` metric with the estimators as long as they are compatible with the task and predictions. # Types of Evaluations in πŸ€— Evaluate The goal of the πŸ€— Evaluate library is to support different types of evaluation, depending on different goals, datasets and models. Here are the types of evaluations that are currently supported with a few examples for each: ## Metrics A metric measures the performance of a model on a given dataset. This is often based on an existing ground truth (i.e. a set of references), but there are also *referenceless metrics* which allow evaluating generated text by leveraging a pretrained model such as [GPT-2](https://huggingface.co/gpt2). Examples of metrics include: - [Accuracy](https://huggingface.co/metrics/accuracy) : the proportion of correct predictions among the total number of cases processed. - [Exact Match](https://huggingface.co/metrics/exact_match): the rate at which the input predicted strings exactly match their references. - [Mean Intersection over union (IoUO)](https://huggingface.co/metrics/mean_iou): the area of overlap between the predicted segmentation of an image and the ground truth divided by the area of union between the predicted segmentation and the ground truth. Metrics are often used to track model performance on benchmark datasets, and to report progress on tasks such as [machine translation](https://huggingface.co/tasks/translation) and [image classification](https://huggingface.co/tasks/image-classification). ## Comparisons Comparisons can be useful to compare the performance of two or more models on a single test dataset. For instance, the [McNemar Test](https://github.com/huggingface/evaluate/tree/main/comparisons/mcnemar) is a paired nonparametric statistical hypothesis test that takes the predictions of two models and compares them, aiming to measure whether the models's predictions diverge or not. The p value it outputs, which ranges from `0.0` to `1.0`, indicates the difference between the two models' predictions, with a lower p value indicating a more significant difference. Comparisons have yet to be systematically used when comparing and reporting model performance, however they are useful tools to go beyond simply comparing leaderboard scores and for getting more information on the way model prediction differ. ## Measurements In the πŸ€— Evaluate library, measurements are tools for gaining more insights on datasets and model predictions. For instance, in the case of datasets, it can be useful to calculate the [average word length](https://github.com/huggingface/evaluate/tree/main/measurements/word_length) of a dataset's entries, and how it is distributed -- this can help when choosing the maximum input length for [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). In the case of model predictions, it can help to calculate the average [perplexity](https://huggingface.co/metrics/perplexity) of model predictions using different models such as [GPT-2](https://huggingface.co/gpt2) and [BERT](https://huggingface.co/bert-base-uncased), which can indicate the quality of generated text when no reference is available. All three types of evaluation supported by the πŸ€— Evaluate library are meant to be mutually complementary, and help our community carry out more mindful and responsible evaluation. We will continue adding more types of metrics, measurements and comparisons in coming months, and are counting on community involvement (via [PRs](https://github.com/huggingface/evaluate/compare) and [issues](https://github.com/huggingface/evaluate/issues/new/choose)) to make the library as extensive and inclusive as possible!

# πŸ€— Evaluate A library for easily evaluating machine learning models and datasets. With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way! Visit the πŸ€— Evaluate [organization](https://huggingface.co/evaluate-metric) for a full list of available metrics. Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage.

Learn the basics and become familiar with loading, computing, and saving with πŸ€— Evaluate. Start here if you are using πŸ€— Evaluate for the first time!

How-to guides

Practical guides to help you achieve a specific goal. Take a look at these guides to learn how to use πŸ€— Evaluate to solve real-world problems.

Conceptual guides

High-level explanations for building a better understanding of important topics such as considerations going into evaluating a model or dataset and the difference between metrics, measurements, and comparisons.


Technical descriptions of how πŸ€— Evaluate classes and methods work.

# Considerations for model evaluation Developing an ML model is rarely a one-shot deal: it often involves multiple stages of defining the model architecture and tuning hyper-parameters before converging on a final set. Responsible model evaluation is a key part of this process, and πŸ€— Evaluate is here to help! Here are some things to keep in mind when evaluating your model using the πŸ€— Evaluate library: ## Properly splitting your data Good evaluation generally requires three splits of your dataset: - **train**: this is used for training your model. - **validation**: this is used for validating the model hyperparameters. - **test**: this is used for evaluating your model. Many of the datasets on the πŸ€— Hub are separated into 2 splits: `train` and `validation`; others are split into 3 splits (`train`, `validation` and `test`) -- make sure to use the right split for the right purpose! Some datasets on the πŸ€— Hub are already separated into these three splits. However, there are also many that only have a train/validation or only train split. If the dataset you're using doesn't have a predefined train-test split, it is up to you to define which part of the dataset you want to use for training your model and which you want to use for hyperparameter tuning or final evaluation. Training and evaluating on the same split can misrepresent your results! If you overfit on your training data the evaluation results on that split will look great but the model will perform poorly on new data. Depending on the size of the dataset, you can keep anywhere from 10-30% for evaluation and the rest for training, while aiming to set up the test set to reflect the production data as close as possible. Check out [this thread](https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090) for a more in-depth discussion of dataset splitting! ## The impact of class imbalance While many academic datasets, such as the [IMDb dataset](https://huggingface.co/datasets/imdb) of movie reviews, are perfectly balanced, most real-world datasets are not. In machine learning a *balanced dataset* corresponds to a datasets where all labels are represented equally. In the case of the IMDb dataset this means that there are as many positive as negative reviews in the dataset. In an imbalanced dataset this is not the case: in fraud detection for example there are usually many more non-fraud cases than fraud cases in the dataset. Having an imbalanced dataset can skew the results of your metrics. Imagine a dataset with 99 "non-fraud" cases and 1 "fraud" case. A simple model that always predicts "non-fraud" cases would give yield a 99% accuracy which might sound good at first until you realize that you will never catch a fraud case. Often, using more than one metric can help get a better idea of your model’s performance from different points of view. For instance, metrics like **[recall](https://huggingface.co/metrics/recall)** and **[precision](https://huggingface.co/metrics/precision)** can be used together, and the **[f1 score](https://huggingface.co/metrics/f1)** is actually the harmonic mean of the two. In cases where a dataset is balanced, using [accuracy](https://huggingface.co/metrics/accuracy) can reflect the overall model performance: ![Balanced Labels](https://huggingface.co/datasets/evaluate/media/resolve/main/balanced-classes.png) In cases where there is an imbalance, using [F1 score](https://huggingface.co/metrics/f1) can be a better representation of performance, given that it encompasses both precision and recall. ![Imbalanced Labels](https://huggingface.co/datasets/evaluate/media/resolve/main/imbalanced-classes.png) Using accuracy in an imbalanced setting is less ideal, since it is not sensitive to minority classes and will not faithfully reflect model performance on them. ## Offline vs. online model evaluation There are multiple ways to evaluate models, and an important distinction is offline versus online evaluation: **Offline evaluation** is done before deploying a model or using insights generated from a model, using static datasets and metrics. **Online evaluation** means evaluating how a model is performing after deployment and during its use in production. These two types of evaluation can use different metrics and measure different aspects of model performance. For example, offline evaluation can compare a model to other models based on their performance on common benchmarks, whereas online evaluation will evaluate aspects such as latency and accuracy of the model based on production data (for example, the number of user queries that it was able to address). ## Trade-offs in model evaluation When evaluating models in practice, there are often trade-offs that have to be made between different aspects of model performance: for instance, choosing a model that is slightly less accurate but that has a faster inference time, compared to a high-accuracy that has a higher memory footprint and requires access to more GPUs. Here are other aspects of model performance to consider during evaluation: ### Interpretability When evaluating models, **interpretability** (i.e. the ability to *interpret* results) can be very important, especially when deploying models in production. For instance, metrics such as [exact match](https://huggingface.co/spaces/evaluate-metric/exact_match) have a set range (between 0 and 1, or 0% and 100%) and are easily understandable to users: for a pair of strings, the exact match score is 1 if the two strings are the exact same, and 0 otherwise. Other metrics, such as [BLEU](https://huggingface.co/spaces/evaluate-metric/exact_match) are harder to interpret: while they also range between 0 and 1, they can vary greatly depending on which parameters are used to generate the scores, especially when different tokenization and normalization techniques are used (see the [metric card](https://huggingface.co/spaces/evaluate-metric/bleu/blob/main/README.md) for more information about BLEU limitations). This means that it is difficult to interpret a BLEU score without having more information about the procedure used for obtaining it. Interpretability can be more or less important depending on the evaluation use case, but it is a useful aspect of model evaluation to keep in mind, since communicating and comparing model evaluations is an important part of responsible machine learning. ### Inference speed and memory footprint While recent years have seen increasingly large ML models achieve high performance on a large variety of tasks and benchmarks, deploying these multi-billion parameter models in practice can be a challenge in itself, and many organizations lack the resources for this. This is why considering the **inference speed** and **memory footprint** of models is important, especially when doing online model evaluation. Inference speed refers to the time that it takes for a model to make a prediction -- this will vary depending on the hardware used and the way in which models are queried, e.g. in real time via an API or in batch jobs that run once a day. Memory footprint refers to the size of the model weights and how much hardware memory they occupy. If a model is too large to fit on a single GPU or CPU, then it has to be split over multiple ones, which can be more or less difficult depending on the model architecture and the deployment method. When doing online model evaluation, there is often a trade-off to be done between inference speed and accuracy or precision, whereas this is less the case for offline evaluation. ## Limitations and bias All models and all metrics have their limitations and biases, which depend on the way in which they were trained, the data that was used, and their intended uses. It is important to measure and communicate these limitations clearly to prevent misuse and unintended impacts, for instance via [model cards](https://huggingface.co/course/chapter4/4?fw=pt) which document the training and evaluation process. Measuring biases can be done by evaluating models on datasets such as [Wino Bias](https://huggingface.co/datasets/wino_bias) or [MD Gender Bias](https://huggingface.co/datasets/md_gender_bias), and by doing [Interactive Error Analyis](https://huggingface.co/spaces/nazneen/error-analysis) to try to identify which subsets of the evaluation dataset a model performs poorly on. We are currently working on additional measurements that can be used to quantify different dimensions of bias in both models and datasets -- stay tuned for more documentation on this topic! # Evaluator The evaluator classes for automatic evaluation. ## Evaluator classes The main entry point for using the evaluator: [[autodoc]] evaluate.evaluator The base class for all evaluator classes: [[autodoc]] evaluate.Evaluator ## The task specific evaluators ### ImageClassificationEvaluator [[autodoc]] evaluate.ImageClassificationEvaluator ### QuestionAnsweringEvaluator [[autodoc]] evaluate.QuestionAnsweringEvaluator - compute ### TextClassificationEvaluator [[autodoc]] evaluate.TextClassificationEvaluator ### TokenClassificationEvaluator [[autodoc]] evaluate.TokenClassificationEvaluator - compute ### TextGenerationEvaluator [[autodoc]] evaluate.TextGenerationEvaluator - compute ### Text2TextGenerationEvaluator [[autodoc]] evaluate.Text2TextGenerationEvaluator - compute ### SummarizationEvaluator [[autodoc]] evaluate.SummarizationEvaluator - compute ### TranslationEvaluator [[autodoc]] evaluate.TranslationEvaluator - compute ### AutomaticSpeechRecognitionEvaluator [[autodoc]] evaluate.AutomaticSpeechRecognitionEvaluator - compute ### AudioClassificationEvaluator [[autodoc]] evaluate.AudioClassificationEvaluator - compute# Hub methods Methods for using the Hugging Face Hub: ## Push to hub [[autodoc]] evaluate.push_to_hub # Main classes ## EvaluationModuleInfo The base class `EvaluationModuleInfo` implements a the logic for the subclasses `MetricInfo`, `ComparisonInfo`, and `MeasurementInfo`. [[autodoc]] evaluate.EvaluationModuleInfo [[autodoc]] evaluate.MetricInfo [[autodoc]] evaluate.ComparisonInfo [[autodoc]] evaluate.MeasurementInfo ## EvaluationModule The base class `EvaluationModule` implements a the logic for the subclasses `Metric`, `Comparison`, and `Measurement`. [[autodoc]] evaluate.EvaluationModule [[autodoc]] evaluate.Metric [[autodoc]] evaluate.Comparison [[autodoc]] evaluate.Measurement ## CombinedEvaluations The `combine` function allows to combine multiple `EvaluationModule`s into a single `CombinedEvaluations`. [[autodoc]] evaluate.combine [[autodoc]] CombinedEvaluations # Loading methods Methods for listing and loading evaluation modules: ## List [[autodoc]] evaluate.list_evaluation_modules ## Load [[autodoc]] evaluate.load # Visualization methods Methods for visualizing evaluations results: ## Radar Plot [[autodoc]] evaluate.visualization.radar_plot # Logging methods πŸ€— Evaluate strives to be transparent and explicit about how it works, but this can be quite verbose at times. We have included a series of logging methods which allow you to easily adjust the level of verbosity of the entire library. Currently the default verbosity of the library is set to `WARNING`. To change the level of verbosity, use one of the direct setters. For instance, here is how to change the verbosity to the `INFO` level: ```py import evaluate evaluate.logging.set_verbosity_info() ``` You can also use the environment variable `EVALUATE_VERBOSITY` to override the default verbosity, and set it to one of the following: `debug`, `info`, `warning`, `error`, `critical`: ```bash EVALUATE_VERBOSITY=error ./myprogram.py ``` All the methods of this logging module are documented below. The main ones are: - [`logging.get_verbosity`] to get the current level of verbosity in the logger - [`logging.set_verbosity`] to set the verbosity to the level of your choice In order from the least to the most verbose (with their corresponding `int` values): 1. `logging.CRITICAL` or `logging.FATAL` (int value, 50): only report the most critical errors. 2. `logging.ERROR` (int value, 40): only report errors. 3. `logging.WARNING` or `logging.WARN` (int value, 30): only reports error and warnings. This the default level used by the library. 4. `logging.INFO` (int value, 20): reports error, warnings and basic information. 5. `logging.DEBUG` (int value, 10): report all information. By default, `tqdm` progress bars will be displayed during evaluate download and processing. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] can be used to suppress or unsuppress this behavior. ## Functions [[autodoc]] evaluate.logging.get_verbosity [[autodoc]] evaluate.logging.set_verbosity [[autodoc]] evaluate.logging.set_verbosity_info [[autodoc]] evaluate.logging.set_verbosity_warning [[autodoc]] evaluate.logging.set_verbosity_debug [[autodoc]] evaluate.logging.set_verbosity_error [[autodoc]] evaluate.logging.disable_propagation [[autodoc]] evaluate.logging.enable_propagation [[autodoc]] evaluate.logging.get_logger [[autodoc]] evaluate.logging.enable_progress_bar [[autodoc]] evaluate.logging.disable_progress_bar ## Levels ### evaluate.logging.CRITICAL evaluate.logging.CRITICAL = 50 ### evaluate.logging.DEBUG evaluate.logging.DEBUG = 10 ### evaluate.logging.ERROR evaluate.logging.ERROR = 40 ### evaluate.logging.FATAL evaluate.logging.FATAL = 50 ### evaluate.logging.INFO evaluate.logging.INFO = 20 ### evaluate.logging.NOTSET evaluate.logging.NOTSET = 0 ### evaluate.logging.WARN evaluate.logging.WARN = 30 ### evaluate.logging.WARNING evaluate.logging.WARNING = 30 # Saving methods Methods for saving evaluations results: ## Save [[autodoc]] evaluate.save # Creating an EvaluationSuite It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. Assessing the model on several types of tasks can reveal gaps in performance along some axis. For example, when training a language model, it is often useful to measure perplexity on an in-domain corpus, but also to concurrently evaluate on tasks which test for general language capabilities like natural language entailment or question-answering, or tasks designed to probe the model along fairness and bias dimensions. The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples as a SubTask to evaluate a model on a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks. A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally as a Python script. Some datasets require additional preprocessing before passing them to an `Evaluator`. You can set a `data_preprocessor` for each `SubTask` which is applied via a `map` operation using the `datasets` library. Keyword arguments for the `Evaluator` can be passed down through the `args_for_task` attribute. To create a new `EvaluationSuite`, create a [new Space](https://huggingface.co/new-space) with a .py file which matches the name of the Space, add the below template to a Python file, and fill in the attributes for a new task. The mandatory attributes for a new `SubTask` are `task_type` and `data`. 1. [`task_type`] maps to the tasks currently supported by the Evaluator. 2. [`data`] can be an instantiated Hugging Face dataset object or the name of a dataset. 3. [`subset`] and [`split`] can be used to define which name and split of the dataset should be used for evaluation. 4. [`args_for_task`] should be a dictionary with kwargs to be passed to the Evaluator. ```python import evaluate from evaluate.evaluation_suite import SubTask class Suite(evaluate.EvaluationSuite): def __init__(self, name): super().__init__(name) self.preprocessor = lambda x: {"text": x["text"].lower()} self.suite = [ SubTask( task_type="text-classification", data="glue", subset="sst2", split="validation[:10]", args_for_task={ "metric": "accuracy", "input_column": "sentence", "label_column": "label", "label_mapping": { "LABEL_0": 0.0, "LABEL_1": 1.0 } } ), SubTask( task_type="text-classification", data="glue", subset="rte", split="validation[:10]", args_for_task={ "metric": "accuracy", "input_column": "sentence1", "second_input_column": "sentence2", "label_column": "label", "label_mapping": { "LABEL_0": 0, "LABEL_1": 1 } } ) ] ``` An `EvaluationSuite` can be loaded by name from the Hugging Face Hub, or locally by providing a path, and run with the `run(model_or_pipeline)` method. The evaluation results are returned along with their task names and information about the time it took to obtain predictions through the pipeline. These can be easily displayed with a `pandas.DataFrame`: ``` >>> from evaluate import EvaluationSuite >>> suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite') >>> results = suite.run("gpt2") ``` | accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | task_name | |-----------:|------------------------:|---------------------:|---------------------:|:------------| | 0.5 | 0.740811 | 13.4987 | 0.0740811 | glue/sst2 | | 0.4 | 1.67552 | 5.9683 | 0.167552 | glue/rte | # πŸ€— Transformers To run the πŸ€— Transformers examples make sure you have installed the following libraries: ```bash pip install datasets transformers torch evaluate nltk rouge_score ``` ## Trainer The metrics in `evaluate` can be easily integrated with the [`~transformers.Trainer`]. The `Trainer` accepts a `compute_metrics` keyword argument that passes a function to compute metrics. One can specify the evaluation interval with `evaluation_strategy` in the [`~transformers.TrainerArguments`], and based on that, the model is evaluated accordingly, and the predictions and labels passed to `compute_metrics`. ```python from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer import numpy as np import evaluate # Prepare and tokenize dataset dataset = load_dataset("yelp_review_full") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) tokenized_datasets = dataset.map(tokenize_function, batched=True) small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(200)) small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200)) # Setup evaluation metric = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) # Load pretrained model and evaluate model after each epoch model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch") trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics, ) trainer.train() ``` ## Seq2SeqTrainer We can use the [`~transformers.Seq2SeqTrainer`] for sequence-to-sequence tasks such as translation or summarization. For such generative tasks usually metrics such as ROUGE or BLEU are evaluated. However, these metrics require that we generate some text with the model rather than a single forward pass as with e.g. classification. The `Seq2SeqTrainer` allows for the use of the generate method when setting `predict_with_generate=True` which will generate text for each sample in the evaluation set. That means we evaluate generated text within the `compute_metric` function. We just need to decode the predictions and labels first. ```python import nltk from datasets import load_dataset import evaluate import numpy as np from transformers import AutoTokenizer, DataCollatorForSeq2Seq from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer # Prepare and tokenize dataset billsum = load_dataset("billsum", split="ca_test").shuffle(seed=42).select(range(200)) billsum = billsum.train_test_split(test_size=0.2) tokenizer = AutoTokenizer.from_pretrained("t5-small") prefix = "summarize: " def preprocess_function(examples): inputs = [prefix + doc for doc in examples["text"]] model_inputs = tokenizer(inputs, max_length=1024, truncation=True) labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs tokenized_billsum = billsum.map(preprocess_function, batched=True) # Setup evaluation nltk.download("punkt", quiet=True) metric = evaluate.load("rouge") def compute_metrics(eval_preds): preds, labels = eval_preds # decode preds and labels labels = np.where(labels != -100, labels, tokenizer.pad_token_id) decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) # rougeLSum expects newline after each sentence decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds] decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels] result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True) return result # Load pretrained model and evaluate model after each epoch model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) training_args = Seq2SeqTrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=4, weight_decay=0.01, save_total_limit=3, num_train_epochs=2, fp16=True, predict_with_generate=True ) trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=tokenized_billsum["train"], eval_dataset=tokenized_billsum["test"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics ) trainer.train() ``` You can use any `evaluate` metric with the `Trainer` and `Seq2SeqTrainer` as long as they are compatible with the task and predictions. In case you don't want to train a model but just evaluate an existing model you can replace `trainer.train()` with `trainer.evaluate()` in the above scripts.# Creating and sharing a new evaluation ## Setup Before you can create a new metric make sure you have all the necessary dependencies installed: ```bash pip install evaluate[template] ``` Also make sure your Hugging Face token is registered so you can connect to the Hugging Face Hub: ```bash huggingface-cli login ``` ## Create All evaluation modules, be it metrics, comparisons, or measurements live on the πŸ€— Hub in a [Space](https://huggingface.co/docs/hub/spaces) (see for example [Accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy)). In principle, you could setup a new Space and add a new module following the same structure. However, we added a CLI that makes creating a new evaluation module much easier: ```bash evaluate-cli create "My Metric" --module_type "metric" ``` This will create a new Space on the πŸ€— Hub, clone it locally, and populate it with a template. Instructions on how to fill the template will be displayed in the terminal, but are also explained here in more detail. For more information about Spaces, see the [Spaces documentation](https://huggingface.co/docs/hub/spaces). ## Module script The evaluation module script (the file with suffix `*.py`) is the core of the new module and includes all the code for computing the evaluation. ### Attributes Start by adding some information about your evalution module in [`EvaluationModule._info`]. The most important attributes you should specify are: 1. [`EvaluationModuleInfo.description`] provides a brief description about your evalution module. 2. [`EvaluationModuleInfo.citation`] contains a BibTex citation for the evalution module. 3. [`EvaluationModuleInfo.inputs_description`] describes the expected inputs and outputs. It may also provide an example usage of the evalution module. 4. [`EvaluationModuleInfo.features`] defines the name and type of the predictions and references. This has to be either a single `datasets.Features` object or a list of `datasets.Features` objects if multiple input types are allowed. Then, we can move on to prepare everything before the actual computation. ### Download Some evaluation modules require some external data such as NLTK that requires resources or the BLEURT metric that requires checkpoints. You can implement these downloads in [`EvaluationModule._download_and_prepare`], which downloads and caches the resources via the `dlmanager`. A simplified example on how BLEURT downloads and loads a checkpoint: ```py def _download_and_prepare(self, dl_manager): model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[self.config_name]) self.scorer = score.BleurtScorer(os.path.join(model_path, self.config_name)) ``` Or if you need to download the NLTK `"punkt"` resources: ```py def _download_and_prepare(self, dl_manager): import nltk nltk.download("punkt") ``` Next, we need to define how the computation of the evaluation module works. ### Compute The computation is performed in the [`EvaluationModule._compute`] method. It takes the same arguments as `EvaluationModuleInfo.features` and should then return the result as a dictionary. Here an example of an exact match metric: ```py def _compute(self, references, predictions): em = sum([r==p for r, p in zip(references, predictions)])/len(references) return {"exact_match": em} ``` This method is used when you call `.compute()` later on. ## Readme When you use the `evalute-cli` to setup the evaluation module the Readme structure and instructions are automatically created. It should include a general description of the metric, information about its input/output format, examples as well as information about its limiations or biases and references. ## Requirements If your evaluation modules has additional dependencies (e.g. `sklearn` or `nltk`) the `requirements.txt` files is the place to put them. The file follows the `pip` format and you can list all dependencies there. ## App The `app.py` is where the Spaces widget lives. In general it looks like the following and does not require any changes: ```py import evaluate from evaluate.utils import launch_gradio_widget module = evaluate.load("lvwerra/element_count") launch_gradio_widget(module) ``` If you want a custom widget you could add your gradio app here. ## Push to Hub Finally, when you are done with all the above changes it is time to push your evaluation module to the hub. To do so navigate to the folder of your module and git add/commit/push the changes to the hub: ``` cd PATH_TO_MODULE git add . git commit -m "Add my new, shiny module." git push ``` Tada πŸŽ‰! Your evaluation module is now on the πŸ€— Hub and ready to be used by everybody! # Choosing a metric for your task **So you've trained your model and want to see how well it’s doing on a dataset of your choice. Where do you start?** There is no β€œone size fits all” approach to choosing an evaluation metric, but some good guidelines to keep in mind are: ## Categories of metrics There are 3 high-level categories of metrics: 1. *Generic metrics*, which can be applied to a variety of situations and datasets, such as precision and accuracy. 2. *Task-specific metrics*, which are limited to a given task, such as Machine Translation (often evaluated using metrics [BLEU](https://huggingface.co/metrics/bleu) or [ROUGE](https://huggingface.co/metrics/rouge)) or Named Entity Recognition (often evaluated with [seqeval](https://huggingface.co/metrics/seqeval)). 3. *Dataset-specific metrics*, which aim to measure model performance on specific benchmarks: for instance, the [GLUE benchmark](https://huggingface.co/datasets/glue) has a dedicated [evaluation metric](https://huggingface.co/metrics/glue). Let's look at each of these three cases: ### Generic metrics Many of the metrics used in the Machine Learning community are quite generic and can be applied in a variety of tasks and datasets. This is the case for metrics like [accuracy](https://huggingface.co/metrics/accuracy) and [precision](https://huggingface.co/metrics/precision), which can be used for evaluating labeled (supervised) datasets, as well as [perplexity](https://huggingface.co/metrics/perplexity), which can be used for evaluating different kinds of (unsupervised) generative tasks. To see the input structure of a given metric, you can look at its metric card. For example, in the case of [precision](https://huggingface.co/metrics/precision), the format is: ``` >>> precision_metric = evaluate.load("precision") >>> results = precision_metric.compute(references=[0, 1], predictions=[0, 1]) >>> print(results) {'precision': 1.0} ``` ### Task-specific metrics Popular ML tasks like Machine Translation and Named Entity Recognition have specific metrics that can be used to compare models. For example, a series of different metrics have been proposed for text generation, ranging from [BLEU](https://huggingface.co/metrics/bleu) and its derivatives such as [GoogleBLEU](https://huggingface.co/metrics/google_bleu) and [GLEU](https://huggingface.co/metrics/gleu), but also [ROUGE](https://huggingface.co/metrics/rouge), [MAUVE](https://huggingface.co/metrics/mauve), etc. You can find the right metric for your task by: - **Looking at the [Task pages](https://huggingface.co/tasks)** to see what metrics can be used for evaluating models for a given task. - **Checking out leaderboards** on sites like [Papers With Code](https://paperswithcode.com/) (you can search by task and by dataset). - **Reading the metric cards** for the relevant metrics and see which ones are a good fit for your use case. For example, see the [BLEU metric card](https://github.com/huggingface/evaluate/tree/main/metrics/bleu) or [SQuaD metric card](https://github.com/huggingface/evaluate/tree/main/metrics/squad). - **Looking at papers and blog posts** published on the topic and see what metrics they report. This can change over time, so try to pick papers from the last couple of years! ### Dataset-specific metrics Some datasets have specific metrics associated with them -- this is especially in the case of popular benchmarks like [GLUE](https://huggingface.co/metrics/glue) and [SQuAD](https://huggingface.co/metrics/squad). πŸ’‘ GLUE is actually a collection of different subsets on different tasks, so first you need to choose the one that corresponds to the NLI task, such as mnli, which is described as β€œcrowdsourced collection of sentence pairs with textual entailment annotations” If you are evaluating your model on a benchmark dataset like the ones mentioned above, you can use its dedicated evaluation metric. Make sure you respect the format that they require. For example, to evaluate your model on the [SQuAD](https://huggingface.co/datasets/squad) dataset, you need to feed the `question` and `context` into your model and return the `prediction_text`, which should be compared with the `references` (based on matching the `id` of the question) : ``` >>> from evaluate import load >>> squad_metric = load("squad") >>> predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}] >>> references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}] >>> results = squad_metric.compute(predictions=predictions, references=references) >>> results {'exact_match': 100.0, 'f1': 100.0} ``` You can find examples of dataset structures by consulting the "Dataset Preview" function or the dataset card for a given dataset, and you can see how to use its dedicated evaluation function based on the metric card. # Using the `evaluator` with custom pipelines The evaluator is designed to work with `transformer` pipelines out-of-the-box. However, in many cases you might have a model or pipeline that's not part of the `transformer` ecosystem. You can still use `evaluator` to easily compute metrics for them. In this guide we show how to do this for a Scikit-Learn [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) and a Spacy [pipeline](https://spacy.io). Let's start with the Scikit-Learn case. ## Scikit-Learn First we need to train a model. We'll train a simple text classifier on the [IMDb dataset](https://huggingface.co/datasets/imdb), so let's start by downloading the dataset: ```py from datasets import load_dataset ds = load_dataset("imdb") ``` Then we can build a simple TF-IDF preprocessor and Naive Bayes classifier wrapped in a `Pipeline`: ```py from sklearn.pipeline import Pipeline from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer text_clf = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ]) text_clf.fit(ds["train"]["text"], ds["train"]["label"]) ``` Following the convention in the `TextClassificationPipeline` of `transformers` our pipeline should be callable and return a list of dictionaries. In addition we use the `task` attribute to check if the pipeline is compatible with the `evaluator`. We can write a small wrapper class for that purpose: ```py class ScikitEvalPipeline: def __init__(self, pipeline): self.pipeline = pipeline self.task = "text-classification" def __call__(self, input_texts, **kwargs): return [{"label": p} for p in self.pipeline.predict(input_texts)] pipe = ScikitEvalPipeline(text_clf) ``` We can now pass this `pipeline` to the `evaluator`: ```py from evaluate import evaluator task_evaluator = evaluator("text-classification") task_evaluator.compute(pipe, ds["test"], "accuracy") >>> {'accuracy': 0.82956} ``` Implementing that simple wrapper is all that's needed to use any model from any framework with the `evaluator`. In the `__call__` you can implement all logic necessary for efficient forward passes through your model. ## Spacy We'll use the `polarity` feature of the `spacytextblob` project to get a simple sentiment analyzer. First you'll need to install the project and download the resources: ```bash pip install spacytextblob python -m textblob.download_corpora python -m spacy download en_core_web_sm ``` Then we can simply load the `nlp` pipeline and add the `spacytextblob` pipeline: ```py import spacy nlp = spacy.load('en_core_web_sm') nlp.add_pipe('spacytextblob') ``` This snippet shows how we can use the `polarity` feature added with `spacytextblob` to get the sentiment of a text: ```py texts = ["This movie is horrible", "This movie is awesome"] results = nlp.pipe(texts) for txt, res in zip(texts, results): print(f"{text} | Polarity: {res._.blob.polarity}") ``` Now we can wrap it in a simple wrapper class like in the Scikit-Learn example before. It just has to return a list of dictionaries with the predicted lables. If the polarity is larger than 0 we'll predict positive sentiment and negative otherwise: ```py class SpacyEvalPipeline: def __init__(self, nlp): self.nlp = nlp self.task = "text-classification" def __call__(self, input_texts, **kwargs): results =[] for p in self.nlp.pipe(input_texts): if p._.blob.polarity>=0: results.append({"label": 1}) else: results.append({"label": 0}) return results pipe = SpacyEvalPipeline(nlp) ``` That class is compatible with the `evaluator` and we can use the same instance from the previous examlpe along with the IMDb test set: ```py eval.compute(pipe, ds["test"], "accuracy") >>> {'accuracy': 0.6914} ``` This will take a little longer than the Scikit-Learn example but after roughly 10-15min you will have the evaluation results! # Installation Before you start, you will need to setup your environment and install the appropriate packages. πŸ€— Evaluate is tested on **Python 3.7+**. ## Virtual environment You should install πŸ€— Evaluate in a [virtual environment](https://docs.python.org/3/library/venv.html) to keep everything neat and tidy. 1. Create and navigate to your project directory: ```bash mkdir ~/my-project cd ~/my-project ``` 2. Start a virtual environment inside the directory: ```bash python -m venv .env ``` 3. Activate and deactivate the virtual environment with the following commands: ```bash # Activate the virtual environment source .env/bin/activate # Deactivate the virtual environment source .env/bin/deactivate ``` Once you have created your virtual environment, you can install πŸ€— Evaluate in it. ## pip The most straightforward way to install πŸ€— Evaluate is with pip: ```bash pip install evaluate ``` Run the following command to check if πŸ€— Evaluate has been properly installed: ```bash python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))" ``` This should return: ```bash {'exact_match': 1.0} ``` ## source Building πŸ€— Evaluate from source lets you make changes to the code base. To install from source, clone the repository and install with the following commands: ```bash git clone https://github.com/huggingface/evaluate.git cd evaluate pip install -e . ``` Again, you can check if πŸ€— Evaluate has been properly installed with: ```bash python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))" ```