Benchmarking Large Language Models on Italian language

In a world increasingly dominated by English-centric AI development, benchmarks have often prioritized English as a default language or relied on translations, even for Large Language Models (LLMs) used across languages. To address this gap, Martin Cimmino, Paolo Albano, Michele Resta, Marco Madeddu, Viviana Patti, and Roberto Zanoli introduced Evalita-LLM, the most comprehensive benchmark yet for assessing Italian LLMs, built entirely on native Italian data to ensure linguistic fidelity and contextual accuracy.
Spanning 10 NLP tasks – including entity recognition, textual entailment, summarization, and sentiment analysis – Evalita-LLM reflects real-world applications to provide a more nuanced appraisal of the models’ performance, with more fairness and objectivity. It also tests the robustness of prompts, shedding light on how output can differ depending on the original prompt formulation.
Essentially, three features set Evalita-LLM apart from other benchmarks:
- All tasks are performed in native Italian language, removing all translation biases and discrepancies
- It includes both classification and generative tasks, allowing for more natural and complex interactions with LLMs
- Each task is tested with multiple prompts, reducing model sensitivity to particular phrasing and enabling fairer evaluations.
All tasks are sourced and adapted from a variety of past Evalita campaigns, with endorsement coming from national institutions such as the Italian Association for Computational Linguistics (AILC) and it is endorsed by the Italian Association for Artificial Intelligence (AI*IA) and the Italian Association for Speech Sciences (AISV).
So far, 22 models have been evaluated using Evalita-LLM, with performance measured under zero-shot and few-shot modalities, reaching the following conclusions:
- Model performance and ranking vary depending on the prompt, demonstrating that a single-prompt evaluation may be misleading or unfair.
- Generative tasks are computationally more intensive than multiple-choice tasks, with implications for benchmark design and task selection.
- Generative task output is more likely to be correctly parsed and evaluated when the prompts are carefully drafted.
- Few-shot prompting significantly improved performance, especially for complex tasks such as Named Entities Recognition and Relation Extraction.
- A larger model size doesn’t necessarily guarantee higher output accuracy, nor do models specifically pretrained on the language of the test data.
This paper uncovers how Evalita-LLM offers a pathway for businesses and policymakers to measure how models behave in real-world, domain-sensitive contexts, ensuring AI systems are not only powerful but also reliable, free from bias, and locally relevant.