This is what drives the collaboration between Domyn and NVIDIA within the EU Sovereign AI initiative. Domyn-Large isa 263B-parameter reasoning model that achieves state-of-the-art results on enterprise-critical benchmarks — Text-to-SQL, Text-to-Cypher, knowledge graph extraction, and safety classification — while maintaining competitive performance on standard academic evaluations and multilingual tasks in key European languages. It follows Italia-10B and Colosseum-355B, all available as NVIDIA NIM, alongside a new wave of agentic AI systems optimized for Italian, French, German, and Spanish, aligned with the European AI Act.
In this blog, we detail how this collaboration establishes a practical blueprint for sovereign, enterprise-ready AI in Europe. We show how Domyn-Large was trained, combining Domyn’s deep expertise in regulated industries with NVIDIA’s foundational AI stack —specifically leveraging the NVIDIA NeMo framework on DGX Cloud and NVIDIA Nemotron techniques.
The collaboration between Domyn and NVIDIA marks a significant milestone in advancing the EU Sovereign AI initiative, dedicated to developing large language models (LLMs) that drive national capability, foster innovation, and ensure technology sovereignty across Italy and the wider EMEA region. Sovereign AI efforts ensure that nations can shape and deploy AI systems securely, responsibly, and in alignmentwith local priorities and societal values.
This project demonstrates how international collaboration can accelerate open, transparent, and accessible AI research while strengthening regional resilience. The team has already released Italia-10B and Colosseum-355B, LLMs designed for highly regulated environments, both available as NVIDIA NIM microservices. These efforts paved the way for Domyn-Large, a 260B-parameter reasoning model built for the enterprise tasks that general-purpose LLMs underserve — Text-to-SQL, Text-to-Cypher, knowledgegraph extraction, and safety classification — while maintaining competitive performance on academic evaluations and multilingual tasks in key European languages.
Domyn-Large illustrates a core principle of Domyn’s approach: the ability to intervene at any stage of the training lifecycle — pre-training, mid-training, or post-training — applying each selectively to produce targeted capabilities without retraining from scratch, in service of delivering domain-specific AI for regulated industries operating under strict data sovereignty requirements. Alongside it, Domyn is developing a new wave of agentic AI systems optimized for Italian, French, German, andSpanish, all aligned with the European AI Act.
This collaboration enables knowledge exchange between Domyn and NVIDIA, helping the EU AI ecosystem move faster from research to real-world solutions. In this blog, you will find a detailed look at how Domyn models are trained, following a similar methodology to the NVIDIA Nemotron model family, and how these models will help shape the next generation of sovereign enterprise-ready AI for Europe.
Motivation and Design Goals
The initiative is driven by the growingneed for sovereign AI models in Europe, a priority that NVIDIA CEO Jensen Huang highlighted at GTC Paris 2025 and reiterated again during his GTC 2026 keynote. Jensen’s announcement last year at GTC Paris underscored that for European model builders, having regionally developed and managed AI infrastructure isessential for data control, privacy, and cultural alignment. These will give enterprises, governments, and researchers the ability to serve local needs without relying on global platforms. This initiative especially empowers communities by providing AI tools optimized for their languages, laws, and standards, reflecting the Italian and broader Europe’s diversity and creativity.
As a cross-institutional team, the collaborators from NVIDIA and Domyn have worked together to define a set of design goals that shape the direction and ambition of the project:
- Deliver a state-of-the-art reasoning model based on Colosseum-355B (see our previous technical blog for details), demonstrating a sustainable method to improve model intelligence without retraining from scratch.
- Achieve state-of-the-art performance on enterprise-critical tasks — Text-to-SQL, Text-to-Cypher, knowledge graph extraction, and safety classification — that directly serve regulated-industry use cases.
- Measure success on multilingual benchmarks that reflect real-world use cases in English and other European languages.
- Compress the model for efficient inference while preserving competitive performance.
- Add model intelligence via continual pretraining for extended context length and supervised fine-tuning on distilled reasoning traces.
This approach combines state-of-the-art technology with practical deployment, establishing a foundation for inclusive, regionally relevant AI that can quickly adapt as new needs and success stories emerge.
Base Model Architecture and Training Strategy
We opted to start from an existing model rather than train from scratch. This choice avoids rerunning a costly–environmentally and computationally–end‑to‑end pretraining pipeline and lets us focus compute resources on adding new capabilities instead of relearning what the model already learned. However, continuing pretraining on an instruction‑tuned model requires careful control of the data blend and training hyperparameters to avoid eroding its instruction‑following behavior.
Base Model: Colosseum-355B
Colosseum‑355B is a large language model designed for use cases in regulated industries such as financial services, government, and heavy industry. It supports a 256k token vocabulary across multiple languages, handles multilingual single‑turn and multi‑turn chat, and offers a context length of up to 16,384 tokens.
The model was originally pretrained on NVIDIA DGX Cloud using 11 trillion tokens spanning English, more than 50 other natural languages, and a variety of programming languages. Domyn then performed FP8 continued pretraining (CPT) on an additional 2 trillion tokens using a dataset that preserved the original data distribution. After CPT, Colosseum‑355B underwent alignment through supervised fine‑tuning (SFT) and direct preference optimization (DPO). The resulting model is not a reasoning model, as it does not employ additional test-time compute during inference. For an in-depth overview of the training, please refer to our previous blog post.
Project Execution
Recent LLMs can handle 128k token sequences and generate reasoning traces before answering. In this project, we enabled these features in Colosseum-355B and compressed the model to improve training and inference efficiency, following a recipe similar to NVIDIA LLama-Nemotron-Ultra.
Our training pipeline was organized into several stages:
- Model compression, including pruning and distillation.
- Continued pretraining (CPT).
- Extended‑context CPT to unlock 128k‑token sequences.
- Supervised fine‑tuning (SFT).
All stages ran on NVIDIA DGX Cloud using NVIDIA NeMo framework, a scalable generative AI platform optimized for large‑scale model training on NVIDIA hardware. We dive into the details of each stage in the following sections.
Model Compression Strategy
To effectively support real-world use cases, our first crucial step was model compression. This process is essential for achieving higher throughput and lower latency in production environments across various tasks. NVIDIA has published several Nemotron techniques on this subject. For example, the Minitron approach and the Puzzle neural architecture search are two popular strategies that developers can run using the NVIDIA Model Optimizer library.
In this work, we used the Minitron approach, which consists of pruning the model to reduce its size followed by logits-based distillation to recover the original model capabilities. In practical terms, we reduced the model from 355B to about 260B parameters while keeping accuracy within a few percentage points on key benchmarks.
Model compression is crucial because it reduces the model FLOPs, not only during training but, more importantly, at inference time, making model serving in production significantly more efficient, in particular for agentic AI application.
Exploration Stage
The Minitron approach supports two primary pruning dimensions: depth and width. Depth pruning removes entire layers (for example a full Transformer layer including both attention and feedforward components), while width pruning reduces the model’s hidden and intermediate dimensions.
Our base model contained 355B parameters, with a hidden size of 18,432, an intermediate size of 73,728, and a depth of 100 layers. As an initial experiment, we pruned the model to a hidden size of 17,420, an intermediate size of 69,952, and 80 layers, yielding a 260B-parameter model. This model was then distilled against the base model using a teacher–student setup for 1,000 training steps. This configuration led to a substantial degradation in downstream performance, with an average drop of approximately 15 percentage points.
We subsequently evaluated several pruning configurations and found that a combination of 10% depth pruning and 10% width pruning struck a favorable balance. This resulted in a hidden size of 16,448, an intermediate size of 65,920, and 90 layers. Under this configuration, the average downstream performance drop after distillation was limited to 3.2 percentage points. More aggressive pruning along either dimension led to steep performance degradation that distillation could not effectively recover.
We also explored iterative pruning: two lighter pruning-and-distillation cycles applied sequentially to reach the same overall 10% depth and 10% width reduction. However, this approach did not yield measurable improvements over a single pruning step.
During distillation, we experimented with weighted averaging of the distillation loss and the language modeling loss. While an equal weighting performed reasonably well, assigning a weight of 0.8 to the language modeling loss produced the best results.
Based on the exploration results, we selected the following final configuration: hidden size 16,448, intermediate size 65,920, 90 layers, and a language modeling loss weight of 0.8.
Final Pruning Run
For the final teacher-student distillation, we doubled the training duration to 2,000 steps, using a global batch size of 64 and a context length of 4,096. This corresponds to approximately 0.5B training tokens. The training loss curve, in Figure 2, shows drastic diminishing returns after 1,000 steps and no substantial gains beyond 1,500 steps. The resulting checkpoint contains approximately 260B parameters, representing a significant reduction from the original 355B-parameter model.
Despite this reduction, the average downstream performance drop was limited to just 2.3 percentage points. Multilingual and coding benchmarks were the most negatively affected, while multiple-choice benchmarks—such as MMLU—were the least impacted and even showed improvements in some categories.
This checkpoint was subsequently staged for continual pre-training in the next phase.
Inference Efficiency
The pruning stage improved model latency and throughput across tasks, while preserving the model’s baseline performance on English benchmarks. The gains benefit the training phase, and most importantly, future inference and production phase of Domyn’s models.
We focus on two representative tasks, translation and summarization. Translation maps between two languages and typically has a similar number of input and output tokens (e.g. 1k in, 1k out). Summarization, by contrast, must condense a long text into a concise summary, so input tokens are much higher than output tokens (e.g. 1k in, 128 out).
For these tasks, we analyze four key metrics:
- Throughput (tokens/sec), how many tokens the system can generate per second across all requests being served
- Concurrency, how many requests are being processed at the same time (e.g., 8, 64, 128, 256 parallel calls)
- Time to first token, the delay between sending a request and receiving the first generated token back from the model
- End-to-End Latency, the total time it takes to complete a request, from submission to the last token being generated
Across these metrics, presented in Figure 3, Domyn-Large reaches higher throughput with lower time‑to‑first‑token, and its throughput saturates earlier, meaning you extract more tokens per second without pushing the hardware to extreme parallelism. At higher concurrency, Colosseum-355B’s latency grows rapidly while Domyn-Large’s grows more moderately, so the compressed model supports heavier multi‑tenant loads with less tail‑latency degradation—an important property for production deployments where many users hit the service at once. When plotting throughput against total time (Figure 3a) and time‑to‑first‑token (Figure 3b), Domyn-Large curves sit above and to the left of Colosseum’s, which visually reinforces that compression pushes the performance–latency frontier outward: you can produce more tokens at lower perceived latency.
Continual PreTraining, Extending Context Length
After the pruning and distillation stage, the model underwent a phase of continual pretraining on a large corpus of diverse text data. This phase had two goals: strengthen domain‑specific language understanding and extend the usable context window to 64k tokens and beyond (up to 128k during evaluation).
To achieve the first goal, we curated a dataset composed of high-quality text from various sources, including scientific articles, multilingual sources, and publicly available source code. This diverse dataset helped the model internalize domain-specific terminology and concepts, improving performance on tasks in these areas.
To extend the context length, we adopted a curriculum learning strategy: starting at 16k tokens and progressively increasing the context window to 32k and 64k tokens. This approach allowed the model to gradually adapt to longer contexts without sacrificing performance on shorter ones.
We ran this phase of training with a global batch size varying between 128 and 64, depending on the context length employed, using the AdamW optimizer with a learning rate between 1e-7 and 1e-6 and a cosine annealing schedule. RoPE theta (the scaling factor in rotary positional embeddings) was also adjusted according to the context length, starting from 10k for 16k tokens up reaching 1M for 64k tokens, enabling the model to handle the increased context window.
The continual pretraining phase using NeMo framework lasted for approximately 50B tokens. The model was evaluated periodically on a set of validation tasks to monitor its performance and ensure effective learning from the new data. As shown in the loss curves (Figure 4), continual pretraining improved model capabilities: losses decreased consistently over training steps on domain-specific and multilingual tasks with 32k token contexts. This phase was crucial in preparing the model for real-world applications where understanding complex, domain-specific language and managing long documents are essential.
We further benchmarked the model's long contexts capabilities by comparing the original Colosseum model, the compressed Domyn-Large 263B (after compression), and the post-CPT Domyn-Large 263B model using a context length of 64K. For this evaluation, we used different versions of the RULER benchmark, a synthetic benchmark designed to measure how well models handle long context tasks beyond simple retrieval. Evaluations beyond each model native context length, i.e. 16K tokens for Colosseum and Domyn-Large (after compression) and 64K tokens for Domyn-Large after CPT, were performed by employing YARN, an efficient method for extending RoPE context windows.
The results, in Figure 5, show a clear trajectory. Immediately after pruning and distillation Domyn-Large, experienced a substantial performance decline, especially when evaluated beyond its native 16K context length. However, the subsequent continual pretraining phase proved highly effective: it not only restored the model's ability to manage longer contexts but significantly enhanced it. On RULER, the post-CPT Domyn-Large model now achieves competitive state-of-the-art results at 64K and 128K tokens, surpassing the performance of the starting model by a significant margin.
Model Alignment
Following the continual pretraining phase, the model underwent a Supervised Fine-Tuning (SFT) phase using NeMo framework to improve its ability to follow instructions, reason and generate aligned responses. We fine-tuned the model using a highly curated dataset, built from Domyn’s experience aligning earlier, smaller models. The team balanced the token count between different tasks and domains, ensuring that the model was exposed to a wide variety of instructions and contexts.
The SFT phase was conducted with a global batch size of 128, using the AdamW optimizer and a learning between 1e-7 and 1e-6 with a cosine annealing schedule. Training ran for more than 20k steps, covering 2.5M samples.The context length during this phase was 32k tokens, which we found to be a good compromise between performance and computational efficiency.
We periodically evaluated the model on a set of validation tasks to ensure it learned to follow instructions and provide aligned responses. During SFT, the validation loss decreased consistently over time, indicating that the model was successfully learning from the fine-tuning dataset. Figure 6, focuses on four validation splits used in this phase: scientific Q&A, multilingual Q&A, math problem solving, and general instruction following.
Training Data and Evaluation Strategy
The development of LLMs for different EU languages, such as Italian, is constrained by the availability and quality of suitable data and evaluation benchmarks, creating a gap between the performance of English and non-English LLMs. Models such as Llama-Nemotron, achieve state-of-the-art performance in English and other high-resource languages, but their generalization and downstream task performance in low-resource languages is limited by domain, cultural, and linguistic gaps in the training data. To bridge these gaps, we focused on targeted data curation and adaptation during continued pretraining (CPT), context extension, and supervised fine-tuning (SFT).
Model evaluation is key to our project. Quantitative metrics allow us to validate that post-training procedures, data selection, and training methods actually move us towards our goals. Most importantly, evaluation provides objective evidence that we meet the project's primary requirement: increasing Colosseum-355B capabilities by enhancing reasoning while improving model efficiency. We aim to boost performance in target European languages and demonstrate improvements in general knowledge, math, code, and function calling as proxies for real-world use-case support.
We first describe the training data strategy for each stage of the pipeline, then discuss evaluation and performance.
Training Data Strategy
CPT adapts a model to a new linguistic distribution by exposing it to large volumes of text representative of the target languages and domains. SFT focuses on aligning the model behavior with preferences, task instructions, and domain-specific requirements. The effectiveness of CPT and SFT depends on the quality, diversity, and domain coverage of the data.
For CPT, we enriched the model’s internal representations so it could better capture language-specific morphology, syntax, semantics, and idiomatic patterns that may be underrepresented in English-centric datasets. The CPT data mix can be thought of as several broad streams, composed of datasets cleaned, filtered, deduplicated, and divided into categories:
- Web-scale crawl content (e.g. DCLM and Dolma)
- Code (The Stack V2)
- SFT-style instruction data, including math (nemotron-cc-math)
- Academic text (e.g. ArXiv and peS2o)
- Wikipedia
- Multilingual slice based on 56 languages extracted from FineWeb2-HQ
Together, these sources contributed roughly 3T tokens of initially available material. Instead of treating every source as a first-class citizen, we organized the corpus at the category level and sampled according to a predefined blend, so training followed a deliberate domain mixture rather than inheriting raw data proportions. The actual CPT run mirrored this sampling distribution, summarized in Figure 7.
Within the multilingual stream, we also made the weighting strategy explicit. Rather than spreading probability mass uniformly across all languages, we prioritized European coverage by splitting languages into four tiers (A, B, C, and an “Others” bucket) and assigning more weight to higher tiers. In particular, Tier A languages (Spanish, Italian, French, German) received higher sampling probability, so the model saw these languages more often during training.
For SFT, the data aimed to capture both linguistic fidelity and task consistency, ensuring the model understood the language and the tasks (reasoning, Q&A, tool use, and so on). While CPT relied on a large, diverse corpus, SFT used a more specific and curated mix designed to teach the model how to follow instructions and provide accurate responses across domains. The SFT data mix included:
- Safety, datasets designed to teach the model to provide safe, aligned responses, and avoid harmful or inappropriate content.
- Long Context, datasets that leverage the model's ability to handle long inputs, including long-form Q&A and document summarization tasks.
- Chat, conversational datasets to improve natural and coherent dialogue.
- Instruction Following, datasets focused on explicit task instructions.
- Multilingual, datasets in multiple languages to strengthen multilingual capabilities.
- Code, datasets containing source code to improve programming and code reasoning skills.
- Math, datasets with mathematical problems and solutions to enhance problem-solving abilities.
- Function calling, datasets that teach the model when and how to use external tools and APIs, an important skill for many real-world applications, including deployment on Domyn's AI platform.
- STEM, datasets centered on science, technology, engineering, and mathematics topics.
The relative contribution of each source to the SFT phase is illustrated in Figure 9.
Among the datasets used during SFT, we highlight popular NVIDIA datasets:
- AceReason-1.1-SFT, a diverse, high-quality SFT dataset focused on math and code reasoning.
- Daring-Anteater, a dataset covering diverse scenarios for broad instruction-following capabilities.
- Nemotron-Post-Training-Dataset-v2, a comprehensive post-training dataset including multilingual data for Spanish, French, German, and Italian.
- OpenCodeReasoning-2 a dataset aimed at improving code understanding, code generation, and reasoning in programming contexts.
- When2Call, a dataset to teach the model when to call tools, ask follow-ups, acknowledge limits, and handle cases where tools aren't available.
Evaluation
For evaluation, we primarily used the NVIDIA NeMo Evaluator library from NeMo framework, which supports a wide range of benchmarks. We compared Domyn-Large with Llama-3.1-Nemotron-Ultra-253B-v1 because they share similar sizes and origins, both undergoing model compression, CPT, and post-training stages designed to enhance reasoning and non-reasoning capabilities.
For this comparison, we selected:
- Multilingual variants of MMLU to assess performance in target European Languages.
- MMLU-PRO for general knowledge.
- GPQA-D and AIME 25 for math.
- IF-EVAL for instruction following.
- LiveCodeBench for coding.
The bar plot in Figure 10 shows results with “thinking on” (reasoning enabled). Domyn-Large and Llama Nemotron Ultra were evaluated with a maximum sequence length of 32k using default OpenCompass prompts and settings. Domain-Large is competitive with Llama Nemotron Ultra on multilingual MMLU, shows gains on MMLU PRO, and narrows or closes the gap compared to Colosseum-355B on IF-EVAL and LiveCodeBench.
We also evaluated function-calling performance using Berkeley Function Calling v4 (BFCL) in "thinking off" mode. Domyn-Large achieves a score of 92.44 on BFCL Non-Live and 81.87 on BFCL Live indicating strong tool-use capabilities. Finally, we tested Domyn-Large for safety using the SafetyBench framework. With reasoning enabled, the model reaches an average score of around 83.17, with similar performance when reasoning is disabled.
Beyond standard academic benchmarks, we evaluated Domyn-Large on enterprise-specific capabilities central to Domyn's regulated-industry use cases: structured query generation (Text2SQL and Text2Cypher), knowledge graph triplet extraction, and safety classification. These evaluations were conducted by Domyn's Financial Services AI team.
Full methodology, evaluation code, and sampled datasets are publicly available on Hugging Face.
The bar plot in Figure 11 summarizes results across all four task types. Domyn-Large leads consistently: it achieves the highest execution accuracy on both Text2SQL and Text2Cypher, the highest Jaccard similarity scores, and the most striking advantage in safety classification—77% recall versus roughly 50% for Qwen3 253B and GPT OSS 120B. On knowledge graph triplet extraction, it scores highest overall and leads on comprehensiveness and relevance while remaining competitive on faithfulness and precision.
Conclusion
Within the EU Sovereign AI initiative, the collaboration between Domyn and NVIDIA led to development of a state-of-the-art reasoning model for regulated industries, tailored for agentic AI use cases such as tool calling, and fluent in key EU languages including Italian, German, French, and Spanish. We achieved this by applying a Minitron-style recipe–model compression, continual pretraining (CPT), and supervised fine-tuning (SFT) of an aligned model–on a corpus of high quality, and carefully curated data to overcome low-resource data limitations while preserving the core instruction-following abilities.
The use of the NeMo framework on world-class infrastructure, including NVIDIA DGX Cloud, was critical in ensuring a stable and efficient training process. This project validates a practical blueprint for enhancing model capabilities along targeted directions driven by sovereign AI requirements. While we demonstrate this approach through low-resource language adaptation and long-document reasoning, the broader contribution lies in establishing the foundations—methodological, data-centric, and architectural—required to iteratively develop competitive, sovereign large language models. Performance was systematically evaluated against other open-source models of similar size on widely recognized benchmarks for multilingual knowledge, math, coding, function calling, and safety, showing that Domyn‑Large is competitive with leading models while better serving European, sovereign‑AI use cases. Critically, Domyn-Large achieves state-of-the-art results on enterprise-specific benchmarks demonstrating that sovereign AI can deliver measurable advantage on the tasks regulated industries depend on most.
Looking ahead, Domyn will apply the learnings from this collaboration to the recently released Nemotron 3 family of LLMs. Nemotron open models will bring not only state-of-the-art capabilities and higher inference throughput, but also a clear model evolution roadmap — giving Domyn's enterprise clients confidence that their deployed models will continue to improve, with each generation inheriting the methodological advances validated in projects like this one.
Resources




