Back to blog
Benchmarks

Natotan: Vision-Language Embedding Model for Multimodal Military Document Retrieval

Racine AI
Natotan — Tactical Embedding Model

Natotan is a domain-adapted vision-language embedding model built for multimodal military and defense document retrieval in English and French. It is built on top of Qwen3-VL-Embedding-2B via LoRA (Low-Rank Adaptation) fine-tuning, with the adapter weights merged into the base model for frictionless deployment.

On a custom benchmark of 5,428 query-document pairs covering NATO and French Armed Forces doctrine publications, Natotan achieves a NDCG@1 of 0.384 (+9% vs the base model) and an MRR of 0.618 (+6.8%). It outperforms Google Gemini multimodalembedding@001 by over 230% in NDCG@10.

The model produces 2,048-dimensional embeddings, identical to the base Qwen3-VL-Embedding-2B. It is distributed in safetensors format on Hugging Face and loads in a single line with AutoModel.from_pretrained() — no separate LoRA adapter required.

MetricValue
Base modelQwen3-VL-Embedding-2B
Fine-tuning methodLoRA (Low-Rank Adaptation), merged
Embedding dimension2,048
LanguagesFrench + English
TaskMultimodal embedding / document retrieval
Formatsafetensors
Benchmark5,428 query-document pairs
Categories evaluated16
NDCG@10.384 (+9.0% vs base)
MRR0.618 (+6.8% vs base)
Recall@100.950 (+4.6% vs base)

Why build a specialized embedding model for defense?

Generic embedding models — whether open-source or proprietary — consistently fail on defense document corpora. Military vocabulary is technical, multilingual, and mixes text, tactical diagrams, tables, and maps within a single document. A general-purpose model has not seen enough of this content during pre-training to produce reliable semantic representations.

The evidence is particularly striking with Google Gemini multimodalembedding@001. On the Natotan benchmark, Gemini achieves a NDCG@10 of only 0.212 where Natotan reaches 0.699 — a 3.3x gap. On French documents, the difference widens further: Gemini drops to 0.132 NDCG@10 versus 0.697 for Natotan, a ratio of 5.3x.

This result confirms a trend observed in the literature: proprietary general-purpose models significantly underperform on specialized domains, particularly outside of English. LoRA fine-tuning, even with a modest compute budget, is enough to close and exceed this gap.

Practical use cases include document retrieval in military RAG systems, doctrine publication lookup for headquarters staff, and multimodal indexing of tactical manuals that combine text with diagrams and schematics.

How was Natotan built?

Natotan was built in 3 steps from the open-source base model Qwen3-VL-Embedding-2B, a 2-billion-parameter vision-language model published by Alibaba’s Qwen team.

StepDescription
1. LoRA fine-tuningDomain adaptation on NATO and French Armed Forces military documents via Low-Rank Adaptation
2. Weight mergingMerge of the LoRA adapter into the base model weights
3. Safetensors exportSave the merged model in standard Hugging Face format

LoRA (Low-Rank Adaptation) works by freezing the base model weights and training only low-rank matrices injected into the attention layers. This approach enables memory-efficient and compute-efficient fine-tuning while preserving the model’s general-purpose capabilities.

The training dataset is derived from the NATO & French Armed Forces Military Doctrine Dataset, a corpus of 454 PDF documents totaling 55,034 pages and 2.53 GB of data covering 16 categories of military publications.

After merging, the resulting model is fully self-contained: no separate LoRA adapter to load, no additional dependencies. It works exactly like the base Qwen3-VL-Embedding-2B with the same API.

python3 merge_lora.py \
  --base_model Qwen/Qwen3-VL-Embedding-2B \
  --adapter ./lora_adapters \
  --output_dir ./merged \
  --trust_remote_code

What are Natotan’s overall retrieval performance results?

Natotan outperforms the base Qwen3-VL-Embedding-2B model on every metric and every cutoff level evaluated. The improvement is strongest at the top of the ranking: NDCG@1 increases from 0.352 to 0.384, a 9.0% gain.

MetricBaseNatotanImprovement
NDCG@10.35240.3841+9.0%
NDCG@50.63620.6802+6.9%
NDCG@100.65750.6990+6.3%
Recall@10.35240.3841+9.0%
Recall@50.84300.8930+5.9%
Recall@100.90790.9501+4.6%
MRR0.57850.6179+6.8%
MAP0.57850.6179+6.8%

In practical terms, a Recall@5 of 0.893 means the relevant document appears in the top 5 results for 89.3% of queries, compared to 84.3% with the base model. At Recall@10 the figure climbs to 95.0% — the correct document is found within 10 results for nearly all queries.

The improvement in MRR (Mean Reciprocal Rank) from 0.579 to 0.618 means the average rank of the first relevant result shifts from approximately position 1.73 to position 1.62. In a military RAG system where every rank matters, this is a meaningful gain.

NDCG@5428 (the maximum cutoff matching the full corpus size) reaches 0.710, confirming that the gains are not limited to the top of the ranking but propagate through the entire result list.

How does Natotan compare to Google Gemini?

The comparison with Google Gemini multimodalembedding@001 illustrates the gap between a general-purpose proprietary model and a domain-adapted open-source model. Natotan outperforms Gemini on every metric without exception, with gaps ranging from +128% to +315%.

MetricGeminiNatotanRatio
NDCG@10.09250.3841x4.2
NDCG@50.18800.6802x3.6
NDCG@100.21180.6990x3.3
Recall@50.26900.8930x3.3
Recall@100.34270.9501x2.8
MRR0.18230.6179x3.4

Gemini multimodalembedding@001 produces 1,408-dimensional embeddings versus 2,048 for Natotan. But the dimension difference does not explain a performance gap of this magnitude. The fundamental issue is the lack of domain specialization: Gemini was not exposed to military terminology and document structures during training.

The most revealing result is Gemini’s Recall@10 of 0.343: out of 10 returned results, the relevant document is present in only 34.3% of cases. For a document retrieval system, this is insufficient. Natotan achieves 95.0% at the same cutoff.

It is important to note that Gemini remains a performant model for general-purpose use cases. These results reflect only the military domain, where specialization proves indispensable.

What are the performance results by language?

Natotan maintains near-perfect parity between French and English, which is notable for an embedding model. NDCG@10 is 0.701 in English and 0.697 in French — a difference of less than 0.6%.

LanguageMetricBaseNatotanImprovement
FrenchNDCG@10.34410.3865+12.3%
FrenchNDCG@100.65270.6966+6.7%
FrenchRecall@100.90640.9440+4.1%
FrenchMRR0.57270.6171+7.8%
EnglishNDCG@10.36070.3817+5.8%
EnglishNDCG@100.66230.7013+5.9%
EnglishRecall@100.90940.9562+5.1%
EnglishMRR0.58430.6187+5.9%

French benefits more from fine-tuning than English, with a +12.3% gain in NDCG@1 compared to +5.8% in English. This is likely because the base Qwen3-VL model had more room for improvement on French military content, a domain underrepresented in generic training data.

The contrast with Gemini is even more striking in French. Gemini achieves only 0.132 NDCG@10 in French versus 0.292 in English — a drop of over 50%. Natotan, by comparison, remains stable across both languages. For deployment in the French Armed Forces or in bilingual NATO headquarters, this stability is a decisive advantage.

The French Recall@10 of 0.944 means that 94.4% of French queries retrieve the correct document within the top 10 results. In English, the figure rises to 95.6%.

How is the evaluation benchmark constructed?

The benchmark uses 5,428 query-document pairs drawn from held-out documents not seen during training, split evenly between 2,714 English pairs and 2,714 French pairs. The documents span 16 categories of military publications, grouped under two main source themes.

Source ThemePairs% of Total
French military publications3,10457.2%
NATO publications2,32442.8%
Total5,428100%

The 16 document categories

CategoryPairsDescription
amedp1,138Allied Medical Publications (NATO)
tta1,100All-Arms Regulatory Texts (FR)
tactical1,016Tactical manuals — infantry, battlegroups (FR)
ajp916Allied Joint Publications (NATO)
ajmedp224Allied Joint Medical Publications (NATO)
un_manuals200UN peacekeeping manuals (FR)
ft154Land Forces directives (FR)
pia136Joint Publications (FR)
irsem132Strategic research — IRSEM (FR)
cahiers_pensee124Military Thought Notebooks (FR)
dia92Joint Doctrine (FR)
lexicons82Glossaries — AAP-06, AAP-15
strategic48White papers, strategic reviews (FR)
other46Other NATO publications
modern14Modern systems (FR)
medot6Operational decision methodology (FR)

The 5 most represented categories (amedp, tta, tactical, ajp, ajmedp) account for 4,394 pairs, or 81% of the benchmark. This ensures statistical robustness for the main categories.

Low-count categories (modern: 14, medot: 6) serve as qualitative indicators but should not be interpreted in isolation due to high statistical variance.

The underlying training dataset is the NATO & French Armed Forces Military Doctrine Dataset, comprising 454 PDF documents, 55,034 pages, and 2.53 GB of data.

Which document categories benefit the most from fine-tuning?

Natotan improves NDCG@10 in 13 out of 16 categories evaluated. The largest gains appear on categories where the base model was weakest, including UN manuals, tactical documents, and allied joint medical publications.

Top 5 categories by NDCG@10 gain

CategorynBaseNatotanAbsolute Gain
medot60.4270.815+0.388
un_manuals2000.6670.764+0.097
ajmedp2240.6530.750+0.097
strategic480.6330.726+0.093
tactical1,0160.5970.669+0.072

The most dramatic improvement is on the medot category (operational decision methodology), with NDCG@10 jumping from 0.427 to 0.815 — a +90.9% gain. However, this category contains only 6 pairs and the result should be interpreted with caution.

On high-volume categories, the gains are more modest but statistically robust. The tactical category (1,016 pairs) improves by +12.1% in NDCG@10, and the tta category (1,100 pairs) by +9.1%. These two categories represent the French Army’s field manuals — the documents most frequently consulted on a daily basis.

Top 5 categories by NDCG@1 gain

CategorynBaseNatotanRelative Gain
medot60.1670.500+200.0%
ajmedp2240.3080.451+46.4%
ft1540.2990.429+43.4%
un_manuals2000.3650.510+39.7%
strategic480.3130.417+33.3%

NDCG@1 gains are especially impactful because they measure the probability that the first returned result is the correct document. For a headquarters staff officer searching for a specific doctrine document, the difference between a relevant first result and an irrelevant one is considerable.

Categories with regression

CategorynBaseNatotanChange
cahiers_pensee1240.6820.678-0.6%
irsem1320.6540.644-1.5%
modern140.7910.757-4.3%

Three categories show slight regressions in NDCG@10. The cahiers_pensee (-0.6%) and irsem (-1.5%) are academic strategic research publications whose writing style differs from standard doctrinal documents. The modern category contains only 14 pairs, making the regression statistically non-significant.

Perfect Recall@10 on 4 categories

Natotan achieves a Recall@10 of 1.000 (100% of relevant documents retrieved in the top 10 results) on 4 categories: medot, strategic, cahiers_pensee, and lexicons. This means the system never misses the correct document for these document types.

What concrete examples illustrate the improvements?

Two qualitative examples from the benchmark illustrate Natotan’s improvements on real French-language queries.

Example 1 — Tactical query

Query: “A table detailing the section leader’s responsibilities during reconnaissance and delaying missions against a superior threat.”

ModelRank of relevant document
Base (Qwen3-VL-Embedding-2B)Not in top 5
NatotanRank 2

The base model completely fails to retrieve the relevant document within the top 5 results. Natotan places it at rank 2. This is a concrete case where fine-tuning transforms a retrieval failure into a usable answer.

Example 2 — Administrative query

Query: “A document detailing the steps for career orientation for volunteer soldiers and the conditions for contract renewal after eleven years of service.”

ModelRank of relevant document
Base (Qwen3-VL-Embedding-2B)Rank 3
NatotanRank 1

The base model retrieves the correct document but ranks it third, behind two irrelevant results. Natotan promotes it directly to rank 1.

These two examples show that Natotan’s improvements are not abstract: they translate into concrete differences in the user experience of a military document retrieval system.

How to use Natotan in a RAG pipeline?

Natotan is a merged model that deploys like any standard Hugging Face model. There is no LoRA adapter to load separately, no additional dependencies.

Loading the model

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "racineai/natotan",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "racineai/natotan",
    trust_remote_code=True,
)

Integration into a RAG system

Natotan integrates into any RAG (Retrieval-Augmented Generation) pipeline as a document and query encoder. The 2,048-dimensional embeddings are compatible with standard vector databases: FAISS, Milvus, Qdrant, Pinecone, Weaviate.

ComponentRole
NatotanDocument and query encoder (2,048 dimensions)
Vector databaseSimilarity storage and search (FAISS, Milvus, Qdrant…)
Generator LLMAnswer generation from retrieved documents

The typical workflow is: (1) encode the corpus documents with Natotan, (2) store the embeddings in a vector database, (3) when a query arrives, encode it with Natotan, (4) retrieve the k most similar documents, (5) pass the retrieved documents to an LLM for answer generation.

With a Recall@5 of 89.3% and a Recall@10 of 95.0%, Natotan ensures that relevant documents are retrieved in the vast majority of cases before the generation step.

What are the limitations of the model?

Natotan is optimized for a specific domain and has several limitations to be aware of before deployment.

Narrow domain. Fine-tuning was performed exclusively on NATO and French Armed Forces doctrine documents. Performance on other domains (legal, civilian medical, finance) has not been evaluated. The base Qwen3-VL-Embedding-2B retains its general-purpose capabilities, but the specialization gains apply only to the training domain.

Two languages only. The benchmark covers French and English. Performance on other NATO languages (German, Spanish, Turkish, etc.) has not been measured, although the base model supports many languages.

Low-count categories. Five benchmark categories contain fewer than 100 pairs (strategic: 48, other: 46, modern: 14, medot: 6). Results on these categories have high statistical variance and should be interpreted with caution.

No incremental updates. The model is a frozen snapshot. It does not update automatically when new doctrine documents are published. Periodic re-fine-tuning is required to integrate new publications.

Model size. At 2 billion parameters, Natotan requires a GPU for full-speed inference. CPU deployment is possible but significantly slower.

Citation

@misc{Natotan2025,
  title={Natotan: LoRA-tuned Qwen3-VL-Embedding-2B for multimodal defense document retrieval},
  year={2025},
  url={https://huggingface.co/racineai/natotan}
}

Technical newsletter

1 article per month on document AI. No spam.

3 + 9 =

Common questions

What is Natotan?

Natotan is a 2-billion-parameter vision-language embedding model, fine-tuned via LoRA on NATO and French Armed Forces doctrine documents. It produces 2,048-dimensional embeddings and is optimized for bilingual French-English multimodal document retrieval.

What is the difference between Natotan and Qwen3-VL-Embedding-2B?

Natotan is a specialized version of Qwen3-VL-Embedding-2B obtained through LoRA fine-tuning on a military document corpus. It improves NDCG@1 by 9.0%, Recall@5 by 5.9%, and MRR by 6.8% on the military document retrieval benchmark.

Is Natotan better than Gemini for military retrieval?

Yes. On the benchmark of 5,428 query-document pairs, Natotan achieves NDCG@10 of 0.699 versus 0.212 for Gemini multimodalembedding@001 — a 3.3x performance difference. The gap is even larger in French: 0.697 for Natotan versus 0.132 for Gemini (5.3x).

Do I need to load a LoRA adapter separately?

No. The LoRA adapter weights have been merged into the base model. Natotan loads directly with AutoModel.from_pretrained() like any standard Hugging Face model, with no additional dependencies.

Does Natotan work in both French and English?

Yes. The benchmark is split evenly between 2,714 French and 2,714 English query-document pairs. Natotan improves performance in both languages, with larger gains in French (+12.3% NDCG@1) than English (+5.8%).

Which document types does Natotan improve the most?

The largest gains appear on tactical field manuals (+12.1% NDCG@10), UN peacekeeping manuals (+14.6%), allied joint medical publications (+14.9%), and strategic doctrine (+14.8%). The model improves 13 out of 16 evaluated categories.

Can Natotan be used in a military RAG system?

Yes. Natotan integrates into any RAG pipeline as a document and query encoder. Its 2,048-dimensional embeddings are compatible with FAISS, Milvus, Qdrant, Pinecone, and Weaviate. With a Recall@10 of 95.0%, it ensures relevant documents are retrieved in the vast majority of cases.

What GPU is required to run Natotan?

Natotan is a 2-billion-parameter model. It runs on any GPU with at least 8 GB of VRAM (RTX 3060, A10, T4, etc.). CPU inference is possible but slower.

What is the relationship with the NATO & French Armed Forces Military Doctrine Dataset?

Natotan was fine-tuned on the NATO & French Armed Forces Military Doctrine Dataset, a corpus of 454 PDF documents totaling 55,034 pages and 2.53 GB. The 5,428-pair evaluation benchmark is derived from the same dataset, using held-out documents not seen during training.

Can Natotan be used outside the military domain?

Natotan inherits the general-purpose capabilities of Qwen3-VL-Embedding-2B. The LoRA fine-tuning is lightweight and has not degraded the base model's general performance. However, performance on out-of-domain tasks has not been formally evaluated.

Let's discuss

Your Project.

AI Documents, legacy automation, field inspection. We deploy solutions that go to production.

Tell us about your project and get a response within 48h.

Contact us