Natotan is a 2-billion-parameter vision-language embedding model, fine-tuned via LoRA on NATO and French Armed Forces doctrine documents. It produces 2,048-dimensional embeddings and is optimized for bilingual French-English multimodal document retrieval.

What is the difference between Natotan and Qwen3-VL-Embedding-2B?

Natotan is a specialized version of Qwen3-VL-Embedding-2B obtained through LoRA fine-tuning on a military document corpus. It improves NDCG@1 by 9.0%, Recall@5 by 5.9%, and MRR by 6.8% on the military document retrieval benchmark.

Is Natotan better than Gemini for military retrieval?

Yes. On the benchmark of 5,428 query-document pairs, Natotan achieves NDCG@10 of 0.699 versus 0.212 for Gemini multimodalembedding@001 — a 3.3x performance difference. The gap is even larger in French: 0.697 for Natotan versus 0.132 for Gemini (5.3x).

Do I need to load a LoRA adapter separately?

No. The LoRA adapter weights have been merged into the base model. Natotan loads directly with AutoModel.from_pretrained() like any standard Hugging Face model, with no additional dependencies.

Does Natotan work in both French and English?

Yes. The benchmark is split evenly between 2,714 French and 2,714 English query-document pairs. Natotan improves performance in both languages, with larger gains in French (+12.3% NDCG@1) than English (+5.8%).

Which document types does Natotan improve the most?

The largest gains appear on tactical field manuals (+12.1% NDCG@10), UN peacekeeping manuals (+14.6%), allied joint medical publications (+14.9%), and strategic doctrine (+14.8%). The model improves 13 out of 16 evaluated categories.

Can Natotan be used in a military RAG system?

Yes. Natotan integrates into any RAG pipeline as a document and query encoder. Its 2,048-dimensional embeddings are compatible with FAISS, Milvus, Qdrant, Pinecone, and Weaviate. With a Recall@10 of 95.0%, it ensures relevant documents are retrieved in the vast majority of cases.

What GPU is required to run Natotan?

Natotan is a 2-billion-parameter model. It runs on any GPU with at least 8 GB of VRAM (RTX 3060, A10, T4, etc.). CPU inference is possible but slower.

What is the relationship with the NATO & French Armed Forces Military Doctrine Dataset?

Natotan was fine-tuned on the NATO & French Armed Forces Military Doctrine Dataset, a corpus of 454 PDF documents totaling 55,034 pages and 2.53 GB. The 5,428-pair evaluation benchmark is derived from the same dataset, using held-out documents not seen during training.

Can Natotan be used outside the military domain?

Natotan inherits the general-purpose capabilities of Qwen3-VL-Embedding-2B. The LoRA fine-tuning is lightweight and has not degraded the base model's general performance. However, performance on out-of-domain tasks has not been formally evaluated.

Natotan: Vision-Language Embedding Model for Multimodal Military Document Retrieval

Natotan is a domain-adapted vision-language embedding model built for multimodal military and defense document retrieval in English and French. It is built on top of Qwen3-VL-Embedding-2B via LoRA (Low-Rank Adaptation) fine-tuning, with the adapter weights merged into the base model for frictionless deployment.

On a custom benchmark of 5,428 query-document pairs covering NATO and French Armed Forces doctrine publications, Natotan achieves a NDCG@1 of 0.384 (+9% vs the base model) and an MRR of 0.618 (+6.8%). It outperforms Google Gemini multimodalembedding@001 by over 230% in NDCG@10.

The model produces 2,048-dimensional embeddings, identical to the base Qwen3-VL-Embedding-2B. It is distributed in safetensors format on Hugging Face and loads in a single line with AutoModel.from_pretrained() — no separate LoRA adapter required.

Metric	Value
Base model	Qwen3-VL-Embedding-2B
Fine-tuning method	LoRA (Low-Rank Adaptation), merged
Embedding dimension	2,048
Languages	French + English
Task	Multimodal embedding / document retrieval
Format	safetensors
Benchmark	5,428 query-document pairs
Categories evaluated	16
NDCG@1	0.384 (+9.0% vs base)
MRR	0.618 (+6.8% vs base)
Recall@10	0.950 (+4.6% vs base)

Why build a specialized embedding model for defense?

Generic embedding models — whether open-source or proprietary — consistently fail on defense document corpora. Military vocabulary is technical, multilingual, and mixes text, tactical diagrams, tables, and maps within a single document. A general-purpose model has not seen enough of this content during pre-training to produce reliable semantic representations.

The evidence is particularly striking with Google Gemini multimodalembedding@001. On the Natotan benchmark, Gemini achieves a NDCG@10 of only 0.212 where Natotan reaches 0.699 — a 3.3x gap. On French documents, the difference widens further: Gemini drops to 0.132 NDCG@10 versus 0.697 for Natotan, a ratio of 5.3x.

This result confirms a trend observed in the literature: proprietary general-purpose models significantly underperform on specialized domains, particularly outside of English. LoRA fine-tuning, even with a modest compute budget, is enough to close and exceed this gap.

Practical use cases include document retrieval in military RAG systems, doctrine publication lookup for headquarters staff, and multimodal indexing of tactical manuals that combine text with diagrams and schematics.

How was Natotan built?

Natotan was built in 3 steps from the open-source base model Qwen3-VL-Embedding-2B, a 2-billion-parameter vision-language model published by Alibaba’s Qwen team.

Step	Description
1. LoRA fine-tuning	Domain adaptation on NATO and French Armed Forces military documents via Low-Rank Adaptation
2. Weight merging	Merge of the LoRA adapter into the base model weights
3. Safetensors export	Save the merged model in standard Hugging Face format

LoRA (Low-Rank Adaptation) works by freezing the base model weights and training only low-rank matrices injected into the attention layers. This approach enables memory-efficient and compute-efficient fine-tuning while preserving the model’s general-purpose capabilities.

The training dataset is derived from the NATO & French Armed Forces Military Doctrine Dataset, a corpus of 454 PDF documents totaling 55,034 pages and 2.53 GB of data covering 16 categories of military publications.

After merging, the resulting model is fully self-contained: no separate LoRA adapter to load, no additional dependencies. It works exactly like the base Qwen3-VL-Embedding-2B with the same API.

python3 merge_lora.py \
  --base_model Qwen/Qwen3-VL-Embedding-2B \
  --adapter ./lora_adapters \
  --output_dir ./merged \
  --trust_remote_code

What are Natotan’s overall retrieval performance results?

Natotan outperforms the base Qwen3-VL-Embedding-2B model on every metric and every cutoff level evaluated. The improvement is strongest at the top of the ranking: NDCG@1 increases from 0.352 to 0.384, a 9.0% gain.

Metric	Base	Natotan	Improvement
NDCG@1	0.3524	0.3841	+9.0%
NDCG@5	0.6362	0.6802	+6.9%
NDCG@10	0.6575	0.6990	+6.3%
Recall@1	0.3524	0.3841	+9.0%
Recall@5	0.8430	0.8930	+5.9%
Recall@10	0.9079	0.9501	+4.6%
MRR	0.5785	0.6179	+6.8%
MAP	0.5785	0.6179	+6.8%

In practical terms, a Recall@5 of 0.893 means the relevant document appears in the top 5 results for 89.3% of queries, compared to 84.3% with the base model. At Recall@10 the figure climbs to 95.0% — the correct document is found within 10 results for nearly all queries.

The improvement in MRR (Mean Reciprocal Rank) from 0.579 to 0.618 means the average rank of the first relevant result shifts from approximately position 1.73 to position 1.62. In a military RAG system where every rank matters, this is a meaningful gain.

NDCG@5428 (the maximum cutoff matching the full corpus size) reaches 0.710, confirming that the gains are not limited to the top of the ranking but propagate through the entire result list.

How does Natotan compare to Google Gemini?

The comparison with Google Gemini multimodalembedding@001 illustrates the gap between a general-purpose proprietary model and a domain-adapted open-source model. Natotan outperforms Gemini on every metric without exception, with gaps ranging from +128% to +315%.

Metric	Gemini	Natotan	Ratio
NDCG@1	0.0925	0.3841	x4.2
NDCG@5	0.1880	0.6802	x3.6
NDCG@10	0.2118	0.6990	x3.3
Recall@5	0.2690	0.8930	x3.3
Recall@10	0.3427	0.9501	x2.8
MRR	0.1823	0.6179	x3.4

Gemini multimodalembedding@001 produces 1,408-dimensional embeddings versus 2,048 for Natotan. But the dimension difference does not explain a performance gap of this magnitude. The fundamental issue is the lack of domain specialization: Gemini was not exposed to military terminology and document structures during training.

The most revealing result is Gemini’s Recall@10 of 0.343: out of 10 returned results, the relevant document is present in only 34.3% of cases. For a document retrieval system, this is insufficient. Natotan achieves 95.0% at the same cutoff.

It is important to note that Gemini remains a performant model for general-purpose use cases. These results reflect only the military domain, where specialization proves indispensable.

What are the performance results by language?

Natotan maintains near-perfect parity between French and English, which is notable for an embedding model. NDCG@10 is 0.701 in English and 0.697 in French — a difference of less than 0.6%.

Language	Metric	Base	Natotan	Improvement
French	NDCG@1	0.3441	0.3865	+12.3%
French	NDCG@10	0.6527	0.6966	+6.7%
French	Recall@10	0.9064	0.9440	+4.1%
French	MRR	0.5727	0.6171	+7.8%
English	NDCG@1	0.3607	0.3817	+5.8%
English	NDCG@10	0.6623	0.7013	+5.9%
English	Recall@10	0.9094	0.9562	+5.1%
English	MRR	0.5843	0.6187	+5.9%

French benefits more from fine-tuning than English, with a +12.3% gain in NDCG@1 compared to +5.8% in English. This is likely because the base Qwen3-VL model had more room for improvement on French military content, a domain underrepresented in generic training data.

The contrast with Gemini is even more striking in French. Gemini achieves only 0.132 NDCG@10 in French versus 0.292 in English — a drop of over 50%. Natotan, by comparison, remains stable across both languages. For deployment in the French Armed Forces or in bilingual NATO headquarters, this stability is a decisive advantage.

The French Recall@10 of 0.944 means that 94.4% of French queries retrieve the correct document within the top 10 results. In English, the figure rises to 95.6%.

How is the evaluation benchmark constructed?

The benchmark uses 5,428 query-document pairs drawn from held-out documents not seen during training, split evenly between 2,714 English pairs and 2,714 French pairs. The documents span 16 categories of military publications, grouped under two main source themes.

Source Theme	Pairs	% of Total
French military publications	3,104	57.2%
NATO publications	2,324	42.8%
Total	5,428	100%

The 16 document categories

Category	Pairs	Description
amedp	1,138	Allied Medical Publications (NATO)
tta	1,100	All-Arms Regulatory Texts (FR)
tactical	1,016	Tactical manuals — infantry, battlegroups (FR)
ajp	916	Allied Joint Publications (NATO)
ajmedp	224	Allied Joint Medical Publications (NATO)
un_manuals	200	UN peacekeeping manuals (FR)
ft	154	Land Forces directives (FR)
pia	136	Joint Publications (FR)
irsem	132	Strategic research — IRSEM (FR)
cahiers_pensee	124	Military Thought Notebooks (FR)
dia	92	Joint Doctrine (FR)
lexicons	82	Glossaries — AAP-06, AAP-15
strategic	48	White papers, strategic reviews (FR)
other	46	Other NATO publications
modern	14	Modern systems (FR)
medot	6	Operational decision methodology (FR)

The 5 most represented categories (amedp, tta, tactical, ajp, ajmedp) account for 4,394 pairs, or 81% of the benchmark. This ensures statistical robustness for the main categories.

Low-count categories (modern: 14, medot: 6) serve as qualitative indicators but should not be interpreted in isolation due to high statistical variance.

The underlying training dataset is the NATO & French Armed Forces Military Doctrine Dataset, comprising 454 PDF documents, 55,034 pages, and 2.53 GB of data.

Which document categories benefit the most from fine-tuning?

Natotan improves NDCG@10 in 13 out of 16 categories evaluated. The largest gains appear on categories where the base model was weakest, including UN manuals, tactical documents, and allied joint medical publications.

Top 5 categories by NDCG@10 gain

Category	n	Base	Natotan	Absolute Gain
medot	6	0.427	0.815	+0.388
un_manuals	200	0.667	0.764	+0.097
ajmedp	224	0.653	0.750	+0.097
strategic	48	0.633	0.726	+0.093
tactical	1,016	0.597	0.669	+0.072

The most dramatic improvement is on the medot category (operational decision methodology), with NDCG@10 jumping from 0.427 to 0.815 — a +90.9% gain. However, this category contains only 6 pairs and the result should be interpreted with caution.

On high-volume categories, the gains are more modest but statistically robust. The tactical category (1,016 pairs) improves by +12.1% in NDCG@10, and the tta category (1,100 pairs) by +9.1%. These two categories represent the French Army’s field manuals — the documents most frequently consulted on a daily basis.

Top 5 categories by NDCG@1 gain

Category	n	Base	Natotan	Relative Gain
medot	6	0.167	0.500	+200.0%
ajmedp	224	0.308	0.451	+46.4%
ft	154	0.299	0.429	+43.4%
un_manuals	200	0.365	0.510	+39.7%
strategic	48	0.313	0.417	+33.3%

NDCG@1 gains are especially impactful because they measure the probability that the first returned result is the correct document. For a headquarters staff officer searching for a specific doctrine document, the difference between a relevant first result and an irrelevant one is considerable.

Categories with regression

Category	n	Base	Natotan	Change
cahiers_pensee	124	0.682	0.678	-0.6%
irsem	132	0.654	0.644	-1.5%
modern	14	0.791	0.757	-4.3%

Three categories show slight regressions in NDCG@10. The cahiers_pensee (-0.6%) and irsem (-1.5%) are academic strategic research publications whose writing style differs from standard doctrinal documents. The modern category contains only 14 pairs, making the regression statistically non-significant.

Perfect Recall@10 on 4 categories

Natotan achieves a Recall@10 of 1.000 (100% of relevant documents retrieved in the top 10 results) on 4 categories: medot, strategic, cahiers_pensee, and lexicons. This means the system never misses the correct document for these document types.

What concrete examples illustrate the improvements?

Two qualitative examples from the benchmark illustrate Natotan’s improvements on real French-language queries.

Example 1 — Tactical query

Query: “A table detailing the section leader’s responsibilities during reconnaissance and delaying missions against a superior threat.”

Model	Rank of relevant document
Base (Qwen3-VL-Embedding-2B)	Not in top 5
Natotan	Rank 2

The base model completely fails to retrieve the relevant document within the top 5 results. Natotan places it at rank 2. This is a concrete case where fine-tuning transforms a retrieval failure into a usable answer.

Example 2 — Administrative query

Query: “A document detailing the steps for career orientation for volunteer soldiers and the conditions for contract renewal after eleven years of service.”

Model	Rank of relevant document
Base (Qwen3-VL-Embedding-2B)	Rank 3
Natotan	Rank 1

The base model retrieves the correct document but ranks it third, behind two irrelevant results. Natotan promotes it directly to rank 1.

These two examples show that Natotan’s improvements are not abstract: they translate into concrete differences in the user experience of a military document retrieval system.

How to use Natotan in a RAG pipeline?

Natotan is a merged model that deploys like any standard Hugging Face model. There is no LoRA adapter to load separately, no additional dependencies.

Loading the model

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "racineai/natotan",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "racineai/natotan",
    trust_remote_code=True,
)

Integration into a RAG system

Natotan integrates into any RAG (Retrieval-Augmented Generation) pipeline as a document and query encoder. The 2,048-dimensional embeddings are compatible with standard vector databases: FAISS, Milvus, Qdrant, Pinecone, Weaviate.

Component	Role
Natotan	Document and query encoder (2,048 dimensions)
Vector database	Similarity storage and search (FAISS, Milvus, Qdrant…)
Generator LLM	Answer generation from retrieved documents

The typical workflow is: (1) encode the corpus documents with Natotan, (2) store the embeddings in a vector database, (3) when a query arrives, encode it with Natotan, (4) retrieve the k most similar documents, (5) pass the retrieved documents to an LLM for answer generation.

With a Recall@5 of 89.3% and a Recall@10 of 95.0%, Natotan ensures that relevant documents are retrieved in the vast majority of cases before the generation step.

What are the limitations of the model?

Natotan is optimized for a specific domain and has several limitations to be aware of before deployment.

Narrow domain. Fine-tuning was performed exclusively on NATO and French Armed Forces doctrine documents. Performance on other domains (legal, civilian medical, finance) has not been evaluated. The base Qwen3-VL-Embedding-2B retains its general-purpose capabilities, but the specialization gains apply only to the training domain.

Two languages only. The benchmark covers French and English. Performance on other NATO languages (German, Spanish, Turkish, etc.) has not been measured, although the base model supports many languages.

Low-count categories. Five benchmark categories contain fewer than 100 pairs (strategic: 48, other: 46, modern: 14, medot: 6). Results on these categories have high statistical variance and should be interpreted with caution.

No incremental updates. The model is a frozen snapshot. It does not update automatically when new doctrine documents are published. Periodic re-fine-tuning is required to integrate new publications.

Model size. At 2 billion parameters, Natotan requires a GPU for full-speed inference. CPU deployment is possible but significantly slower.

Citation

@misc{Natotan2025,
  title={Natotan: LoRA-tuned Qwen3-VL-Embedding-2B for multimodal defense document retrieval},
  year={2025},
  url={https://huggingface.co/racineai/natotan}
}