Benchmarks

NATO & French Military Doctrine Dataset: 377 Documents, 29,271 Pages for Visual Document Retrieval

Q: How many documents are in the NATO & French Military Doctrine Dataset?

The dataset contains **377 documents** totaling **29,271 unique page images** and **58,542 rows** (each page appears twice, once per language). The parquet files total **12.93 GB**.

Q: What is the dataset format?

The dataset is provided in **Hugging Face parquet format** with two splits: train (53,114 rows, 341 documents) and test (5,428 rows, 36 documents). Each row contains a page image (JPEG binary) and bilingual queries.

Q: What languages are covered?

The dataset has **bilingual queries** — every page image has both a French query (`query_fr`) and an English query (`query_en`). The source documents are 192 French documents and 185 NATO (English) documents.

Q: What is the schema?

The dataset has 14 columns: `id`, `doc_id`, `page_num`, `total_pages`, `folder`, `subfolder`, `filename`, `source_path`, `query_fr`, `query_en`, `language`, `query`, `created_at`, and `image` (JPEG binary).

Q: What NATO publications are included?

The NATO portion includes **47 Allied Joint Publications (AJP)**, **93 Allied Medical Publications (AMEDP)**, **20 Allied Joint Medical Publications (AJMEDP)**, and **25 other NATO standards**. Key series include AJP-01, AJP-3, AJP-4, and AJP-5.

Q: What French military doctrine categories are included?

The French portion has **12 categories**: TTA (33 docs), lexicons (21 docs), tactical (20 docs), PIA (25 docs), strategic (19 docs), DIA (23 docs), IRSEM (22 docs), MEDOT (12 docs), UN manuals (4 docs), cahiers_pensee (7 docs), FT (5 docs), and modern (1 doc).

Q: How do I load this dataset?

Use the Hugging Face datasets library: `from datasets import load_dataset; ds = load_dataset("parquet", data_dir="data/")`. The train split contains 53,114 rows and the test split 5,428 rows.

Q: What are the best use cases?

The dataset is designed for **visual document retrieval** and **visual question answering**. Given a text query, retrieve the most relevant page image. The bilingual queries also enable **cross-lingual retrieval** research.

Racine AI February 12, 2026

The NATO & French Military Doctrine Dataset is a visual document retrieval corpus of 377 documents totaling 29,271 page images with bilingual AI-generated queries in a 12.93 GB parquet format. It is designed for document retrieval and visual question answering (VQA) tasks in the military domain.

This military doctrine dataset draws from authoritative sources: NATO, the French Ministry of Armed Forces, IRSEM (Institute for Strategic Research), the United Nations, and COEMED (Centre of Excellence for Military Medicine). It covers the full spectrum of doctrinal hierarchy, from strategic white papers down to tactical field manuals.

Each page image is paired with AI-generated queries in both French and English, creating 58,542 total rows (each page appears twice—once per language). This structure makes the dataset uniquely suited for cross-lingual document retrieval, visual question answering, and bilingual defense NLP pipelines.

What Are the Key Statistics of This Military Doctrine Dataset?

This defense training data corpus contains 377 documents spanning 29,271 unique page images in a 12.93 GB parquet format.

Metric	Value
Total documents	377
Unique page images	29,271
Total rows	58,542 (each page × 2 languages)
Total size	12.93 GB (parquet with images)
Train split	53,114 rows (341 documents, 26,557 pages)
Test split	5,428 rows (36 documents, 2,714 pages)
Languages	Bilingual queries (FR + EN per page)
French documents	192 (50.9%)
NATO documents	185 (49.1%)
French pages	19,782
NATO pages	9,489
French categories	12
NATO categories	4

How Are the Documents Distributed by Language?

French documents slightly outnumber NATO documents at 192 (50.9%) versus 185 (49.1%). However, French pages represent 67.6% of the total page count.

Source	Documents	Pages	Rows	% of Docs
French	192	19,782	39,564	50.9%
NATO	185	9,489	18,978	49.1%
Total	377	29,271	58,542	100%

The average French document is larger (103 pages) than the average NATO document (51 pages). Note that “bilingual” refers to the AI-generated queries—each page image has both a French query and an English query, regardless of the source document’s original language.

What French Military Doctrine Categories Are Included?

The French portion of the dataset is organized into 12 distinct doctrinal categories, ranging from all-arms field manuals (TTA) to strategic research papers (IRSEM). The TTA category is the largest by page count with 6,370 pages across 33 documents.

Category	Description	Docs	Pages	Rows
tta	Textes Toutes Armes (All-Arms)	33	6,370	12,740
lexicons	Glossaries, AAP-06/15	21	3,477	6,954
tactical	Tactical manuals (INF, GTIA)	20	2,792	5,584
strategic	White papers, strategic reviews	19	1,615	3,230
pia	Joint Publications (Interarmées)	25	1,608	3,216
dia	Joint Doctrine (Interarmées)	23	1,273	2,546
irsem	Strategic research (IRSEM)	22	926	1,852
medot	Operational decision methodology	12	527	1,054
un_manuals	UN peacekeeping (FR)	4	428	856
cahiers_pensee	Military thought journals	7	407	814
ft	FT/RFT Land Forces	5	352	704
modern	Modern doctrine	1	7	14
Total		192	19,782	39,564

The TTA (Textes Toutes Armes) documents are the densest, averaging 193 pages per document. These all-arms manuals form the backbone of French land forces training.

Lexicons and glossaries provide 3,477 pages of standardized military terminology. These are particularly valuable for building domain-specific tokenizers and military vocabulary resources for NLP models.

What NATO Doctrine Categories Are Included?

The NATO portion contains 185 documents organized into 4 categories. The AJP (Allied Joint Publications) category is the largest by page count with 4,188 pages.

Category	Description	Docs	Pages	Rows
ajp	Allied Joint Publications	47	4,188	8,376
amedp	Allied Medical Publications	93	3,759	7,518
ajmedp	Allied Joint Medical Publications	20	1,088	2,176
other	NATO standards and references	25	454	908
Total		185	9,489	18,978

The 47 AJP documents cover joint operations (AJP-3), logistics (AJP-4), planning (AJP-5), and allied joint doctrine (AJP-01). The medical publications (AMEDP and AJMEDP) add 113 documents and 4,847 pages of specialized medical doctrine relevant for force health protection and CBRN countermeasures.

What Doctrinal Levels Does the Dataset Cover?

This military doctrine dataset covers all four levels of the doctrinal hierarchy: strategic, operational, tactical, and technical. This full-spectrum coverage enables AI models to learn the relationships between high-level strategic intent and ground-level tactical execution.

Doctrinal Level	Key Documents	Document Count
Strategic	Strategic (19), IRSEM (22), AJP-01, AJP-5	41+
Operational	DIA (23), PIA (25), FT (5), AJP-3, AJP-4	53+
Tactical	TTA (33), Tactical (20), MEDOT (12), ATP series	65+
Technical	Lexicons (21), AAP-06, AAP-15	21+

The strategic level includes French white papers and IRSEM research. The operational level pairs French joint doctrine (DIA/PIA) with NATO AJP-3 and AJP-4. The tactical level features TTA all-arms manuals and tactical INF publications. The technical level provides standardized terminology through lexicons and glossaries.

Which Operational Domains Are Represented?

The dataset covers multiple operational domains across both French and NATO documents, including:

Medical and health services: 113 NATO medical publications (AMEDP + AJMEDP) plus French medical doctrine
Joint operations: AJP series and French DIA/PIA joint doctrine (70+ documents)
Tactical operations: TTA all-arms manuals and tactical INF publications (53 documents)
Strategic planning: IRSEM research and strategic reviews (41 documents)
Terminology and standards: Lexicons and glossaries (21 documents)

The NATO portion is particularly strong in medical doctrine, with 113 documents covering force health protection, CBRN countermeasures, and allied medical procedures. The French portion excels in tactical and operational doctrine, with the TTA and tactical categories alone providing over 9,000 page images.

What Time Periods Does the Dataset Span?

The dataset primarily contains modern military doctrine, with the majority of documents from the 2010s and 2020s. The corpus reflects current NATO standardization agreements and contemporary French military thinking.

Key temporal characteristics:

Modern era dominance: Most documents reflect post-2010 doctrine, ensuring relevance for current military concepts including cyber operations, multi-domain operations, and hybrid threats
Living doctrine: NATO AJP publications and French DIA/PIA documents are regularly updated, and this dataset captures recent editions
Historical depth: Some foundational texts and lexicons have roots in earlier NATO standardization efforts

The heavy weighting toward current doctrine ensures that AI models trained on this data learn the most relevant and up-to-date military concepts.

How Are Documents Distributed by Size?

The dataset contains 377 documents with a mean of approximately 78 pages per document. Document sizes range from single-page references up to 701-page comprehensive manuals.

Statistic	Value (pages)
Total documents	377
Total pages	29,271
Mean pages/doc	78
Maximum	701

The largest document is Tactique Théorique by Général Yakovleff with 701 pages (669 extracted). Document sizes vary significantly by category—TTA documents average 193 pages while medical publications (AMEDP) average 40 pages.

The diversity of document sizes makes this dataset suitable for various ML tasks: shorter documents work well for context-window-constrained models, while longer documents provide dense material for retrieval-augmented generation pipelines.

What Are the Largest Documents?

The largest document in the dataset is Tactique Théorique by Général Michel Yakovleff at 701 pages (669 pages extracted after filtering blank pages). This foundational text on tactical theory is one of the most cited works in French military education.

The TTA (Textes Toutes Armes) category contains the densest documents, with several exceeding 400 pages. These all-arms manuals form the backbone of French land forces training and include:

Core training documents used throughout the French Army
Comprehensive tactical and operational references
Standardized procedures for all-arms operations

The INF 202 (Infantry Section Manual) and TTA 150 (General Knowledge) are examples of core training documents whose inclusion makes this dataset directly representative of the texts that French military personnel study during their careers.

How Does This Dataset Compare to Other Military NLP Datasets?

This NATO doctrine dataset is distinguished by three characteristics rarely found together in defense datasets: visual page images, bilingual queries, and institutional source diversity.

Most existing military NLP corpora are text-only and monolingual. This dataset provides page images with bilingual queries, enabling multimodal and cross-lingual research.

Feature	This Dataset	Typical Military Corpora
Format	Page images + queries	Text only
Languages	Bilingual queries (FR + EN)	Monolingual (EN only)
Documents	377	50-200
Page images	29,271	N/A
Total rows	58,542	5,000-15,000
Doctrinal levels	4 (Strategic to Technical)	1-2
Size	12.93 GB	< 500 MB
Train/test split	Yes (90/10 by document)	Often missing

The bilingual query structure is particularly valuable. It enables cross-lingual retrieval research: can a model retrieve French military documents using English queries? This opens research avenues that monolingual or text-only datasets cannot support.

For researchers building defense AI systems, this dataset provides both visual document understanding capabilities and cross-lingual retrieval benchmarks.

What Are the Best Use Cases for AI and Machine Learning?

This military doctrine dataset is purpose-built for visual document understanding and retrieval tasks. Its page images with bilingual queries make it uniquely suited for multimodal AI applications.

1. Visual Document Retrieval

The primary use case: given a text query, retrieve the most relevant page image. With 58,542 query-image pairs (29,271 pages × 2 languages), the dataset enables training and evaluation of vision-language retrieval models in the military domain.

2. Visual Question Answering (VQA)

Each page image is paired with AI-generated queries describing the page content. This structure supports training VQA models that can answer questions about military doctrine documents based on their visual appearance.

3. Cross-Lingual Document Retrieval

Every page has both French and English queries, enabling research on cross-lingual retrieval: retrieve French military documents using English queries, or vice versa.

4. Document Layout Understanding

The page images preserve the visual structure of military documents—tables, diagrams, hierarchical formatting, and multi-column layouts. This supports research on document layout analysis and structure extraction.

5. Military Domain Embeddings

The dataset can train specialized embedding models that understand military terminology and concepts. The 16 document categories provide natural clustering for evaluation.

6. RAG Systems for Military Q&A

The dataset’s organization into clear doctrinal categories makes it suitable for building retrieval-augmented generation systems. The bilingual queries enable multilingual RAG pipelines.

How Was This Defense Training Data Collected?

The dataset was built in two stages: PDF collection and parquet conversion.

Stage 1: PDF Collection

Source PDFs were collected from institutional websites using Python 3 with Requests and BeautifulSoup4. Every PDF was validated using magic bytes verification.

French sources include defense.gouv.fr/cicde, c-dec.terre.defense.gouv.fr, irsem.fr, and asso-minerve.fr. NATO sources include gov.uk and coemed.org. UN peacekeeping documents were also included.

Stage 2: Parquet Conversion

The collected PDFs were converted to a Hugging Face-compatible parquet format:

Page rendering: Each PDF page was converted to a JPEG image
Query generation: AI-generated queries were created for each page in both French and English
Bilingual duplication: Each page appears twice in the dataset (once with language=fr, once with language=en)
Train/test split: Documents were split 90/10 by document (not by page), resulting in 341 train documents and 36 test documents

The final parquet files total 12.93 GB (11.87 GB train + 1.06 GB test), with page images stored as JPEG binary in the image column.

Citation

When using this dataset, please cite:

NATO & French Military Doctrine Dataset (2026). Sources: NATO, French Ministry of Armed Forces, IRSEM, UN Peacekeeping. 377 documents, 29,271 page images, 58,542 rows with bilingual queries.

Technical newsletter

1 article per month on document AI. No spam.

Sources

Common questions

How many documents are in the NATO & French Military Doctrine Dataset?

The dataset contains **377 documents** totaling **29,271 unique page images** and **58,542 rows** (each page appears twice, once per language). The parquet files total **12.93 GB**.

What is the dataset format?