Back to blog
Benchmarks

NATO & French Military Doctrine Dataset: 377 Documents, 29,271 Pages for Visual Document Retrieval

Racine AI

The NATO & French Military Doctrine Dataset is a visual document retrieval corpus of 377 documents totaling 29,271 page images with bilingual AI-generated queries in a 12.93 GB parquet format. It is designed for document retrieval and visual question answering (VQA) tasks in the military domain.

This military doctrine dataset draws from authoritative sources: NATO, the French Ministry of Armed Forces, IRSEM (Institute for Strategic Research), the United Nations, and COEMED (Centre of Excellence for Military Medicine). It covers the full spectrum of doctrinal hierarchy, from strategic white papers down to tactical field manuals.

Each page image is paired with AI-generated queries in both French and English, creating 58,542 total rows (each page appears twice—once per language). This structure makes the dataset uniquely suited for cross-lingual document retrieval, visual question answering, and bilingual defense NLP pipelines.

What Are the Key Statistics of This Military Doctrine Dataset?

This defense training data corpus contains 377 documents spanning 29,271 unique page images in a 12.93 GB parquet format.

MetricValue
Total documents377
Unique page images29,271
Total rows58,542 (each page × 2 languages)
Total size12.93 GB (parquet with images)
Train split53,114 rows (341 documents, 26,557 pages)
Test split5,428 rows (36 documents, 2,714 pages)
LanguagesBilingual queries (FR + EN per page)
French documents192 (50.9%)
NATO documents185 (49.1%)
French pages19,782
NATO pages9,489
French categories12
NATO categories4

How Are the Documents Distributed by Language?

French documents slightly outnumber NATO documents at 192 (50.9%) versus 185 (49.1%). However, French pages represent 67.6% of the total page count.

SourceDocumentsPagesRows% of Docs
French19219,78239,56450.9%
NATO1859,48918,97849.1%
Total37729,27158,542100%

The average French document is larger (103 pages) than the average NATO document (51 pages). Note that “bilingual” refers to the AI-generated queries—each page image has both a French query and an English query, regardless of the source document’s original language.

What French Military Doctrine Categories Are Included?

The French portion of the dataset is organized into 12 distinct doctrinal categories, ranging from all-arms field manuals (TTA) to strategic research papers (IRSEM). The TTA category is the largest by page count with 6,370 pages across 33 documents.

CategoryDescriptionDocsPagesRows
ttaTextes Toutes Armes (All-Arms)336,37012,740
lexiconsGlossaries, AAP-06/15213,4776,954
tacticalTactical manuals (INF, GTIA)202,7925,584
strategicWhite papers, strategic reviews191,6153,230
piaJoint Publications (Interarmées)251,6083,216
diaJoint Doctrine (Interarmées)231,2732,546
irsemStrategic research (IRSEM)229261,852
medotOperational decision methodology125271,054
un_manualsUN peacekeeping (FR)4428856
cahiers_penseeMilitary thought journals7407814
ftFT/RFT Land Forces5352704
modernModern doctrine1714
Total19219,78239,564

The TTA (Textes Toutes Armes) documents are the densest, averaging 193 pages per document. These all-arms manuals form the backbone of French land forces training.

Lexicons and glossaries provide 3,477 pages of standardized military terminology. These are particularly valuable for building domain-specific tokenizers and military vocabulary resources for NLP models.

What NATO Doctrine Categories Are Included?

The NATO portion contains 185 documents organized into 4 categories. The AJP (Allied Joint Publications) category is the largest by page count with 4,188 pages.

CategoryDescriptionDocsPagesRows
ajpAllied Joint Publications474,1888,376
amedpAllied Medical Publications933,7597,518
ajmedpAllied Joint Medical Publications201,0882,176
otherNATO standards and references25454908
Total1859,48918,978

The 47 AJP documents cover joint operations (AJP-3), logistics (AJP-4), planning (AJP-5), and allied joint doctrine (AJP-01). The medical publications (AMEDP and AJMEDP) add 113 documents and 4,847 pages of specialized medical doctrine relevant for force health protection and CBRN countermeasures.

What Doctrinal Levels Does the Dataset Cover?

This military doctrine dataset covers all four levels of the doctrinal hierarchy: strategic, operational, tactical, and technical. This full-spectrum coverage enables AI models to learn the relationships between high-level strategic intent and ground-level tactical execution.

Doctrinal LevelKey DocumentsDocument Count
StrategicStrategic (19), IRSEM (22), AJP-01, AJP-541+
OperationalDIA (23), PIA (25), FT (5), AJP-3, AJP-453+
TacticalTTA (33), Tactical (20), MEDOT (12), ATP series65+
TechnicalLexicons (21), AAP-06, AAP-1521+

The strategic level includes French white papers and IRSEM research. The operational level pairs French joint doctrine (DIA/PIA) with NATO AJP-3 and AJP-4. The tactical level features TTA all-arms manuals and tactical INF publications. The technical level provides standardized terminology through lexicons and glossaries.

Which Operational Domains Are Represented?

The dataset covers multiple operational domains across both French and NATO documents, including:

  • Medical and health services: 113 NATO medical publications (AMEDP + AJMEDP) plus French medical doctrine
  • Joint operations: AJP series and French DIA/PIA joint doctrine (70+ documents)
  • Tactical operations: TTA all-arms manuals and tactical INF publications (53 documents)
  • Strategic planning: IRSEM research and strategic reviews (41 documents)
  • Terminology and standards: Lexicons and glossaries (21 documents)

The NATO portion is particularly strong in medical doctrine, with 113 documents covering force health protection, CBRN countermeasures, and allied medical procedures. The French portion excels in tactical and operational doctrine, with the TTA and tactical categories alone providing over 9,000 page images.

What Time Periods Does the Dataset Span?

The dataset primarily contains modern military doctrine, with the majority of documents from the 2010s and 2020s. The corpus reflects current NATO standardization agreements and contemporary French military thinking.

Key temporal characteristics:

  • Modern era dominance: Most documents reflect post-2010 doctrine, ensuring relevance for current military concepts including cyber operations, multi-domain operations, and hybrid threats
  • Living doctrine: NATO AJP publications and French DIA/PIA documents are regularly updated, and this dataset captures recent editions
  • Historical depth: Some foundational texts and lexicons have roots in earlier NATO standardization efforts

The heavy weighting toward current doctrine ensures that AI models trained on this data learn the most relevant and up-to-date military concepts.

How Are Documents Distributed by Size?

The dataset contains 377 documents with a mean of approximately 78 pages per document. Document sizes range from single-page references up to 701-page comprehensive manuals.

StatisticValue (pages)
Total documents377
Total pages29,271
Mean pages/doc78
Maximum701

The largest document is Tactique Théorique by Général Yakovleff with 701 pages (669 extracted). Document sizes vary significantly by category—TTA documents average 193 pages while medical publications (AMEDP) average 40 pages.

The diversity of document sizes makes this dataset suitable for various ML tasks: shorter documents work well for context-window-constrained models, while longer documents provide dense material for retrieval-augmented generation pipelines.

What Are the Largest Documents?

The largest document in the dataset is Tactique Théorique by Général Michel Yakovleff at 701 pages (669 pages extracted after filtering blank pages). This foundational text on tactical theory is one of the most cited works in French military education.

The TTA (Textes Toutes Armes) category contains the densest documents, with several exceeding 400 pages. These all-arms manuals form the backbone of French land forces training and include:

  • Core training documents used throughout the French Army
  • Comprehensive tactical and operational references
  • Standardized procedures for all-arms operations

The INF 202 (Infantry Section Manual) and TTA 150 (General Knowledge) are examples of core training documents whose inclusion makes this dataset directly representative of the texts that French military personnel study during their careers.

How Does This Dataset Compare to Other Military NLP Datasets?

This NATO doctrine dataset is distinguished by three characteristics rarely found together in defense datasets: visual page images, bilingual queries, and institutional source diversity.

Most existing military NLP corpora are text-only and monolingual. This dataset provides page images with bilingual queries, enabling multimodal and cross-lingual research.

FeatureThis DatasetTypical Military Corpora
FormatPage images + queriesText only
LanguagesBilingual queries (FR + EN)Monolingual (EN only)
Documents37750-200
Page images29,271N/A
Total rows58,5425,000-15,000
Doctrinal levels4 (Strategic to Technical)1-2
Size12.93 GB< 500 MB
Train/test splitYes (90/10 by document)Often missing

The bilingual query structure is particularly valuable. It enables cross-lingual retrieval research: can a model retrieve French military documents using English queries? This opens research avenues that monolingual or text-only datasets cannot support.

For researchers building defense AI systems, this dataset provides both visual document understanding capabilities and cross-lingual retrieval benchmarks.

What Are the Best Use Cases for AI and Machine Learning?

This military doctrine dataset is purpose-built for visual document understanding and retrieval tasks. Its page images with bilingual queries make it uniquely suited for multimodal AI applications.

1. Visual Document Retrieval

The primary use case: given a text query, retrieve the most relevant page image. With 58,542 query-image pairs (29,271 pages × 2 languages), the dataset enables training and evaluation of vision-language retrieval models in the military domain.

2. Visual Question Answering (VQA)

Each page image is paired with AI-generated queries describing the page content. This structure supports training VQA models that can answer questions about military doctrine documents based on their visual appearance.

3. Cross-Lingual Document Retrieval

Every page has both French and English queries, enabling research on cross-lingual retrieval: retrieve French military documents using English queries, or vice versa.

4. Document Layout Understanding

The page images preserve the visual structure of military documents—tables, diagrams, hierarchical formatting, and multi-column layouts. This supports research on document layout analysis and structure extraction.

5. Military Domain Embeddings

The dataset can train specialized embedding models that understand military terminology and concepts. The 16 document categories provide natural clustering for evaluation.

6. RAG Systems for Military Q&A

The dataset’s organization into clear doctrinal categories makes it suitable for building retrieval-augmented generation systems. The bilingual queries enable multilingual RAG pipelines.

How Was This Defense Training Data Collected?

The dataset was built in two stages: PDF collection and parquet conversion.

Stage 1: PDF Collection

Source PDFs were collected from institutional websites using Python 3 with Requests and BeautifulSoup4. Every PDF was validated using magic bytes verification.

French sources include defense.gouv.fr/cicde, c-dec.terre.defense.gouv.fr, irsem.fr, and asso-minerve.fr. NATO sources include gov.uk and coemed.org. UN peacekeeping documents were also included.

Stage 2: Parquet Conversion

The collected PDFs were converted to a Hugging Face-compatible parquet format:

  1. Page rendering: Each PDF page was converted to a JPEG image
  2. Query generation: AI-generated queries were created for each page in both French and English
  3. Bilingual duplication: Each page appears twice in the dataset (once with language=fr, once with language=en)
  4. Train/test split: Documents were split 90/10 by document (not by page), resulting in 341 train documents and 36 test documents

The final parquet files total 12.93 GB (11.87 GB train + 1.06 GB test), with page images stored as JPEG binary in the image column.

Citation

When using this dataset, please cite:

NATO & French Military Doctrine Dataset (2026). Sources: NATO, French Ministry of Armed Forces, IRSEM, UN Peacekeeping. 377 documents, 29,271 page images, 58,542 rows with bilingual queries.

Technical newsletter

1 article per month on document AI. No spam.

8 + 9 =

Common questions

How many documents are in the NATO & French Military Doctrine Dataset?

The dataset contains **377 documents** totaling **29,271 unique page images** and **58,542 rows** (each page appears twice, once per language). The parquet files total **12.93 GB**.

What is the dataset format?

The dataset is provided in **Hugging Face parquet format** with two splits: train (53,114 rows, 341 documents) and test (5,428 rows, 36 documents). Each row contains a page image (JPEG binary) and bilingual queries.

What languages are covered?

The dataset has **bilingual queries** — every page image has both a French query (`query_fr`) and an English query (`query_en`). The source documents are 192 French documents and 185 NATO (English) documents.

What is the schema?

The dataset has 14 columns: `id`, `doc_id`, `page_num`, `total_pages`, `folder`, `subfolder`, `filename`, `source_path`, `query_fr`, `query_en`, `language`, `query`, `created_at`, and `image` (JPEG binary).

What NATO publications are included?

The NATO portion includes **47 Allied Joint Publications (AJP)**, **93 Allied Medical Publications (AMEDP)**, **20 Allied Joint Medical Publications (AJMEDP)**, and **25 other NATO standards**. Key series include AJP-01, AJP-3, AJP-4, and AJP-5.

What French military doctrine categories are included?

The French portion has **12 categories**: TTA (33 docs), lexicons (21 docs), tactical (20 docs), PIA (25 docs), strategic (19 docs), DIA (23 docs), IRSEM (22 docs), MEDOT (12 docs), UN manuals (4 docs), cahiers_pensee (7 docs), FT (5 docs), and modern (1 doc).

How do I load this dataset?

Use the Hugging Face datasets library: `from datasets import load_dataset; ds = load_dataset("parquet", data_dir="data/")`. The train split contains 53,114 rows and the test split 5,428 rows.

What are the best use cases?

The dataset is designed for **visual document retrieval** and **visual question answering**. Given a text query, retrieve the most relevant page image. The bilingual queries also enable **cross-lingual retrieval** research.

Let's discuss

Your Project.

AI Documents, legacy automation, field inspection. We deploy solutions that go to production.

Tell us about your project and get a response within 48h.

Contact us