The NATO & French Military Doctrine Dataset is a visual document retrieval corpus of 377 documents totaling 29,271 page images with bilingual AI-generated queries in a 12.93 GB parquet format. It is designed for document retrieval and visual question answering (VQA) tasks in the military domain.
This military doctrine dataset draws from authoritative sources: NATO, the French Ministry of Armed Forces, IRSEM (Institute for Strategic Research), the United Nations, and COEMED (Centre of Excellence for Military Medicine). It covers the full spectrum of doctrinal hierarchy, from strategic white papers down to tactical field manuals.
Each page image is paired with AI-generated queries in both French and English, creating 58,542 total rows (each page appears twice—once per language). This structure makes the dataset uniquely suited for cross-lingual document retrieval, visual question answering, and bilingual defense NLP pipelines.
What Are the Key Statistics of This Military Doctrine Dataset?
This defense training data corpus contains 377 documents spanning 29,271 unique page images in a 12.93 GB parquet format.
| Metric | Value |
|---|---|
| Total documents | 377 |
| Unique page images | 29,271 |
| Total rows | 58,542 (each page × 2 languages) |
| Total size | 12.93 GB (parquet with images) |
| Train split | 53,114 rows (341 documents, 26,557 pages) |
| Test split | 5,428 rows (36 documents, 2,714 pages) |
| Languages | Bilingual queries (FR + EN per page) |
| French documents | 192 (50.9%) |
| NATO documents | 185 (49.1%) |
| French pages | 19,782 |
| NATO pages | 9,489 |
| French categories | 12 |
| NATO categories | 4 |
How Are the Documents Distributed by Language?
French documents slightly outnumber NATO documents at 192 (50.9%) versus 185 (49.1%). However, French pages represent 67.6% of the total page count.
| Source | Documents | Pages | Rows | % of Docs |
|---|---|---|---|---|
| French | 192 | 19,782 | 39,564 | 50.9% |
| NATO | 185 | 9,489 | 18,978 | 49.1% |
| Total | 377 | 29,271 | 58,542 | 100% |
The average French document is larger (103 pages) than the average NATO document (51 pages). Note that “bilingual” refers to the AI-generated queries—each page image has both a French query and an English query, regardless of the source document’s original language.
What French Military Doctrine Categories Are Included?
The French portion of the dataset is organized into 12 distinct doctrinal categories, ranging from all-arms field manuals (TTA) to strategic research papers (IRSEM). The TTA category is the largest by page count with 6,370 pages across 33 documents.
| Category | Description | Docs | Pages | Rows |
|---|---|---|---|---|
| tta | Textes Toutes Armes (All-Arms) | 33 | 6,370 | 12,740 |
| lexicons | Glossaries, AAP-06/15 | 21 | 3,477 | 6,954 |
| tactical | Tactical manuals (INF, GTIA) | 20 | 2,792 | 5,584 |
| strategic | White papers, strategic reviews | 19 | 1,615 | 3,230 |
| pia | Joint Publications (Interarmées) | 25 | 1,608 | 3,216 |
| dia | Joint Doctrine (Interarmées) | 23 | 1,273 | 2,546 |
| irsem | Strategic research (IRSEM) | 22 | 926 | 1,852 |
| medot | Operational decision methodology | 12 | 527 | 1,054 |
| un_manuals | UN peacekeeping (FR) | 4 | 428 | 856 |
| cahiers_pensee | Military thought journals | 7 | 407 | 814 |
| ft | FT/RFT Land Forces | 5 | 352 | 704 |
| modern | Modern doctrine | 1 | 7 | 14 |
| Total | 192 | 19,782 | 39,564 |
The TTA (Textes Toutes Armes) documents are the densest, averaging 193 pages per document. These all-arms manuals form the backbone of French land forces training.
Lexicons and glossaries provide 3,477 pages of standardized military terminology. These are particularly valuable for building domain-specific tokenizers and military vocabulary resources for NLP models.
What NATO Doctrine Categories Are Included?
The NATO portion contains 185 documents organized into 4 categories. The AJP (Allied Joint Publications) category is the largest by page count with 4,188 pages.
| Category | Description | Docs | Pages | Rows |
|---|---|---|---|---|
| ajp | Allied Joint Publications | 47 | 4,188 | 8,376 |
| amedp | Allied Medical Publications | 93 | 3,759 | 7,518 |
| ajmedp | Allied Joint Medical Publications | 20 | 1,088 | 2,176 |
| other | NATO standards and references | 25 | 454 | 908 |
| Total | 185 | 9,489 | 18,978 |
The 47 AJP documents cover joint operations (AJP-3), logistics (AJP-4), planning (AJP-5), and allied joint doctrine (AJP-01). The medical publications (AMEDP and AJMEDP) add 113 documents and 4,847 pages of specialized medical doctrine relevant for force health protection and CBRN countermeasures.
What Doctrinal Levels Does the Dataset Cover?
This military doctrine dataset covers all four levels of the doctrinal hierarchy: strategic, operational, tactical, and technical. This full-spectrum coverage enables AI models to learn the relationships between high-level strategic intent and ground-level tactical execution.
| Doctrinal Level | Key Documents | Document Count |
|---|---|---|
| Strategic | Strategic (19), IRSEM (22), AJP-01, AJP-5 | 41+ |
| Operational | DIA (23), PIA (25), FT (5), AJP-3, AJP-4 | 53+ |
| Tactical | TTA (33), Tactical (20), MEDOT (12), ATP series | 65+ |
| Technical | Lexicons (21), AAP-06, AAP-15 | 21+ |
The strategic level includes French white papers and IRSEM research. The operational level pairs French joint doctrine (DIA/PIA) with NATO AJP-3 and AJP-4. The tactical level features TTA all-arms manuals and tactical INF publications. The technical level provides standardized terminology through lexicons and glossaries.
Which Operational Domains Are Represented?
The dataset covers multiple operational domains across both French and NATO documents, including:
- Medical and health services: 113 NATO medical publications (AMEDP + AJMEDP) plus French medical doctrine
- Joint operations: AJP series and French DIA/PIA joint doctrine (70+ documents)
- Tactical operations: TTA all-arms manuals and tactical INF publications (53 documents)
- Strategic planning: IRSEM research and strategic reviews (41 documents)
- Terminology and standards: Lexicons and glossaries (21 documents)
The NATO portion is particularly strong in medical doctrine, with 113 documents covering force health protection, CBRN countermeasures, and allied medical procedures. The French portion excels in tactical and operational doctrine, with the TTA and tactical categories alone providing over 9,000 page images.
What Time Periods Does the Dataset Span?
The dataset primarily contains modern military doctrine, with the majority of documents from the 2010s and 2020s. The corpus reflects current NATO standardization agreements and contemporary French military thinking.
Key temporal characteristics:
- Modern era dominance: Most documents reflect post-2010 doctrine, ensuring relevance for current military concepts including cyber operations, multi-domain operations, and hybrid threats
- Living doctrine: NATO AJP publications and French DIA/PIA documents are regularly updated, and this dataset captures recent editions
- Historical depth: Some foundational texts and lexicons have roots in earlier NATO standardization efforts
The heavy weighting toward current doctrine ensures that AI models trained on this data learn the most relevant and up-to-date military concepts.
How Are Documents Distributed by Size?
The dataset contains 377 documents with a mean of approximately 78 pages per document. Document sizes range from single-page references up to 701-page comprehensive manuals.
| Statistic | Value (pages) |
|---|---|
| Total documents | 377 |
| Total pages | 29,271 |
| Mean pages/doc | 78 |
| Maximum | 701 |
The largest document is Tactique Théorique by Général Yakovleff with 701 pages (669 extracted). Document sizes vary significantly by category—TTA documents average 193 pages while medical publications (AMEDP) average 40 pages.
The diversity of document sizes makes this dataset suitable for various ML tasks: shorter documents work well for context-window-constrained models, while longer documents provide dense material for retrieval-augmented generation pipelines.
What Are the Largest Documents?
The largest document in the dataset is Tactique Théorique by Général Michel Yakovleff at 701 pages (669 pages extracted after filtering blank pages). This foundational text on tactical theory is one of the most cited works in French military education.
The TTA (Textes Toutes Armes) category contains the densest documents, with several exceeding 400 pages. These all-arms manuals form the backbone of French land forces training and include:
- Core training documents used throughout the French Army
- Comprehensive tactical and operational references
- Standardized procedures for all-arms operations
The INF 202 (Infantry Section Manual) and TTA 150 (General Knowledge) are examples of core training documents whose inclusion makes this dataset directly representative of the texts that French military personnel study during their careers.
How Does This Dataset Compare to Other Military NLP Datasets?
This NATO doctrine dataset is distinguished by three characteristics rarely found together in defense datasets: visual page images, bilingual queries, and institutional source diversity.
Most existing military NLP corpora are text-only and monolingual. This dataset provides page images with bilingual queries, enabling multimodal and cross-lingual research.
| Feature | This Dataset | Typical Military Corpora |
|---|---|---|
| Format | Page images + queries | Text only |
| Languages | Bilingual queries (FR + EN) | Monolingual (EN only) |
| Documents | 377 | 50-200 |
| Page images | 29,271 | N/A |
| Total rows | 58,542 | 5,000-15,000 |
| Doctrinal levels | 4 (Strategic to Technical) | 1-2 |
| Size | 12.93 GB | < 500 MB |
| Train/test split | Yes (90/10 by document) | Often missing |
The bilingual query structure is particularly valuable. It enables cross-lingual retrieval research: can a model retrieve French military documents using English queries? This opens research avenues that monolingual or text-only datasets cannot support.
For researchers building defense AI systems, this dataset provides both visual document understanding capabilities and cross-lingual retrieval benchmarks.
What Are the Best Use Cases for AI and Machine Learning?
This military doctrine dataset is purpose-built for visual document understanding and retrieval tasks. Its page images with bilingual queries make it uniquely suited for multimodal AI applications.
1. Visual Document Retrieval
The primary use case: given a text query, retrieve the most relevant page image. With 58,542 query-image pairs (29,271 pages × 2 languages), the dataset enables training and evaluation of vision-language retrieval models in the military domain.
2. Visual Question Answering (VQA)
Each page image is paired with AI-generated queries describing the page content. This structure supports training VQA models that can answer questions about military doctrine documents based on their visual appearance.
3. Cross-Lingual Document Retrieval
Every page has both French and English queries, enabling research on cross-lingual retrieval: retrieve French military documents using English queries, or vice versa.
4. Document Layout Understanding
The page images preserve the visual structure of military documents—tables, diagrams, hierarchical formatting, and multi-column layouts. This supports research on document layout analysis and structure extraction.
5. Military Domain Embeddings
The dataset can train specialized embedding models that understand military terminology and concepts. The 16 document categories provide natural clustering for evaluation.
6. RAG Systems for Military Q&A
The dataset’s organization into clear doctrinal categories makes it suitable for building retrieval-augmented generation systems. The bilingual queries enable multilingual RAG pipelines.
How Was This Defense Training Data Collected?
The dataset was built in two stages: PDF collection and parquet conversion.
Stage 1: PDF Collection
Source PDFs were collected from institutional websites using Python 3 with Requests and BeautifulSoup4. Every PDF was validated using magic bytes verification.
French sources include defense.gouv.fr/cicde, c-dec.terre.defense.gouv.fr, irsem.fr, and asso-minerve.fr. NATO sources include gov.uk and coemed.org. UN peacekeeping documents were also included.
Stage 2: Parquet Conversion
The collected PDFs were converted to a Hugging Face-compatible parquet format:
- Page rendering: Each PDF page was converted to a JPEG image
- Query generation: AI-generated queries were created for each page in both French and English
- Bilingual duplication: Each page appears twice in the dataset (once with
language=fr, once withlanguage=en) - Train/test split: Documents were split 90/10 by document (not by page), resulting in 341 train documents and 36 test documents
The final parquet files total 12.93 GB (11.87 GB train + 1.06 GB test), with page images stored as JPEG binary in the image column.
Citation
When using this dataset, please cite:
NATO & French Military Doctrine Dataset (2026). Sources: NATO, French Ministry of Armed Forces, IRSEM, UN Peacekeeping. 377 documents, 29,271 page images, 58,542 rows with bilingual queries.