ADAM — A Diverse Archive of Mankind
ADAM dataset and benchmark for evaluating and improving LLMs in biographical reasoning
Abstract
We introduce ADAM (A Diverse Archive of Mankind), a framework for evalu- ating and improving multimodal large language models (MLLMs) in biograph- ical reasoning. To the best of our knowledge, this is the first work to system- atically examine LLM capabilities in biography, a critical yet underexplored di- mension of factual knowledge. At its core, AdamDB is a multilingual and mul- timodal dataset covering over 4 million individuals across geography, time, and profession, while AdamBench provides cognitively structured evaluations based on Bloom’s taxonomy, spanning six reasoning levels in both English and native languages. To address hallucinations, particularly for lesser-known individuals, we propose AdamRAG, a retrieval-augmented generation system tailored to bi- ographical contexts. Experiments show that AdamRAG substantially improves open-source models and modestly benefits closed-source ones, with the largest gains on lower-order reasoning. Popularity strongly mediates accuracy, and mul- timodal input via face images offers smaller, less consistent improvements than retrieval. ADAM establishes the first benchmark and framework for cognitively,a culturally, and multimodally grounded biographical evaluation, advancing the de- velopment of multilingual, accurate, and hallucination-resistant MLLMs.
Publication
ADAM: A Diverse Archive of Mankind for Evaluating and Enhancing LLMs in Biographical Reasoning
ADAM: Overview
ADAM is the first retrieval-augmented framework specifically designed for biographical reasoning. It addresses key weaknesses in existing LLM pipelines: hallucination on factual biographical content, poor coverage for less-documented individuals, English-centric bias, and the lack of cognitively-structured evaluation.
Contributions
- ADAM Framework: An integrated retrieval-augmented framework for biography, combining multilingual retrieval, popularity-aware ranking, and cognitive benchmarking.
- AdamDB: A large-scale, multilingual, multimodal biographical knowledge base covering over 4 million individuals across nearly 600 languages.
- AdamBench: A Bloom’s Taxonomy–grounded evaluation suite with multilingual and multimodal multiple-choice questions.
- AdamRAG: A retrieval-augmented generation system that reduces hallucinations through popularity-weighted retrieval and cross-lingual linking.
- Comprehensive evaluation: Systematic analysis across open-source and closed-source models, languages, modalities, and popularity tiers.
Introduction
Large language models have transformed access to information, but biography poses particular challenges that demand exact factual accuracy. ADAM addresses these challenges by combining a scalable database of structured biographical records (AdamDB), a cognitively informed benchmark (AdamBench), and a retrieval pipeline (AdamRAG) to ensure outputs are evidence-backed and auditable.
Dataset (AdamDB)
Key characteristics:
- Large-scale: ~4 million structured individual records.
- Multimodal: textual biographies with image references.
- Multilingual: coverage across nearly 600 languages.
- Popularity metrics: Wikipedia pageview-derived popularity weights used for retrieval ranking.
Construction highlights: automated extraction from WikiDBs, NER-based filtering, Wikidata Q‑ID alignment, deduplication, and multilingual name resolution.
Benchmark (AdamBench)
AdamBench contains multiple-choice questions organized by Bloom’s Taxonomy, written in both English and the subject’s native language where available. It tests cognitive levels from Remembering up to Creating and includes multimodal questions that combine text and images.
AdamRAG (Retrieval & Generation)
AdamRAG retrieves contextual passages from AdamDB, applies context-window optimization and popularity-weighted ranking, then forwards the assembled evidence to a generator LLM. This pipeline reduces hallucinations and improves disambiguation for similar names and lesser-known subjects.
Evaluation & Results
We evaluate ADAM across open-source and closed-source models under multilingual and multimodal settings. Results show retrieval augmentation consistently improves factual accuracy, with the largest gains for open-source models and for lower-order cognitive tasks. Popularity strongly mediates accuracy; retrieval reduces but does not fully eliminate this bias.
Impact and Future Work
ADAM provides a robust, reproducible framework for biographical reasoning research. Future work includes refining multimodal fusion strategies, designing fairness-aware benchmarks, and releasing the full AdamDB and AdamBench datasets to the community.