BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

ByAdmin

Apr 20, 2026

THE AI TODAY

arXiv:2604.16241v1 Announce Type: cross
Abstract: Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.

By Admin

AI RESEARCH

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

ByAdmin

By Admin

Related Post

PatientAgentBench: A Benchmark Framework for Evaluating Patient-Facing Health AI Agents

Matrix-Free Photoacoustic Image Reconstruction via Sensor-Token Self-Attention

Evaluating Communicative Belief Updates in Large Language Models via Implicature Recognition and Cancellation

You missed

DisasterTD: Disaster Toponym Disambiguation Using Multimodal LLMs and Cross-View Geolocalization

CoTinyVLA: Chain-of-Thought Distillation for a Sub-Billion-Parameter Vision-Language-Action Model

Inferring Missing Trajectory Data with Temporal Convolutional Networks

Toward Standardized Cross-Vendor Agent Tool Trust Management in Autonomous Networks