NeuralBench: The New Standard for Benchmarking Brain-AI Models

Published: 2026-05-07 16:45:52 | Category: Data Science

Evaluating artificial intelligence models designed to interpret brain signals has long been a chaotic process. Different labs apply their own preprocessing steps, train on disparate datasets, and report results on only a handful of tasks—making it nearly impossible to determine which model truly excels. To address this, Meta AI's research team has introduced NeuralBench, an open-source framework that standardizes the benchmarking of NeuroAI models. The first release, NeuralBench-EEG v1.0, is the largest unified benchmark of its kind, encompassing 36 downstream tasks, 94 datasets, data from 9,478 subjects, and over 13,600 hours of EEG recordings, all evaluated under a single interface. This initiative aims to bring clarity and reproducibility to a field where claims of model generality often rest on cherry-picked results.

Why was NeuralBench created?

The broader field of NeuroAI—where deep learning meets neuroscience—has grown rapidly in recent years. Techniques like self-supervised learning, originally developed for language and vision, are now being adapted to build brain foundation models: large models pretrained on unlabeled neural recordings and fine-tuned for tasks such as clinical seizure detection or decoding visual and auditory stimuli. However, the evaluation landscape was fragmented. Existing benchmarks like MOABB cover up to 148 BCI datasets but limit evaluation to just 5 tasks. Other efforts—EEG-Bench, EEG-FM-Bench, AdaBrain-Bench—each have their own constraints. For modalities like MEG and fMRI, there was no systematic benchmark at all. This lack of a common reference made it easy for researchers to cherry-pick tasks that made their models look good, undermining scientific rigor. NeuralBench was designed to fix this by providing a unified, standardized framework that forces consistent evaluation across a wide range of tasks and datasets.

NeuralBench: The New Standard for Benchmarking Brain-AI Models — Source: www.marktechpost.com

What does NeuralBench-EEG v1.0 include?

The first release of NeuralBench is focused exclusively on electroencephalography (EEG) data. It includes 36 downstream tasks, drawn from 94 datasets covering 9,478 subjects and 13,603 hours of recordings. The framework evaluates 14 deep learning architectures under a single standardized interface. Tasks range from clinical applications like seizure detection to brain-computer interfacing tasks such as decoding imagined speech or motor imagery. This scale makes it the largest open benchmark for EEG-based NeuroAI models, offering a comprehensive testbed for assessing model generalization. The datasets are curated from public repositories including OpenNeuro, DANDI, and NEMAR, ensuring broad coverage of different experimental paradigms and recording conditions.

How does the NeuralBench framework work technically?

NeuralBench is built on three core Python packages that form a modular pipeline. NeuralFetch handles dataset acquisition, pulling curated data from public repositories. NeuralSet prepares the data as PyTorch-ready dataloaders, wrapping existing neuroscience tools like MNE-Python and nilearn for preprocessing, and integrates HuggingFace for extracting stimulus embeddings relevant to tasks involving images, speech, or text. NeuralTrain provides modular training code built on PyTorch-Lightning, Pydantic, and the exca execution and caching library. The entire framework is controlled via a command-line interface (CLI) after installation with pip install neuralbench. Running a task requires just three commands: download the data, prepare the cache, and execute. Each task is configured through a lightweight YAML file that specifies the data source, train/validation/test splits, preprocessing steps, target processing, training hyperparameters, and evaluation metrics. This design ensures reproducibility and ease of use.

How do researchers use NeuralBench?

Using NeuralBench is straightforward. After installing the package, researchers interact with it through a command-line interface. The typical workflow involves three simple commands: first, neuralbench download <task_name> to fetch the data; second, neuralbench prepare <task_name> to create cached, preprocessed dataloaders; and third, neuralbench run <task_name> with a specified model architecture. Configuration is handled via YAML files that define every aspect of the experiment, from data splits to hyperparameters. This unified interface allows researchers to compare model performance across all 36 tasks using the exact same preprocessing and evaluation pipeline. The framework also supports custom models, enabling teams to benchmark their own architectures against the 14 included baselines. By standardizing the evaluation process, NeuralBench eliminates many sources of variability that previously made cross-study comparisons unreliable.

What types of tasks and datasets are covered?

The 36 downstream tasks in NeuralBench-EEG v1.0 span a wide range of applications. They include clinical tasks such as seizure detection, sleep staging, and cognitive load classification; brain-computer interfacing tasks like motor imagery, P300 speller, and steady-state visual evoked potentials; and perception and language tasks such as decoding visual stimuli from EEG or recognizing imagined speech. The 94 datasets are drawn from major public repositories including OpenNeuro, DANDI, and NEMAR, and cover diverse recording setups, subject populations, and experimental protocols. This diversity ensures that a model's performance is tested across multiple conditions, providing a robust measure of generalizability. For example, a model trained on one dataset can be fine-tuned and evaluated on others, allowing researchers to assess how well it transfers to new tasks or recording environments.

Why is a unified benchmark critical for NeuroAI?

Without a common benchmark, claims about foundation models being "generalizable" or "foundational" often rest on cherry-picked tasks with no common reference point. This hampers scientific progress because it becomes impossible to compare methods fairly. NeuralBench addresses this by providing a standardized interface that forces all models to be evaluated on the same predefined tasks, datasets, preprocessing steps, and evaluation metrics. This levels the playing field and makes results reproducible. The framework also reduces the effort required to set up experiments—researchers no longer need to write custom data loaders or preprocessing pipelines for every new dataset. By lowering these barriers, NeuralBench encourages more systematic evaluation and accelerates the development of robust NeuroAI models that can be trusted for real-world applications like clinical diagnostics and assistive technologies.

Codenil