OpenBHB: a Multi-Site Brain MRI Dataset for Age Prediction and Debiasing

Citation Author(s):
CEA Saclay
CEA Saclay
CEA Saclay
CEA Saclay
CEA Saclay
CEA Saclay
Submitted by:
Benoit Dufumier
Last updated:
Tue, 09/20/2022 - 11:37
Data Format:
0 ratings - Please login to submit your rating.


The Open Big Healthy Brains (OpenBHB) dataset is a large (N>5000) multi-site 3D brain MRI dataset gathering 10 public datasets (IXI, ABIDE 1, ABIDE 2, CoRR, GSP, Localizer, MPI-Leipzig, NAR, NPC, RBP) of T1 images acquired across 93 different centers, spread worldwide (North America, Europe and China). Only healthy controls have been included in OpenBHB with age ranging from 6 to 88 years old, balanced between males and females. All T1 images have been uniformly pre-processed with CAT12 (SPM), FreeSurfer (FSL) and Quasi-Raw (in-house minimal pre-processing) and they all passed a visual quality check. Both Voxel-Based Morphometry and Surface-Based Morphometry measures are available for each T1 MRI. Participant's age and sex are provided as well as the acquisition site, MRI magnetic field and MRI scanner settings used for each image acquisition. 

Note: OpenBHB has been divided into an official train, validation and test split for the open challenge currently deployed on brain age prediction and site-effect removal (see below). To avoid any data leakage during this challenge, data in test are kept private on the submission servers to compute the challenge metrics. Only training and validation data are openly available for now.

The OpenBHB Challenges

  1. Brain age prediction and debiasing with site-effect removal

OpenBHB has been designed for brain age prediction and debiasing with site-effect removal in current brain MRI datasets through representation learning. The challenge consists in developing new algorithms taking as input T1 MRI images available in OpenBHB and outputting representation vectors preserving the biological variability (age) and removingundesirable non-biological confounding variables (acquisition site/settings). The representation quality is evaluated through linear probing on brain age prediction and site debiasing with various metrics (e.g Mean Absolute Error). All algorithms can be submitted on RAMP (check out our webpage for more details) with a public recording of their performance and an official leaderboard. This challenge should promote reproducible research in neuroimaging and it tackles 2 hot topics in both computer vision and neuroimaging, namely representation learning and debiasing.


Please read carrefuly the following sections.

Dataset organization

This dataset comprises 3227 training images, 757 validation images and 664 testing images (kept private) dedicated to the OpenBHB challenge. Additionally, 628 images are available with missing label information (age, sex, or scanner details) and they are excluded for the current challenge. The exact content of this dataset is described in our paper. Check out the Github repository to submit your model:

The dataset is organized as follows:

  • Official training and validation data with all modalities concatenated (VBM, SBM, Quasi-Raw) and labels (age and site) are accessible in
  • All meta-data information (age, sex, site, acquisition setting, magnetic field strengh, etc.) can be found in participants.tsv.
  • Corresponding T1 images pre-processed with CAT12 (VBM), FSL (SBM) and Quasi-Raw can be found in training_data.
  • The pairs (site, acquisition setting) discretized used for the OpenBHB Challenge are in official_site_class_labels.tsv.
  • Additional T1 images with missing label information are in missing_label_data.
  • The metrics used for Quality Check (e.g Euler number for FreeSurfer) can be found in qc.tsv.


  • the templates used during the VBM analysis can be found in cat12vbm_space-MNI152_desc-gm_TPM.nii.gz.
  • the templates used during the Quasi-Raw analysis can be found in quasiraw_space-MNI152_desc-brain_T1w.nii.gz.
  • the Region-Of-Interest (ROI) names corresponding to the default CAT12 atlas (Neuromorphometrics) and FSL Desikan and Destrieux atlases can be found in cat12vbm_labels.txt, freesurfer_atlas-desikan_labels.txt and freesurfer_atlas-destrieux_labels.txt respectively.
  • the surface-based feature names derived by FreeSurfer on both Desikan and Destrieux atlases are available in freesurfer_channels.txt.


If you use this dataset for your work, please use the following citation:


      title={{OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing}},

      author={Dufumier, Benoit and Grigis, Antoine and Victor, Julie and Ambroise, Corentin and Frouin, Vincent and Duchesnay, Edouard},




Licence and Data Usage Agreement

This dataset is under Licence CC BY-NC-SA 3.0. By downloading this dataset, you also agree to the most restrictive Data Usage Agreement (DUA) of all cohorts (see the Data Usage Agreement terms included in this dataset):

  • ABIDE 1 [1]. Licence term CC BY-NC-SA 3.0 (ShareAlike), DUA
  • ABIDE 2 [2]. Licence term CC BY-NC-SA 3.0, DUA
  • IXI [3]. Licence term CC0, DUA
  • CoRR [4] Licence term CC0, DUA
  • GSP [5]  Licence term  DUA
  • NAR [6] Licence term CC0
  • MPI-Leipzig [7] Licence term CC0
  • NPC [8] Licence term CC0
  • RBP [9,10] Licence term CC0
  • Localizer [11] Licence term CC BY 3.0


  1. [1]
  2. [2]
  3. [3]
  4. [4] Zuo, X.N., et al., An Open Science Resource for Establishing Reliability and Reproducibility in Functional Connectomics, (In Press)
  5. [5] Buckner, Randy L.; Roffman, Joshua L.; Smoller, Jordan W., 2014, "Brain Genomics Superstruct Project (GSP)",, Harvard Dataverse, V10
  6. [6] Nastase, S. A., et al., Narratives: fMRI data for evaluating models of naturalistic language comprehension.
  7. [7] Babayan, A., Erbey, M., Kumral, D. et al. A mind-brain-body dataset of MRI, EEG, cognition, emotion, and peripheral physiology in young and old adults. Sci Data 6, 180308 (2019).
  8. [8] Sunavsky, A. and Poppenk, J. (2020). Neuroimaging predictors of creativity in healthy adults. OpenNeuro. doi: 10.18112/openneuro.ds002330.v1.1.0
  9. [9] Li, P., & Clariana, R. (2019) Reading comprehension in L1 and L2: An integrative appraoch. Journal of Neurolinguistics, 50, 94-105.(
  10. [10] Follmer, J., Fang, S., Clariana, R., Meyer, B., & Li, P (2018). What predicts adult readers' understanding of STEM texts? Reading and Writing, 31, 185-214.(
  11. [11] Orfanos, D. P., Michel, V., Schwartz, Y., Pinel, P., Moreno, A., Le Bihan, D., & Frouin, V. (2017). The brainomics/localizer database. NeuroImage, 144, 309-314.


Good Morning sir I am not able to access the datasat to analyse. I want the link of the dataset can you please provide it. I will be thankful if you provide me the dataset.

Submitted by Sumiran Singh on Tue, 10/17/2023 - 01:39