Treatment Effect Estimation Benchmarks

Citation Author(s):
University of Essex
Submitted by:
Damian Machlanski
Last updated:
Wed, 08/16/2023 - 08:00
Data Format:
0 ratings - Please login to submit your rating.


This bundle contains 4 well known and established causal inference benchmark datasets in order to evaluate the performance of causal/treatment effect estimation methods. These datasets are: IHDP, Jobs, Twins and News. All datasets are already publicly available. This bundle merely collects them in a single location for ease of replication.

IHDP is based on Infant Health Development Program (IHDP) clinical trial. Goal: predict the effect of receiving specialised childcare on cognitive test score of the infants. Introduced by [1].

Jobs combines data from the National Supported Work Program and the Panel Study of Income Dynamics. Goal: predict the effect of job training on employement status. Introduced by [4].

Twins consists of twin births in the US between 1989-1991. Goal: predict the effect of higher body mass on mortality. This specifically pre-processed data come from [3].

News is a collection of news articles represented as bags of words. Goal: predict the effect of device type used to read the article on the user experience. Introduced by [2].


See respective references for more details about the datasets.



[1] J. L. Hill, ‘Bayesian Nonparametric Modeling for Causal Inference’, Journal of Computational and Graphical Statistics, vol. 20, no. 1, pp. 217–240, Jan. 2011, doi: 10.1198/jcgs.2010.08162.

[2] F. D. Johansson, U. Shalit, and D. Sontag, ‘Learning representations for counterfactual inference’, in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, in ICML’16. New York, NY, USA:, Jun. 2016, pp. 3020–3029.

[3] C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling, ‘Causal Effect Inference with Deep Latent-Variable Models’, Advances in Neural Information Processing Systems, vol. 30, 2017, Accessed: May 25, 2021. [Online]. Available:

[4] J. A. Smith and P. E. Todd, ‘Does matching overcome LaLonde’s critique of nonexperimental estimators?’, Journal of Econometrics, vol. 125, no. 1–2, pp. 305–353, 2005.


Please visit the following GitHub repository for further instructions and examples on how to load and use the datasets with Python programming language.