Re-curated Breast Imaging Subset DDSM Dataset (RBIS-DDSM) is a curated version of 849 images from the CBIS-DDSM dataset available online with a permissive copyright license (CC-BY-SA 3.0). The  CBIS-DDSM dataset is an improved version of the DDSM dataset. The authors of the CBIS-DDSM dataset attempted to improve the ground truth by applying simple image processing based methods to enhance the edges without any manual intervention from medical experts in order to segment and annotate masses. However, these annotations (segmentation maps) are inaccurate in most of the images. 


Twitter is one of the most popular social networks for sentiment analysis. This data set of tweets are related to the stock market. We collected 943,672 tweets between April 9 and July 16, 2020, using the S&P 500 tag (#SPX500), the references to the top 25 companies in the S&P 500 index, and the Bloomberg tag (#stocks). 1,300 out of the 943,672 tweets were manually annotated in positive, neutral, or negative classes. A second independent annotator reviewed the manually annotated tweets.


Twitter RAW data was downloaded using the Twitter REST API search, namely the "Tweepy (version 3.8.0)" Python package, which was created to make the interaction between the REST API and the developers easier. The Twitter REST API only retrieves data from the past seven days and allows to filter tweets by language. The tweets retrieved were filtered out for the English (en) language. Data collection was performed from April 9 to July 16, 2020, using the following Twitter tags as search parameter: #SPX500, #SP500, SPX500, SP500, $SPX, #stocks, $MSFT, $AAPL, $AMZN, $FB, $BBRK.B, $GOOG, $JNJ, $JPM, $V, $PG, $MA, $INTC $UNH, $BAC, $T, $HD, $XOM, $DIS, $VZ, $KO, $MRK, $CMCSA, $CVX, $PEP, $PFE. Due to the large number of data retrieved in the RAW files, it was necessary to store only each tweet's content and creation date.


The file tweets_labelled_09042020_16072020.csv consists of 5,000 tweets selected using random sampling out of the 943,672 sampled. Out of those 5,000 tweets, 1,300 were manually annotated and reviewed by a second independent annotator. The file tweets_remaining_09042020_16072020.csv contains the remaining 938,672 tweets.


This dataset is a companion to a paper, "Segmentation Convolutional Neural Networks for Automatic Crater Detection on Mars" by DeLatte et al. 2019. DOI link:


These are the segmentation target files for the three targets described in the paper: solid filled, thicker edge, and thinner edge. 


These files match with the tiles that can be downloaded from the THEMIS Daytime IR Global Mosaic:

Alternatively, this directory can be used for the download:

Use this file pattern to grab the tiles:

  • 0 to +30N: thm_dir_N00_*.png
  • -30N to 0: thm_dir_N-30_*.png 


Included here are three targets for the 24 tiles ±30º latitude, 0-360º longitude. (Each tile is 30º by 30º, 7680 x 7680 pixels, and has a resolution of 256 pixels per degree). Craters with 2-32km radius are included, as identified by the Robbins & Hynek global Mars dataset ( The original data file for the crater locations and parameters can be found here: 

Any arbitrary range of segmentation crater targets can be created using the file and python OpenCV.


To use for segmentation, download the corresponding THEMIS Daytime IR Global Mosaic tiles and this dataset can be used as the target images for segmentation. The filenames of the target files will match the filenames in the THEMIS Daytime IR Global Mosaic.


The file names for each type match the following patterns:

  • solid filled: thm_dir_N*_2_32_km_segrng.png
  • thicker edge (8): thm_dir_N*_2_32_km_segrng_8_edge.png
  • thinner edge (4): thm_dir_N*_2_32_km_segrng_4_edge.png

(segrng = segmentation range, referring to the 2-32 km radius range of craters in this dataset)

The numbers 4 and 8 above refer to the thickness parameter in python OpenCV. The circle drawing function is described here: