This contains data for ISFET based pH sensor drift compensation using machine learning techniques


Database for FMCW THz radars (HR workspace) and sample code for federated learning 


Reinforcement Learning (RL) agents can learn to control a nonlinear system without using a model of the system. However, having a model brings benefits, mainly in terms of a reduced number of unsuccessful trials before achieving acceptable control performance. Several modelling approaches have been used in the RL domain, such as neural networks, local linear regression, or Gaussian processes. In this article, we focus on a technique that has not been used much so far:\ symbolic regression, based on genetic programming.


Real life business processes change over time, in both planned and unexpected ways. These changes over time are called concept drifts and its detection is a big challenge in process mining since the inherent complexity of the data makes difficult distinguishing between a change and an anomalous execution. The following logs were generated synthetically in order to prove the quality of different concept drift detection algorithms.


The log files are available in 4 different sizes: 2500, 5000, 7500 and 10000 traces.

Each log has a sudden drift at every 10% of the log.

The change patterns applied to the model are the ones from the paper "Change patterns and change support features - Enhancing flexibility in process-aware information systems".


Code duplicates in large code corpora have adverse effects on the evaluation and use of machine learning models that rely on them. Most existing corpora suffer from this problem to some extent. This dataset contains a "duplication" index for some of the existing corpora in Big Code research. The method for collecting this dataset is described in "The Adverse Effects of Code Duplication in Machine Learning Models of Code" by Allamanis [ArXiV, to appear in SPLASH 2019].



For each of the existing datasets, a single .json file is provided. Each JSON file has the following format:


[ duplicate_group_1, duplicate_group_2, ...]


where each duplicate group is a list of filenames of that dataset that are near duplicates.


For the corpora that were given as a single file (e.g. Hashimoto et al.) the line number of the original record is given.


This dataset contains a sequence of network events extracted from a commercial network monitoring platform, Spectrum, by CA. These events, which are categorized by their severity, cover a wide range of events, from a link state change up to critical usages of CPU by certain devices. Regarding the layers they cover, they are focused on the physical, network and application layer. As such, the whole set gives a complete overview of the network’s general state.


The dataset is composed by a single plain text file in csv format.  This csv we contains the following variables:

• Severity: the importance of the event. It is divided in four different levels: Blank, Minor, Major and Critical.

• Created On: the date and time when the event was created.Theschemeis"month/day/year hour:minute:second".

• Name: (anonymized) name of the device the event happened on.

• EventType: hexadecimal code detailing the category the event pertains to.

• Event: message associated with the event.


Thus, a certain event will be a combination of an event type on a certain device on a certain time, it will be described by its severity and explained by the event message.


The compressed file contains:

  • Data files in spreadsheet format from three different networks (friendship, companionship and acquaintances).
  • Analysis files from UCINET, Pajek, Cytoscape and Gephi.

It is thus possible to corroborate the results mentioned in different studies that refer to these data.


OntoSNAQA is the name that combines Social Network Analysis (SNA), People and Questionnaires (Question and Answers - QA).This ontology will be updated in this project of github and in the url's an ontology that combines three different domains:- People- Questionnaires- Social Network Analysis termsThe mainly objective of this ontology is to achieve a complete automatized Social Network Analysis.


We introduce a benchmark of distributed algorithms execution over big data. The datasets are composed of metrics about the computational impact (resource usage) of eleven well-known machine learning techniques on a real computational cluster regarding system resource agnostic indicators: CPU consumption, memory usage, operating system processes load, net traffic, and I/O operations. The metrics were collected every five seconds for each algorithm on five different data volume scales, totaling 275 distinct datasets.


The sections below explain the specification of the cluster of machines, the content of the data used to run the DML algorithms, the structure of the data metrics (logs) gathered, and the methods applied to collect the execution logs.


Datasets Structure

Each one of the 275 datasets corresponds to one execution of one DML in one specific volume (scale). To separate the data in an appropriated manner, the filenames are organized as <dml_algorithm>_<resource>_<scale>.csv, where:

  • <dml_algorithm> stands for one of eleven executed techniques (see the section about DML algorithms, below)
  • <resource> stands for disc, CPU, memory, processes, and network (see the section about metrics, in the sequence)
  • <scale> stands for b1, b2, gigantic, huge, and large (see the section about data volumes, in the sequence)


This way, a file named als_cpu_huge.csv designates the metrics of the CPU load of the Alternating Least Squares algorithm where applied to solve a problem in the "huge" scale. The same way, a file named kpar_mem_B1.csv stores the metrics for memory usage of the K-means Parallel (Dense k-means), and so on.


Computer Cluster Specification

The experiments were hosted on Google Cloud Data Proc environment. The cluster was composed of eight high power machines, a master node and seven slave nodes, totalizing 128 cores, integrated to the same internal network, fully dedicated to the machine learning algorithms’ execution. All nodes having the same specification, described as follows.

  • operating system: Debian GNU/Linux 8.10, kernel 3.16.51-3+deb8u1
  • CPU: Intel Xeon @ 2.60 GHz 
  • architecture: x86_64, Little Endian
  • cores: 16; 2 threads per core
  • RAM memory: 106 GB
  • HDD storage: 500 GB
  • SSD storage: 375 GB

The total capacity of the cluster was: 8 nodes, 128 cores, 3.4 Terabytes of total storage capacity, 848 GB of total RAM memory, managed by the computing framework Apache Spark 2.2.0, and cluster manager YARN - Apache Hadoop 2.8.2.


The Metrics

Five system indicators were analyzed: CPU load, network traffic, memory consumption, disk access, and process load, as follows.


CPU (seven dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • system: O.S. work 
  • user: algorithm's work
  • iowait: input/output busy wait time
  • softirq: software interrupt request 
  • idle: useless time


Memory (six dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • buffer_cache: memory in cache
  • used (O.S. + algorithm): memory used by DML algorithms and O.S.
  • free: available memory  
  • map: number of map operations


Disk (eight dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • bytes_read: data read from storage (MB)
  • bytes_write: data written to the storage (MB)
  • io_read: number of I/O read operations
  • io_write: number of I/O write operations
  • time_spent_read: reading time
  • time_spent_write: writing time


Network (five dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • recv_packets: number of received packets 
  • send_bytes: number of sent bytes
  • send_packets: number of sent packets



  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • load5 **: average load in the last 5 minutes 
  • load10: average load in the last 10 minutes
  • load15: average load in the last 15 minutes
  • proc: number of processes

** The amount of work performed by the system. An idle computer \\has load 0. Each process increments load by 1.


The DML Algorithms

As stated above, the metrics about consumption was gathered from the execution of eleven machine learning algorithms. They are:

1) Alternating Least Squares (ALS) - collaborative filtering technique — introduced by Tapestry [1] developers. It Analyzes relationships between users and items to identify possible new associations. They use neighborhood-based or matrix-factoring methods [2]. The Spark implementation is based on the strategy described in [3].

2) Naive Bayes (NB) - a widely used technique for text classification. It computes the conditional probability distribution on the characteristics and applies the Bayes’ theorem [4] for prediction. Spark implements the Mutinomial Naive Bayes (used in the experiments) and Bernouli Naive Bayes approaches [5].

3) Gradient Boosted Trees (GBT) - used for classification or regression, the latter applied in the tests. Formed by ensem- bles that train a sequence of decision trees. The Spark implementation is based on [6,7].

4) Dense k-means (k||) - clustering technique that separates datasets into k partitions. Spark implements a parallel variant of the algorithm k-means ++ [8], called k-means || [9] (k-means parallel ou dense k-means).

5) Latent Dirichlet Allocation (LDA) - clustering technique that applies a probabilistic model over discrete data collections, such as a Corpus of documents. Used in document modeling, text sorting and collaborative filtering [10].

6) Linear Regression (LinR) - used for regression. Its ancestral form was the least squares method, published by Legendre (1805) and Gauss (1809). The linear regression case deals with a simple equation that has on the right side an intercept and an explanatory variable with an inclination coefficient [11].

7)  Logistic Regression (LogR) - the main reason for choosing the logistic function for the analysis of dichotomous output variables is its flexibility, easily usable and allows judicious interpretation [12]. The experiment uses Spark implementation based on Limited-memory BFGS [13] for classification.

8) Principal Component Analysis (PCA) - dimensionality reduction technique that allows transforming a complex dataset into a smaller dimension, revealing hidden structures in the original dataset [14,15].

9) Random Forests (RF) - a nonparametric statistical method used for regression and classification, based on decision trees and bootstrap [16]. An extensive technical discussion about RFs for Big Data is available in [17].

10) Singular Value Decomposition (SVD) - dimensionality reduction technique proposed by [18]. The Spark implementation is based on matrix optimization techniques on clusters, described in [19].

11) Support Vector Machine (SVM) - a model, used for regression and classification (used in this paper), based on high-dimensionality hyperplane construction [20]. The Spark implementation uses Linear SVM for binary classification.


The Data and the Volume (Scales) Used in the DML Algorithms

We used synthetic data (by Intek HiBench framework due to (i) barriers to fit data requirements for each algorithm, (ii) adjusting the volume of data in each scale and (iii) transferring them from the source to the internal network on the cluster. The scales, in ascending order of size are: large (L), huge (H), gigantic (G), big data 1 (B1), and big data 2 (B2). Same volume, no content variation in scales B1/B2, and content variation in scales G, H, and L. The size of data for each experiment is detailed as follow, in Megabytes, in the sequence L, H, G, B1, and B2 (same size of B1):

  • NB - 359, 1792, 3594, 71885, 71885
  • LogR - 7629, 22886, 38144, 53402, 53402
  • SVM - 19077, 109875, 149544, 171674, 171674
  • RF - 8, 15258, 22886, 33567, 33567
  • LDA - 245, 653, 1976, 4260, 4260
  • k|| - 3830, 19149, 38308, 229816, 229816
  • GBT- 15, 31, 61, 92, 92
  • LinR- 45783, 114463, 305203, 762993
  • ALS - 115, 688, 1372, 1720, 1720
  • PCA- 31, 183, 229, 257, 257
  • SVD - 61, 191, 275, 374, 374

All metrics in the benchmark datasets are gathered from the execution of the above-specified DML algorithms in each one of these scales.



[1] Goldberg D, Nichols D, Oki BM, et al. Using collaborative filtering to weave an information tapestry. Communications of the ACM. 1992;35(12):61–70.

[2] Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8).

[3] Hu Y, Koren Y, Volinsky C. Collaborative filtering for implicit feedback datasets. In: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on; Ieee; 2008. p.


[4] Vapnik VN, Vapnik V. Statistical learning theory. Vol. 1. Wiley New York; 1998.

[5] Sanderson M, Christopher D, Manning H, et al. Introduction to information retrieval. Natural Language Engineering. 2010;16(1):100.

[6] Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics. 2001;:1189–1232.

[7] Friedman JH. Stochastic gradient boosting. Computational Statistics and Data Analysis. 2002;38(4):367–378.

[8] Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. In: Proceed- ings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms; Society for Industrial and Applied Mathematics; 2007. p. 1027–1035.

[9] Bahmani B, Moseley B, Vattani A, et al. Scalable k-means++. Proceedings of the VLDB Endowment. 2012;5(7):622–633.

[10] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of machine Learning research. 2003;3(Jan):993–1022.

[11] Yan X, Su X. Linear regression analysis: theory and computing. World Scientific; 2009.

[12] Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Vol. 398. John Wiley & Sons; 2013.

[13] Spark. Linear Methods - RDD-based API - Logistic Regression; 2017. Available at:

[14] Shlens J. A Tutorial on Principal Component Analysis. Epidemiology. 2005;2(c):223–228.

[15] Jolliffe IT. Principal Component Analysis, Second Edition. Encyclopedia of Statistics in Behavioral Science. 2002;30(3):487.

[16] Breiman L. Random forests. Machine learning. 2001;45(1):5–32.

[17] Genuer R, Poggi JM, Tuleau-Malot C, et al. Random forests for big data. Big Data Research. 2017;9:28–46.

[18] Lehoucq RB, Sorensen DC, Yang C. ARPACK users’ guide: solution of large-scale eigen-value problems with implicitly restarted Arnoldi methods. Vol. 6. Siam; 1998.

[19] Bosagh Zadeh R, Meng X, Ulanov A, et al. Matrix Computations and Optimization in Apache Spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16; New York, New York, USA. ACM Press; 2016. p. 31–38.

[20] Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–297.


SDTwittC consists of 200 authors evenly balanced by gender (100 for each). We identified the gender of the tweeters via their names and profile pictures. As potential copy-and-paste texts, both tweets and retweets are discarded in the first place. Only replies are compiled. The number of replies for each author varies from hundreds to thousands. Male authors produced 233926 replies whereas 219740 replies are generated by the female group