This pre-trained Word2Vec model has 300-dimensional vectors for more than 0.5 million Nepali words and phrases. A separate Nepali language text corpus was created using the news contents freely available in the public domain. The text corpus contained more than 90 million running words. The "Nepali Text Corpus" can be accessed freely from


from gensim.models import KeyedVectors

# Load vectors
model = KeyedVectors.load_word2vec_format(''.../path/to/nepali_embeddings_word2vec.txt', binary=False)

# find similarity between words

#most similar words

#try some linear algebra maths with Nepali words
model.most_similar(positive=['', ''], negative=[''], topn=1)

The design of the Nepali text corpus and the training of the Word2Vec model was done at Database Systems and Artificial Intelligence Lab, School of Computer and System Sciences, Jawaharlal Nehru University, New Delhi.


This data set comprises 4223 videos from a laser surface heat treatment process (also called laser heat treatment) applied to cylindrical workpieces made of steel. The purpose of the dataset is to detect anomalies in the laser heat treatment learning a model from a set of non-anomalous videos.In the laser heat treatment, the laser beam is following a pattern similar to an "eight" with a frequency of 100 Hz. This pattern is sometimes modified to avoid obstacles in the workpieces.The videos are recorded at a frequency of 1000 frames per second with a thermal camera.


See for details on the structure of the dataset.


This is the dataset for the manuscript entitled "Physics-prior Bayesian neural networks in semiconductor processing", IEEE Access


This contains data for ISFET based pH sensor drift compensation using machine learning techniques


Database for FMCW THz radars (HR workspace) and sample code for federated learning 


Reinforcement Learning (RL) agents can learn to control a nonlinear system without using a model of the system. However, having a model brings benefits, mainly in terms of a reduced number of unsuccessful trials before achieving acceptable control performance. Several modelling approaches have been used in the RL domain, such as neural networks, local linear regression, or Gaussian processes. In this article, we focus on a technique that has not been used much so far:\ symbolic regression, based on genetic programming.


Real life business processes change over time, in both planned and unexpected ways. These changes over time are called concept drifts and its detection is a big challenge in process mining since the inherent complexity of the data makes difficult distinguishing between a change and an anomalous execution. The following logs were generated synthetically in order to prove the quality of different concept drift detection algorithms.


The log files are available in 4 different sizes: 2500, 5000, 7500 and 10000 traces.

Each log has a sudden drift at every 10% of the log.

The change patterns applied to the model are the ones from the paper "Change patterns and change support features - Enhancing flexibility in process-aware information systems".


Code duplicates in large code corpora have adverse effects on the evaluation and use of machine learning models that rely on them. Most existing corpora suffer from this problem to some extent. This dataset contains a "duplication" index for some of the existing corpora in Big Code research. The method for collecting this dataset is described in "The Adverse Effects of Code Duplication in Machine Learning Models of Code" by Allamanis [ArXiV, to appear in SPLASH 2019].



For each of the existing datasets, a single .json file is provided. Each JSON file has the following format:


[ duplicate_group_1, duplicate_group_2, ...]


where each duplicate group is a list of filenames of that dataset that are near duplicates.


For the corpora that were given as a single file (e.g. Hashimoto et al.) the line number of the original record is given.


This dataset contains a sequence of network events extracted from a commercial network monitoring platform, Spectrum, by CA. These events, which are categorized by their severity, cover a wide range of events, from a link state change up to critical usages of CPU by certain devices. Regarding the layers they cover, they are focused on the physical, network and application layer. As such, the whole set gives a complete overview of the network’s general state.


The dataset is composed by a single plain text file in csv format.  This csv we contains the following variables:

• Severity: the importance of the event. It is divided in four different levels: Blank, Minor, Major and Critical.

• Created On: the date and time when the event was created.Theschemeis"month/day/year hour:minute:second".

• Name: (anonymized) name of the device the event happened on.

• EventType: hexadecimal code detailing the category the event pertains to.

• Event: message associated with the event.


Thus, a certain event will be a combination of an event type on a certain device on a certain time, it will be described by its severity and explained by the event message.


The compressed file contains:

  • Data files in spreadsheet format from three different networks (friendship, companionship and acquaintances).
  • Analysis files from UCINET, Pajek, Cytoscape and Gephi.

It is thus possible to corroborate the results mentioned in different studies that refer to these data.