Machine Learning
PT7 Web is an annotated Portuguese language Corpus built from samples collected from Sep 2018 to Mar 2020 from seven Portuguese-speaking countries: Angola, Brazil, Portugal, Cape Verde, Guinea-Bissau, Macao e Mozambique. The records were filtered from Common Crawl — a public domain petabyte-scale dataset of webpages in many languages, mixed together in temporal snapshots of the web, monthly available [1]. The Brazilian pages were labeled as the positive class and the others as the negative class (non-Brazillian Portuguese).
- Categories:
Parallel sentences in English and French, with mathematical expressions tokenized. The French sentences were extracted from course notes on error-correcting codes authored by Dr. Monica Nevins, University of Ottawa.
- Categories:
From state-of-the-art visualization algorithms, we distill six working principles which are, by hypothesis, sufficient to produce visual projections qualitatively similar to those obtained with these state-of-the-art algorithms. These working principles are presented through the geometrical reasoning of the classical Multidimensional Scaling algorithm, and their effectiveness is illustrated through a novel straightforward algorithm for image visualization.
- Categories:
The dataset has Gaussian Blobs of varying samples, centers and features. The number of samples ranges from 500 to 50,000. Similarly, the number of centers varies from 2 to 100, while the number of features varies from 2 to 2048. These different sets of Gaussian blobs can be used for testing clustering algorithms for their scalability and effectiveness. There are two kinds of files inside the compressed sets. Files ending with "_X.csv" consist of datapoints, while the files ending with "_y.csv" represent respective class data.
- Categories:
The dataset is composed of digital signals obtained from a capacitive sensor electrodes that are immersed in water or in oil. Each signal, stored in one row, is composed of 10 consecutive intensity values and a label in the last column. The label is +1 for a water-immersed sensor electrode and -1 for an oil-immersed sensor electrode. This dataset should be used to train a classifier to infer the type of material in which an electrode is immersed in (water or oil), given a sample signal composed of 10 consecutive values.
- Categories:
Optical Character Recognition (OCR) system is used to convert the document images, either printed or handwritten, into its electronic counterpart. But dealing with handwritten texts is much more challenging than printed ones due to erratic writing style of the individuals. Problem becomes more severe when the input image is doctor's prescription. Before feeding such image to the OCR engine, the classification of printed and handwritten texts is a necessity as doctor's prescription contains both handwritten and printed texts which are to be processed separately.
- Categories:
Annotated image dataset of household objects from the RoboFEI@Home team
This data set contains two sets of pictures of household objects, created by the RoboFEI@Home team to develop object detection systems for a domestic robot.
The first data set was created with objects from a local supermarket. Product brands are typical from Brazil. The second data set is composed of objects from the RoboCup@Home 2018 OPL competition.
- Categories:
This dataset gives a cursory glimpse at the overall sentiment trend of the public discourse regarding the COVID-19 pandemic on Twitter. The live scatter plot of this dataset is available as The Overall Trend block at https://live.rlamsal.com.np. The trend graph reveals multiple peaks and drops that need further analysis. The n-grams during those peaks and drops can prove beneficial for better understanding the discourse.
- Categories:
Most text-simplification systems require an indicator of the complexity of the words. The prevalent approaches to word difficulty prediction are based on manual feature engineering. Using deep learning based models are largely left unexplored due to their comparatively poor performance. We have explored the use of one of such in predicting the difficulty of words. We have treated the problem as a binary classification problem. We have trained traditional machine learning models and evaluated their performance on the task.
- Categories:
DataSet used in learning process of the traditional technique's operation, considering different devices and scenarios, the proposed approach can adapt its response to the device in use, identifying the MAC layer protocol, perform the commutation through the protocol in use, and make the device to operate with the best possible configuration.
- Categories: