The greatest challenge of machine learning problems is to select suitable techniques and resources such as tools and datasets. Despite the existence of millions of speakers around the globe and the rich literary history of more than a thousand years, it is expensive to find the computational linguistic work related to Punjabi Shahmukhi script, a member of the Perso-Arabic context-specific script low-resource language family. The selection of the best algorithm for a machine learning problem heavily depends on the availability of a dataset for that specific task.


Aspect Sentiment Triplet Extraction (ASTE) is an Aspect-Based Sentiment Analysis subtask (ABSA). It aims to extract aspect-opinion pairs from a sentence and identify the sentiment polarity associated with them. For instance, given the sentence ``Large rooms and great breakfast", ASTE outputs the triplet T = {(rooms, large, positive), (breakfast, great, positive)}. Although several approaches to ASBA have recently been proposed, those for Portuguese have been mostly limited to extracting only aspects without addressing ASTE tasks.


Dataset asscociated with a paper in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems

"Talk the talk and walk the walk: Dialogue-driven navigation in unknown indoor environments"

If you use this code or data, please cite the above paper.



See the docs directory.


The General Data Protection Regulation (GDPR), adopted in 2018, profoundly impacts information processing organizations as they must comply with this regulation. In this research, we consider GDPR-compliance as a high-level goal in software development that should be addressed at the offset of software development, meaning during requirements engineering (RE). In this work, we hypothesize that Natural Language Processing (NLP) can offer a viable means to automate this process.


Wine has been popular with the public for centuries; in the market, there are a variety of wines to choose from. Among all, Bordeaux, France, is considered as the most famous wine region in the world. In this paper, we try to understand Bordeaux wines made in the 21st century through Wineinformatics study. We developed and studied two datasets: the first dataset is all the Bordeaux wine from 2000 to 2016; and the second one is all wines listed in a famous collection of Bordeaux wines, 1855 Bordeaux Wine Official Classification, from 2000 to 2016.


The dataset comes from Wine Spectator Bordeaux wine reviews in human language format from year 2000 to year 2016. A total of 14,349 wines have been collected. There are 4263 above score 90/100 wines and 10,086 below score 89/100 wines. Detailed information is available in the paper. The dataset was processed by the Computational Wine Wheel to become the uploaded dataset. The first attribute of the dataset is the name of the wine. The second attribute of the dataset is the vintage of the wine. The third attribute of the dataset is the score given by the Wine Spectator of the wine. The fourth attribute of the dataset is the price of the wine. $NA indicates the wine price was not available during the time of the wine being reviewed. The rest of the attributes are the characteristic describing the wine with true/false value.


For Publications, please cite the following papers:

Dong, Zeqing, Xiaowan Guo, Syamala Rajana, and Bernard Chen. "Understanding 21st Century Bordeaux Wines from Wine Reviews Using Naïve Bayes Classifier." Beverages 6, no. 1 (2020): 5.

Chen, Bernard, Christopher Rhodes, Aaron Crawford, and Lorri Hambuchen. "Wineinformatics: applying data mining on wine sensory reviews processed by the computational wine wheel." In 2014 IEEE International Conference on Data Mining Workshop, pp. 142-149. IEEE, 2014.

Chen, Bernard, Christopher Rhodes, Alexander Yu, and Valentin Velchev. "The Computational Wine Wheel 2.0 and the TriMax Triclustering in Wineinformatics." In Industrial Conference on Data Mining, pp. 223-238. Springer, Cham, 2016.


The age of Artificial Intelligence (AI) is coming. Since Natural Language Processing (NLP) is a core AI technology for communication between humans and devices, it is vital to understand technological trends. Early research on NLP focused on syntactic processing such as information extraction and subject modeling but later developed into the semantic-oriented analysis. To analyze technological trends concerning NLP, especially semantic analysis, patent data that contains objective and extensive information is analyzed.


This dataset page is currently being updated. The tweets collected by the model deployed at are shared here. However, because of COVID-19, all computing resources I have are being used for a dedicated collection of the tweets related to the pandemic. You can go through the following datasets to access those tweets:


A benchmark dataset is always required for any classification or recognition system. To the best of our knowledge, no benchmark dataset exists for handwritten character recognition of Manipuri Meetei-Mayek script in public domain so far. Manipuri, also referred to as Meeteilon or sometimes Meiteilon, is a Sino-Tibetan language and also one of the Eight Scheduled languages of Indian Constitution. It is the official language and lingua franca of the southeastern Himalayan state of Manipur, in northeastern India.