artificial intelligence; machine learning; natural language processing;

Gramatika is a syntectic GEC dataset for Indonesian. The Gramatika dataset has a total of 1.5 million sentences with 4,666,185 errors. Of all sentences, only 30,000 (2%) are correct sentences with no mistakes. Each sentence has a maximum of 6 errors, and there can only be 2 of the same error type in each sentence.We also split the dataset into three splits: train, dev, and test splits, with the proportion of 8:1:1 (with the size of 1,199,705, 150,171, and 150,124 sentences, respectively).

Categories:
14 Views

Gramatika is a syntectic GEC dataset for Indonesian. The Gramatika dataset has a total of 1.5 million sentences with 4,666,185 errors. Of all sentences, only 30,000 (2%) are correct sentences with no mistakes. Each sentence has a maximum of 6 errors, and there can only be 2 of the same error type in each sentence.We also split the dataset into three splits: train, dev, and test splits, with the proportion of 8:1:1 (with the size of 1,199,705, 150,171, and 150,124 sentences, respectively).

Categories:
Views

The Thai Deaf Corpus (TDC) is constructed from a writing activity where deaf students randomly select picture words using the image picker wheel, then write sentences corresponding to these words on the writing sheet. The sentences are transcribed and corrected manually to create the TDC.

Categories:
236 Views

An AI-based Ancient Hebrew Language Translator aims to revive Ancient Hebrew by constructing a comprehensive dataset with contemporary and ancient Hebrew samples. Seamless integration of the Google Vision API facilitates Optical Character Recognition (OCR) for image processing. The translation process initiates in English through the model, leading to a multilingual interface. This initiative represents a crucial step in preserving ancient languages in the digital age.

Categories:
10 Views