Linear Code Sentences English/French

Citation Author(s):
Tanya
Schmah
University of Ottawa
Aditya
Ohri
University of Ottawa
Submitted by:
Tanya Schmah
Last updated:
Tue, 05/17/2022 - 22:21
DOI:
10.21227/9h6z-z514
Data Format:
Link to Paper:
License:
138 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

Parallel sentences in English and French, with mathematical expressions tokenized. The French sentences were extracted from course notes on error-correcting codes authored by Dr. Monica Nevins, University of Ottawa.

Instructions: 

The dataset consists of three files of "parallel sentences":

linearcode160-en.txt

linearcode160-fr-target.txt

linearcode160-fr-polymath.txt

"Parallel sentences" means: for every line number i from 1 to 160, line i in all three files correspond, i.e. all three are meant to convey the same meaning. When used in a machine translation task, the first two files (English and French) should be used. These files were constructed as follows: The French sentences were extracted from course notes on error-correcting codes authored by Dr. Monica Nevins, University of Ottawa., who has given permission for these sentences to be published in this form. The sentences were extracted, the math expressions tokenized, and the sentences translated into English by Aditya Ohri using the PolyMath Translator and Google Translate. "Tokenized" means that each mathematical expression has been converted into a unique math token of the form MATHnX, where n is the token number. Tanya Schmah manually corrected the English translations, and also corrected the format of some English and French sentences.

The third file is the uncorrected output of the PolyMath Translator, which translated the first file from English to French.

Dataset Files

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in  users. Don't have a login?  Create a free IEEE account.  IEEE Membership is not required.