Datasets
Open Access
Linear Code Sentences English/French
- Citation Author(s):
- Submitted by:
- Tanya Schmah
- Last updated:
- Tue, 05/17/2022 - 22:21
- DOI:
- 10.21227/9h6z-z514
- Data Format:
- Link to Paper:
- License:
- Categories:
- Keywords:
Abstract
Parallel sentences in English and French, with mathematical expressions tokenized. The French sentences were extracted from course notes on error-correcting codes authored by Dr. Monica Nevins, University of Ottawa.
The dataset consists of three files of "parallel sentences":
linearcode160-en.txt
linearcode160-fr-target.txt
linearcode160-fr-polymath.txt
"Parallel sentences" means: for every line number i from 1 to 160, line i in all three files correspond, i.e. all three are meant to convey the same meaning. When used in a machine translation task, the first two files (English and French) should be used. These files were constructed as follows: The French sentences were extracted from course notes on error-correcting codes authored by Dr. Monica Nevins, University of Ottawa., who has given permission for these sentences to be published in this form. The sentences were extracted, the math expressions tokenized, and the sentences translated into English by Aditya Ohri using the PolyMath Translator and Google Translate. "Tokenized" means that each mathematical expression has been converted into a unique math token of the form MATHnX, where n is the token number. Tanya Schmah manually corrected the English translations, and also corrected the format of some English and French sentences.
The third file is the uncorrected output of the PolyMath Translator, which translated the first file from English to French.
Dataset Files
- linearcode160-en.txt (13.72 kB)
- linearcode160-fr-target.txt (15.18 kB)
- linearcode160-fr-polymath.txt (15.17 kB)
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.