Morse Code Symbol Classification

Abstract

Morse code is a system of communication using dots and dashes to represent numbers, letters and symbols. For example, the letter 'B' is represented as a dash followed by 3 dots, i.e. "–...". The dataset used in this competition is synthetically generated, and mimics a human writing dots and dashes on a piece of paper. In this sense, it is like a 1-dimensional version of an image represented by numeric pixel values. The challenge is to classify the resulting 1-dimensional input into 1 out of 64 classes which represent various letter, numbers and symbols.

The intensity of the pen on paper is represented by numbers. A single data point has 64 numerical features, with long consecutive high-intensity features denoting dashes, shorter consecutive high-intensity features ones denoting dots, and consecutive low-intensity features denoting spaces. For example, the "–..." of 'B' is basically 'leading space, dash, space, dot, space, dot, space, dot, trailing space'. This could be represented as [(0.1,0.2,0,...), (0.8,0.9,0.8,0.7,1,0.8), (0.1), (0.9,0.9,0.8), (0.3,0.2,0), (0.6,0.9), (0.4,0), (1), (0,0.4,0.2,...)]. Notice that dots and spaces have roughly the same length, while dashes are longer. Leading and trailing spaces are provided to have exactly 64 features for each input sample.

Instructions:

DATASET: https://github.com/souryadey/morse-dataset/blob/master/difficult.npz

SCRIPT: https://github.com/souryadey/morse-dataset/blob/master/load_data.py

The dataset used in this competition has 320k training samples, 64k validation samples, and 64k test samples. These are provided in difficult.npz, with keys titled xtr, ytr, xva, yva, xte, yte, i.e. input x and output y data for training, validation and test. Each input sample has 64 numerical features. (These represent a single Morse symbol which has leading and trailing spaces and added noise to confuse spaces with dots and dashes. Dash length is between 3-9 features, while dots and spaces are between 1-3 features.) Each output sample is one-hot encoded between 1 out of 64 classes. Requires Python and numpy.

The evaluation metric is classification accuracy. To make a submission, provide a Python script titled main.py which accepts 1 argument – XTE_FINAL. This will be of exactly the same format and size as xte, i.e. numpy array of size (64000,64). We will run your script as follows:

>>> main.py(XTE_FINAL)

Your code must output an array of predictions of size 64000, each element being a whole number between 0-63 denoting output label. We will compare this with our own test labels and the resulting accuracy will be your score.

A script is provided to extract the data from the npz file. For more details on dataset generation, refer to the Github repository morse-dataset: https://github.com/souryadey/morse-dataset, and the award-winning research paper 'Morse Code Datasets for Machine Learning': https://ieeexplore.ieee.org/document/8494011

Good luck!