By: Rabindra Lamsal, Database Systems and Artificial Intelligence Lab at Jawaharlal Nehru University, New Delhi
During a crisis, many people use Twitter to quickly communicate with friends, family, and even local law enforcement. With so much data being quickly communicated through this public platform, our goal at the Database Systems and Artificial Intelligence Lab at Jawaharlal Nehru University, New Delhi was to design a useful way to classify this information. More specifically, we wanted to develop a disaster response system that could be used to classify crisis-related tweets to various categories such as community needs, loss of life, or damages.
To start, the system needed a way to filter out news-specific tweets related to the particular event. Once that was done, the remaining tweets could either be sent to an appropriate department or the geo-location information could be used to tentatively sketch the critically affected area. This is all done using a deep learning classifier running on the backend of the disaster response system to classify the tweets obtained from the real-time Twitter stream.
For a year and a half we worked extensively with the variants of Recurrent Neural Networks (RNNs) - Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) - to study their performance as well as their overarching behaviors alongside word embeddings. During this time, we also made 300-Dimensional word vectors for more than half a million Nepali words/phrases available as open-access content from IEEE DataPort.
Evolving our Tweet Analysis to Include Sentiment
We wanted the deep learning model to be able to operate behind any Web application with minimal computing resources. Therefore, as an experiment, we deployed the model as a Web app on a 1GB memory and 1vCPU Microsoft Azure virtual machine. However, this time, the deep learning model was modified to also perform a sentiment analysis of the tweets received from the real-time Twitter stream.
The deployed Web app monitors the real-time Twitter stream for specified keywords and then downloads and stores tweets using those keywords in an SQLite database. For example, in the spring of 2020, we monitored keywords related to the global pandemic including corona, covid-19, coronavirus, or the variants of sars-cov-2.
The SQLite database we created contains three columns – date and time of the tweet, the actual tweet text, and a sentiment score. The LSTM deep network computes the sentiment score of the tweets. It uses a range for the score of -1 to 1, with -1 being the negative, 0 being neutral, and +1 being the positive sentiment score. The re-sampling of the tweets is assisting the Web app in enumerating the sentiment score in a continuous range. After testing various caching and re-sampling methods, the Web app could graph the sentiment scores of around 850,000 tweets in near real time.
Benefits of Hosting Data on the IEEE DataPort Platform
There are three primary reasons we selected the IEEE DataPort platform to host our data. First, the 2TB storage capacity provided to researchers for sharing datasets is impressive. Second, dataset users can directly connect with dataset owners to make specific inquiries. Lastly, dataset owners get a digital object identifier (DOI), and people using the standard/open-access dataset can easily cite the original material if they use it in their research.
Additionally, since IEEE itself backs this platform, researchers will not have a second thought about using the materials available here. I primarily work on applied machine learning research; therefore, I prefer having a centralized platform like IEEE DataPort for easily accessing standard datasets. Finally, I firmly believe that the name "IEEE" enables people to remain genuine in sharing the datasets they've created for pushing the limits of research being carried out in the field of machine intelligence.
The data from the research conducted by Rabindra Lamsal won both the Spring 2019 and Spring 2020 IEEE DataPort Data Competitions.