Abstract

The presence of organisations in Online Social Networks (OSNs) has motivated malicious users to look for attack vectors, which are then used to increase the possibility of carrying out successful attacks and obtaining either private information or access to the organisation. This article hypothesised that organisations have specific languages that their members use in OSNs, which malicious users could potentially use to carry out an impersonation attack. To prove these specific languages, we propose two tasks: classifying tweets in isolation by their author’s organisation and classifying users’ entire timelines by organisation. To accomplish both tasks, we generate this dataset of over 15 million tweets from more than 5000 members of five different organisations.

Instructions:

This dataset contains tweets from members of 5 different organisations employed to analyse if they use specific languages differentiable among organisations. For the dataset, we gathered organisations that have several members and from diverse fields:

Organisation A is an NGO focused on humans’ rights.
Organisation B is a multinational aerospace corporation.
Organisation C is a multinational professional services network.
Organisation D is a political party.
Organisation E is a multinational technology company.

Tweets from organisations D and E are divided because they contain a great number of tweets and exceeded the size limit allowed by Github. To join the files of those organisations in one, use this command:

cat organisationD_*.csv > organisationD.csv

Dataset Files

Organisation_language_twitter.zip (335.54 MB)

Datasets

Standard Dataset

Analysing the Existence of Organisation Specific Languages on Twitter: The Dataset

Abstract

Dataset Files

QUESTIONS?