7200 .csv files, each containing a 10 kHz recording of a 1 ms lasting 100 hz sound, recorded centimeterwise in a 20 cm x 60 cm locating range on a table. 3600 files (3 at each of the 1200 different positions) are without an obstacle between the loudspeaker and the microphone, 3600 RIR recordings are affected by the changes of the object (a book). The OOLA is initially trained offline in batch mode by the first instance of the RIR recordings without the book. Then it learns online in an incremental mode how the RIR changes by the book.

Instructions: 

folder 'load and preprocess offline data': matlab sourcecodes and raw/working offline (no additional obstacle) data files

folder 'lvq and kmeans test': matlab sourcecodes to test and compare in-sample failure with and without LVQ

folder 'online data load and preprocess': matlab sourcecodes and raw/working online (additional obstacle) data files

folder 'OOL': matlab sourcecodes configurable for case 1-4

folder 'OOL2': matlab sourcecodes for case 5

folder 'plots': plots and simulations

Categories:
475 Views

As one of the research directions at OLIVES Lab @ Georgia Tech, we focus on the robustness of data-driven algorithms under diverse challenging conditions where trained models can possibly be depolyed. To achieve this goal, we introduced a large-sacle (~1.72M frames) traffic sign detection video dataset (CURE-TSD) which is among the most comprehensive datasets with controlled synthetic challenging conditions. The video sequences in the 

Instructions: 

The name format of the video files are as follows: “sequenceType_sequenceNumber_challengeSourceType_challengeType_challengeLevel.mp4”

·         sequenceType: 01 – Real data 02 – Unreal data

·         sequenceNumber: A number in between [01 – 49]

·         challengeSourceType: 00 – No challenge source (which means no challenge) 01 – After affect

·         challengeType: 00 – No challenge 01 – Decolorization 02 – Lens blur 03 – Codec error 04 – Darkening 05 – Dirty lens 06 – Exposure 07 – Gaussian blur 08 – Noise 09 – Rain 10 – Shadow 11 – Snow 12 – Haze

·         challengeLevel: A number in between [01-05] where 01 is the least severe and 05 is the most severe challenge.

Test Sequences

We split the video sequences into 70% training set and 30% test set. The sequence numbers corresponding to test set are given below:

[01_04_x_x_x, 01_05_x_x_x, 01_06_x_x_x, 01_07_x_x_x, 01_08_x_x_x, 01_18_x_x_x, 01_19_x_x_x, 01_21_x_x_x, 01_24_x_x_x, 01_26_x_x_x, 01_31_x_x_x, 01_38_x_x_x, 01_39_x_x_x, 01_41_x_x_x, 01_47_x_x_x, 02_02_x_x_x, 02_04_x_x_x, 02_06_x_x_x, 02_09_x_x_x, 02_12_x_x_x, 02_13_x_x_x, 02_16_x_x_x, 02_17_x_x_x, 02_18_x_x_x, 02_20_x_x_x, 02_22_x_x_x, 02_28_x_x_x, 02_31_x_x_x, 02_32_x_x_x, 02_36_x_x_x]

The videos with all other sequence numbers are in the training set. Note that “x” above refers to the variations listed earlier.

The name format of the annotation files are as follows: “sequenceType_sequenceNumber.txt“

Challenge source type, challenge type, and challenge level do not affect the annotations. Therefore, the video sequences that start with the same sequence type and the sequence number have the same annotations.

·         sequenceType: 01 – Real data 02 – Unreal data

·         sequenceNumber: A number in between [01 – 49]

The format of each line in the annotation file (txt) should be: “frameNumber_signType_llx_lly_lrx_lry_ulx_uly_urx_ury”. You can see a visual coordinate system example in our GitHub page.

·         frameNumber: A number in between [001-300]

·         signType: 01 – speed_limit 02 – goods_vehicles 03 – no_overtaking 04 – no_stopping 05 – no_parking 06 – stop 07 – bicycle 08 – hump 09 – no_left 10 – no_right 11 – priority_to 12 – no_entry 13 – yield 14 – parking

Categories:
2914 Views

As one of the research directions at OLIVES Lab @ Georgia Tech, we focus on the robustness of data-driven algorithms under diverse challenging conditions where trained models can possibly be depolyed.

Instructions: 

The name format of the provided images are as follows: "sequenceType_signType_challengeType_challengeLevel_Index.bmp"

  • sequenceType: 01 - Real data 02 - Unreal data

  • signType: 01 - speed_limit 02 - goods_vehicles 03 - no_overtaking 04 - no_stopping 05 - no_parking 06 - stop 07 - bicycle 08 - hump 09 - no_left 10 - no_right 11 - priority_to 12 - no_entry 13 - yield 14 - parking

  • challengeType: 00 - No challenge 01 - Decolorization 02 - Lens blur 03 - Codec error 04 - Darkening 05 - Dirty lens 06 - Exposure 07 - Gaussian blur 08 - Noise 09 - Rain 10 - Shadow 11 - Snow 12 - Haze

  • challengeLevel: A number in between [01-05] where 01 is the least severe and 05 is the most severe challenge.

  • Index: A number shows different instances of traffic signs in the same conditions.

Categories:
2004 Views

Network traffic analysis, i.e. the umbrella of procedures for distilling information from network traffic, represents the enabler for highly-valuable profiling information, other than being the workhorse for several key network management tasks. While it is currently being revolutionized in its nature by the rising share of traffic generated by mobile and hand-held devices, existing design solutions are mainly evaluated on private traffic traces, and only a few public datasets are available, thus clearly limiting repeatability and further advances on the topic.

Instructions: 

MIRAGE-2019 is a human-generated dataset for mobile traffic analysis with associated ground-truth, having the goal of advancing the state-of-the-art in mobile app traffic analysis.

MIRAGE-2019 takes into consideration the traffic generated by more than 280 experimenters using 40 mobile apps via 3 devices.

APP LIST reports the details on the apps contained in the two versions of the dataset.

If you are using MIRAGE-2019 human-generated dataset for scientific papers, academic lectures, project reports, or technical documents, please help us increasing its impact by citing the following reference:

Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, Valerio Persico and Antonio Pescapè,"MIRAGE: Mobile-app Traffic Capture and Ground-truth Creation",4th IEEE International Conference on Computing, Communications and Security (ICCCS 2019), October 2019, Rome (Italy).

[ARTICLE] [BIBTEX]

Categories:
941 Views

A paradigm dataset is constantly required for any characterization framework. As far as we could possibly know, no paradigmdataset exists for manually written characters of Telugu Aksharaalu content in open space until now. Telugu content (Telugu: తెలుగు లిపి, romanized: Telugu lipi), an abugida from the Brahmic group of contents, is utilized to compose the Telugu language, a Dravidian language spoken in the India of Andhra Pradesh and Telangana just a few other neighboring states. The Telugu content is generally utilized for composing Sanskrit writings.

Categories:
8126 Views

WiFi measurements dataset for WiFi fingerprint indoor localization compiled on the first and ground floors of the Escuela Técnica Superior de Ingeniería Informática, in Seville, Spain. The facility has 24.000 m² approximately, although only accessible areas were compiled.

Instructions: 

The training dataset consists of 7175 fingerprints collected from 489 different locations. Each fingerprint is stored as a JSON object corresponding to an unique scan with the following values:

  • _id: contains an unique identifier for the fingerprint, uses to differentiate one fingerprint from another.

  • avgMagneticMagnitude: average magnetic magnitude during scanning with the mobile phone sensor, although this value is not used is provided in case it was useful.

  • location: object with the coordinates of the real world in which the sample was captured.

    • floor: number indicating the floor in which the sample was captured.

    • lat: latitude as part of the coordinate at which the sample was captured.

    • lon: longitude as part of the coordinate at which the sample was captured.

  • timestamp: UNIX timestamp in which the sample was captured.

  • userId: identifier of the user who captured the sample, this value will be anonymized so that it is not directly identifiable but remains unique.

  • wifiDevices: list of APs appearing in the sample.

    • bssid: unique AP identifier, this value will be anonymized so that it is not directly identifiable but remains unique.

    • frequency: AP WiFi frequency.

    • level: AP WiFi signal strength (RSSI).

    • ssid: AP name, this value will be anonymized so that it is not directly identifiable but can be used to compare APs with the same name.

The training dataset was compiled by taking samples at every 3 meters on average with 15 samples per location. The time at each location was approximately 40 seconds performing consecutive scans with a bq Aquaris E5 4G device using Android stock 6.0.1 without making any movements during the process. The following is an example of a fingerprint, the list of WiFi devices has been shortened to two APs, as it was too long.

{
"_id":"5cc81e8ac28d6d2533709425",
"avgMagneticMagnitude":40.615368,
"location":{
"floor":1,
"lat": 37.357746,
"lon": -5.9878354
},
"timestamp":1556618890,
"userId":"USER-0",
"wifiDevices":[
{
"bssid":"AP-BSSID-0",
"frequency":2457,
"level":-75,
"ssid":"AP-SSID-0"
},
...
{
"bssid":"AP-BSSID-23",
"frequency":2437,
"level":-64,
"ssid":"AP-SSID-6"
}
]
}

The testing dataset consists of two tests with a total of 390 samples in random locations yet in areas captured by the training dataset and with different devices. This dataset is grouped by tests and within it are the captured samples, so both the individual error and the average error can be obtained, besides recalculating this error to test different algorithms. Each test is stored as a JSON object corresponding to an unique scan with the following values:

  • _id: contains an unique identifier for the test, uses to differentiate one test from another.

  • userId: identifier of the user who performed the test, this value will be anonymized so that it is not directly identifiable but remains unique.

  • startTimestamp: UNIX timestamp that indicates when the test was started.

  • endTimestamp: UNIX timestamp that indicates when the test was ended.

  • samples: list of samples taken during testing.

    • timestamp: UNIX timestamp that indicates when the sample was collected.

    • real: object with the coordinates of the real world in which the sample was captured.

      • floor: number indicating the floor in which the sample was captured.

      • lat: latitude as part of the coordinate at which the sample was captured.

      • lon: longitude as part of the coordinate at which the sample was captured.

    • predicted: object with the predicted coordinates of the real world.

      • floor: number indicating the floor predicted.

      • lat: latitude as part of the predicted coordinate.

      • lon: longitude as part of the predicted coordinate.

    • wifiDevices: list of APs appearing in the sample.

      • bssid: unique AP identifier, this value will be anonymized so that it is not directly identifiable but remains unique.

      • frequency: AP WiFi frequency.

      • level: AP WiFi signal strength (RSSI).

      • ssid: AP name, this value will be anonymized so that it is not directly identifiable but can be used to compare APs with the same name.

    • error: approximate distance between the actual location and the predicted location.

  • error: average distance between the actual locations and the predicted locations.

The testing dataset was compiled two days after the training phase by taking samples at random locations with an average of 3 meters, performing a single scan per location. The samples were taken with two devices, which represent each of the tests individually, a bq Aquaris E5 4G device using Android stock 6.0.1 and a Xiaomi Redmi 4X using Android 7.1.2 with MIUI 10 Global 9.5.16. Before taking the sample, 5 seconds were waited without making any movements. The following is an example of a test entry, the list of samples has been shortened to one sample and wifi devices has been shortened to two APs, as it was too long.

{
"_id":"5d13245e279a550b548e3bfe",
"userId":"USER-0",
"startTimestamp": 1557212799.6555429,
"endTimestamp": 1557222705.0710876,
"samples":[
{
"timestamp":1557212799.6552203,
"real":{
"floor":0,
"lat":37.358547,
"lon":-5.9867215
},
"predicted":{
"floor":0,
"lat":37.358547,
"lon":-5.9868493
},
"wifiDevices":[
{
"bssid":"AP-BSSID-156",
"frequency":2412,
"level":-80,
"ssid":"AP-SSID-5"
},
...
{
"bssid":"AP-BSSID-146",
"frequency":2462,
"level":-36,
"ssid":"AP-SSID-6"
}
],
"error":5.233510868645419
},
...
],
"error":3.975672826048607
}

In order to provide more information about the device used in each fingerprint of the dataset, the following relationship between users and devices is given:

USER-0: Xiaomi Redmi 4X (Android 7.1.2 with MIUI 10 Global 9.5.16)

USER-1: BQ Aquaris E5 4G (Android stock 6.0.1)

Categories:
1093 Views

This FFT-75 dataset contains randomly sampled, potentially overlapping file fragments from 75 popular file types (see details below). It is the most diverse and balanced dataset available to the best of our knowledge. The dataset is labeled with class IDs and is ready for training supervised machine learning models. We distinguish 6 different scenarios with different granularity and provide variants with 512 and 4096-byte blocks. In each case, we sampled a balanced dataset and split the data as follows: 80% for training, 10% for testing and 10% for validation.

Instructions: 

See documentation (readme.md).

Categories:
1499 Views

 Measurements collected from R1 for root cause analyses of the network service states defined from quality and service design perspectives

Categories:
482 Views

We introduce a benchmark of distributed algorithms execution over big data. The datasets are composed of metrics about the computational impact (resource usage) of eleven well-known machine learning techniques on a real computational cluster regarding system resource agnostic indicators: CPU consumption, memory usage, operating system processes load, net traffic, and I/O operations. The metrics were collected every five seconds for each algorithm on five different data volume scales, totaling 275 distinct datasets.

Instructions: 

The sections below explain the specification of the cluster of machines, the content of the data used to run the DML algorithms, the structure of the data metrics (logs) gathered, and the methods applied to collect the execution logs.

 

Datasets Structure

Each one of the 275 datasets corresponds to one execution of one DML in one specific volume (scale). To separate the data in an appropriated manner, the filenames are organized as <dml_algorithm>_<resource>_<scale>.csv, where:

  • <dml_algorithm> stands for one of eleven executed techniques (see the section about DML algorithms, below)
  • <resource> stands for disc, CPU, memory, processes, and network (see the section about metrics, in the sequence)
  • <scale> stands for b1, b2, gigantic, huge, and large (see the section about data volumes, in the sequence)

 

This way, a file named als_cpu_huge.csv designates the metrics of the CPU load of the Alternating Least Squares algorithm where applied to solve a problem in the "huge" scale. The same way, a file named kpar_mem_B1.csv stores the metrics for memory usage of the K-means Parallel (Dense k-means), and so on.

 

Computer Cluster Specification

The experiments were hosted on Google Cloud Data Proc environment. The cluster was composed of eight high power machines, a master node and seven slave nodes, totalizing 128 cores, integrated to the same internal network, fully dedicated to the machine learning algorithms’ execution. All nodes having the same specification, described as follows.

  • operating system: Debian GNU/Linux 8.10, kernel 3.16.51-3+deb8u1
  • CPU: Intel Xeon @ 2.60 GHz 
  • architecture: x86_64, Little Endian
  • cores: 16; 2 threads per core
  • RAM memory: 106 GB
  • HDD storage: 500 GB
  • SSD storage: 375 GB

The total capacity of the cluster was: 8 nodes, 128 cores, 3.4 Terabytes of total storage capacity, 848 GB of total RAM memory, managed by the computing framework Apache Spark 2.2.0, and cluster manager YARN - Apache Hadoop 2.8.2.

 

The Metrics

Five system indicators were analyzed: CPU load, network traffic, memory consumption, disk access, and process load, as follows.

 

CPU (seven dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • system: O.S. work 
  • user: algorithm's work
  • iowait: input/output busy wait time
  • softirq: software interrupt request 
  • idle: useless time

 

Memory (six dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • buffer_cache: memory in cache
  • used (O.S. + algorithm): memory used by DML algorithms and O.S.
  • free: available memory  
  • map: number of map operations

 

Disk (eight dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • bytes_read: data read from storage (MB)
  • bytes_write: data written to the storage (MB)
  • io_read: number of I/O read operations
  • io_write: number of I/O write operations
  • time_spent_read: reading time
  • time_spent_write: writing time

 

Network (five dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • recv_packets: number of received packets 
  • send_bytes: number of sent bytes
  • send_packets: number of sent packets

 

Processes 

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • load5 **: average load in the last 5 minutes 
  • load10: average load in the last 10 minutes
  • load15: average load in the last 15 minutes
  • proc: number of processes

** The amount of work performed by the system. An idle computer \\has load 0. Each process increments load by 1.

 

The DML Algorithms

As stated above, the metrics about consumption was gathered from the execution of eleven machine learning algorithms. They are:

1) Alternating Least Squares (ALS) - collaborative filtering technique — introduced by Tapestry [1] developers. It Analyzes relationships between users and items to identify possible new associations. They use neighborhood-based or matrix-factoring methods [2]. The Spark implementation is based on the strategy described in [3].

2) Naive Bayes (NB) - a widely used technique for text classification. It computes the conditional probability distribution on the characteristics and applies the Bayes’ theorem [4] for prediction. Spark implements the Mutinomial Naive Bayes (used in the experiments) and Bernouli Naive Bayes approaches [5].

3) Gradient Boosted Trees (GBT) - used for classification or regression, the latter applied in the tests. Formed by ensem- bles that train a sequence of decision trees. The Spark implementation is based on [6,7].

4) Dense k-means (k||) - clustering technique that separates datasets into k partitions. Spark implements a parallel variant of the algorithm k-means ++ [8], called k-means || [9] (k-means parallel ou dense k-means).

5) Latent Dirichlet Allocation (LDA) - clustering technique that applies a probabilistic model over discrete data collections, such as a Corpus of documents. Used in document modeling, text sorting and collaborative filtering [10].

6) Linear Regression (LinR) - used for regression. Its ancestral form was the least squares method, published by Legendre (1805) and Gauss (1809). The linear regression case deals with a simple equation that has on the right side an intercept and an explanatory variable with an inclination coefficient [11].

7)  Logistic Regression (LogR) - the main reason for choosing the logistic function for the analysis of dichotomous output variables is its flexibility, easily usable and allows judicious interpretation [12]. The experiment uses Spark implementation based on Limited-memory BFGS [13] for classification.

8) Principal Component Analysis (PCA) - dimensionality reduction technique that allows transforming a complex dataset into a smaller dimension, revealing hidden structures in the original dataset [14,15].

9) Random Forests (RF) - a nonparametric statistical method used for regression and classification, based on decision trees and bootstrap [16]. An extensive technical discussion about RFs for Big Data is available in [17].

10) Singular Value Decomposition (SVD) - dimensionality reduction technique proposed by [18]. The Spark implementation is based on matrix optimization techniques on clusters, described in [19].

11) Support Vector Machine (SVM) - a model, used for regression and classification (used in this paper), based on high-dimensionality hyperplane construction [20]. The Spark implementation uses Linear SVM for binary classification.

 

The Data and the Volume (Scales) Used in the DML Algorithms

We used synthetic data (by Intek HiBench framework due to (i) barriers to fit data requirements for each algorithm, (ii) adjusting the volume of data in each scale and (iii) transferring them from the source to the internal network on the cluster. The scales, in ascending order of size are: large (L), huge (H), gigantic (G), big data 1 (B1), and big data 2 (B2). Same volume, no content variation in scales B1/B2, and content variation in scales G, H, and L. The size of data for each experiment is detailed as follow, in Megabytes, in the sequence L, H, G, B1, and B2 (same size of B1):

  • NB - 359, 1792, 3594, 71885, 71885
  • LogR - 7629, 22886, 38144, 53402, 53402
  • SVM - 19077, 109875, 149544, 171674, 171674
  • RF - 8, 15258, 22886, 33567, 33567
  • LDA - 245, 653, 1976, 4260, 4260
  • k|| - 3830, 19149, 38308, 229816, 229816
  • GBT- 15, 31, 61, 92, 92
  • LinR- 45783, 114463, 305203, 762993
  • ALS - 115, 688, 1372, 1720, 1720
  • PCA- 31, 183, 229, 257, 257
  • SVD - 61, 191, 275, 374, 374

All metrics in the benchmark datasets are gathered from the execution of the above-specified DML algorithms in each one of these scales.

 

References

[1] Goldberg D, Nichols D, Oki BM, et al. Using collaborative filtering to weave an information tapestry. Communications of the ACM. 1992;35(12):61–70.

[2] Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8).

[3] Hu Y, Koren Y, Volinsky C. Collaborative filtering for implicit feedback datasets. In: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on; Ieee; 2008. p.

263–272.

[4] Vapnik VN, Vapnik V. Statistical learning theory. Vol. 1. Wiley New York; 1998.

[5] Sanderson M, Christopher D, Manning H, et al. Introduction to information retrieval. Natural Language Engineering. 2010;16(1):100.

[6] Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics. 2001;:1189–1232.

[7] Friedman JH. Stochastic gradient boosting. Computational Statistics and Data Analysis. 2002;38(4):367–378.

[8] Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. In: Proceed- ings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms; Society for Industrial and Applied Mathematics; 2007. p. 1027–1035.

[9] Bahmani B, Moseley B, Vattani A, et al. Scalable k-means++. Proceedings of the VLDB Endowment. 2012;5(7):622–633.

[10] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of machine Learning research. 2003;3(Jan):993–1022.

[11] Yan X, Su X. Linear regression analysis: theory and computing. World Scientific; 2009.

[12] Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Vol. 398. John Wiley & Sons; 2013.

[13] Spark. Linear Methods - RDD-based API - Logistic Regression; 2017. Available at: https://spark.apache.org/docs/2.2.0/mllib-linear-methods.html#logistic-r....

[14] Shlens J. A Tutorial on Principal Component Analysis. Epidemiology. 2005;2(c):223–228.

[15] Jolliffe IT. Principal Component Analysis, Second Edition. Encyclopedia of Statistics in Behavioral Science. 2002;30(3):487.

[16] Breiman L. Random forests. Machine learning. 2001;45(1):5–32.

[17] Genuer R, Poggi JM, Tuleau-Malot C, et al. Random forests for big data. Big Data Research. 2017;9:28–46.

[18] Lehoucq RB, Sorensen DC, Yang C. ARPACK users’ guide: solution of large-scale eigen-value problems with implicitly restarted Arnoldi methods. Vol. 6. Siam; 1998.

[19] Bosagh Zadeh R, Meng X, Ulanov A, et al. Matrix Computations and Optimization in Apache Spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16; New York, New York, USA. ACM Press; 2016. p. 31–38.

[20] Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–297.

Categories:
1099 Views

In an aging population, the demand for nurse workers increases to care for elders. Helping nurse workers make their work more efficient, will help increase elders quality of life, as the nurses can focus their efforts on care activities instead of other activities such as documentation.
Activity Recognition can be used for this goal. If we can recognize what activity a nurse is engaged in, we can partially automate documentation process to reduce time spent on this task, monitor care plan compliance to assure that all care activities have been done for each elder, among others.

Last Updated On: 
Fri, 12/06/2019 - 03:40

Pages