CRAWDAD mit/reality

Citation Author(s):
Alex (Sandy)
Submitted by:
Last updated:
Thu, 11/09/2006 - 08:00
Data Format:
0 ratings - Please login to submit your rating.


Traces of communication, proximity, location, and activity information from 100 subjects at MIT over the course of the 2004-2005 academic year.

The authors have captured communication, proximity, location, and activity information from 100 subjects at MIT over the course of the 2004-2005 academic year. This data represents over 350,000 hours (~40 years) of continuous data on human behavior. Such rich data on complex social systems have implications for a variety of fields.

last modified :


release date :


date/time of measurement start :


date/time of measurement end :


collection environment :

Our study consists of one hundred Nokia 6600 smart phones pre-installed 
with several pieces of software we have developed as well as a version of 
the Context application from the University of Helsinki. 
Seventy-five users are either students or faculty in the MIT Media Laboratory, 
while the remaining twenty-five are incoming students at the MIT Sloan business 
school adjacent to the laboratory. Of the seventy-five users at the lab, 
twenty are incoming masters students and five are incoming MIT freshman.

network configuration :

We exploit the fact that modern phones use both a short-range RF network 
(e.g., Bluetooth) and a long-range RF network (e.g., GSM), and that 
the two networks can augment each other for location and activity inference.
We logged cell tower ID to determine approximate location and at the same
time we logged  Bluetooth devices. 
Bluetooth is a wireless protocol in the 2.40-2.48 GHz range, developed 
by Ericsson in 1994 and released in 1998 as a serial-cable replacement 
to connect different devices.

data collection methodology :

The information we are collecting includes call logs, Bluetooth devices in proximity, 
cell tower IDs, application usage, and phone status (such as charging and idle), 
which comes primarily from the Context application. The study will generate 
data collected by one hundred human subjects over the course of nine months and 
represent approximately 500,000 hours of data on users' location, communication 
and device usage behavior.



Traceset of communication, proximity, location, and activity information.

  • description: The authors have captured communication, proximity, location, and activity information from 100 subjects at MIT over the course of the 2004-2005 academic year. This data represents over 350,000 hours (~40 years) of continuous data on human behavior.
  • measurement purpose: Social Network Analysis, Human Behavior Modeling
  • methodology: Every Bluetooth device is capable of device-discovery, which allows them to collect information on other Bluetooth devices within 5-10 meters. This information includes the Bluetooth MAC address (BTID), device name, and device type. The BTID is a hex number unique to the particular device. The device name can be set at the user's discretion; e.g., Tony's Nokia. Finally, the device type is a set of three integers that correspond to the device discovered; e.g., Nokia mobile phone, or IBM laptop. To log BTIDs we designed a software application, BlueAware, that runs passively in the background on MIDP2-enabled mobile phones. Bluetooth was primarily designed to enable wireless headsets or laptops to connect to phones, but as a byproduct, devices are becoming aware of other Bluetooth devices carried by people nearby. Our application records and timestamps the BTIDs encountered in a proximity log and makes them available to other applications. BlueAware is automatically run in the background when the phone is turned on, making it essentially invisible to the user. Bluedar was developed to be placed in a social setting and continuously scan for visible devices, wirelessly transmitting detected BTIDs to a server over an 802.11b network. The heart of the device is a Bluetooth beacon designed by Mat Laibowitz incorporating a class 2 Bluetooth chipset that can be controlled by an XPort web server. We integrated this beacon with an 802.11b wireless bridge and packaged them in an unobtrusive box. An application was written to continuously telnet into multiple BlueDar systems, repeatedly scan for Bluetooth devices, and transmit the discovered proximate BTIDs to our server. Because the Bluetooth chipset is a class 2 device, it is able to detect any visible Bluetooth device within a working range of up to twenty-five meters.
  • last modified: 2006-10-17
  • dataname: mit/reality/blueaware
  • version: 20050701
  • change: the initial version
  • release date: 2005-06-01
  • date/time of measurement start: 2004-07-26
  • date/time of measurement end: 2005-05-05
  • limitation: 1. Continually scanning and logging BTIDs can expend an older mobile phone battery in about 18 hours. While continuous scans provide a rich depiction of a user's dynamic environment, most individuals expect phones to have standby times exceeding 48 hours. Therefore BlueAware was modified to only scan the environment once every five minutes, providing at least 36 hours of standby time. 2. While the custom logging application on the phone crashes occasionally (approximately once every week), these crashes fortunately do not result in significant data loss. An additional small application was written to start on boot and continually review the running processes on the phone, verifying that our logging application is always running. Should there be a time where this is not the case, the application is immediately restarted. This functionality also ensures that logging begins immediately once the phone is turned on. However, while this logging application is now fairly robust and can be assumed to be running anytime the phone is on, the dataset generated is certainly not without noise. 3. By scanning only periodically every five minutes, shorter proximity events may be missed.
  • hole: 1. All the data from a phone are stored on a flash memory card, which has a finite number of read-write cycles. Initial versions of our application wrote over the same cells of the memory card. This led to failure of a new card after about a month of data collection, resulting in the complete loss of data. When the application was changed to store the incremental logs in RAM and subsequently write each complete log to the flash memory, our data corruption issues virtually vanished. However, ten cards were lost before this problem was identified, destroying portions of the data collected during the months of September and October for six Sloan students and four Media Lab students. 2. Another source of missing data is due to powered-off devices. On average we have logs accounting for approximately 85.3% of the time since the phones have been deployed. Less than 5% of this is due to data corruption, while the majority of the missing 14.7% is due to almost one fifth of the subjects turning off their phones at night. 3. There is a small probability (between 1-3% depending on the phone) that a proximate, visible device will not be discovered during a scan. Typically this is due to either a low level Symbian crash of an application called the "BTServer", or a lapse in the device discovery protocol. The BT server crashes and restarts approximately once every three days (at a 5 minute scanning interval) and accounts for a small fraction of the total error. However, to detect other subjects, we can leverage the redundancy implicit in the system. Because both of the subjects' phones are actually scanning, the probability of a simultaneous crash or device discovery error is less than 1 in 1000 scans.
  • error: 1. The ten meter range of Bluetooth along with the fact that it can penetrate some types of walls, means that people not physically proximate may incorrectly be logged as such. 2. An error comes from the phone being either explicitly turned off by the user or exhausting the batteries. According to our collected survey data, users report exhausting the batteries approximately 2.5 times each month. One fifth of our subjects manually turn the phone off on a regular basis during specific contexts such as classes, movies, and (most frequently) when sleeping. Immediately before the phone powers down, the event is timestamped and the most recent log is closed. A new log is created when the phone is restarted and again a timestamp is associated with the event. 3. A more critical source of error occurs when the phone is left on, but not carried by the user. From surveys, we have found that 30% of our subjects claim to never forget their phones, while 40% report forgetting it about once each month, and the remaining 30% state that they forget the phone approximately once each week. Identifying the times where the phone is on, but left at home or in the office presents a significant challenge when working with the dataset. To grapple with the problem, we have created a 'forgotten phone' classifier. Features included staying in the same location for an extended period of time, charging, and remaining idle through missed phone calls, text messages and alarms. When applied to a subsection of the dataset which had corresponding diary text labels, the classifier was able to identify the day where the phone was forgotten, but also mislabeled a day when the user stayed home sick. By ignoring both days, we risk throwing out data on outlying days, but have greater certainty that the phone is actually with the user. A significantly harder problem is to determine whether the user has temporarily moved beyond ten meters of his or her office without taking the phone. Empirically, this appears to happen with many subjects on a regular basis and there doesn't seem to be enough unique features of the event to accurately classify it. However, this phenomenon does not diminish the extremely strong correlation between detected proximity and self-report interactions. Lastly, while frequency of proximity within the workplace can be useful, the most salient data comes from detecting a proximity event outside MIT, where temporarily forgetting the phone is less likely to repeatedly occur.
  • note: In return for the use of the Nokia 6600 phones, students have been asked to fill out web-based surveys regarding their social activities and the people they interact with throughout the day. Comparison of the logs with survey data has given us insight into our dataset's ability to accurately map social network dynamics. Through surveys of approximately forty senior students, we have validated that the reported frequency of (self-report) interaction is strongly correlated with the number of logged BTIDs (R=.78, p=.003), and that the dyadic self-report data has a similar correlation with the dyadic proximity data (R=.74, p~=.0001). Additionally, a subset of subjects kept detailed activity diaries over several months. Comparisons revealed no systematic errors with respect to proximity and location, except for omissions due to the phone being turned off.

mit/reality/blueaware Traces

    • activityscpan: Activity span logs.
  • configuration: activity span logs
  • format: oid, endtime, starttime, person_oid
  • description: Activity span logs.
  • last modified: 2006-10-17
  • dataname: mit/reality/blueaware/activityscpan
  • version: 20050701
  • change: The initial version
  • release date: 2005-07-01
  • date/time of measurement start: 2004-07-26
  • date/time of measurement end: 2005-05-05
    • callspan: Call span logs.
  • configuration: call span logs
  • format: oid, endtime, starttime, person_oid, phonenumber_oid, callid, contact, description, direction, duration, number, status, remote "person_oid" refers to the person running the software on their phone, for which this call was logged. It is who this callspan is 'attached' to, and will always be attached to some person_oid. "direction" refers to the direction of the call from the perspective of this particular person/cellphone that recorded this callspan (the same as the person referred to by person_oid). Can be Incoming, Missed Call, or Outgoing. "phonenumber_oid" refers to the number 'on the other end' of the network, which may be a landline, a cell phone line, or even that phone network's voicemail. So in other words, person_oid and phonenumber_oid represent the two ends of the phone call, with the direction of the phone call represented in the direction field. If you want to utilize all 897921 callspan records, you might want to define these "calls" as between two phonenumbers, instead of as between two persons. So the call would exist between callspan.person_oid's phonenumber_oid, and the callspan.phonenumber_oid. In addition, if the callspan records a call between two people that were running the software and part of the study (they both are part of the study), then there are a few additional properties that will hold about the callspan: For some person src: src.oid = callspan.person_oid (for all calls) For some person dst: dst.phonenumber_oid = callspan.phonenumber_oid (only for in-network calls) There should also be a symmetric callspan going in the other direction. For some callspan Y: Y.person_oid == dst.oid Y.phonenumber_oid = src.phonenumber_oid
  • description: Call span logs.
  • last modified: 2006-10-17
  • dataname: mit/reality/blueaware/callspan
  • version: 20050701
  • change: The initial version
  • release date: 2005-07-01
  • date/time of measurement start: 2004-08-03
  • date/time of measurement end: 2004-12-25
    • cellspan: Cell span logs.
  • configuration: cell span logs
  • format: oid, endtime, starttime, person_oid, celltower_oid
  • description: Cell span logs.
  • last modified: 2006-10-17
  • dataname: mit/reality/blueaware/cellspan
  • version: 20050701
  • change: The initial version
  • release date: 2005-07-01
  • date/time of measurement start: 2004-07-26
  • date/time of measurement end: 2005-05-05
    • coverspan: Cover span logs.
  • configuration: cover span logs
  • format: oid, endtime, starttime, person_oid
  • description: Cover span logs.
  • last modified: 2006-10-17
  • dataname: mit/reality/blueaware/coverspan
  • version: 20050701
  • change: The initial version
  • release date: 2005-07-01
  • date/time of measurement start: 2004-07-27
  • date/time of measurement end: 2005-05-05
    • devicespan: Device span logs.
  • configuration: device span logs
  • format: oid, endtime, starttime, person_oid, device_oid
  • description: Device span logs.
  • last modified: 2006-10-17
  • dataname: mit/reality/blueaware/devicespan
  • version: 20050701
  • change: The initial version
  • release date: 2005-07-01
  • date/time of measurement start: 2004-07-26
  • date/time of measurement end: 2005-05-05

The files in this directory are a CRAWDAD dataset hosted by IEEE DataPort. 

About CRAWDAD: the Community Resource for Archiving Wireless Data At Dartmouth is a data resource for the research community interested in wireless networks and mobile computing. 

CRAWDAD was founded at Dartmouth College in 2004, led by Tristan Henderson, David Kotz, and Chris McDonald. CRAWDAD datasets are hosted by IEEE DataPort as of November 2022. 

Note: Please use the Data in an ethical and responsible way with the aim of doing no harm to any person or entity for the benefit of society at large. Please respect the privacy of any human subjects whose wireless-network activity is captured by the Data and comply with all applicable laws, including without limitation such applicable laws pertaining to the protection of personal information, security of data, and data breaches. Please do not apply, adapt or develop algorithms for the extraction of the true identity of users and other information of a personal nature, which might constitute personally identifiable information or protected health information under any such applicable laws. Do not publish or otherwise disclose to any other person or entity any information that constitutes personally identifiable information or protected health information under any such applicable laws derived from the Data through manual or automated techniques. 

Please acknowledge the source of the Data in any publications or presentations reporting use of this Data. 


Nathan Eagle, Alex (Sandy) Pentland, mit/reality, , Date: 20050701


Dataset Files

Open Access dataset files are accessible to all logged in  users. Don't have a login?  Create a free IEEE account.  IEEE Membership is not required.


File mit-reality-readme.txt1.58 KB

These datasets are part of Community Resource for Archiving Wireless Data (CRAWDAD). CRAWDAD began in 2004 at Dartmouth College as a place to share wireless network data with the research community. Its purpose was to enable access to data from real networks and real mobile users at a time when collecting such data was challenging and expensive. The archive has continued to grow since its inception, and starting in summer 2022 is being housed on IEEE DataPort.

Questions about CRAWDAD? See our CRAWDAD FAQ. Interested in submitting your dataset to the CRAWDAD collection? Get started, by submitting an Open Access Dataset.