Modern technologies have made the capture and sharing of digital video commonplace; the combination of modern smartphones, cloud storage, and social media platforms have enabled video to become a primary source of information for many people and institutions. As a result, it is important to be able to verify the authenticity and source of this information, including identifying the source camera model that captured it. While a variety of forensic techniques have been developed for digital images, less research has been conducted towards the forensic analysis of videos. In part, this is due to a lack of standard digital video databases, which are necessary to develop and evaluate state-of-the-art video forensic algorithms. To address this need, in this paper we present the Video Authentication and Camera IDentification (VideoACID) database, a large collection of videos specifically collected for the development of camera model identification algorithms. The Video-ACID database contains over 12,000 videos from 46 physical devices representing 36 unique camera models. Videos in this database are hand collected in a diversity of real-world scenarios, are unedited, and have known and trusted provenance. In this paper, we describe the qualities, structure, and collection procedure of Video-ACID, which includes clearly marked videos for evaluating camera model identification algorithms. Finally, we provide baseline camera model identification results on these evaluation videos using a state-of-the-art deep-learning technique. The Video-ACID database is publicly available at misl.ece.drexel.edu/video-acid
The Video Authentication and Camera Identification Database(Video-ACID) contains 12,173 total videos from 46 devices, comprising 36 unique camera models. These cameras include a variety of different device categories, manufactures, and models. Specifically, our database contains videos from 19 different smartphones or tablets. We also include videos from 10 point-and-shoot digital cameras, 3 digital camcorders, 2 single-lense reflex cameras, and 2 action cameras. Our dataset contains videos from 18 different camera manufacturers, including Apple, Asus, Canon, Fujifilm, GoPro, Google, Huawei, JVC, Kodak, LG, Motorola, Nikon, Nokia, Olympus, Panasonic, Samsung, Sony, and Yi. For nine of the camera models in the Video-ACID database, videos were collected using two or more physical devices of the same make and model. Table 1 shows the make and model of each class, as well as some properties of the videos recorded using thoses cameras. VOLUME 4, 2016 3 Hosler et al.:A New Database for Video ForensicsVideo-ACID: a Database for the Study of Video Forensics A. CAPTURE PROCEDURE Videos were captured by hand by a team of researchers who had physical access to each device. Additionally, these videos are unaltered, in the original format directly output by the camera. In order to ensure consistency and avoid biases across the Video-ACID database, the following guidelines were developed and used during data collection. Device Settings Cameras are often configurable to capture videos with many different parameters including frame size and frame rate. Videos in this dataset are captured by cameras operating at their highest quality setting, usually 1080P at 30 frames per second. Digital zoom can introduce distortions into multimedia content and the associated forensic traces, so during the data collection process, all cameras were left at their default zoom level. Many of these camera models have multiple image sensors on a single device. In these scenarios, we refer to the higher quality rear-facing camera, as opposed to the front facing "selfie" camera. Duration The videos captured for the Video-ACID dataset are each roughly five seconds or more in duration. This number was chosen in light of many constraints. First, videos must be long enough to exhibit forensically significant behavior, providing a lower bound of several GOP sequences in duration. Second, some forensic algorithms operate on individual frames, as opposed to an entire video. In light of this, we would like to maximize the number of frames available in each video. Third, data collection is an expensive process, and we would like to maximize the number of videos that can be recorded in a given time period. We found that videos of five seconds in duration fit all these constraints. Content Content diversity is important for many forensic tasks. We collected videos from a variety of different scenes with each camera, including near and far-field focus, indoor and outdoor settings, varied lighting conditions, horizontal and vertical capture, and varied background such as greenery, urban sprawl, snowy landscapes, etc. All videos incorporate some sort of motion or change in scene content, lessening the redundancy of frames within a single video. This motion is typically in the form of panning or rotating the camera, or changing the distance of the camera from the scene. Figure 1 shows still frames of different videos in the VideoACID database, demonstrating the range and variety of scene content in this database. B. CAMERA CAPTURE PROPERTIES Table 1 summarizes the collected videos and the capture parameters of each device. This table contains information about the capture and compression properties of videos recorded by each camera model such as the resolution, codec profile frame rate mode, and GOP structure. Resolution The resolution of a video corresponds to the width and height of the video in pixels. Additionally, a suffix of ‘I’ or ‘P’ is added to indicate the scan type of the video. A video with an interlaced scan is displayed by updating every other row of a frame to reduce the amount of data that needs to be stored. A progressive video is displayed by updating every row of the display for each frame. Codec Profile Of the 36 camera models used to collect data, most encode captured videos according to the H.264 video coding standard. Associated with this standard are Profiles and Levels, indicating the complexity and speed required to decode the given video. However, two of these cameras encode video using the MJPEG codec, which does not have a profile or level. The profile of a video determines the complexity needed to decode that video. while the level indicates the speed required. For example "Baseline" and "Constrained baseline" videos use Context-Adaptive Variable-Length Coding (CAVLC), while "Main" and "High" profile videos use Context-based Adaptive Binary Arethmetic Coding (CABAC). Both encoding schemes are lossless, however CABAC is much more computationally intensive to encode and decode than CAVLC. Notably, B-frames are not available when using the "Baseline" profiles, but are available when encoding "Main" and "High" profile video. While the Profile indicates a video stream’s complexity in terms of the capability necessary to decode it, a video’s Level indicates bitrate necessary to decode the stream. For example, decoding a 1080p video at 30FPS requires a decoder capable of Level 4 or above. Most modern flagship smartphones use the "High" profile, while older phones, cheaper phones, and point-and-shoot digital camera are more likely to use the "Main" or "Baseline" profiles. Frame Rate The Video-ACID dataset contains a mix of "Variable" and "Fixed" frame rate video. In fixed frame rate videos, each frame is displayed for the same amount of time as every other. In variable frame rate videos, the timing between frames can change. For example, a camera may detect fast motion in a scene, and increase the frame rate to be able to better capture this motion. GOP structure The length and sequence of a Group of Pictures (GOP) is not fixed by the codec. Instead, as long as a GOP starts with an I-frame, each encoder is allowed to determine its own sequence of P and B frames. A video’s GOP sequence is usually parameterized by the length of of the GOP – the number of frames between I frames – and the maximum number of B frames allowed between anchor frames. In Table 1, the N value is the number of frames between I frames, and the M value is the maximum number of B-frames between P-frames For MJPEG-encoded videos there are no predicted frames, so the distance between IFrames is 1. As seen in Table 1, many cameras will use different M and N parameters. Across similar devices from the same manufacturer however, these parameters are likely to be constant. For example, all the Samsung devices use M=1 and N=30, while Google’s devices use M=1, N=29. C. STRUCTURE Within Video-ACID, we provide two datasets, a "Full" dataset, and a "Duplicate Devices" dataset. The Full dataset 4 VOLUME 4, 2016 Hosler et al.:A New Database for Video ForensicsVideo-ACID: a Database for the Study of Video Forensics FIGURE 1. Sample frames from captured videos. contains all videos from all camera models. The Duplicate Devices dataset contains videos from camera models represented by multiple devices. We organize these videos in the following way: Full Dataset The Full dataset contains all videos from all devices in the Video-ACID database. We split this dataset into disjoint sets of training and evaluation videos. To do this, we randomly select 25 videos from each camera model to act as the evaluation set. In the case of multiple devices of the same make and model, these 25 videos are randomly split across the devices. The rest of the videos are left for training. Many existing camera model identification algorithms operate using patches of an image or video. In light of this, we selected an additional 25 videos from the Nikon Coolpix S3700 because the small frame size limits the number of unique non-overlapping patches that can be extracted. Table 2 shows the total number of training and evaluation videos for each camera model. Within the root directory of our "Full" dataset, we have a directory for training data, and another for evaluation data. Within these directories, there is a subdirectory for each camera model. Camera models are identified by both a model number, from 0 to 35, and a name describing the make and model of the camera. Within each of these camera model directories, we separate videos by the device which captured them. For most models, this is just a single subdirectory labeled "DeviceA". When multiple devices of the same model were used to capture videos, there are multiple subdirectories, "DeviceA", "DeviceB", etc. The videos are named according to the following scheme: MXX_DY_T0000.mp4. MXX is the model number assigned to the camera. DY is the device identifier, e.g. DA for Device A. The prefix of the video number is either ’T’ or ’E’ indicating whether the video belongs to the training set or evaluation set respectively. Finally, a four-digit number is assigned to index videos captured by the same device. For example, M30_DA_E0010.mp4 refers to the 11th evaluation video captured by device A of a Samsung Galaxy S7. The full filepath is then, "eval/M30_Samsung_Galaxy_S7/ DeviceA/M30_DA_E0010.mp4". Duplicate Devices Dataset The Duplicate Devices dataset contains only those camera models from which videos were captured using multiple devices. This dataset is useful for studying the device dependence of various forensic algorithms. Table 3 lists the nine camera models and the number of training and evaluation videos in this Duplicate Devices dataset. From each of these camera models we select the A device and the B device. Videos from the A devices are divided into Train-A and Eval-A, where the Eval-A directory contains 25 videos from each model, and the Train-A directory contains the rest. The videos from the B devices are divided the same way. One camera model, the Google Pixel 1, has videos from three devices. Device C from the Google Pixel 1 is excluded from the Duplicate Devices dataset. These directories are structured similarly to the "Full" set, with the different model numbers prefixing the model names. The device subdirectories, because the device is implied, is removed. The videos are directly beneath the camera model directory. For example, trainA/M03_Kodak_Ektra/M03_DA_0010.mp4 is the file path pointing to the 11th training video captured by the A device of the Kodak Ektra.