Synthetic Event Streams
As stated in the Process Mining Manifesto, it is still difficult to compare different Process Mining tools and techniques. In the same line, an explorative comparison starts with the datasets employed, which should represent well the different behaviour that data might assume. Despite the availability of event logs, the majority of them was not created for online scenarios. Therefore, one of the challenges in Process Mining research is to provide reliable benchmark datasets consisting of representative online settings. We contribute to this aim by proposing a benchmark dataset composed of 942 event streams with concept drift. The event streams explore different characteristics of an online scenario, such as drift types, perspectives, sizes and noise percentage.
This package contains 942 synthetic event streams that simulate concept drift in business processes. Each stream has only one drift. Different stream sizes, types and perspective of drift, and noise percentual are applied. Each event in the stream contains four main attributes: case identification, event name, event start time, event completion time.
All event streams share a few common characteristics: (i) the arrival rate of cases is fixed to 20 minutes, i.e. after every 20 minutes an event from a new case arrives in the stream; (ii) the time distribution between events of the same case follows a normal distribution. For baseline behavior, the mean time was set to 30 minutes and the standard variation to 3 minutes. While for drifted behavior the mean and standard variation were 5 and 0.5 minutes, respectively; (iii) for time drifts, the model used in a single event stream is the same, i.e. the drift happens only in the time perspective; this way, we avoid introducing other factors; (iv) all drifts were created with 100, 500 and 1000 cases; (v) noise was introduced in the event stream for all the trace drifts. We chose to introduce noise in the form of anomalous cases. The anomalies consisted of removing either the first or the last half of the trace. Then, different percentages were applied (5%, 10%, 15% and 20%) in relation to the total stream size. Note that standard cases were swapped for anomalous ones, this way preserving the event stream size. We explored four different types of drifts to compose the dataset of event streams:
- Sudden drift: the first half of the stream is composed of the baseline model, and the second half is composed of the drifted model. The same idea applies for trace and time drifts (for time drifts the change is only in the time distribution and not the actual model).
- Recurring drift: for streams sizes of 100 traces, cases follow the division 33-33-34. The initial and the last concepts are the baseline, and the inner one is the drifted behavior, i.e. the baseline behavior starts the stream, fades after 33 traces and reappears for the last 34 traces, indicating a recurring characteristic; the same applies for time drifts. For 500 and 1000 traces, the division is 167-167-166 and 330-330-340, respectively.
- Gradual drift: one concept slowly takes place over another. This way, 20% of the stream was dedicated to the transition between concepts.
- Incremental drift: for the trace perspective, an intermediate model between the baseline and the drift model is required since the process change is incremental. This way, only complex change patterns were used because it was possible to create intermediate models from them whereas, for simple change patterns, the same is not possible since the simple change is already the final form of drift. This way, 20% of the stream log was dedicated for the intermediate behavior, so the division was 40-20-40 (baseline-intermediate model-incremental drift). The same applies for the other sizes following the proportion. For incremental time drifts all change patterns were used since the incremental drift was applied to the time perspective, disregarding of the model. This way, the transition state (20% of the stream log) was subdivided into four parts where standard time distribution decreases 5 minutes between them, following the incremental change of time.
- Drift types (A): gradual, incremental, recurring and sudden
- Drift perspectives (B): time and trace
- Noise percentage (C): 0, 5, 10, 15, 20
- Number of cases in the stream (D): 100, 500, 1000
- Change patterns (E): baseline, cb, cd, cf, cp, IOR, IRO, lp, OIR, pl, pm, re, RIO, ROI, rp, sw
The file name follows the pattern [A]_[B]_noise[C]_[D]_[E]
An identical version of this dataset in the MXML format is available at: http://www.uel.br/grupo-pesquisa/remid/?page_id=145