Method for detecting evolutionary timescales

Let us consider a series of time-stamped events recorded during some period of time. Our goal is to create time slices, i.e. sets of consecutive time intervals, that capture the significant evolutionary timescales of the dynamics of the system. There are some minimal conditions that data must fulfil for this to be a useful procedure: events must be recurrent in time and the total period of recording must be long enough to capture the evolution of the system.

In this framework, the appropriate timescales detected must have the following properties: i) Evolution is understood as changes in the event landscape, i.e., intervals will only depend on the identities of events within them, not on the rate of events in time. ii) Timescales should become shorter when the system is evolving faster, and longer when it is evolving more slowly. iii) The particular times when all events suddenly change (critical times) should have an interval placed at exactly that point. This is a limiting case of (ii).

To obtain timescales satisfying the above properties, our method iteratively constructs the time intervals [tn, tn+1) according to a similarity measure between each pair of consecutive slices Sn−1 and Sn, where Sn is the set of events between time tn and tn+1. By looking at the similarity measure as a function of time, one can detect multiple time scales. The algorithm is designed to detect the longest possible timescales in a parameter-free fashion. For discussion of timescale detection and why this is a good choice for real data, see Supplementary Information (SI), Section S1.

For the sake of clarity, assume that our method is in progress and we have a previous slice Sn−1. We want to find the next slice Sn that is defined by the interval [tn, tn + Δtn). The increment Δtn is the objective variable, and is determined by locally optimizing the Jaccard index between Sn−1 and Sn:

This is shown in Fig. 1(a). The Jaccard Index15J(Δtn) estimates the similarity of the two sets as the fraction of their common items (events) with respect to the total number of different items in them. In our case, the optimization process is performed by increasing Δtn until a peak for J(Δtn) is found (see Materials and Methods). The Jaccard similarity can then be used to understand the underlying event dynamics. We stress that in the case where the same event can occur multiple times, one could also consider using weighted similarity measures, that account for the number of occurrences of the events. We discuss weighted similarity measures in the SI. In the examples we discuss results are well aligned with expectations, even though multiple event occurrences are disregarded.

Figure 1: Schematic representation of the application of our method to synthetic data.
Figure 1

In each plot, the horizontal axis represents time in arbitrary units, and the vertical axis the event ID. A dot is present at each point where a given event occurs at a given time. The gray and white segments show the detected intervals. (a) The Jaccard score J is computed for the sets of events between adjacent intervals. Intervals are adjusted to maximize J. (b) Interval identification in a trivial case. Each region contains a characteristic set of events, all distinct from those of the other regions. (c) The total event rate (see second interval) does not affect interval size, only the characteristic set of events does. (d) Example of varying interval sizes. In the first half of time, events change more slowly than in the second half, and thus intervals are longer. Because this is data with clear transitions, (a–c) have first intervals merged.

This iterative process requires an initial slice, which is computed as follows: Beginning with the initial time t0, we construct two intervals [t0, t0 + Δt) and [t0 + Δt, t0 + 2Δt). Repeating the process above, we find the maximum of J(Δt) and use that to set the first two intervals at once. This initialization preserves the meaning of our slices and does not include any additional hypothesis. After computing both slices, the second one is discarded so that all subsequent intervals have the same semantic meaning, and we start the iterative process described above. Note that the method is parameter free. The Jaccard Index is used in a similar way in refs 13, 14 to assess the similarity of consecutive snapshots that are eventually grouped by means of hierarchical clustering to build relevant intervals, and in ref. 16 to study the effects of time window lengths in social network analysis.

The logic behind the method is easy to understand. Let us first consider a smoothly evolving system. We take the previous interval Sn−1 as fixed, and try to find the next interval Sn(Δt). On one hand, if Δt is small, increasing Δt will very likely add events already present in Sn−1. Thus, the numerator (intersection term) of Eq. (1) increases more than the denominator (union term), and J(Δt) becomes larger. On the other hand, if Δt is very large, the increase in Δt will tend to add new events not seen in the previous interval. Therefore the denominator (union term) of Eq. (1) grows faster than the numerator (intersection term), and consequently J(Δt) decreases. These principles combined imply the existence of an intermediate value of Δt representing a balance of these factors, which is a solution satisfying our basic requirements of sliced data. If the smoothness in the evolution of the temporal data is altered by an abrupt change (critical time), the previous reasoning still holds, and a slice boundary will be placed directly at the snapshot where the anomalous behavior begins.

Our method has several technical advantages. First and foremost, it is fast and scalable. The method is O(N) in the total data size N (e.g. number of distinct events), if intervals are small. Efficient hash table set implementations also allow linear scaling in interval size, and therefore extremely large data sizes can be processed. The data processing is done online: only the events from the previous interval must be saved to compute the next interval. There are no input parameters or a priori assumptions that must be made before the method can begin. The input is simple, consisting simply of (event_id, time) tuples. The method finds an initial intrinsic scale to the data, where each interval represents roughly the same amount of change.

Validation on synthetic data

First we have generated synthetic temporal data examples to be analyzed with our method. The data displays several common configurations of the temporal evolution of a toy system. Figure 1 shows some examples and the timescales that our method detects. In Fig. 1(a) we sketch the details of the method. Figure 1(b) shows three clearly distinguishable sets of events, which are correctly sliced into their respective timescales. Within each interval, events repeat very often, but between intervals there are no repetitions. Figure 1(c) shows that the rate of the short-term repetitions does not matter from our evolutionary point of view. The second interval has the same characteristic events at all times, so no evolutionary changes are observed. Figure 1(d) shows a continually evolving system. For the first half, the long-term rate of evolutionary change is slow (events are repeated for a longer time before dying out), and thus the intervals are longer. Thereafter, the process occurs much faster, and thus the intervals are shorter.

We have also validated our method on more complex synthetic data designed to reveal non-trivial but controlled evolutionary timescales. We choose a dynamics where there is a certain fixed number of events which, at every time, might be active or inactive (short-term repetition). We impose that the volume of active events per time unit must be, on average, constant. Additionally, to account for the long-term evolution, we change the identities of the events at a (periodically) varying rate. Low rates of change, that is, changing the identities of only a few events at any time, should result in large time intervals for the slices, while faster rates lead to high variations in the identities of the events and should produce shorter slices.

We check our method on this benchmark using a periodic evolutionary rate, with period τ = 500 (see Materials and Methods). In Fig. 2, the method shows the desired behavior, namely, it produces short intervals in regions where change is fastest (t ≈ 250, 750) and longer intervals for low variability of events, i.e. slow evolution (t ≈ 0, 500, 1000). Most notably, the Jaccard similarity is seen to reflect these dynamics, too. When change is slowest, the inter-interval similarity is high, and vice versa. These similarities could be used to perform some form of agglomerative clustering to produce super-intervals representing longer-term dynamics. We also show the Shannon entropy within each interval. We see that our method produces intervals containing roughly the same amount of information.

Figure 2: Artificial data with changing event turnover rate with period τ = 500.
Figure 2

(Top) Representation of events. Events have been ordered to show the envelope of the dynamics. Events change at the slowest rate at t = 500, 1000, and there are critical points at t = 1200, 1400. For clarity, only 10% of events are shown. (Middle) Detected intervals (gray/white regions), and inter-interval similarity (plotted green lines) for the slicing algorithm. We can see that the internal length reacts to the rate of change of events, and that the critical points correspond to intervals boundaries and to large drops in similarity. The similarity also reflects the rate of change of events at other times. (Bottom) Event rate (dark blue) and interval entropy (light yellow) over time. We see that overall event rate is constant and cannot be used to detect dynamical properties. We also see that each interval contains about the same amount of information, 7.3 to 7.7 bits.

At times t = 1200 and t = 1400, we add two “critical points”, when the entire set of active events changes instantly. This is the limiting case of evolutionary change. As one would expect, slice boundaries are placed at exactly these points. Furthermore, the drastic drop in similarity indicates that unique change has happened here, and that dynamics across this boundary are uncorrelated.

Analysis of real temporal data

In this section we apply our approach to real-world data, where any true signal is obfuscated by noise and we do not have any information about the underlying dynamics.

We start with the Enron email dataset (see Materials and Methods). This dataset was made public during the legal investigation concerning the Enron Corporation, which ended in the bankruptcy and collapse of this corporation17. Here, each event is an email communication sent or received by any of the senior managers who were subsequently investigated. During the course of the investigation, there were major structural changes in the company (see SI). Figure 3 shows the results of applying the method to the Enron data. Generally, the intervals are of the order of several weeks, which appears to be a reasonable time frame for changes in email communication patterns. As expected, we see that the detected intervals do not simply follow changes in the rate of events, but rather in event composition. When comparing interval boundaries to some of major events of the scandal (exogenous information, shown as red vertical lines), we see that the interval boundaries often align with the events; note that there is no a priori reason for external events always to result in abrupt changes in communication patterns. Additionally, we present the results of the application of the change point detection method of Peel and Clauset (cpdetect)8. Unlike cpdetect, our method reveals the underlying basic evolutionary timescale (weeks), rather than only major change points. On the other hand, cpdetect requires a basic scale as an input before its application (in this case, one week). Performing a randomization hypothesis test, we find evidence (at significance p = 0.0194) that cpdetect change points correspond to lower-than-random similarity values within our method (see SI).

Figure 3: Comparison of our method and the change point detection method of ref. 8 on the Enron email data.
Figure 3

Thick red lines indicate events from the company’s timeline. (Top) Results of our method on the network, in the same format as Fig. 2. (Middle) Results of the cpdetect method on the network. Dark vertical lines mark change points. (Bottom) Event density of the network. We see that both methods align with the company’s events to some degree. Our method provides more fine-grained intervals, not just the major points, and is capable of more accurately aligning with the real events since it does not have predefined window sizes as an input.

Next, we focus on the MIT Reality Mining personal mobility dataset. In this experiment, about 100 subjects, most of them affiliated to MIT Media Lab, have cellphones which periodically scan for nearby devices using the Bluetooth personal area network18. Here, events correspond to pairs of devices being in Bluetooth range (i.e. physically close). Note that we do not limit the analysis to interactions only between the 100 subjects, but process all ≈5 × 104 unique Bluetooth pairs seen (including detected devices not part of the experiment), a total of 1.8 × 106 distinct data points. The major events (see SI) correspond to particular days in the MIT academic calendar or internal Media Lab events. In Fig. 4 we plot the time slices generated by our method. Again, we see that the basic evolutionary timescale is about one week. We see a correlation with some of the real events, such as a long period during winter holidays (late December) and Spring Break (late March). At the beginning (August) and end (June) there is a minimal data volume, assimilable to noise, which precludes the method to find reliable timescales.

Figure 4: Slicing of the MIT reality mining dataset.
Figure 4

In this dataset, an event is considered the (from, to) pair of Bluetooth scanning, representing two devices being located within proximity of each other. The times of major events are known, such as important conferences or start/ends of semesters (see SI). We see that the slicing can find such major events, including the beginning of the 2005 semester and the beginning of classes at 2005-Feb.

In our last example, we look at the social media response to the semifinals of the UEFA Champions League 2014–2015, a major European soccer tournament. We collected data from Twitter looking for tweets containing the hashtag #UCL, the most widely used hashtag to refer to the competition. As all captured tweets contain the hashtag #UCL, we define events by considering the other hashtags that a tweet may contain.

If a tweet contains more than one other hashtag, each of them is considered as a separate event. For example, if at time t there is a tweet containing #UCL, #Goal and #Football, we consider that we have two events at time t, one for #Goal and the other for #Football. In Fig. 5(a), we see a one-week span around the two first semifinal games. Because Twitter discourse is rather unstructured, intervals are very small, comprising at most a few hours. During the games themselves (May 5, 6 evenings), we see extremely short intervals representing the fast evolution of topics. In Fig. 5(b), we zoom in the first semifinal game. As the game begins, intervals become much shorter as we capture events that did not manifest in the previous games to the match. We see other useful signatures, like the rapid dynamics at the beginning of the second half (21:45) and at the time of the game-winning goal (21:58).

Figure 5
Figure 5

(a) Slicing of Twitter hashtag sequences generated during UEFA Champions League 2014–2015 football tournament. The format of the figure is as in Fig. 2. In this data all tweets corresponding to the #UCL hashtag are captured, so the rest of the co-occurring hashtags are taken as our events. The two increments in the event rate represent the two matches, which are captured through very fine intervals in the data slicing. (b) First semifinal of UEFA Champions League 2014–2015, between Juventus and Real Madrid. The format of the figure is as in Fig. 2. The game runs from 20:45 until 22:30, with the halftime from 21:30 until 21:45. We see sped-up activity around the goals in the slicing (top) that is not simply explained by the event rate (bottom). When goals are scored (20:53, 21:12 and 21:55) there is a much faster topics turnover.

One could wonder if our method detects something in our data, beyond what one would expect in random data. To study this, we do a shuffling test. For each dataset, we compare the initial data with 100 randomized cases. Randomization is performed by shuffling all event IDs while preserving the same set of timestamps. For the base case, we calculate the widths of all intervals. We use a Mann-Whitney U test to compare this distribution to the distribution of all widths found in 100 resamples. The resulting p-values are listed in Table 1 and show that our results are clearly distinguishable from those found in random event streams datasets.

Table 1: Test of results on real datasets with respect to a baseline consisting of a random reshuffling of the events.

 

 

Detection of timescales in evolving complex systems : Scientific Reports.

Source: Detection of timescales in evolving complex systems : Scientific Reports