UCR Suite for Time Series Subsequence Search

The UCR Suite: Funded by NSF IIS - 1161997 II.

This webpage was build in support of the UCR Suite; Software that enables ultrafast subsequence search under both Dynamic Time Warping (DTW) and Euclidean Distance (ED). The work first appeared in a SIGKDD 2012 paper.

ACM SIGKDD Best Paper Award Winner 2012

ACM SIGKDD Test of Time Winner 2022

We observe that UCR-Suite wins in exact query answering and on hard queries.. Echihabi et al, VLDB 2019.

Recent optimizations on DTW similarity search (the UCR Suite) can make this entire operation feasible in real time. Stuart Russell et al CHI 2013

The UCR Suite was developed by :

UC Riverside: Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Qiang Zhu, Jesin Zakaria, Eamonn Keogh
University of Sao Paulo: Gustavo Batista
Brigham and Women's Hospital: Brandon Westover
Authors Rakthanmanon, Campana, Mueen and Batista contributed equally, and should be consider joint first authors.

How fast is the UCR-Suite? It depends on the data, query length, query shape, hardware, warping constraint etc. However, to a first degree approximation:

We can search a million datapoints in a second...
We can search billions of datapoints in minutes...
We can search trilliions of datapoints in hours.

What are the advantages of the UCR-Suite?

It is exact, not aproximate.
It does not require parameters to be set.
It requires zero preprocessing time.
It correctly z-normalizes the data.
It has no minimum or maximum query length (We have searched queries as short as 16 and as long as 72,500, see DNA video)
It can also handle exact queries under uniform scaling.
The same idea works for both streaming data, and batch offline search.
Finally, we are simply much faster than any known technique.

Here we show we can search a day-long ECG tracing in 35 seconds under DTW, using a single core.

Using the same query, we can search a year of ECG (8,518,554,188 datapoints) in 18 minutes using a multi-core machine.

Thus we can search 256Hz signals about thirty thousand times faster than real time.

Here we show we can support very long queries. We search for a query of length 72,500 in 21,435,268 datapoints in 18 seconds.
The refernce dendrogram we compared to at the end of this video is from:
D. P. Locke, et al. 2011. Comparative and demographic analysis of orangutan genomes. Nature 469, 529-533.

How does changing the width of the warping effect the speed-up? See here for the numbers, however, in brief, it makes very little difference. Over the range of 0 to 15, which would include the best accuracy setting for the vast majority of the UCR archive problems, the difference is bearly perceptable

Code:

The code.

Data:

Face (four) dataset has been available for 8 years here, with Gun/NoGun data, and all UCR archive data.
The raw DNA came from UCSC, our code to convert it to time seres is here.
The music symbols where collected by Alicia Fornes, they are here. See fig 10 of this paper for samples.
The online motif data is here.
The code for random walk is here, including the exact seeds we used. See also.
The 20 million random walk dataset is here, including all the queries used.
The 22 hours and 23 minutes of ECG data (20,140,000 datapoints) shown in the video above is here, together with the exact query.
The 1,000 star light curve data is the entire training set from StarLighhtCurves archived here.
The 1.08 year of ECG data came from Physionet.org. Here we list the exact set of data we trawled. This is too large for our servers to host. If you want the exact data, just send us a 16 Gig thumb drive with your return address, we will pay return shipping.