This is a supporting webpage of our KDD 2015 paper

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy

by

Nurjahan Begum, Liudmila Ulanova, Jun Wang, and Eamonn Keogh

The paper is here

Real Life Case Study 1: Electromagnetic Articulograph dataset

This data contains tongue and lip time-series movement data in producing 25 words collected from multiple speakers of native English. Electromagnetic Articulograph was used to collect the data, where small wired sensors were attached on the surface of tongue and lips. Each data file is a sample (example) - production of a word, by one speaker.
The data files have been segmented (from long sequences) into individual words, which means the beginning of a data file is the beginning of a word production, the end of the data file if the end of the word production.

12 sensors were used for data collection, including HC, HL, HR, T1, T2, T3, T4, UL, LL, JC, JL, and JR. HC, HL, HR are head sensors; T1 - T4 are tongue sensors, from tongue tip to tongue back in the midline; UL and LL are lip sensors; JC, JL, and JR are jaw sensors; Data from head sensors are used for calculating head-independent movement of other sensors.

The validity of jaw sensor data is not verified. Thus, currently, only tongue and lip data are available for word classification.

In our experiment we used data from the lower lip sensor. We preprocessed the data to contain 18 different words.

Data

For any data related question, contact (Dr. Jun Wang)

Anytime Experiment

We demonstrate the accelerated multi dimensional clustering problem under DTW in the Table below. As we can see, using a single axis to do the clustering does not give the best results (row 1-3). In addition, using all the axes is useful neither (row 8). However, if we carefully omit irrelevant axis (in our case it is X axis, because it is only useful for patients with facial asymmetry), then we can perform very well in terms of cluster quality, and distance computation (row 5).


We can now ask how effective our pruning strategy is when performing this clustering. As we can see from the figure below.top) TADPole achieves ~94% pruning , and the zoomed-in view of bottom) confirms that TADPole performs almost as good as the oracle algorithm. In wall clock terms, TADPole takes only 2.89 seconds, allowing interactive analysis and feedback to the patient.

Real Life Case Study 2: Pulsus Dataset

Pulsus Paradoxus is defined as a significant decline in the pulse with inspiration. It is a symptom of Cardiac Tamponade, a life-threatening condition where high-pressure fluid fills the sac surrounding the heart. Of the several ways to detect Pulsus, the least invasive and simplest uses the PPG (PhotoPlethysmoGram).


In the figure below we show the PPG measurement process. From the right figure, we can see that the Pulsus instances are within a compact cluster, and the non-Pulsus instances seem to form a number of sub-clusters. Our medical collaborator suggests this reflects the fact that there is one way to have Tamponade, but multiple ways to have a healthy heartbeat.


For this case study we consider a dataset of 500 PPGs from two sources, the MIMIC II Waveform Database [1][2] and our collaborators. As shown in the figure below, bcause the raw PPG data is very complex, we converted the PPGs to amplitude spectrums (middle) and clustered in that space. We also show the Pulsus (green) and non-Pulsus (red) in the figure below(bottom).

Data

Real Life Case Study 3: Person Reidentification Dataset

Person re-identification is the task of recognizing individuals across spatially disjointed cameras, and an important problem for understanding human behavior in areas covered by surveillance cameras.


We considered the PRID dataset [3], randomly extracting 1,000 images of 12 different individuals. We converted the images to color histogram.

Data

Image Data

Anytime Experiment

From the figure below, we see that the performance of TADPole is very close to an oracle version of the algorithm.

When does TADPole compute distances, and when does it prune?


We analyzed when the distance computation of TADPole is more likely and when not on the Re Identification dataset. For this, we bi-clustered the sparse distance matrix returned by TADPole using the algorithm in [4]. The figure below shows the bi-clustered distance matrix. The black cells in A) shows the cells when TADPoel calculates the actual DTW distance. The colored cells in B) show the object pairs from A) with the same ground truth label. The colored cells in C) show the object pairs from A) with the different ground truth label.


Observations from Intra-class Distance Computations by TADPole

- Pair of items from same class cause more distance computations (72%) - Pair of items from different class cause more distance prunings (13.8%)

Density computation
Pair of items which are very similar OR dissimilar in terms of the cutoff distance will have their distance computation pruned.

Turns out that
- items from same class mostly stay in the border line for cutoff distance causing more distance computation

- items from different class are mostly dissimilar causing more distance prunings


NN distance from high density list computation

Turns out that
- items from same class are similar, has lower LB, causes more distance computation(83%)

- items from different class are mostly dissimilar,causes more distance prunings (17%)
Examples


Pair of items from same class with distance pruning means,
They are either very similar OR very dissimilar. An example:


Pair of items from different class with distance computation means,
They are very similar. An example:

Clustering Quality and Time Experiment

The spreadsheet of the experiment is here.

Proof of Exactness



Code

Given the all-pair distance matrix,

- an example k-means algorithm implementation is here.

- an example DBSCAN algorithm implementation is here.

- an example Spectral Clustering algorithm implementation is here.

An example TADPole code with all parameters set, and sample data is available here.


References

[1] Goldberger, A. L. et al. Physiobank, Physiotoolkit, and Physionet Components of A New Research Resource for Complex Physiologic Signals. Circulation, 101(23), e215-e220, 2000.

[2] Saeed, M., et al. Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): A Public-access Intensive Care Unit Database. Critical care medicine, 39(5), 952, 2011.

[3] Hirzer, Martin, et al. Person Re-identification by Descriptive and Discriminative Classification. Image Analysis. Springer Berlin Heidelberg, pp. 91-102, 2011.

[4] Dhillon, I. S. Co-clustering documents and words using bipartite spectral graph partitioning,ACM SIGKDD, 2001.