A Briefly Annotated Timeline of Similarity Search in Time Series

Oct. 13, 2010, NB: This “living document” assembles a collection of works that address similarity search in large databases of time-series, starting in 1993 and ending in 2004. I am not so clear what has happened since then — so references and pointers are appreciated. My briefly annotated timeline of similarity search in audio and music signals is here.


1993


R. Agrawal, C. Faloutsos, and A. Swami, “Efficient similarity search in sequence databases,” in Proc. Int. Conf. Foundations Data Org. Algo., (Chicago, IL), pp. 69-84, Oct. 1993.

The authors propose time series
similarity matching for time series of same size.
They use the DFT to transform each one
and approximate its shape by keeping the first 3-5 DFT coefficients,
which are compared to queries using the Euclidean distance.
Then they construct an “F-index” structure with an R*-tree  —
a tree-indexing method for spatial data.
They justify keeping the first few (complex) coefficients of the DFT
since many real world signals have power spectral densities
that drop off like \(f^{-\alpha}\), with \(\alpha > 1\).

1994


C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast subsequence matching in time-series databases,” in Proc. ACM SIGMOD Int. Conf. Mgmt. Data, (Minneapolis, MN), pp. 419-429, 1994.

The authors extend the
approach in Agrawal et al. (1993) to subsequence matching.
Sliding window of the minimum query size is used on data
for the DFT; first few coefficients are kept, creating a “trail”
in the feature space.
Trails are broken into sub-trails, which are generalized
using minimum bounding rectangles,
and then indexed using an R*-tree.

1995


R. Agrawal, K.-I. Lin, H. S. Sawhney, and K. Shim, “Fast similarity search in the presence of noise, scaling, and translation in time-series databases,” in VLDB ’95: Proceedings of the 21th International Conference on Very Large Data Bases, (San Francisco, CA, USA), pp. 490-501, Morgan Kaufmann Publishers Inc., 1995.

The authors extend the approach of Agrawal et al. (1993)
to make invariant to scaling, translation, and noise —
but at the price of generality.
They propose a new measure of similarity: two sequences are similar
if they have enough non-overlapping time-ordered pairs
of subsequences that are similar.
Their approach is data-specific, involving
fitting waveforms within envelopes drawn around another,
stitching together elements, scaling series and changing offsets.

1996


C.-S. Li, P. S. Yu, and V. Castelli, “Hierarchyscan: A hierarchical similarity search algorithm for databases of long sequences,” in Proc. Int. Conf. Data Eng., (New Orleans, LA), pp. 546-553, Feb. 1996.

The authors propose “HierarchyScan,” an enhancement
of the method proposed in Faloutsos et al. 1994,
and very similar in flavor to P. Jost and P. Vandergheynst, “On finding approximate nearest neighbours in a set of compressible signals,” in Proc. European Signal Process. Conf., (Lausanne, Switzerland), pp. 1-5, Aug. 2008.
Here they use a correlation measure in the frequency domain to gauge similarity.
The feature extraction process involves an orthonormal transformation of
overlapping and normalized segments of the signals of the database.
These transformed vectors are then partitioned
into non-overlapping subsets of features,
the size of which maximizes the speed of the correlation process.
The features of the query signal are found in the same way
as those of the database.
Starting with the most discriminative subset of features of the query
(for instance, the subset with the largest energy),
the correlation in the time domain is approximated
by point-wise multiplication in the frequency domain
(here the authors forget to flip one of the time domain signals).
Inverse transforming this to the time domain and finding the maximum
in relation to a prescribed threshold determines if
further comparison between the two should occur.

1997


D. Rafiei and A. Mendelzon, “Similarity-based queries for time series data,” in SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, (New York, NY, USA), pp. 13-25, ACM, 1997.

The authors recognize that
the Euclidean distance between two time series
can be reduced dramatically by moving averages,
and time scaling (by an integer factor), and inversion.
By assigning a cost to these two transformations,
one can determine the “best” fit of two time series
within a limiting cost.
They use the truncated DFT for the search space,
and derive simple relations that show how each
transformation changes these feature vectors.
To index the feature space they use minimum bounding rectangles
and an R-tree index.

F. Korn, H. V. Jagadish, and C. Faloutsos, “Efficiently supporting ad hoc queries in large datasets of time sequences,” in SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, (Tucson, AZ), pp. 289-300, ACM, 1997.

The authors recognize that
ad hoc queries on compressed databases are
not convenient given the nature of compression;
so they look for ways to compress a database that
will allow such ad hoc queries.
Assuming that data matrices are large and skinny,
they propose using SVD to compress the data set
to its principle components with which to build
an index structure, which, by definition, will outperform
the truncated DFT of Agrawal et al. (1993), or the DCT feature vector.

D. Rafiei and A. Mendelzon, “Similarity-based queries for time series data,” in SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, (New York, NY, USA), pp. 13-25, ACM, 1997.

This works extends previous work using Fourier series representations for indexing time-series, by adding the ability to define and measure similarity through different transformations on the data, such as translation, stretch, and averaging.

1998


D. Rafiei and A. Mendelzon, “Efficient retrieval of similar time sequences using DFT,” in Proc. Int. Conf. Found. Data Org., (Kobe, Japan), pp. 249-257, Nov. 1998.

The authors extend the approach in Agrawal et al. 1993
to use conjugate property of DFT for real sequences.
This speeds the search time in the database.
I have talked about this here.

B.-K. Yi, H. V. Jagadish, and C. Faloutsos, “Efficient retrieval of similar time sequences under time warping,” in ICDE ’98: Proceedings of the Fourteenth International Conference on Data Engineering, (Washington, DC, USA), pp. 201-208, IEEE Computer Society, 1998.

The authors propose a time-warp distance
for similarity search among a database of arbitrary length sequences.
A pair of sequences is time-warped by “stuttering” their elements
such that their peaks are aligned — which requires dynamic programming,
and thus a high computational cost.
They avoid having to compute this for every pair
by first finding a subset of the database which
contains signals most similar to the query.
This set is then reduced by computing
the time-warping distances.

1999


K. P. Chan and A. W.-C. Fu, “Efficient time series matching by wavelets,” in Proc. Int. Conf. Data Eng., (Sydney, Australia), pp. 126-133, Mar. 1999.

The authors show that similarity search
in time series can be performed using feature vectors
built from a Haar wavelet decomposition instead of DFT.
Similar to the truncated DFT approach, they use
a truncated Haar wavelet transform,
and show no false dismissals are possible
with an appropriate scaling of the \(\epsilon\) parameter
(which is the frame constant).
An R-tree is used to index the feature space.
They show that the specificity of the wavelet coefficients
reduces the number of series that must be pruned
when using the DFT approach.
Also related is: Z. R. Struzik and A. Siebes, “The haar wavelet transform in the time series similarity paradigm,” in PKDD ’99: Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery, (London, UK), pp. 12-22, Springer-Verlag, 1999.

D. Rafiei, “On similarity-based queries for time series data,” in ICDE ’99: Proceedings of the 15th International Conference on Data Engineering, (Washington, DC, USA), pp. 410-417, Mar 1999.

This work extends that in Rafiei and Mendelzon 1998
to using a combination of transformations instead of just one,
e.g., scaling and translation.

K. Kashino, G. Smith, and H. Murase, “Time-series active search for quick retrieval of audio and video,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 6, pp. 2993-2996, Mar. 1999.

The authors propose a method
for the search of audio and video data within a large audio or video stream.
For audio data, the feature vector is a set of \(N\)-band (e.g.,7)
normalized short-term power spectrum
(found using \(N\) second order IIR bandpass filters
(e.g., \(Q=15\)) with center frequencies equally spaced
in logarithmic frequency).
These features are used because they are
less expensive to compute than MFCCs.
The feature vectors of the windowed data
(e.g., 128 samples at 11 kHz sampling rate)
are reduced to a histogram representation.
The distance between two histograms
is calculate using “histogram intersection,”
which is the sum of the minimum values in bin pairs over all bins.
Then the distance between two windows of data
is the minimum histogram intersection computed
over all sub-windows.

2000


C. Wang and X. S. Wang, “Supporting content-based searches on time series via approximation,” in Proc. Int. Conf. on Scientific and Statistical Database Management, pp. 69-81, July 2000.

The authors propose to approximate
time series to perform efficient similarity search for
five types of content: shapes, trends, cyclic components,
autocorrelation functions, and partial autocorrelation functions.
Two types of approximation are explored: non-linear truncated wavelet
decomposition (picking most significant coefficients
of linear B-spline); and line fitting (approximating a
subsequence with a line until the error reaches a maximum,
after which a new line is begun).
A similarity search is then performed on these
approximation functions using a distance measure
specific to the content of interest.

Y.-L. Wu, D. Agrawal, and A. El Abbadi, “A comparison of dft and dwt based similarity search in time-series databases,” in Proc. Ninth Int. Conf. on Information and Knowledge Management, (McLean, Virginia, United States), pp. 488-495, 2000.

In contradiction to claims of Chan and Fu 1999,
the authors find that the Haar wavelet decomposition performs no better
than when using conjugate property of DFT proposed in Rafiei et al. 1998.

2000


E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionality reduction for fast similarity search in large time series databases,” Journal of Knowledge and Information Systems, vol. 3, pp. 263-286, 2000.

The authors propose a new
dimensionality reduction technique: Piecewise Aggregate Approximation (PAA).
This is essentially sample-and-hold using the mean value
of each segment, which results in a downsampled series.
To calculate the distance between two time series
using their approximations, the authors use
the standard and weighted Euclidean distance.
With the segment length selected wisely, PAA produces
results identical to a Haar transform with less computation;
but it can be generalized to accommodate a greater range of query lengths.

P. Indyk, N. Koudas, and S. Muthukrishnan, “Identifying representative trends in massive time series data sets using sketches,” in Proc. Int. Conf. Very Large Data Bases, (Cairo, Egypt), pp. 363-372, Sep. 2000.

The authors propose mining large
discrete time series for {\em representative trends}, which are defined to be
observation intervals that behave suitably different from other intervals.
Two trends that they focus on are the “relaxed period” and the “average trend,”
for which they develop fast algorithms in this paper.
(The relaxed period is defined as the contents of the
smallest disjoint partitioning of the dataset, where,
when repeated to the length of the original data,
results in the smallest distance from the original data.
Similarly, the average trend of length \(l\)
is the subsequence of the same length
that has the smallest total distance to all other disjoint
subsequences in the same dataset.)
As calculating these efficiently and accurately is burdensome,
they develop approximate methods by making “sketches” of the data,
which appears to be “sensing” the data using random vectors
as done in compressed sensing.
Unlike compressed sensing, however,
they do not wish for perfect reconstruction
as they keep the original time series,
but only for a guarantee that the norm of a difference of two sketches
will be nearly the same as that of the two original higher-dimensional vectors,
which is guaranteed by the Johnson-Lindenstrauss Lemma.
They avoid further costs of having to compute this
for every possible subsequence length by
finding an approximate relationship between sketches of
subvectors related to the smaller or larger subvector of interest.

B.-K. Yi and C. Faloutsos, “Fast time sequence indexing for arbitrary lp norms,” in VLDB ’00: Proceedings of the 26th International Conference on Very Large Data Bases, (San Francisco, CA, USA), pp. 385-394, Morgan Kaufmann Publishers Inc., 2000.

The authors present an indexing scheme for similarity search using a measure defined by any \(\ell_p\) norm.
They also propose and test a new feature extraction method, which essentially models a time-series by uniform sample and hold.

E. Keogh, K. Chakrabarti, S. Mehrotra, and M. Pazzani, “Locally adaptive dimensionality reduction for indexing large time series databases,” ACM Trans. Database Syst., vol. 27, no. 2, pp. 188-228, 2001.

Extending the work in Keogh et al. 2000,
the authors propose a locally adaptive version of
Piecewise Aggregate Approximation:
Adaptive Piecewise Constant Approximation (APCA).
Instead of using segments of a constant length,
the segment length is adapted to the local behavior
of the data such that the error of the approximation is less.
They propose an indexing procedure by using
a spatial indexing method (R*-tree) with
minimum bounding rectangles in the
time-amplitude domain.

2002


I. Popivanov and R. J. Miller, “Similarity search over time-series data using wavelets,” in Proc. IEEE Int. Conf. Data Eng., (San Jose, CA), pp. 212-221, Feb. 2002.

The authors show that it is possible
for other wavelets to perform better than the Haar at similarity search.
They also prove that any wavelet decomposition can guarantee
no false dismissals occur with an appropriate scaling
of the \(\epsilon\) parameter by the frame constant.

K. Kawagoe and T. Ueda, “A similarity search method of time series data with combination of fourier and wavelet transforms,” in Proc. Int. Symp. Temporal Representation, Reasoning, (Washington, DC), pp. 86-92, 2002.

The authors propose a combination
of the DFT and Haar DWT for similarity search.
Initial matches to a query found using the Euclidean
distances between truncated DFTs features,
are filtered using the Euclidean distances between
truncated Haar features,
and then the time-domain Euclidean distances
are computed on the remaining matches.
They first use the DFT features because
their experience has shown a failure of
selectivity of the Haar features.

P. Patel, E. Keogh, J. Lin, and S. Lonardi, “Mining motifs in massive time series databases,” in ICDM ’02: Proceedings of the 2002 IEEE International Conference on Data Mining, (Washington, DC, USA), pp. 370-377, IEEE Computer Society, 2002.

After describing the field of
similarity search in massive time series databases
as a solved problem, the authors suggest the more
interesting data mining problem of finding patterns
in data that frequently occur — which they call “motifs.”
They approximate each time series using PAA of Keogh et al. 2000,
and discretize the values in such a way that the
Euclidean distance between the approximations
will lower bound the true Euclidean distance —
thus satisfying the lower bound distance requirement for no false dismissals.
They propose and test an algorithm
to search for repeating patterns in a data set
while being sensitive to trivial representations
(e.g., the same sequence appearing a few samples later).

E. Keogh and S. Kasetty, “On the need for time series data mining benchmarks: a survey and empirical demonstration,” in KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, (New York, NY, USA), pp. 102-111, ACM, 2002.

The authors make the claim that we do not know the real utility of all of the work thus far done on the problems of indexing, clustering, classification, and segmentation in time-series.
This is because the datasets used in the experiments
are not representative of the diversity present in real-world data (they say, “89% of the test databases could comfortably fit on a 1.44 Mb floppy disk”),
and many experiments suffer from data bias (“use of a particular set of testing data to confirm a desired finding”) and implementation bias (“disparity in the quality of implementation of a proposed approach vs. the quality of implementation of the competing approaches”).
This is a rather hard-hitting paper, pointing out numerous contradictions, naming names, asking how anyone can screw up the Euclidean distance?
This is somewhat softened in the conclusion.
(I wonder what the response was to this paper?)
They perform an exhaustive set of experiments using algorithms and similarity measures selected from over 360 papers, but testing with 50 real world datasets from finance, medicine, chemistry, networking, etc.
They find that several contradictory performance rankings
are obtained just by changing the dataset.

2003


K. P. Chan, A. W. Fu, and C. Yu, “Haar wavelets for efficient similarity search of time-series: With and without time warping,” IEEE Trans. Knowledge Data Eng., vol. 15, pp. 686-705, May 2003.

The authors extend the work in Chan and Fu 1999
to include time warping as discussed in Yi et al. 1998,
but using Haar wavelet decompositions.

M. Vlachos, J. Lin, E. Keogh, and D. Gunopulos, “A wavelet-based anytime algorithm for k-means clustering of time series,” in Proc. Workshop on Clustering High Dimensionality Data and Its Applications, (San Francisco, CA), pp. 23-30, May 2003.

The authors propose
clustering time series based on their Haar wavelet decompositions
starting at lowest resolution using K-means.
The resulting performance is experimentally
found to be better than “hierarchical clustering”
performed with the untransformed data,
and with a computational cost that is far less as well.
I think the applicability of this work is limited
as it is not shift-invariant, nor scale-invariant,
and it does not allow data of variable lengths.
There also still remains the problem
of selecting the number of clusters to use.

2004


A. Angeles-Yreta, H. Solís-Estrella, V. Landassuri-Moreno, and J. Figueroa-Nazuno, “Similarity search in seismological signals,” in Proc. Fifth Mexican Int. Conf. in Computer Science, 2004.

The authors attempt to
apply PAA from Keogh et al. 2000 to approximate seismological signals,
and then query for similar seismological signals using a time warping distance.
Unfortunately, they do not understand that PAA is essentially
a large downsampling factor (e.g., reducing 5000 samples to 5).
The resulting “similar” signals are dissimilar.

T. Kahveci and A. K. Singh, “Optimizing similarity search for arbitrary length time series queries,” IEEE Trans. Knowledge Data Eng., vol. 16, pp. 418-433, Apr. 2004.

The authors address the problem of
variable length queries in similarity search.
They propose two new indexing methods for each of these
using minimum bounding rectangles,
built by splitting the database entries
into non-overlapping segments of increasing size
(powers of 2, starting with the minimum size query possible).
The DFT of each of these sequences is taken,
and a few coefficients are kept.
This creates a multi-resolution index structure
that facilitates quick searching for long queries.
Experiments show a large difference in
computational complexity between this method
and sequential scanning, although it
requires a large increase of memory.
To alleviate this requirement, the authors explore the use
of compression of the index structure
and show that even with a 5-to-1 compression,
their indexing technique works extremely well.

J. Lin and E. Keogh, “Finding or Not Finding Rules in Time Series,” in Advances in Econometrics, vol. 19: Applications of Artificial Intelligence in Finance and Economics, 2004.

See Po’D.

2008


W. Wu and J. Hu, “Similarity search based on random projection for high frequency time series,” in Proc. IEEE Conf. Cybernetics and Intelligent Syst., (Chengdu, China), pp. 388-393, Sep. 2008.

The author discuss the use of random projections (essentially Indyk’s sketching)
for similarity search in time series with high frequency components.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s