*Oct. 13, 2010, NB: This “living document” assembles a collection of works that address similarity search in large databases of time-series, starting in 1993 and ending in 2004. I am not so clear what has happened since then — so references and pointers are appreciated. My briefly annotated timeline of similarity search in audio and music signals is here.*

**1993**

R. Agrawal, C. Faloutsos, and A. Swami, “Efficient similarity search in sequence databases,” in Proc. Int. Conf. Foundations Data Org. Algo., (Chicago, IL), pp. 69-84, Oct. 1993.

The authors propose time series

similarity matching for time series of same size.

They use the DFT to transform each one

and approximate its shape by keeping the first 3-5 DFT coefficients,

which are compared to queries using the Euclidean distance.

Then they construct an “F-index” structure with an R*-tree —

a tree-indexing method for spatial data.

They justify keeping the first few (complex) coefficients of the DFT

since many real world signals have power spectral densities

that drop off like \(f^{-\alpha}\), with \(\alpha > 1\).

**1994**

C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast subsequence matching in time-series databases,” in Proc. ACM SIGMOD Int. Conf. Mgmt. Data, (Minneapolis, MN), pp. 419-429, 1994.

The authors extend the

approach in Agrawal et al. (1993) to subsequence matching.

Sliding window of the minimum query size is used on data

for the DFT; first few coefficients are kept, creating a “trail”

in the feature space.

Trails are broken into sub-trails, which are generalized

using minimum bounding rectangles,

and then indexed using an R*-tree.

**1995**

R. Agrawal, K.-I. Lin, H. S. Sawhney, and K. Shim, “Fast similarity search in the presence of noise, scaling, and translation in time-series databases,” in VLDB ’95: Proceedings of the 21th International Conference on Very Large Data Bases, (San Francisco, CA, USA), pp. 490-501, Morgan Kaufmann Publishers Inc., 1995.

The authors extend the approach of Agrawal et al. (1993)

to make invariant to scaling, translation, and noise —

but at the price of generality.

They propose a new measure of similarity: two sequences are similar

if they have enough non-overlapping time-ordered pairs

of subsequences that are similar.

Their approach is data-specific, involving

fitting waveforms within envelopes drawn around another,

stitching together elements, scaling series and changing offsets.

**1996**

C.-S. Li, P. S. Yu, and V. Castelli, “Hierarchyscan: A hierarchical similarity search algorithm for databases of long sequences,” in Proc. Int. Conf. Data Eng., (New Orleans, LA), pp. 546-553, Feb. 1996.

The authors propose “HierarchyScan,” an enhancement

of the method proposed in Faloutsos et al. 1994,

and very similar in flavor to P. Jost and P. Vandergheynst, “On finding approximate nearest neighbours in a set of compressible signals,” in Proc. European Signal Process. Conf., (Lausanne, Switzerland), pp. 1-5, Aug. 2008.

Here they use a correlation measure in the frequency domain to gauge similarity.

The feature extraction process involves an orthonormal transformation of

overlapping and normalized segments of the signals of the database.

These transformed vectors are then partitioned

into non-overlapping subsets of features,

the size of which maximizes the speed of the correlation process.

The features of the query signal are found in the same way

as those of the database.

Starting with the most discriminative subset of features of the query

(for instance, the subset with the largest energy),

the correlation in the time domain is approximated

by point-wise multiplication in the frequency domain

(here the authors forget to flip one of the time domain signals).

Inverse transforming this to the time domain and finding the maximum

in relation to a prescribed threshold determines if

further comparison between the two should occur.

**1997**

D. Rafiei and A. Mendelzon, “Similarity-based queries for time series data,” in SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, (New York, NY, USA), pp. 13-25, ACM, 1997.

The authors recognize that

the Euclidean distance between two time series

can be reduced dramatically by moving averages,

and time scaling (by an integer factor), and inversion.

By assigning a cost to these two transformations,

one can determine the “best” fit of two time series

within a limiting cost.

They use the truncated DFT for the search space,

and derive simple relations that show how each

transformation changes these feature vectors.

To index the feature space they use minimum bounding rectangles

and an R-tree index.

F. Korn, H. V. Jagadish, and C. Faloutsos, “Efficiently supporting ad hoc queries in large datasets of time sequences,” in SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, (Tucson, AZ), pp. 289-300, ACM, 1997.

The authors recognize that

ad hoc queries on compressed databases are

not convenient given the nature of compression;

so they look for ways to compress a database that

will allow such ad hoc queries.

Assuming that data matrices are large and skinny,

they propose using SVD to compress the data set

to its principle components with which to build

an index structure, which, by definition, will outperform

the truncated DFT of Agrawal et al. (1993), or the DCT feature vector.

D. Rafiei and A. Mendelzon, “Similarity-based queries for time series data,” in SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, (New York, NY, USA), pp. 13-25, ACM, 1997.

This works extends previous work using Fourier series representations for indexing time-series, by adding the ability to define and measure similarity through different transformations on the data, such as translation, stretch, and averaging.

**1998**

D. Rafiei and A. Mendelzon, “Efficient retrieval of similar time sequences using DFT,” in Proc. Int. Conf. Found. Data Org., (Kobe, Japan), pp. 249-257, Nov. 1998.

The authors extend the approach in Agrawal et al. 1993

to use conjugate property of DFT for real sequences.

This speeds the search time in the database.

I have talked about this here.

B.-K. Yi, H. V. Jagadish, and C. Faloutsos, “Efficient retrieval of similar time sequences under time warping,” in ICDE ’98: Proceedings of the Fourteenth International Conference on Data Engineering, (Washington, DC, USA), pp. 201-208, IEEE Computer Society, 1998.

The authors propose a time-warp distance

for similarity search among a database of arbitrary length sequences.

A pair of sequences is time-warped by “stuttering” their elements

such that their peaks are aligned — which requires dynamic programming,

and thus a high computational cost.

They avoid having to compute this for every pair

by first finding a subset of the database which

contains signals most similar to the query.

This set is then reduced by computing

the time-warping distances.

**1999**

K. P. Chan and A. W.-C. Fu, “Efficient time series matching by wavelets,” in Proc. Int. Conf. Data Eng., (Sydney, Australia), pp. 126-133, Mar. 1999.

The authors show that similarity search

in time series can be performed using feature vectors

built from a Haar wavelet decomposition instead of DFT.

Similar to the truncated DFT approach, they use

a truncated Haar wavelet transform,

and show no false dismissals are possible

with an appropriate scaling of the \(\epsilon\) parameter

(which is the frame constant).

An R-tree is used to index the feature space.

They show that the specificity of the wavelet coefficients

reduces the number of series that must be pruned

when using the DFT approach.

Also related is: Z. R. Struzik and A. Siebes, “The haar wavelet transform in the time series similarity paradigm,” in PKDD ’99: Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery, (London, UK), pp. 12-22, Springer-Verlag, 1999.

D. Rafiei, “On similarity-based queries for time series data,” in ICDE ’99: Proceedings of the 15th International Conference on Data Engineering, (Washington, DC, USA), pp. 410-417, Mar 1999.

This work extends that in Rafiei and Mendelzon 1998

to using a combination of transformations instead of just one,

e.g., scaling and translation.

K. Kashino, G. Smith, and H. Murase, “Time-series active search for quick retrieval of audio and video,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 6, pp. 2993-2996, Mar. 1999.

The authors propose a method

for the search of audio and video data within a large audio or video stream.

For audio data, the feature vector is a set of \(N\)-band (e.g.,7)

normalized short-term power spectrum

(found using \(N\) second order IIR bandpass filters

(e.g., \(Q=15\)) with center frequencies equally spaced

in logarithmic frequency).

These features are used because they are

less expensive to compute than MFCCs.

The feature vectors of the windowed data

(e.g., 128 samples at 11 kHz sampling rate)

are reduced to a histogram representation.

The distance between two histograms

is calculate using “histogram intersection,”

which is the sum of the minimum values in bin pairs over all bins.

Then the distance between two windows of data

is the minimum histogram intersection computed

over all sub-windows.

**2000**

C. Wang and X. S. Wang, “Supporting content-based searches on time series via approximation,” in Proc. Int. Conf. on Scientific and Statistical Database Management, pp. 69-81, July 2000.

The authors propose to approximate

time series to perform efficient similarity search for

five types of content: shapes, trends, cyclic components,

autocorrelation functions, and partial autocorrelation functions.

Two types of approximation are explored: non-linear truncated wavelet

decomposition (picking most significant coefficients

of linear B-spline); and line fitting (approximating a

subsequence with a line until the error reaches a maximum,

after which a new line is begun).

A similarity search is then performed on these

approximation functions using a distance measure

specific to the content of interest.

Y.-L. Wu, D. Agrawal, and A. El Abbadi, “A comparison of dft and dwt based similarity search in time-series databases,” in Proc. Ninth Int. Conf. on Information and Knowledge Management, (McLean, Virginia, United States), pp. 488-495, 2000.

In contradiction to claims of Chan and Fu 1999,

the authors find that the Haar wavelet decomposition performs no better

than when using conjugate property of DFT proposed in Rafiei et al. 1998.

**2000**

E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionality reduction for fast similarity search in large time series databases,” Journal of Knowledge and Information Systems, vol. 3, pp. 263-286, 2000.

The authors propose a new

dimensionality reduction technique: Piecewise Aggregate Approximation (PAA).

This is essentially sample-and-hold using the mean value

of each segment, which results in a downsampled series.

To calculate the distance between two time series

using their approximations, the authors use

the standard and weighted Euclidean distance.

With the segment length selected wisely, PAA produces

results identical to a Haar transform with less computation;

but it can be generalized to accommodate a greater range of query lengths.

P. Indyk, N. Koudas, and S. Muthukrishnan, “Identifying representative trends in massive time series data sets using sketches,” in Proc. Int. Conf. Very Large Data Bases, (Cairo, Egypt), pp. 363-372, Sep. 2000.

The authors propose mining large

discrete time series for {\em representative trends}, which are defined to be

observation intervals that behave suitably different from other intervals.

Two trends that they focus on are the “relaxed period” and the “average trend,”

for which they develop fast algorithms in this paper.

(The relaxed period is defined as the contents of the

smallest disjoint partitioning of the dataset, where,

when repeated to the length of the original data,

results in the smallest distance from the original data.

Similarly, the average trend of length \(l\)

is the subsequence of the same length

that has the smallest total distance to all other disjoint

subsequences in the same dataset.)

As calculating these efficiently and accurately is burdensome,

they develop approximate methods by making “sketches” of the data,

which appears to be “sensing” the data using random vectors

as done in compressed sensing.

Unlike compressed sensing, however,

they do not wish for perfect reconstruction

as they keep the original time series,

but only for a guarantee that the norm of a difference of two sketches

will be nearly the same as that of the two original higher-dimensional vectors,

which is guaranteed by the Johnson-Lindenstrauss Lemma.

They avoid further costs of having to compute this

for every possible subsequence length by

finding an approximate relationship between sketches of

subvectors related to the smaller or larger subvector of interest.

B.-K. Yi and C. Faloutsos, “Fast time sequence indexing for arbitrary lp norms,” in VLDB ’00: Proceedings of the 26th International Conference on Very Large Data Bases, (San Francisco, CA, USA), pp. 385-394, Morgan Kaufmann Publishers Inc., 2000.

The authors present an indexing scheme for similarity search using a measure defined by any \(\ell_p\) norm.

They also propose and test a new feature extraction method, which essentially models a time-series by uniform sample and hold.

E. Keogh, K. Chakrabarti, S. Mehrotra, and M. Pazzani, “Locally adaptive dimensionality reduction for indexing large time series databases,” ACM Trans. Database Syst., vol. 27, no. 2, pp. 188-228, 2001.

Extending the work in Keogh et al. 2000,

the authors propose a locally adaptive version of

Piecewise Aggregate Approximation:

Adaptive Piecewise Constant Approximation (APCA).

Instead of using segments of a constant length,

the segment length is adapted to the local behavior

of the data such that the error of the approximation is less.

They propose an indexing procedure by using

a spatial indexing method (R*-tree) with

minimum bounding rectangles in the

time-amplitude domain.

**2002**

I. Popivanov and R. J. Miller, “Similarity search over time-series data using wavelets,” in Proc. IEEE Int. Conf. Data Eng., (San Jose, CA), pp. 212-221, Feb. 2002.

The authors show that it is possible

for other wavelets to perform better than the Haar at similarity search.

They also prove that any wavelet decomposition can guarantee

no false dismissals occur with an appropriate scaling

of the \(\epsilon\) parameter by the frame constant.

K. Kawagoe and T. Ueda, “A similarity search method of time series data with combination of fourier and wavelet transforms,” in Proc. Int. Symp. Temporal Representation, Reasoning, (Washington, DC), pp. 86-92, 2002.

The authors propose a combination

of the DFT and Haar DWT for similarity search.

Initial matches to a query found using the Euclidean

distances between truncated DFTs features,

are filtered using the Euclidean distances between

truncated Haar features,

and then the time-domain Euclidean distances

are computed on the remaining matches.

They first use the DFT features because

their experience has shown a failure of

selectivity of the Haar features.

P. Patel, E. Keogh, J. Lin, and S. Lonardi, “Mining motifs in massive time series databases,” in ICDM ’02: Proceedings of the 2002 IEEE International Conference on Data Mining, (Washington, DC, USA), pp. 370-377, IEEE Computer Society, 2002.

After describing the field of

similarity search in massive time series databases

as a solved problem, the authors suggest the more

interesting data mining problem of finding patterns

in data that frequently occur — which they call “motifs.”

They approximate each time series using PAA of Keogh et al. 2000,

and discretize the values in such a way that the

Euclidean distance between the approximations

will lower bound the true Euclidean distance —

thus satisfying the lower bound distance requirement for no false dismissals.

They propose and test an algorithm

to search for repeating patterns in a data set

while being sensitive to trivial representations

(e.g., the same sequence appearing a few samples later).

E. Keogh and S. Kasetty, “On the need for time series data mining benchmarks: a survey and empirical demonstration,” in KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, (New York, NY, USA), pp. 102-111, ACM, 2002.

The authors make the claim that we do not know the real utility of all of the work thus far done on the problems of indexing, clustering, classification, and segmentation in time-series.

This is because the datasets used in the experiments

are not representative of the diversity present in real-world data (they say, “89% of the test databases could comfortably fit on a 1.44 Mb floppy disk”),

and many experiments suffer from data bias (“use of a particular set of testing data to confirm a desired finding”) and implementation bias (“disparity in the quality of implementation of a proposed approach vs. the quality of implementation of the competing approaches”).

This is a rather hard-hitting paper, pointing out numerous contradictions, naming names, asking how anyone can screw up the Euclidean distance?

This is somewhat softened in the conclusion.

(I wonder what the response was to this paper?)

They perform an exhaustive set of experiments using algorithms and similarity measures selected from over 360 papers, but testing with 50 real world datasets from finance, medicine, chemistry, networking, etc.

They find that several contradictory performance rankings

are obtained just by changing the dataset.

**2003**

K. P. Chan, A. W. Fu, and C. Yu, “Haar wavelets for efficient similarity search of time-series: With and without time warping,” IEEE Trans. Knowledge Data Eng., vol. 15, pp. 686-705, May 2003.

The authors extend the work in Chan and Fu 1999

to include time warping as discussed in Yi et al. 1998,

but using Haar wavelet decompositions.

M. Vlachos, J. Lin, E. Keogh, and D. Gunopulos, “A wavelet-based anytime algorithm for k-means clustering of time series,” in Proc. Workshop on Clustering High Dimensionality Data and Its Applications, (San Francisco, CA), pp. 23-30, May 2003.

The authors propose

clustering time series based on their Haar wavelet decompositions

starting at lowest resolution using K-means.

The resulting performance is experimentally

found to be better than “hierarchical clustering”

performed with the untransformed data,

and with a computational cost that is far less as well.

I think the applicability of this work is limited

as it is not shift-invariant, nor scale-invariant,

and it does not allow data of variable lengths.

There also still remains the problem

of selecting the number of clusters to use.

**2004**

A. Angeles-Yreta, H. Solís-Estrella, V. Landassuri-Moreno, and J. Figueroa-Nazuno, “Similarity search in seismological signals,” in Proc. Fifth Mexican Int. Conf. in Computer Science, 2004.

The authors attempt to

apply PAA from Keogh et al. 2000 to approximate seismological signals,

and then query for similar seismological signals using a time warping distance.

Unfortunately, they do not understand that PAA is essentially

a large downsampling factor (e.g., reducing 5000 samples to 5).

The resulting “similar” signals are dissimilar.

T. Kahveci and A. K. Singh, “Optimizing similarity search for arbitrary length time series queries,” IEEE Trans. Knowledge Data Eng., vol. 16, pp. 418-433, Apr. 2004.

The authors address the problem of

variable length queries in similarity search.

They propose two new indexing methods for each of these

using minimum bounding rectangles,

built by splitting the database entries

into non-overlapping segments of increasing size

(powers of 2, starting with the minimum size query possible).

The DFT of each of these sequences is taken,

and a few coefficients are kept.

This creates a multi-resolution index structure

that facilitates quick searching for long queries.

Experiments show a large difference in

computational complexity between this method

and sequential scanning, although it

requires a large increase of memory.

To alleviate this requirement, the authors explore the use

of compression of the index structure

and show that even with a 5-to-1 compression,

their indexing technique works extremely well.

J. Lin and E. Keogh, “Finding or Not Finding Rules in Time Series,” in Advances in Econometrics, vol. 19: Applications of Artificial Intelligence in Finance and Economics, 2004.

See Po’D.

**2008**

W. Wu and J. Hu, “Similarity search based on random projection for high frequency time series,” in Proc. IEEE Conf. Cybernetics and Intelligent Syst., (Chengdu, China), pp. 388-393, Sep. 2008.

The author discuss the use of random projections (essentially Indyk’s sketching)

for similarity search in time series with high frequency components.