SciELO - Scientific Electronic Library Online

 
vol.44 número3Implementation of predictive multivariable DMC controller in a pilot plantPerformance analysis of fuzzy mathematical morphology operators on noisy MRI índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Latin American applied research

versión On-line ISSN 1851-8796

Lat. Am. appl. res. vol.44 no.3 Bahía Blanca jul. 2014

 

Improving time series classification accuracy: combining global and local information in the similarity criterion

X. He, C. Shao†‡ and Y. Xiong

School of Computer Science and Technology, University of Science and Technology of China, 230027, Hefei, China.
Anhui Province Key Laboratory of Software in Computing and Communication, 230027, Hefei, China. E-mail: xiaoxuhe@mail.ustc.edu.cn

Abstract— Given the widespread use of time series classification in many domains, how to improve the accuracy of classification has attracted considerable focus. In this paper, a new similarity measure (SIMscl) based on the global and local information has been proposed for improving the precision rate of one nearest neighbor (1NN) classifier. Specifically, the global information records the intrinsic properties of time series, and is reflected by two indicators: the shape information and the complexity; the local information pays attention to the exact match of value, and is realized by LB_keogh. Simultaneously, a method based on multi-scale discrete haar wavelet transform, key point extraction, and symbolization has been put forward to extract the shape information. To test the efficacy of the proposed shape similarity SIMshape and hybrid similarity SIMscl, the experiments are conducted on two data sets: star light curve and beef. Experimental evaluations show that SIMshape can deal with some time series misclassified by Euclidean Distance (ED), LB_keogh, and Complexity Invariant Distance (CID), and SIMscl has higher precision than ED, LB_keogh, and CID in time series 1NN classification.

Keywords— Time Series; Classification; Complexity; Shape Information; Similarity Measure.

I. INTRODUCTION

As a data mining application in its own right, time series classification is pervasive in different domains, including medicine, finance, science, an entertainment (Esling and Agon, 2012; Tak-chung, 2011; Ding et al., 2008). Meanwhile, it is often used as a subroutine in other higher-level data mining applications, for example, summarization, outlier discover, rule-finding (Esling and Agon, 2012; Tak-chung, 2011; Povinelli et al., 2004). Therefore, time series classification is arguably one of the most fundamental data mining applications, and how to improve its performance has been paying attention to. Although the accuracy is always affected by two aspects: the size of the training set and similarity measure (Ding et al., 2008; Lines et al., 2012; Geurts, 2001), in this work, our work is merely concerned with the similarity measure used in time series classification, and the size of the training set is assumed to be fixed.

In view of parameter-free and easy, ED and its variants become the most straightforward methods for time series classification (Ding et al., 2008). However, ED has been forcibly shown to be extremely brittle as its sensitivity to distortions in time axis (Esling and Agon, 2012; Ding et al., 2008).

To handle time warping, Dynamic Time Warping (DTW) was introduced in time series classification, which allowed a time series to be "stretched" or "compressed" to provide a better match with another time series (Berndt and Clifford, 1994). But the main problem with DTW is its high computational complexity O(n2) (Keogh and Ratanamahatana, 2005). To get rid of its limitation, several lower bounding techniques have been introduced to speed up the computation of DTW (Keogh and Ratanamahatana, 2005; Keogh et al., 2006; Jeong et al., 2011). The famous one is LB_keogh (Keogh and Ratanamahatana, 2005; Keogh et al., 2006) with the complexity O(n). At the same time, it has been shown that LB_keogh improves the accuracy of DTW through discarding the pathological matching (Ratanamahatana and Keogh, 2004). Whereas, for LB_keogh, the defect is the non-applicability at complex time series owning different amounts of peaks and valleys. Therefore, aligning some of them does not fully solve the problem of local distortions.

Recently, CID was firstly introduced (Batista et al., 2012). It is a simple and parameter-free method to mitigate the problem that pairs of complex objects, which subjectively may seem to be very similar, tended to be further apart under current distance measures. However, CID may lose effectiveness in dealing with global scaling.

This is easy to discovery that the common shortcoming of the mentioned similarity methods is they merely tolerate one or two distortions. By contrast, the similar sequences in real time series classification always appear in diverse kinds of distortions. Consequently, it is unsurprising that the precision of time series classification based on these similarity methods remains unsatisfied. Under such circumstances, to increase the accuracy needs to design a similarity method for putting up with more deformations (Esling and Agon, 2012; Jeong et al., 2011).

In this paper, we propose a novel similarity measure (SIMscl) combining global and local information. Our method is robust in multiple distortions. The major contributions of this paper are the following:

  • We introduce a feature exaction model for time series to capture the shape information of different scales. The shape information is separated out through multi-scale discrete haar wavelet transform, key point extraction, and symbolization. Symbolization play a major role in dimensionality reduction. Multi-scale discrete haar wavelet transform and key point extraction retain the essential characteristics of the original time series.
  • Based on the shape information of different scales, SIMshape is designed first. Furthermore, to strengthen the performance of SIMshape, we bring in two calibration factors in similarity criterion: the complexity and the local information. Specifically, the complexity reduces the influence of the distortions in a high degree; the local information is with an eye to the exact match of the corresponding value, which makes up the comparison in local information overlooked by SIMshape. In the end, SIMscl, as the enhancement of SIMshape, is taken out to find the homomorphic sequences in 1NN time series classification.

The rest of the paper is organized as follows. The proposed method is presented in Section 2. Experimental results and the corresponding analyses are given in Section 3. Finally, conclusions are made in Section 4.

II. METHODS

The shape information extraction method and the new similarity measure SIMscl are two main aspects of the proposed scheme. A detailed description of these two parts is given in the following subsections.

A. The shape information extraction method

Our goal is to acquire the shape information of time series Yt, where k is the decomposition level. The minute algorithm is shown in Algorithm. 1.

First of all, we need to examine whether the length of input sequence Yt is an integer power of 2, which is the requirement of Discrete Wavelet Transform (DWT) (Chaovalit et al., 2011). If the length does not meet, the series must be extended to the nearest integer power of 2 by padding zeros to the end.

Then, we apply discrete haar wavelet transform on Yt recursively. At the first decomposition level, Yt was decomposed into a low frequency part (called approximate coefficient cA1) and a high frequency part (called detail coefficient cD1). At the second decomposition level, the approximate coefficient cA1 was further divided into an approximate part cA2 and a detail part cD2. This process was repeated until the length of the new approximate coefficient was 1. At this point, we obtain approximate wavelet coefficient and the number of decomposing scales k is equal to log2n.

Next, we seek and extract key points for each approximate coefficient cAi(i=1,,k-1). Where, the key points include the first point, the end points, and the local extreme points.

Afterwards, this step is to obtain the key points sequences of symbols. We need an intermediate representation to help to complete this transformation. The intermediate representation is produced by computing the relative change in adjacent points of key points sequence. Specifically, it records the difference between the latter key point and the previous key point. Once we gain the intermediate representation, we could execute the symbolization. As the key points sequence is consisted by local extreme points, the relative change of adjacent key points fits one of the following cases: increasing or decreasing. If increasing, the change is mapped to 1. If decreasing, the change is mapped to -1.

Ultimately, the vital step for dimensionality reduction would be implemented in the last step. The property that -1/1 alternately appears in the sequences of symbols has been discussed. Hence, the subsequence like or is absolutely nonexistent. With this in mind, each sequence of symbols, we can merely store the product through multiplying the length by the first symbol. For example, if a symbolic sequence is , we can simply it to 6, which is the product of the length 6 and the first symbol 1. We simply each symbolic sequence from scale 1 to k- 1 and label the products as .

To date, we have extracted the shape information W by the proposed method. W can greatly reduce dimension from n to (log2n- 1), which leads the improvement in efficiency especially for long sequences. Besides, W can retain the shape information of time series from different scales, which make the information loss minimized. Because the approximate coefficient cAi (i=1, , k-1) carries meaningful signals of Yt, and its key points sequence retains the main shape of the approximate coefficient, the essential information of Yt is reserved in W.

B. The new similarity measure

After acquiring the shape information W, the problem how to design a similarity measure with robustness needs to be considered. In time series classification, two similar time series always appear in diverse kinds of deformation, such as amplitude shifting, amplitude scaling, time scaling, time warping, linear drift and noise. In other words, such two time series are similar in their basic shape information, but different in the local value as the existence of distortions. Based on this, we first introduce the shape similarity in Eq. 1:

(1)

where Xt and Yt are two compared time series, and their length is not required the same, k, WXt and WYt are the decomposing scale and the shape information, respectively.

As shown in Eq. 1, we set an array xcom with length of (k-1) and design a weight function . The goal of setting xcom is to indicate whether the element in WXt exits in WYt, if the i-th element is the common element of WXt and WYt, we remark xcom(i)=1, otherwise, we let xcom(i)=0. Besides, considering that the global information has greater contribution for basic shape information than the detailed information, we assign a higher weight value to the global information and a lower weight to the detailed information.

It is worth noting that the similarity deformations in this paper are required not to alter the basic feature of original time series, which means they do not cause changes in the global shape of a sequence. In order to avoid the excessive deformations, the complexity factor is introduced in the similarity criterion. Complexity factor reflects information about complexity differences between two time series, which can be seen as one calibration factor for SIMshape. In order to quickly compute and have nature interpretation, the complexity of time series Yt can be estimated by Eq. 2:

(2)

where is based on the rate of change to estimate the complexity of time series, so it is reasonable (Gonzales Andino et al., 2000; Keogh et al., 2007).

However, shape information and complexity are from the global view to measure the difference of time series. To design an accurate similarity measure, the new similarity measure should consider the local information difference, which is the other calibration factor for SIMshape. Seeing that LB_keogh has a better performance in reflecting the difference of detail information (Ding et al., 2008), we choose it to compute the result in the local information matching. Its detailed implementation can be found in these literatures (Keogh and Ratanamahatana, 2005; Keogh et al., 2006).

Therefore, based on the above discussions, the new similarity measure SIMscl can be proposed in Eq. 3:

(3)

SIMscl can be regarded as an enhanced version of SIMshape by introducing two calibration factors into SIMshape. Due to combining the global and local information in the similarity criterion simultaneously, SIMscl may become more robust in dealing with the multiple distortions.

III. EXPERIMENTS

This section contains two groups of experiments on 1NN classifier, which is shown in Algorithm 2, to evaluate the effectiveness of the proposed similarity. In the first experiment, we want to show the shape similarity measure SIMshape can deal with the distortions causing ED, LB_keogh, and CID invalid. In the second experiment, our goal is to show SIMscl is more accurate than ED, LB_keogh, and CID in the classification of time series.

Our experiments were conducted on two data sets (Batista et al., 2012; Keogh et al., 2012), including one artificial data set: star light curve, and one real data set: beef. The main features of this data set are described as below in Table 1. Due to the disturbance of many factors, real time series in classification would have all kinds of deformations, so it is meaningful to design a similarity measure with a higher tolerance in these deformations. As the star light curve data set contains multiple distortions and the beef data set is real application, they are very suitable for verifying the performance of the similarity methods in multiple invariance.

Table 1: Characteristics of the experimental time series

A. The Performance of SIMshape

First of all, we respectively use ED, LB_keogh and CID as the similarity measure to classify the star light curve data set and the beef data set. From the classification results, we can select time series, which are wrongly classified by ED, LB_keogh and CID. Then, we set an array IU for storing time series simultaneously misclassified by ED, LB_keogh and CID. Next, we use SIMshape to classify IU. Table 2 displays IU, the position of its elements in testing set, and the time series correctly classified by SIMshape.

Table 2: The performance of SIMshape in IU

From Table 2, it can be found that: for the star light curve data set, SIMshape assigns the right labels to five time series from IU1; for the beef data set, SIMshape can correctly classify three time series form IU. These results imply that SIMshape can solve some problems difficult for ED, LB_keogh and CID.

These results are reasonable. Two homomorphic sequences in classification always appear in diverse kinds of deformations, such as amplitude shifting, amplitude scaling, time scaling, time warping, linear drift and noise, which will greatly add the complexity of time series. However, the deformations in small degree always do not alter the basic feature of original time series, which means they don't cause changes in the general shape. Therefore, two time series are similar in their whole trend, but different in the local value. It is clear that CID is sensitive in complexity; LB_keogh is invalid in amplitude distortion; ED is based on point to point computation, any distortion will cause mismatch for ED; while the shape similarity SIMshape is based on basic shape information, so long as the degree of distortion does not alter the true nature of time series, it will be valid.

B. The Performance of SIMscl in Classification

In this group experiment, we still choose ED, LB_keogh and CID as the comparative methods and compare SIMscl with them in the classification accuracy of the two data sets. The classification results are shown in Table 3.

Table 3: The classification accuracy of ED, LB_keogh, CID and SIMscl

From Table 3, it can be found SIMscl have a higher classification accuracy than other three typical similarity methods in these two experimental data sets. This result is natural and expected. As SIMscl is based on the shape information, complexity, and local information, it may deal with the cases solved by ED, LB_keogh, CID and SIMshape. Besides, SIMscl can be considered as an enhanced version of SIMshape by introducing two calibration factors, complexity and local information, into SIMshape. The results in the first group experiment indicate that SIMshape can correctly classify the time series misclassified by ED, LB_keogh and CID. Therefore, SIMscl would become more accurate than ED, LB_keogh and CID. More deeply, unlike most of traditional similarity methods merely using local information, SIMscl also takes into account the intrinsic properties of the data set, which are the shape information and complexity in this paper. Overall the reason for the higher accuracy of SIMscl could be boiled down to the combination of the global and local information in the similarity criterion.

IV. CONCLUSIONS

In this paper, to improve the accuracy of classification, we have presented a similarity measure SIMscl, which is invariant with more distortions. In order to combine the global and local information in similarity criterion, SIMscl is based on the shape information, the complexity, and the local information. Therefore, SIMscl also can be considered as an enhancement of SIMshape by introducing two correction factors complexity and local information into SIMshape. The proposed approaches have been compared with three current methods ED, LB_keogh and CID. The features of our proposed work confirmed by experiment are shown as follows:

1) The proposed similarity measure SIMshape can correctly classify time series misclassify by ED, LB_keogh and CID.

2) SIMscl produces more significant improvements in classification accuracy than ED, LB_keogh and CID.

The experimental results also indicate that some intrinsic properties of time series, such as complexity and the shape information, have an impact on the performance of similarity measure. Therefore, combining the global and local information in the similarity criterion would improve the accuracy of similarity methods. This is of great value in time series classification. In the future, we plan to find other essential features of time series and incorporate them with the current similarity methods to enhance the classification performance.

ACKNOWLEDGEMENTS
This work is supported by the Natural Science Foundation of China (NSFC) under Grant No. 61174144, No.61232018 and Grant No. 60874065.

REFERENCES
1. Batista, G.E.A.P.A., X. Wang and E.J. Keogh, "A Complexity-Invariant Distance Measure for Time Series,"11th SIAM International Conference on Data Mining, 699-710 (2011).
2. Berndt, D. and J. Clifford, "Using Dynamic Time Warping to Find Patterns in Time Serie," Workshop on Knowledge Discovery in Databases, 359-370 (1994).
3. Chaovalit, P., A. Gangopadhyay, G. Karabatis and Z. Chen, "Discrete wavelet transform-based time series analysis and mining," ACM Comput. Surv., 43, 1-37 (2011).
4. Ding, H., G. Trajcevski, P. Scheuermann, X. Wang and E. Keogh, "Querying and mining of time series data: experimental comparison of representations and distance measures," . Proc. VLDB Endow, 1, 1542-1552 (2008).
5. Esling, P. and C. Agon, "Time-series data mining," ACM Comput. Surv., 45, 1-34 (2012).
6. Geurts, P., "Pattern Extraction for Time Series Classification," Principles of Data Mining and Knowledge Discovery, Eds. L. De Raedt and A. Siebes, Springer Berlin / Heidelberg, 2168, 115-127 (2001).
7. Gonzalez Andino, S.L., R. Grave de Peralta Menendez, G. Thut, L. Spinelli, O. Blanke, C.M. Michel, M. Seeck and T. Landis, "Measuring the complexity of time series: An application to neurophysiological signals." Human Brain Mapping,. 11, 46-57 (2000).
8. Jeong, Y.S, M.K. Jeong and O.A. Omitaomu, "Weighted dynamic time warping for time series classification," Pattern Recognition, 44, 2231-2240 (2011).
9. Keogh, E. and C.A. Ratanamahatana, "Exact indexing of dynamic time warping," Knowledge and Information Systems, 7, 358-386 (2005).
10. Keogh, E., L. Wei, X. Xi, S.-H. Lee and M. Vlachos, "LB_Keogh supports exact indexing of shapes under rotation invariance with arbitrary representations and distance measures;" Proceedings of the 32nd international conference on Very large data bases, Seoul, Korea, 882-893 (2006).
11. Keogh, E., S. Lonardi, C.A. Ratanamahatana, L. Wei, S.-H. Lee and J. Handley, "Compression-based data mining of sequential data," Data Mining and Knowledge Discovery, 14, 99-129 (2007).
12. Keogh, E. X. Xi, L.Wei and C. Ratanamahatana, The UCR Time Series Classification/Clustering Home-page (2012).
13. Lines, J., L.M. Davis, J. Hills and A. Bagnall, "A shapelet transform for time series classification," Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, Beijing, China, 289-297 (2012).
14. Povinelli, R.J., M.T. Johnson, A.C. Lindgren and Y. Jinjin, "Time series classification using Gaussian mixture models of reconstructed phase spaces," IEEE Transactions on Knowledge and Data Engineering, 16, 779-783 (2004).
15. Ratanamahatana, C.A. and E. Keogh, "Making time-series classification more accurate using learned constraints," SIAM Proceedings Series, 11-12, (2004).
16. Tak-chung, F. "A review on time series data mining," Engineering Applications of Artificial Intelligence, 24, 164-181 (2011).

Received: September 5, 2012
Accepted: December 1, 2013
Recommended by Subject Editor: José Guivant