Publication:
Optimal distance bounds on time-series data

dc.contributor.coauthorVlachos, Michail
dc.contributor.coauthorYu, Philip S.
dc.contributor.departmentDepartment of Electrical and Electronics Engineering
dc.contributor.kuauthorKozat, Süleyman Serdar
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.date.accessioned2024-11-09T23:58:58Z
dc.date.issued2009
dc.description.abstractMost data mining operations include an integral search component at their core. For example, the performance of similarity search or classification based on Nearest Neighbors is largely dependent on the underlying compression and distance estimation techniques. As data repositories grow larger, there is an explicit need not only for storing the data in a compressed form, but also for facilitating mining operations directly on the compressed data. Naturally, the quality or tightness of the estimated distances on the compressed objects directly affects the search performance. We motivate our work within the setting of search engine weblog repositories, where keyword demand trends over time are represented and stored as compressed time-series data. Search and analysis over such sequence data has important applications for the search engines, including discovery of important news events, keyword recommendation and efficient keyword-to-advertisement mapping. We present new mechanisms for very fast search operations over the compressed time-series data, with specific focus on weblog data. An important contribution of this work is the derivation of optimally tight bounds on the Euclidean distance estimation between compressed sequences. Since our methodology is applicable to sequential data in general, the proposed technique is of independent interest. Additionally, our distance estimation strategy is not tied to a specific compression methodology, but can be applied on top of any orthonormal based compression technique (Fourier, Wavelet, PCA, etc). The experimental results indicate that the new optimal bounds lead to a significant improvement in the pruning power of search compared to previous state-of-the-art, in many cases eliminating more than 80% of the candidate search sequences.
dc.description.indexedbyScopus
dc.description.openaccessYES
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuN/A
dc.description.volume1
dc.identifier.isbn9781-6156-7109-0
dc.identifier.linkhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-72849126859andpartnerID=40andmd5=4ebdbc0ca716bf118d930ce67ca49c88
dc.identifier.quartileN/A
dc.identifier.scopus2-s2.0-72849126859
dc.identifier.urihttps://hdl.handle.net/20.500.14288/15560
dc.keywordsCompression techniques
dc.keywordsData repositories
dc.keywordsDistance bound
dc.keywordsDistance estimation
dc.keywordsEuclidean distance
dc.keywordsFast search
dc.keywordsFourier
dc.keywordsMining operations
dc.keywordsNearest neighbors
dc.keywordsNew mechanisms
dc.keywordsON time
dc.keywordsOptimal bounds
dc.keywordsOrthonormal
dc.keywordsSearch performance
dc.keywordsSequence data
dc.keywordsSequential data
dc.keywordsSimilarity search
dc.keywordsTight bound
dc.keywordsTime-series data
dc.keywordsEstimation
dc.keywordsInformation retrieval
dc.keywordsMining
dc.keywordsSearch engines
dc.keywordsWorld Wide Web
dc.keywordsData handling
dc.language.isoeng
dc.publisherSociety for Industrial and Applied Mathematics
dc.relation.ispartofSociety for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics
dc.subjectEngineering
dc.subjectElectrical and electronics engineering
dc.titleOptimal distance bounds on time-series data
dc.typeConference Proceeding
dspace.entity.typePublication
local.contributor.kuauthorKozat, Süleyman Serdar
local.publication.orgunit1College of Engineering
local.publication.orgunit2Department of Electrical and Electronics Engineering
relation.isOrgUnitOfPublication21598063-a7c5-420d-91ba-0cc9b2db0ea0
relation.isOrgUnitOfPublication.latestForDiscovery21598063-a7c5-420d-91ba-0cc9b2db0ea0
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files