Data Science Lab

Sequential pattern mining refers to identifying frequent subsequences in sequence database as patterns. It provides an effective way to analyze the sequential data. Sequential pattern mining has proven to be very essential for handling order-based critical business problems, such as behavior analysis, gene analysis in bioinformatics and weblog mining. For example, sequential pattern mining is widely employed in DNA and protein to discover interesting structures and functions of molecular or DNA sequences. The selection of interesting sequences is generally based on the frequency/support framework: sequences of high frequency are treated as significant. In the last two decades, researchers have proposed many techniques and algorithms to extract the frequent sequential patterns, which the downward closure property (also known as Apriori property) plays a fundamental role. On the other hand, the relative importance of each item is introduced in frequent pattern mining, and the “high utility itemset mining” is proposed. Instead of selecting high frequency patterns, the utility-based methods extract itemsets with high utilities, and many algorithms and strategies are proposed. These methods can only process the itemsets in the utility framework.

However, all the above methods suffer from the following common issues and problems to different extents: 1) Sometimes, too many patterns are identified by frequent sequential pattern mining algorithms. Most of them may not be informative to business decision-making, since they do not show the business value and impact. In some cases, such as fraud detection, some truly interesting sequences may be filtered because of their low frequencies. 2) Even if there is an algorithm considers the business impacts (namely utility), it can only obtain high utility sequences based on a given minimum utility threshold, it is very difficult for users to specify an appropriate minimum utility and to directly obtain the most valuable patterns. 3) The algorithm in the utility framework may generate a large number of patterns, many of them are redundant. Real valuable patterns which the users are interested might be flooded in hundreds of thousands of similar patterns. Another critical issue is that the existing methods cause dramatic running time and memory when sequences are very long or the threshold is low, thus resulting in a huge number of patterns extracted.

Although high utility sequential pattern mining is very essential, to discover the patterns is challenging. Because: 1) The downward closure property does not hold in utility-based sequence mining. This means that most of the existing algorithms cannot be directly transferred, e.g. from frequent sequential pattern mining to high utility sequential pattern mining. Further more, compared to high utility itemset mining, utility-based sequence analysis faces the critical combinational explosion and computational complexity caused by sequencing between sequential elements (itemsets). 2) Since the minimum utility is not given in advance, the algorithm essentially starts the searching from 0 minimum support. This not only incurs very high computational costs, but also the challenge of how to raise the minimum threshold without missing any top-k high utility sequences. 3) Due to the fundamental difference, incorporating the traditional closure concept into high utility sequential pattern mining makes the outcome patterns irreversible lossy and no longer recoverable, which will be reasoned in the following chapters. Therefore, it is exceedingly challenging to address the above issues by designing a novel representation for high utility sequential patterns. In order to address these research limitations and challenges, this thesis proposes a high utility sequential pattern mining framework, and proposed both a threshold-based and top-k-based mining algorithms. Furthermore, a compact and lossless representation of utility-based sequence is presented, and an efficient algorithm is provided to mine such kind of pattern. The chapter 2 thoroughly reviews the related works in the frequent sequential pattern mining and high utility itemset/sequence mining. The chapter 3 incorporates utility into sequential pattern mining, and a generic framework for high utility sequence mining is defined. Two efficient algorithm, namely USpan and USpan+, are presented to mine for high utility sequential patterns. In USpan and USpan+, we introduce the lexicographic quantitative sequence tree to extract the complete set of high utility sequences and design concatenation mechanisms for calculating the utility of a node and its children with three effective pruning strategies. The chapter 4 proposes a novel framework called top-k high utility sequential pattern mining to tackle this critical problem. Accordingly, an efficient algorithm, Top-k high Utility Sequence (TUS for short) mining, is designed to identify top-k high utility sequential patterns without minimum utility. In addition, three effective features are introduced to handle the efficiency problem, including two strategies for raising the threshold and one pruning for filtering unpromising items. The chapter 5 proposes a novel concise framework to discover US-closed (Utility Sequence closed) high utility sequential patterns, with theoretical prove that it is lossless representation of high-utility patterns. An efficient algorithm named CloUSpan is introduced to extract the US-closed patterns. Two effective strategies are used to enhance the performance of CloUSpan. All of the algorithms are examined in both synthetic and real datasets. The performances, including the running time and memory consumptions, are compared. Furthermore, the utility-based sequential patterns are compared with the patterns in the frequency/support framework. The results show that high utility sequential patterns provide insightful knowledge for the users.