SVM-based methods including one-class SVM, binary SVM and multi-class classification SVM have shown their great potential compared with many classification methods. This thesis aims to develop a series of SVM-based algorithms to cope with the challenges in SVM-based multi-class classification, outlier detection and data streams. These challenges are briefly introduced as follows.
In SVM-based multi-class classification, traditional SVM-based methods mainly adopt the strategy of mapping the dataset with all classes into a single feature space via a kernel function, in which SVM is constructed for each decomposed binary classification problem. However, it is not always possible – and is sometimes very costly – to find an appropriate kernel function to render all the classes distinguishable from one another in a single feature space. This is because each class is always derived from individual uniform distribution and the selection of kernel function is related to the distribution of individual classes. This always results that the classification accuracy is not being as good as expected. How to improve the performance of multi-class classification is a challenge for SVM-based multi-class classification algorithms.
In outlier detection, most of the existing works on outlier detection have not explicitly dealt with the uncertainty of the input data. An underlying assumption is that the training dataset is perfectly labeled for building outlier detection models or classifiers. However, in many real-world applications, the data may be corrupted with noises or may only be partially complete. Moreover, another important observation is that, negative examples or outliers, although very few, do exist in many applications. For example, in the network intrusion domain, in addition to extensive data about the normal traffic conditions in the network, there also exist a small number of cyber attacks that can be collected to facilitate outlier detection. Therefore, how to cope with data uncertainty and incorporate a small number of outliers into the learning phase to improve the performance of outlier detection is very important.
In one-class data streams, one of challenges is to learn one-class classifiers on uncertain data streams. This is because the presence of uncertain data always makes one-class learning far more difficult than traditional data stream learning methods. Therefore, how to cope with data uncertainty in one-class data streams is a key challenge in one-class uncertain data stream learning.
To cope with the above challenges, this thesis aims to (1) design a more efficient and accurate SVM-based multi-class classification algorithm; (2) design a novel and robust support vector data description (SVDD) approach for outlier detection with uncertain data; (3) design a novel approach to one-class-based uncertain data stream learning.
Firstly, design a more efficient and accurate SVM-based multi-class classification algorithm. In order to improve the performance of SVM-based multi-class classification, we will propose an innovative approach – called multi-space-mapping (MSM) with SVM, which maps the data set with all classes into different feature spaces and then constructs SVM for each binary classification in terms of a binary tree architecture. The theoretical analysis indicates that the SVM-based MSM more easily determines suitable kernel function in comparison to traditional SVM-based multi-class classification methods. The computational complexity of MSM at its worst lies between that of the one-against-all scheme and one-against-one scheme. Substantial experiments have been conducted on one-against-all, one-against-one, fuzzy SVM (FSVM), decision directed acyclic graph (DDAG) algorithms and MSM using fifteen UCI data sets. The statistical results show that MSM outperforms them in terms of classification accuracy and standard deviation. Regarding the time taken to determine suitable kernel function, the one-against-all scheme always takes the longest time, while our MSM method requires the least amount of time for some data sets.
Secondly, design a novel and robust support vector data description (SVDD) approach for outlier detection with uncertain data. We will present a novel hybrid approach to outlier detection by incorporating local data uncertainty into the construction of a global classifier. To deal with local data uncertainty, we introduce a confidence value to each data example in the training data, which measures the strength of the corresponding class label. Our proposed method works in two steps. In the first step, we generate a pseudo training dataset by computing a confidence value of each input example on its class label. We present two different mechanisms: kernel k-means clustering algorithm and kernel LOF-based algorithm, to compute the confidence values based on the local data behavior. In the second step, we construct a global classifier for outlier detection by generalizing the SVDD-based learning framework to incorporate both positive and negative examples as well as their associated confidence values. By integrating local and global outlier detection, our proposed method explicitly handles the uncertainty of the input data and enhances the ability of SVDD in reducing the sensitivity to noise. Extensive experiments on real life datasets demonstrate that our proposed method can achieve a better tradeoff between detection rate and false alarm rate as compared to four state-of-the-art outlier detection algorithms.
Finally, design a novel approach to one-class-based uncertain data stream learning. Our proposed approach works in three steps. Firstly, we put forward a local kernel density-based method to generate a bound score for each instance, which refines the location of the corresponding instance. Secondly, we construct an uncertain one-class classifier by incorporating the generated bound score into a one-class SVM-based learning phase. Thirdly, we devise an ensemble classifier, integrated from uncertain one-class classifiers built on the current and historical chunks, to cope with the concept drift involved in the uncertain data stream environment. Our proposed method explicitly handles the uncertainty of the input data and enhances the ability of one-class learning in reducing the sensitivity to noise. Experiments demonstrate that our proposed approach can achieve better accuracy and is highly robust to noise in comparison with state-of-the-art one-class learning method.
In all, extensive experiments have demonstrated that our developed approaches in this thesis can cope with the challenges compared with the traditional methods. In the future, we would like to extend our algorithms developed in the thesis for on-line learning and prediction.