Data Science Lab

In supervised learning, the distance or similarity measure is widely used in a lot of classification algorithms. When calculating the categorical data similarity, the strategy used by the traditional classifiers often overlooks the inter relationship between different data attributes and assumes that they are independent to each other, for example, the overlap similarity and the frequency based similarity. While for the numerical data, the most used Euclidean distance or Minkowski distance are all restricted in each single feature and assume the features in the data set have no outer connections. That will cause some problem in expressing the real similarity or distance between instances and may give an incorrect result if the inter-relationship between attributes is ignored. The same problems exist in other supervised learning, such as the classification tasks of class-imbalance or multi-label. In order to solve these research limitations and challenges, this thesis proposes an insightful analysis on coupled similarity in supervised learning, not only used on categorical-feature data sets, but also on numerical-feature data sets and mixed type data, to give an expression of similarity which is more closely related to the real nature of problem.

Firstly, we proposed a coupled fuzzy kNN to classify imbalanced categorical data which has strong relationships between objects, attributes and classes in Chapter 3. It incorporates the size membership of a class with attribute weight into a coupled similarity measure, which effectively extracts the intercoupling and intra-coupling relationships from categorical attributes. As it reveals the inner true inner-relationship between attributes, the similarity strategy we used can make the instances of each class more compact to each other when expressed by the distance. That will bring us some goodness in dealing with the class imbalance data. The experiment results show that our supposed method has a more stable and higher average performance than the classic algorithms, such as kNN, kENN, CCWkNN, SMOTE-based kNN, Decision Tree and NaiveBayes, especially when applied on class-imbalanced categorical data.

We also introduced a similar coupled distance for continuous features, by considering the intra-coupled relationship and inter-coupled relationship between the numerical attributes and their corresponding extends. As detailed in Chapter 4, we present the coupling distance using the Pearson’s correlations between attributes and their square root and square. Substantial experiments have verified that our coupled distance outperforms the original distance, and this is also supported by statistical analysis. The experiment result demonstrates that our coupling strategy on continuous features has a great improvement in expressing the real distance between objects with numerical features.

When we consider the similarity concept, people may only relate to the categorical data, while for the distance concept, people may only think about the numerical data. Seldom have methods taken into account both of them, especially when considering the coupling relationship between features. In Chapter 5, we proposed a new method which integrated our coupling concept for mixed type data, that is to say, data that contains both numerical and categorical features. In our method, we first do discretization on numerical attributes to transfer such continuous values into separate groups, so as to adopt the inter-coupling distance as we do on categorical features (coupling similarity), then we combine this new coupled distance to the original distance (Euclidean distance), to overcome the shortcoming of the previous algorithms. The experiment results show some improvement when compared to the basic and some variants of kNN algorithms.

We also extended our coupling concept to multi-label classification tasks. The traditional single-label classifiers are known to be not suitable for multilabel tasks anymore, due to the overlap concept of the class labels. The most used classifier in multi-label problems, ML-kNN, learns a single classifier for each label independently, so it is actually a binary relevance classifier. In other words, it does not consider the correlations between different labels. This algorithm is often criticized for such drawbacks. To overcome this problem, we introduced a coupled label similarity, which explores the inner relationship between different labels in multi-label classification according to their natural co-occupance. This similarity reflects the distance of the different classes. By integrating this similarity with the multi-label kNN algorithm, we overcome the ML-kNN’s shortcoming and improve the performance significantly. Evaluated over three commonly-used verification criteria (the Hamming Loss, One Error and Average Precision) for multi-label classifiers, our proposed coupled multi-label classifier outperforms the ML-KNN, BR- kNN and even IBLR. The result indicates that our supposed coupled label similarity is appropriate for multi-label learning problems and can work more effectively compared to other methods.

All the classifiers analyzed in this thesis are based on our coupling similarity (or distance), and applied to different tasks in supervised learning. The performance of these models are examined by the widely used verification criteria, such as ROC, Accuracy Rate, Average Precision and Hamming Loss. This analysis provide insightful knowledge for investors to find the inner relationship between features in supervised learning tasks.