Data Science Lab

In traditional supervised learning, there exist several assumptions: (1) the training label is associated with a single instance; (2) both of the positive class and the negative class are available to learn the classifier; (3) the concept underneath the data is stable and will not change over time. However, these assumptions may not always satisfy. This thesis focuses on the complex data which does not meet all these assumptions. The complex data studied in this thesis includes: multiple-instance data, streaming Web page data, and positive and unlabelled data.

To cope with the above challenges, this thesis aims at designing novel algorithms to deal with the complex data. Firstly, this thesis proposes a novel multiple-instance learning (MIL) method, named SMILE (Similarity-based Multiple-Instance LEarning) by introducing similarity weights to each instance in the positive bags. Secondly, this thesis presents a novel multi-instance streaming learning framework to classify Web pages in time-evolving data streams. Thirdly, this thesis puts forward a robust PU learning (RPUL) approach, by associating the undetermined instances with two instance weights, which indicate the probability of an undetermined instance towards the positive and negative class, respectively. Substantial experiments have shown that our proposed approaches in this thesis are able to cope with the challenges compared with the traditional methods. Keywords: Support Vector Machine, Multiple-Instance Learning, Web Page Stream Classification, Positive and Unlabelled Instance Learning.