Data Science Lab

Latent Variable Model (LVM) is the statistical model that aims to uncover hidden information behind data. These models have been widely used for real-world applications such as community detection, link prediction or recommender systems. However, LVM faces significant challenges in modeling complex relations since LVM assumes that the data are independent and identically distributed (IID). However, real-world data are often coupled in terms of object attributes, object relations, or even hidden variable relations. For example, in social networks, users that indicate a similar `age’, `location’ and `high school’ are often friends. To this end, non-IID learning has the potential to describe the above hierarchical relations in real-world data which are typically not independent or identically distributed (non-IID).

In this thesis, we are interested in determining the relations behind observations and hidden variables in LVM. More specifically, we focus on coupling relations in non-IID data in terms of various LVM, including Latent Class Model (LCM), Latent Feature Model (LFM), and Latent Factor Model-Matrix Factorization (LFM-MF). In particular, we aim to model the following relations: (1) relations between attributes in observed data (e.g., user/item metadata such as `location’ of a user or `genre’ of a movie); (2) relations between different sources of observed data (e.g., metadata and user’s friendships); and (3) relations between latent variables in LVM. We also apply Bayesian Nonparametric (BNP) techniques to the proposed LVM models to automatically tune the number of latent variables in LVM for efficient computation.

Furthermore, to work with large and sparse data, we introduce several methods for better inference of the proposed LVM models. The empirical analysis of both proposed models reveals that our models significantly outperform state-of-the-art models in the same family. Together with improved optimization techniques (i.e., BNP and inference methods), our proposed models indicate their potential for online modeling of large, sparse data.