Data Science Lab

Introduction

Large-scale statistical learning aims to develop advanced statistical methods for complex machine learning problems with large, sparse, and multi-source data and complex relations and dynamics in the data. Such methods are critical for statistical machine learning of real-life applications such as collaborative filtering, network analysis, text analysis, and count data analysis, data mining, recommender systems, network analysis, document analysis, and natural language processing.

Classic statistical models face challenges in modeling large, sparse and multi-source data due to the intensive mathematical computation and inefficiency on the entire and sparse data, poor representation of multis-source relations and dynamics. Large-scale statistical learning performs the computation only on non-missing data combined with efficient Bayesian inference methods.

Research Topics

There are many interesting problems and topics to be explored in large-scale statistical learning, including:

Large Scale Bayesian Inference: developing techniques for the inference of Bayesian probabilistic models such as Markov Chain Monte Carlo (MCMC) and Variational Inference (VI) on large and sparse data;
Bayesian Nonparametrics: reducing computational time on large datasets including by Dirichlet process, Gaussian process, and latent feature models;
Modeling count data: developing statistical models for count data with sparsity;
Modeling sparse data: developing statistical models for sparse data, e.g., Poisson matrix factorization;
Modeling multi-source data: developing statistical techniques to model relations in heterogeneous data;
Modeling dynamic data: developing statistical models for both static and dynamic data with temporal transition and dynamics;
Modeling networking behaviors: developing statistical models for interactive networks, and networking behaviors;
Modeling large-scale recommendation: modeling rating relations, item relations, user relations and user-item relations for collaborative filtering etc. for large, sparse and multi-source recommendation problems;
Deep Bayesian networks: developing deep/hierarchical Bayesian networks with deep learning mechanisms, e.g., attention, and dropout.

Tutorials

Trong Dinh Thac Do, Longbing Cao and Jinjin Guo. Statistical Machine Learning: Big, Multi-source and Sparse Data with Complex Relations and Dynamics, AAAI2020
Trong Dinh Thac Do, Longbing Cao. Statistical Machine Learning of Large, Sparse, and Multi-source Data, PAKDD2019, large-scale statistical learning tutorial slides

References

Qing Liu, Trong Dinh Thac Do and Longbing Cao. Answer Keyword Generation for Community Question Answering by Multi-aspect Gamma-Poisson Matrix Completion, IEEE Intelligent Systems. BibTeX
Trong Dinh Thac Do and Longbing Cao. Gamma-Poisson Dynamic Matrix Factorization Embedded with Metadata Influence, NIPS2018. BibTeX
Trong Dinh Thac Do and Longbing Cao. Metadata-dependent Infinite Poisson Factorization for Efficiently Modelling Sparse and Large Matrices in Recommendation, IJCAI2018. BibTeX
Trong Dinh Thac Do, Longbing Cao. Coupled Poisson Factorization Integrated with User/Item Metadata for Modeling Popular and Sparse Ratings in Scalable Recommendation. AAAI2018. BibTeX
Xuhui Fan; Richard Yi Da Xu; Longbing Cao, Yin Song. Learning Nonparametric Relational Models by Conjugately Incorporating Node Information in a Network, IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2016.2521376, 2016. BibTeX
Xuhui Fan, Richard Xu, Longbing Cao. Copula Mixed-Membership Stochastic Blockmodel. IJCAI2016. BibTeX
Fan, Xuhui; Longbing Cao, Xu, Richard Yi Da. Dynamic Infinite Mixed-Membership Stochastic Blockmodel, IEEE Transactions on Neural Networks and Learning Systems, 26(9): 2072-2085 (2015). BibTeX