Data Science Lab

Introduction

Our human beings and natural, socioeconomic, cultural, virtual and cyber spaces, systems, behaviors and data are complex. Two typical system complexities are heterogeneities and interactions (non-IIDnesses [3]), e.g., heterogeneous systems and their entities, attributes, behaviors and dynamics, where there may be diversified interactions within/between systems, entities, and so on. These non-IIDnesses go beyond classic IID (or i.i.d.) thinking and assumptions widely applied to science, nature and societal systems, i.e., things could be assumed homogeneous and independent, data could be sampled in an independent and identically distributed (IID, i.i.d.) manner, and variables could be IID drawn from a given distribution, where we focus on IIDnesses rather than non-IIDnesses [1]. Such IID thinking lays the foundational assumption of classic statistics/mathematics, information theory, machine learning, analytics, etc. and their applications to AI/data science domains, such as vision, text, communication, behavior, action, planning, search, recommendation, and decision, etc., while these domains are actually beyond IID.

Can non-i.i.d. theories in statistics solve the non-IIDnesses?

The answer is unfortunately no, because the statistical non-i.i.d. theories are too narrow and specific, only addressing a small proportion of non-IIDnesses [1].

There have been some compromised approaches and solutions grounded in statistics/mathematics and information theories to address statistically non-i.i.d. settings, where they mostly rely on the dependence theories, the theorem of large numbers, and the central limit theorem, etc. However, the real-life non-IIDnesses, and their required non-IID thinking, informatics, statistics, and learning [1, 2] surpass classic non-i.i.d. settings in the statistical/mathematical sense, where correlation and dependency are typically focused. Non-IID thinking and learning are not opposite to the classic i.i.d. analysis following their i.i.d. assumptions. Non-IIDnesses refer to any settings and complexities beyond IIDnesses, where beyond independence refers to any beyond-independent settings, such as dependent, correlated, coupled, entangled, and interactive; beyond homogeneity refers to any beyond-identically distributed settings, e.g., with heterogeneous types, distributions, structures, and relations over variables, sources, samplings, time, space, or heterogeneous results from distinct processes and methods, etc. Together, non-IIDnesses refer to complexities beyond IID, including interactions, coupling relationships, heterogeneities and nonstationarities over time, space, sampling, source, domain, modality, modelling process, or methods, etc. Non-IID learning refers to learning from data with non-IIDnesses. In contrast, IIDnesses refer to data with independence and identical distributions, and IID learning refers to learning from data with IIDnesses [1,5].

In a non-IID data problem (see Figure 1(a)), non-IIDnesses (see Figure 1(c)) refers to any interactions and couplings [4] (both well-explored relationships such as co-occurrence, neighborhood, dependency, linkage, correlation, and causality, and poorly-explored and rarely-studied ones such as sophisticated cultural and religious connections and influence), and heterogeneities, which exist within and between two or more aspects, such as entity, entity class, entity property (variable), process, fact and state of affairs, or other types of entities or properties (such as learners and learned results) appearing or produced prior to, during and after a target process (such as a learning task). By contrast, IIDnesses ignore or simplify them, as shown in Figure 1(b) [1,3,4].

Figure 1: (b) IIDness and IID learning vs. (c) non-IIDness and non-IID learning [1,3].

Learning visible and especially invisible non-IIDnesses is a fundamental challenge in the age of data, data science and new-generation AI for an intrinsic understanding of data, behaviors, and systems with weak and/or unclear structures, distributions, relationships, semantics, and dynamics, etc. In many cases, locally visible but globally invisible (or vice versa) non-IIDness are presented in a range of forms, structures, and layers and on diverse entities. Often, individual learners cannot tell the whole story due to their inability to identify such complex non-IIDness. Effectively learning the widespread, various, visible and invisible non-IIDness is thus crucial for obtaining the truth and a complete picture of the underlying problem.

We frequently only focus on explicit non-IIDness, which is visible to us and easy to learn. Typically, work in the hybridization of multiple methods and the combination of multiple sources of data into a big table for analysis fall into this category. Computing non-IIDness refers to understanding, formalizing and quantifying the non-IID aspects, entities, interactions, layers, forms, and strength. This includes extracting, discovering and estimating the interactions and heterogeneities between learning components, including the method, objective, task, level, dimension, process, measure, and outcome, especially when the learning involves multiples of the above components, such as multi-methods or multi-tasks. We are concerned about understanding non-IIDness at a range of levels from values, attributes, objects, methods and measures to processing outcomes (such as mined patterns). Such non-IIDness is both comprehensive and complex.

Figure 2: IID to non-IID space [1].

Non-IIDnesses in complex systems, behaviors, and data

Here, non-IIDness [1,3] refers to any interactions and coupling relationships (for instance, co-occurrence, neighborhood, dependency, linkage, correlation, or causality) and heterogeneities between two or more aspects, such as object, object class, object property (variable), process, fact and state of affairs, or other types of entities or properties (such as learners and learned results) appearing or produced prior to, during and after a target process (such as a learning task).

In a learning system, as shown in Figures 3 and 4, non-IIDnesses may exist within and/or between aspects, such as entity (objects, object class, instance, or group/community) and its/their properties (variables), context (environment) and its constraints, interactions (exchange of information, material or energy) between entities or between the entity and its/their environment, learning objectives (targets, such as risk level or fraud), the corresponding learning methods (models, algorithms or systems) and resultant outcomes (such as patterns or clusters).

Figure 3. Terminology and conceptual map of non-IIDness: non-ID—heterogeneities, and non-I—interactions. [1]

Figure 4. Various aspects of hierarchical non-IIDnesses in complex data, behaviors, systems and businesses [1,4]

**Non-IIDnesses and open learning challenge deep learning**

Can deep neural networks and broadly deep learning address non-IIDnesses sufficiently?

The answer is unfortunately no as well, because the current deep models still rely on the principles of fitting, thus challenged by issues including network vulnerability and distributional vulnerability, etc. This corresponds to problems including unsupervised and self-supervised deep learning, out-of-distribution (OOD) detection [5], open set, etc.

Disentangled representation learning, or decoupled representation learning, has been promoted for explainable deep learning. Is the disentanglement and decoupling thinking aligned with the intrinsic non-IIDnesses? Perhaps, not really.

Many open learning problems require non-IID thinking, informatics, and learning theories.

Open learning issues include data-related open data (open set) and open domain, modeling-related open models and open architectures, process-related OOD sampling and open parameterization, and results-related open measurement and open evaluation. Addressing open learning requires thinking and paradigm breakthroughs, including non-IID thinking, informatics, statistics, computing, and specifically, non-IID statistical theories, non-IID information theories, non-IID machine/deep learning, and non-IID analytics [1,2].

The area of Non-IID and its research directions

Figure 5 summarizes some challenges and prospects of non-IID data processing, non-IID feature engineering, non-IID representation,
non-IID pattern mining, non-IID statistical learning, non-IID reinforcement learning, non-IID deep learning, non-IID transfer learning, non-IID federated learning, non-IID multi-modal/source/task learning, non-IID vision learning, non-IID natural language processing/document/text analysis, and non-IID behavior modeling, and non-IID applications including non-IID outlier detection, and
non-IID recommendation [3].

Figure 5. A research map of non-IID learning [1,2]

Below, we illustrate the main prospects of inventing new and effective data science (including analytics and machine learning) thinkings, paradigms, theories and tools for non-IID learning (also called non-IIDness learning, non-IID data learning, or learning from non-IID data. We examine how to address the non-IID data characteristics (note, not just about IID objects) in terms of new feature analysis by considering feature relations and distributions, new learning theories, algorithms and models for analytics, and new metrics for similarity measurement and evaluation, etc.

Deep understanding of non-IID data characteristics: This is to identify, specify and quantify non-IID data characteristics, factors, aspects, forms, types, levels of non-IIDness in the raw data, behaviors, systems and problems during their understanding, acquisition, manipulation and enhancement, and identify the difference between what can be captured by existing data/behavior/problem understanding technologies and systems and what are left out.
New and effective non-IID feature analysis and construction: This is to invent new theories and tools for the analysis of feature relationships by considering non-IIDness within and between features and objects, and developing new theories and algorithms for extracting, transforming, selecting, mining and constructing explicit and implicit features.
New non-IID learning theories, algorithms and models: This is to create new theories, algorithms and models for analyzing, learning, and mining non-IID data by considering value-to-object (and up to results if required) couplings and heterogeneity.
New non-IID similarity and evaluation metrics: This is to develop new similarity and dissimilarity learning methods and metrics, as well as evaluation metrics that consider non-IIDness in data, behaviors, systems and business.
New non-IID simulation and visualization theories and methods: This is to develop new theories, methods and tools, as well as evaluation metrics that simulate, visualize, and present non-IIDness in data, behaviors, systems and business.

More broadly, many existing data-oriented theories, designs, mechanisms, systems and tools may need to be reinvented when non-IIDness is taken into consideration. In addition to non-IID learning for data mining, machine learning and general data analytics, this involves well-established bodies of knowledge, including mathematical and statistical foundations, descriptive analytics theories and tools, data management theories and systems, information retrieval theories and tools, multi-media analysis, and X-analytics.

Relevant research on non-IID learning

Non-IID learning concepts

[1] Longbing Cao. Beyond i.i.d.: Non-IID Thinking, Informatics, and Learning, IEEE Intelligent Systems, 37:4, 5-17, 2022.
[2] Longbing Cao, Philip S. Yu, Zhilin Zhao. Shallow and Deep Non-IID Learning on Complex Data, KDD 2022: 4774-4775.
Longbing Cao. Non-IID Federated Learning, IEEE Intell. Syst. 37(2): 14-15 (2022).
Can Wang, Fosca Giannotti, Longbing Cao. Learning Complex Couplings and Interactions, IEEE Intell. Syst. 36(1): 3-5, 2021.
[3] Longbing Cao. Non-IIDness Learning in Behavioral and Social Data, The Computer Journal, 57(9): 1358-1370 (2014).
[4] Longbing Cao. Coupling Learning of Complex Interactions, Journal of Information Processing and Management, 51(2): 167-186 (2015).
[5] Zhilin Zhao, Longbing Cao and Kun-Yu Lin. Revealing the Distributional Vulnerability of Discriminators by Implicit Generators, IEEE Transaction on Pattern Recognition and Machine Intelligence, 1-13, 2022
[6] Longbing Cao. Combined Mining: Analyzing Object and Pattern Relations for Discovering and Constructing Complex but Actionable Patterns, WIREs Data Mining and Knowledge Discovery, 3(2): 140-155, 2013.
[7] Longbing Cao, Huaifeng Zhang, Yanchang Zhao, Dan Luo, Chengqi Zhang. Combined Mining: Discovering Informative Knowledge in Complex Data, IEEE Trans. SMC Part B, 41(3): 699 – 712, 2011.

Non-IID representation

Chengzhang Zhu, Longbing Cao and Jianpin Yin. Unsupervised Heterogeneous Coupling Learning for Categorical Representation, IEEE Transaction on Pattern Recognition and Machine Intelligence, 44(1): 533-549, 2022.
Songlei Jian, Liang Hu, Longbing Cao, and Kai Lu. Metric-based Auto-Instructor for Learning Mixed Data Representation, AAAI2018.
Songlei Jian, Longbing Cao, Guansong Pang, Kai Lu, Hang Gao. Embedding-based Representation of Categorical Data with Hierarchical Value Couplings, IJCAI2017.

OOD detection, mitigation and prediction

Zhilin Zhao, Longbing Cao. Weighting non-IID batches for out-of-distribution detection. Machine Learning (2024). https://doi.org/10.1007/s10994-024-06605-z.
Zhilin Zhao, Longbing Cao and Kun-Yu Lin. Revealing the Distributional Vulnerability of Discriminators by Implicit Generators, IEEE Transaction on Pattern Recognition and Machine Intelligence, 1-13, 2022.
Zhilin Zhao, Longbing Cao, Chang-Dong Wang: Gray Learning from Non-IID Data with Out-of-distribution Samples. CoRR abs/2206.09375 (2022)
Zhilin Zhao, Longbing Cao, Kun-Yu Lin: Supervision Adaptation Balances In-Distribution Generalization and Out-of-Distribution Detection. CoRR abs/2206.09380 (2022)
Zhilin Zhao, Longbing Cao: Label and Distribution-discriminative Dual Representation Learning for Out-of-Distribution Detection. CoRR abs/2206.09387 (2022)

Non-IID clustering

Can Wang, Zhong She, Longbing Cao. Coupled Attribute Analysis on Numerical Data, IJCAI 2013.
Can Wang, Dong, Xiangjun; Zhou, Fei; Longbing Cao, Chi, Chi-Hung. Coupled Attribute Similarity Learning on Categorical Data, IEEE Transactions on Neural Networks and Learning Systems, 26(4): 781-797 (2015).
Can Wang, Longbing Cao, Minchun Wang, Jinjiu Li, Wei Wei, Yuming Ou. Coupled Nominal Similarity in Unsupervised Learning, CIKM 2011, 973-978.
Can Wang, Dong, Xiangjun; Zhou, Fei; Longbing Cao, Chi, Chi-Hung. Coupled Attribute Similarity Learning on Categorical Data, IEEE Transactions on Neural Networks and Learning Systems, 26(4): 781-797 (2015).

Non-IID KNN/classification

Chunming Liu, Longbing Cao. A Coupled k-Nearest Neighbor Algorithm for Multi-label Classification, PAKDD2015, 176-187.
Chunming Liu, Longbing Cao, Philip S Yu. A Hybrid Coupled k-Nearest Neighbor Algorithm on Imbalance Data, IJCNN 2014.
Chunming Liu, Longbing Cao, Philip S Yu. Coupled Fuzzy k-Nearest Neighbors Classification of Imbalanced Non-IID Categorical Data, IJCNN 2014.

Non-IID data discretization

Can Wang, Mingchun Wang, Zhong She, Longbing Cao. CD: A Coupled Discretization Algorithm, PAKDD2012, 407-418

Non-IID recommender systems

Quangui Zhang, Longbing Cao, Chengzhang Zhu, Zhiqiang Li and Jinguang Sun. CoupledCF: Learning Explicit and Implicit User-item Couplings in Recommendation for Deep Collaborative Filtering. IJCAI2018.
Liang Hu, Longbing Cao, Jian Cao, Zhipeng Gu, Guandong Xu, Jie Wang. Improving the Quality of Recommendations for Users and Items in the Tail of Distribution. ACM Trans. Info Sys., 2017.
Liang Hu, Longbing Cao, Jian Cao, Zhipeng Gu, Guandong Xu, Dingyu Yang. Learning Informative Priors from Heterogeneous Domains to Improve Recommendation in Cold-Start User Domains. ACM Trans. Info Sys., 35(2):, doi>10.1145/2976737, 2016
Longbing Cao. Non-IID Recommender Systems: A Review and Framework of Recommendation Paradigm Shifting. Engineering, 2: 212-224, doi:10.1016/J.ENG.2016.02.013., 2016.
Longbing Cao, Philip Yu. Non-IID Recommendation Theories and Systems. IEEE Intelligent Systems, 31(2), 81-84, 2016.
Fangfang Li, Guandong Xu, Longbing Cao. Coupled Matrix Factorization within Non-IID Context, PAKDD2015, 707-719.
Liang Hu, Longbing Cao, Guandong Xu, Jian Cao, and Wei Cao. Bayesian Heteroskedastic Choice Modeling on Non-identically Distributed Linkages, ICDM2014.
Liang Hu, Jian Cao, Guandong Xu, Longbing Cao, Zhiping Gu and Wei Cao. Deep Modeling of Group Preferences for Group-based Recommendation, AAAI 2014, 1861-1867.
Fangfang Li, Guandong Xu, Longbing Cao: Coupled Item-Based Matrix Factorization. WISE (1) 2014: 1-14
Liang Hu, Jian Cao, Guandong Xu, Jie Wang, Zhiping Gu, Longbing Cao. Cross-Domain Collaborative Filtering via Bilinear Multilevel Analysis, IJCAI2013.
Liang Hu, Jian Cao, Guandong Xu, Longbing Cao, Zhiping Gu, Can Zhu. Cross-Domain Collaborative Filtering Triadic Factorization, WWW 2013.
Fangfang Li, Guandong Xu, Longbing Cao, Zhendong Niu. Coupled Group-based Matrix Factorization for Recommender System, WISE 2013.
Yonghong Yu, Can Wang, Yang Gao, Longbing Cao, Qianqian Chen: A Coupled Clustering Approach for Items Recommendation. PAKDD (2) 2013

Non-IID document/text analysis

Jinjin Guo, Longbing Cao, Zhiguo Gong: Recurrent Coupled Topic Modeling over Sequential Documents. ACM Trans. Knowl. Discov. Data 16(1): 8:1-8:32, 2022.
Shufeng Hao, Chongyang Shi, Longbing Cao, Zhendong Niu, Ping Guo: Learning deep relevance couplings for ad-hoc document retrieval. Expert Syst. Appl. 183: 115335, 2021.
Shufeng Hao, Chongyang Shi, Zhendong Niu, Longbing Cao. Concept Coupling Learning for Improving Concept Lattice-based Document Retrieval. Engineering Applications of Artificial Intelligence, Volume 69, 65-75, 2018.
Qianqian Chen, Liang Hu, Jia Xu, Wei Liu, Longbing Cao. Document similarity analysis via involving both explicit and implicit semantic couplings. DSAA 2015: 1-10.
Xin Cheng, Duoqian Miao, Can Wang, Longbing Cao. Coupled Term-Term Relation Analysis for Document Clustering, IJCNN2013.

Pattern/rule relation analysis

Shoujin Wang, Longbing Cao. Inferring Implicit Rules by Learning Explicit and Hidden Item Dependency, IEEE Transactions on Systems, Man, and Cybernetics: Systems.
Jinjiu Li, Can Wang, Longbing Cao, Philip S. Yu. Efficient Selection of Globally Optimal Rules on Large Imbalanced Data Based on Rule Coverage Relationship Analysis, SDM 2013.
Yanchang Zhao, Huaifeng Zhang, Longbing Cao, Chengqi Zhang. Combined Pattern Mining: from Learned Rules to Actionable Knowledge, LNCS 5360/2008, 393-403, 2008.
Huaifeng Zhang, Yanchang Zhao, Longbing Cao and Chengqi Zhang. Combined Association Rule Mining, PAKDD2008.

Non-IID ensemble clustering

Can Wang, Zhong She, Longbing Cao. Coupled Clustering Ensemble: Incorporating Coupling Relationships Both between Base Clusterings and Objects, ICDE2013.

Group/Coupled behavior analytics

Can Wang, Longbing Cao, Chi-Hung Chi: Formalization and Verification of Group Behavior Interactions. IEEE Trans. Systems, Man, and Cybernetics: Systems 45(8): 1109-1124 (2015)
Wei Cao, Liang Hu, Longbing Cao: Deep Modeling Complex Couplings within Financial Markets. AAAI 2015: 2518-2524
Wei Cao, Longbing Cao, Yin Song: Coupled market behavior based financial crisis detection. IJCNN 2013: 1-8
Yin Song, Longbing Cao, et al. Coupled Behavior Analysis for Capturing Coupling Relationships in Group-based Market Manipulation, KDD 2012, 976-984.
Yin Song and Longbing Cao. Graph-based Coupled Behavior Analysis: A Case Study on Detecting Collaborative Manipulations in Stock Markets, IJCNN 2012, 1-8.
Longbing Cao, Yuming Ou, Philip S Yu. Coupled Behavior Analysis with Applications, IEEE Trans. on Knowledge and Data Engineering, 24(8): 1378-1392 (2012).
Longbing Cao, Yuming Ou, Philip S YU, Gang Wei. Detecting Abnormal Coupled Sequences and Sequence Changes in Group-based Manipulative Trading Behaviors, KDD2010, 85-94.

Non-IID computer vision and image processing

Yonggang Huang, Yuying Liu, Longbing Cao, Jun Zhang, I Pan. Exploring Feature Coupling and Model Coupling for Image Source Identification, IEEE Transactions on Information Forensics & Security, 2018
Shi, Y., Li, W., Gao, Y., Cao, L., Shen, D. Beyond IID: Learning to combine non-iid metrics for vision tasks. AAAI’17
Zhe Xu, Ya Zhang, Longbing Cao. Social Image Analysis from a Non-IID Perspective, IEEE Transactions on Multimedia.
Yinghuan Shi, Heung-Il Suk, Yang Gao, Dinggang Shen. Joint Coupled-Feature Representation and Coupled Boosting for Alzheimer’s Disease Diagnosis, CVPR, 2014

Keyword query with couplings

Semantic Approximate Keyword Query Based on Keyword and Query Coupling Relationship Analysis, Xiangfu Meng, longbing Cao and Jingyu Shao, CIKM2014

Statistical relation learning

Trong Dinh Thac Do and Longbing Cao. Gamma-Poisson Dynamic Matrix Factorization Embedded with Metadata Influence, NIPS2018.
Trong Dinh Thac Do and Longbing Cao. Metadata-dependent Infinite Poisson Factorization for Efficiently Modelling Sparse and Large Matrices in Recommendation. IJCAI2018.
Trong Dinh Thac Do and Longbing Cao. Coupled Poisson Factorization Integrated with User/Item Metadata for Modeling Popular and Sparse Ratings in Scalable Recommendation. AAAI2018.
Xuhui Fan, Richard Xu, Longbing Cao. Copula Mixed-Membership Stochastic Blockmodel. IJCAI2016.
Xuhui Fan, Richard Xu, Longbing Cao, Yin Song. Learning Nonparametric Relational Models by Conjugately Incorporating Node Information in a Network. IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2016.2521376.
Fan, Xuhui; Longbing Cao, Xu, Richard Yi Da. Dynamic Infinite Mixed-Membership Stochastic Blockmodel, IEEE Transactions on Neural Networks and Learning Systems, 26(9): 2072-2085 (2015).
Wei Cao, Liang Hu, Longbing Cao. Deep Modeling Complex Couplings within Financial Markets, AAAI2015, 2518-2524.
Liang Hu, Longbing Cao, Guandong Xu, Jian Cao, and Wei Cao. Bayesian Heteroskedastic Choice Modeling on Non-identically Distributed Linkages, ICDM2014.
Liang Hu, Jian Cao, Guandong Xu, Longbing Cao, Zhiping Gu and Wei Cao. Deep Modeling of Group Preferences for Group-based Recommendation, AAAI 2014, 1861-1867, AAAI 2014, 1861-1867.

Non-IID feature engineering/Non-IID outlier detection

Guansong Pang, Longbing Cao and Ling Chen. Homophily outlier detection in non-IID categorical data, Data Min. Knowl. Discov. 35(4): 1163-1224, 2021.
Guansong Pang, Hongzuo Xu, Longbing Cao and Wentao Zhao. Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data. CIKM2017.
Guansong Pang, Longbing Cao, Ling Chen, Huan Liu. Modeling Homophily Couplings for Wrapper-based Noise-resilient Outlier Detection. IJCAI2017.
Guansong Pang, Longbing Cao, Ling Chen. Outlier Detection in Complex Categorical Data by Modelling the Feature Value Couplings. IJCAI2016.
Guansong Pang, Longbing Cao, Ling Chen. Outlier Detection in Complex Categorical Data by Modelling the Feature Value Couplings. ICDM2016.

Some relevant activities on non-IID learning

IJCAI 2023 tutorial Deep Non-IID Learning, 2023
UTS guest lecture slides Shallow to Deep Non-IID Learning: Complex Systems, Behaviors and Data, for subject Technology Research Foundation, 2022
KDD’2022 tutorial Shallow and Deep Non-IID Learning on Complex Data, KDD’2022, with the introduction
IJCAI2019 tutorial Non-IID Learning of Complex Data and Behaviors, with Tutorial Slides
Special Issue on Learning complex couplings and interactions, with IEEE Intelligent Systems, Can Wang, Fosca Giannotti, Longbing Cao: Learning Complex Couplings and Interactions, 36(1):3-5, IEEE Intelligent Systems, 2021.
KDD2017 tutorial Non-IID Learning, with Tutorial Slides; and Youtube video part 1 and Youtube video part 2).
CIKM2014 tutorial on Learning Non-IID Big Data, slides about Non-IIDness Learning/Learning from Non-IID Data/Non-IID Learning here, this slide has been updated in Nov 2016.
PAKDD2014 tutorial on Non-IIDness Learning in Big Data, with PAKDD2014.
DSAA2017 Special session on non-IID Learning, with DSAA2017.
Workshop on Beyond IID in Information Theory.
ECML 2009 LNIID Workshop on Learning from non-IID data: Theory, Algorithms and Practice, http://ama.liglab.fr/~amini/ecml-wk-lniid.html