Biclustering in Data Mining
Last Updated :
05 Oct, 2022
In recent days there is a tremendous development in technology. With recent technological advances in such areas as IT and biomedicine, many are facing issues in extracting of required data from the huge volume of data. These modern computers can produce and store unlimited data. So the problem of partitioning objects into no groups can be met in many areas. The vector partitioning problems consist of the partitioning of n-dimensional vectors into p-parts, these problems are mainly in data mining
"Data mining is a board area convening variety of methodologies for analyzing and modeling large data"
Analyzing patterns to partition the data samples according to some criteria is called clustering. The data mining technique which allows simultaneous clustering of the rows and columns of a matrix is called biclustering. A set of m samples represented by an n-dimensional feature vector, the entire dataset can be represented as m rows in an n column the biclustering algorithm generates biclusters, a subset of rows that exhibits similar behavior across a subset of columns. A biclustering of a dataset is a collection of pairs of sample and feature objects B=(l1, F1);(L2, F2); ........(Lr, Fr) such that collection (L1, L2, L3......) forms a partitioning of a set of samples, and collections (F1, F2, F3....) form a partition of the set of features. A set of (Lk, Fk) will be a bicluster.
Types of Biclusters:
- Biclusters with a constant value: It reorders rows and columns to group similar rows and columns with similar values, constant. A perfect constant bicluster is a matrix having all values equal.
20.0 | 20.0 | 20.0 | 20.0 | 20.0 |
20.0 | 20.0 | 20.0 | 20.0 | 20.0 |
20.0 | 20.0 | 20.0 | 20.0 | 20.0 |
20.0 | 20.0 | 20.0 | 20.0 | 20.0 |
20.0 | 20.0 | 20.0 | 20.0 | 20.0 |
- Bicluster with constant values on rows or columns: In these biclusters, rows, and columns should be normalized.
- Bicluster with constant values on rows:
20.0 | 20.0 | 20.0 | 20.0 | 20.0 |
21.0 | 21.0 | 21.0 | 21.0 | 21.0 |
22.0 | 22.0 | 22.0 | 22.0 | 22.0 |
23.0 | 23.0 | 23.0 | 23.0 | 23.0 |
24.0 | 24.0 | 24.0 | 24.0 | 24.0 |
- Bicluster with constant value on columns:
20.0 | 21.0 | 22.0 | 23.0 | 24.0 |
20.0 | 21.0 | 22.0 | 23.0 | 24.0 |
20.0 | 21.0 | 22.0 | 23.0 | 24.0 |
20.0 | 21.0 | 22.0 | 23.0 | 24.0 |
20.0 | 21.0 | 22.0 | 23.0 | 24.0 |
- Bicluster with coherent values: The subsets of rows or columns will almost have the same score.
- Additive:
1.0 | 4.0 | 5.0 | 0.0 | 1.5 |
4.0 | 7.0 | 8.0 | 3.0 | 4.5 |
3.0 | 6.0 | 7.0 | 2.0 | 3.5 |
5.0 | 8.0 | 9.0 | 4.0 | 5.5 |
2.0 | 5.0 | 6.0 | 1.0 | 2.5 |
1.0 | 0.5 | 2.0 | 0.2 | 0.8 |
2.0 | 1.0 | 4.0 | 0.4 | 1.6 |
3.0 | 1.5 | 6.0 | 0.6 | 2.4 |
4.0 | 2.0 | 8.0 | 0.8 | 3.2 |
5.2 | 2.5 | 10.0 | 1.0 | 4.0 |
- Unusually high/low values: In these matrices, we can have decimals, integers, etc, and in the top left 4 values are negative, and the bottom right 4 values are positive.
-10 | -10 | 0.1 | 0.1 |
-10 | -10 | 0.2 | 0.3 |
0.3 | 0.2 | 10 | 10 |
0.3 | 0.2 | 10 | 10 |
- Submatrices with low variance: In the matrix v , the values in v11,v12,v13,v14,v21,v31,v41 will be from 0.0 to 0.8. The values in v22,v23,v32,v33,v42,v43 will be from 0.1 to 0.2.
0.5 | 0.5 | 0.0 | 0.0 |
0.5 | 0.1 | 0.2 | 0.7 |
0.8 | 0.2 | 0.2 | 0.7 |
0.8 | 0.1 | 0.1 | 0.9 |
Bi-Partite Graph:
A vertex set divides into two disjoint sets v1,v2 and each edge in the graph joins a vertex in v1 to the vertex v2.
Row/Column | C1 | C2 | C3 | C4 |
R1 | 0.1 | 0.0 | 0.0 | 0.2 |
R2 | 0.5 | 0.0 | 0.0 | 0.3 |
R3 | 0.0 | 0.2 | 0.1 | 0.0 |
R4 | 0.0 | 0.2 | 0.0 | 0.2 |
Spectral Co-Clustering:
Takes inputs as a bipartite graph, the data divides into a set of nodes and connected by edges. It finds biclusters with higher values and rearranges the matrix with higher values along the diagonal columns.
For Inputs Matrix Aij:
An = R^-1/2*A*C^-1/2
[R^-1/2 ------> diagonal matrix with entry i summation j (Aij)
C^-1/2 -------> diagonal matrix with entry i summation i (Aij)]
For Singular Value Decomposition:
An = U summation v^T ------> provides rows and columns of A,
left singular vector gives row partition
and right singular vector gives columns.
L = [log2K]
z = \begin{bmatrix}
R^{-1/2} & U \\
C^{-1/2} & V \\
\end{bmatrix}
Spectral Biclustering:
It assumes the input matrix has a hidden keyboard structure. In this structure rows and columns are partitioned so that entries of any bicluster in the cartesian product of row clusters and column clusters are approximately constants.
Types of Normalization:
- Independent row and column normalization: This method makes the rows sum to a constant and sum of columns to a different constant.
- Bistochastization: This method makes the both rows and columns sum to the same constant.
- log normalization: The matrix is computed according to
kij = Lij - Li - Lj + L
Li=column;
Lj=row;
L=logA
Biclustering Evaluation:
To compare individual biclusters, The general formula is based on the Jaccard index:
J(A,B) = | A interception B | / |A| + |B| - | A interception B |
Where, A interception B means number of elements common to both A and B.
If the Jaccard index is minimum when the biclusters do not overlap at all, the maximum occurs when they are identical. The consensus score ranges between o to 1.0
0-->min(good)
=> clusters are very well separated.
(all pairs of biclusters are totally dissimilar.
1.0-->max(not good)
=> occurs when both sets are identical.
Similar Reads
Graph Clustering Methods in Data Mining Technological advancement has made data analysis and visualization easy. These include the development of software and hardware technologies. According to Big Data, 90% of global data has doubled after 1.2 years since 2014. In every decade we live, we can attest that data analysis is becoming more s
5 min read
Data Cleaning in Data Mining Data Cleaning is the main stage of the data mining process, which allows for data utilization that is free of errors and contains all the necessary information. Some of them include error handling, deletion of records, and management of missing or incomplete records. Absolute data cleaning is necess
15+ min read
Data Mining - Cluster Analysis Data mining is the process of finding patterns, relationships and trends to gain useful insights from large datasets. It includes techniques like classification, regression, association rule mining and clustering. In this article, we will learn about clustering analysis in data mining.Understanding
6 min read
Measuring Clustering Quality in Data Mining A cluster is the collection of data objects which are similar to each other within the same group. The data objects of a cluster are dissimilar to data objects of other groups or clusters. Clustering Approaches:1. Partitioning approach: The partitioning approach constructs various partitions and the
4 min read
CLIQUE Algorithm in Data Mining CLIQUE is a density-based and grid-based subspace clustering algorithm. So let's first take a look at what is a grid and density-based clustering technique. Grid-Based Clustering Technique: In Grid-Based Methods, the space of instance is divided into a grid structure. Clustering techniques are then
3 min read
Clustering High-Dimensional Data in Data Mining Clustering is basically a type of unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Clustering is the task of dividing the population or data points into a number of groups such that
3 min read