Skip to content
PaperThe following article is Open access

Datacube segmentation via deep spectral clustering

, , , and

Published 22 July 2024 © 2024 The Author(s). Published by IOP Publishing Ltd
, , Citation Alessandro Bombini et al 2024 Mach. Learn.: Sci. Technol. 5 035024DOI 10.1088/2632-2153/ad622f

2632-2153/5/3/035024

Abstract

Extended vision techniques are ubiquitous in physics. However, the data cubes steaming from such analysis often pose a challenge in their interpretation, due to the intrinsic difficulty in discerning the relevant information from the spectra composing the data cube. Furthermore, the huge dimensionality of data cube spectra poses a complex task in its statistical interpretation; nevertheless, this complexity contains a massive amount of statistical information that can be exploited in an unsupervised manner to outline some essential properties of the case study at hand, e.g. it is possible to obtain an image segmentation via (deep) clustering of data-cube’s spectra, performed in a suitably defined low-dimensional embedding space. To tackle this topic, we explore the possibility of applying unsupervised clustering methods in encoded space, i.e. perform deep clustering on the spectral properties of datacube pixels. A statistical dimensional reduction is performed by an ad hoc trained (variational) AutoEncoder, in charge of mapping spectra into lower dimensional metric spaces, while the clustering process is performed by a (learnable) iterative K-means clustering algorithm. We apply this technique to two different use cases, of different physical origins: a set of macro mapping x-ray fluorescence (MA-XRF) synthetic data on pictorial artworks, and a dataset of simulated astrophysical observations.

Export citation and abstractBibTeXRIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The advent of machine learning (ML), and more specifically deep learning, has revolutionised numerous fields, from computer vision to natural language processing [1]. In recent years, these powerful computational tools have also begun to reshape the field of applied physics.

Deep learning, a subset of ML, employs artificial neural networks with multiple layers (hence ‘deep’) to model and understand complex patterns. In the realm of applied physics, deep learning can be used to recognise patterns in large datasets, predict system behaviour, or control physical systems.

One of the key advantages of deep learning in applied physics is its ability to process and learn from vast amounts of data, often surpassing traditional computational physics methods in both speed and accuracy. This is particularly useful in fields where experimental or simulation data is abundant, such as astrophysics or nuclear physics applied to cultural heritage.

In astrophysics, the growing influx of data from advanced astronomical facilities calls for the use of deep learning algorithms, which excel in modeling complex patterns and handling large-scale data, often outperforming traditional data analysis methods (see [2, 3], and references therein). In the research field of galaxy formation and evolution, deep learning methods, such as convolutional neural networks, have shown remarkable proficiency. Recent studies [46] have employed these methods for more accurate and efficient galaxy morphological classification, a critical aspect in understanding galaxy evolution. Similarly, deep learning has revolutionized the study of gravitational lensing. For instance, the work [7] illustrates how deep neural networks can analyze Hubble Space Telescope images to measure galaxy mass distribution more precisely, thus enhancing our understanding of cosmic structures. The field of exoplanet discovery has also greatly benefited from deep learning (see e.g. [8, 9]). As an example, [10] have successfully applied neural networks to Kepler mission data, significantly improving the detection rate of new exoplanets. In cosmology, the analysis of the cosmic microwave background has been revolutionized by deep learning (see, e.g. [11, 12]. and references therein), as highlighted in recent works [13, 14], which have extracted subtle cosmological signals from complex datasets more effectively than traditional methods. Moreover, the application of deep learning in transient astrophysical phenomena, such as fast radio bursts and supernovae, has enabled rapid data processing and analysis, crucial for timely scientific observations (e.g. [15, 16]). In summary, the growing field of deep learning in astrophysics is not just a technical advancement but a necessary evolution to keep pace with the ever-growing complexity and volume of astronomical data.

In the field of nuclear physics applied to cultural heritage (for a nice introduction to the subject, see, e.g. [1724] and references therein), deep learning can be used to analyze data from non-destructive testing techniques, such as neutron imaging or gamma-ray spectroscopy. For example, it can help identify the composition of ancient artifacts or detect hidden structures in historical buildings, providing valuable insights without damaging these precious objects. This involves the use of advanced algorithms to statistically analyse the data obtained from imaging using nuclear techniques. These algorithms can identify patterns and correlations in the data, providing valuable insights into the artwork’s composition, age, and condition. [2538] (for other ML approaches in cultural heritage, see [39], and references therein).

Out of many deep learning methods, deep clustering is one approach that may be relevant for the topic at hand (for nice reviews of deep clustering, see [4045], and references therein). deep clustering is a branch of unsupervised learning that combines traditional clustering methods with deep learning.

The goal of classical clustering is to partition a dataset into groups, called clusters, such that data points in the same cluster are more similar to each other than to those in other clusters. Traditional clustering methods, such as K-means [46], rely on predefined distance metrics to measure the similarity between data points. However, these methods often fail to capture complex patterns in high-dimensional data.

This is where deep clustering comes into play. Deep clustering leverages the power of deep neural networks to learn a suitable data representation for clustering in an end-to-end fashion. The deep neural network is trained to map the input data into a lower-dimensional space where the traditional clustering algorithm is applied. The key idea is to learn a mapping that makes the clustering task easier in the transformed space.

There are several approaches to deep clustering, including AutoEncoder (AE)-based methods [47, 48], deep embedded clustering [49, 50], and methods based on self-supervised learning [5154]. AE-based methods use an AE [55] to learn a compressed representation of the input data, and then perform clustering on the compressed data. Deep embedded clustering extends this idea by jointly optimizing the AE and the clustering objective.

In self-supervised learning methods, auxiliary tasks are designed to exploit the inherent structure of the data, and the learned representations are used for clustering. For example, a network might be trained to predict the rotation of an image, and the learned features are then used for clustering.

Deep clustering has shown promising results in various applications. We explore here the possibility of using such method to perform datacube segmentation via deep spectral clustering.

In section 2, we describe the deep embedding models we will use. In section 3, we train deep embedding models in two use cases, using synthetic dataset, an astrophysical synthetic dataset, and a synthetic heritage scientific dataset. Finally, in section 4 we report the conclusions and the planned future steps.

2. Methods: deep clustering with AEs

The core idea of this report is to obtain datacube segmentation by performing unsupervised clustering on the dimensionally reduced latent space. The whole process is self-supervised, since only spectra are used during training. The whole architecture comprises:

  • (i)  
    An Encoder: $\mathrm{Enc} : X \in \mathcal{M}\subseteq \mathbb{R}^{d = 512} \mapsto \mu \in \mathcal{N}\subseteq \mathbb{R}^{n \ll 512}$, $\mu = \mathrm{Enc} [X]$;
  • (ii)  
    A clustering algorithm in latent space: $\mathrm{IKMeans}: \mu^{(i)} \mapsto C_{I = 1, \ldots N}$
  • (iii)  
    A Decoder: $\mathrm{Dec} : \mu \in \mathcal{N}\subseteq \mathbb{R}^{n \ll 512} \mapsto X \in \mathcal{M}\subseteq \mathbb{R}^{d = 512} $, $\tilde X = \mathrm{Dec} [\mu]$;

Having a well-trained architecture of that sort, starting from a datacube $\mathcal{I}_{x,y; e}$ of shape $(\mathrm{width}, \mathrm{height}, \mathrm{depth})$, where the first two indices represent the $x-$ and $y-$pixel position, while e is the spectral index, we can extract the $\mathrm{height} \cdot \mathrm{width}~\mathrm{depth}-$dimensional spectra $X^{(i = 1,\ldots \mathrm{height} \cdot \mathrm{width})}$, pass them to the Encoder to get a lower-dimensional statistical representation of the data $\mu^{(i)} = \mathrm{Enc}[X^{(i)}]$, and perform a clustering onto those, to get a set of clusters $C_{I = 1, \ldots N}$, and their barycenter $c_I = \frac{1}{\# C_I} \sum_{j | \mu^{(j)}\in C_I} \mu^{(j)} $. By using the Decoder it is now possible to see how the barycenter maps in the spectral space, $\bar{X}_I = \mathrm{Dec}[c_I]$; furthermore, it is possible to create binary maps of clusters onto the image, by retaining the pixel-to-index relation, $(x,y) \leftrightarrow i$. We will show that those binary maps effectively represent unsupervised segmentation of the datacubes.

In the following, we are going to describe the implementation of the deep clustering architecture, based on the deep clustering network architecture of [49].

2.1. AEs architecture

The first ingredient is an AE architecture [1, 5658], based on either on plain multi-layer perceptrons (MLPs) with drop-out layers, or self-normalising neural networks (SNNs) [59]. The AE class can be instantiated with a tunable number of layers, while the nodes are algorithmically fixed, starting from the Input size and dividing either by $2\cdot k$ (where k is the layer index), or by 2k , up to the latent space dimension n. The Encoder and Decoder are instantiated with a specular architecture, i.e. the first Encoder layer has the same size as the last Decoder one, and so on inwards. The layers’ weights are initialised using the Kaiming-He initialisation [60]. For the MLP module, the layer’s activation is a ReLU and the dropout probability can be tuned, while of course, the activation for the SNN module is a SELU. For a visual representation of an AE, see figure 1.

Figure 1. Refer to the following caption and surrounding text.

Figure 1. Graphical representation of an AE architecture.

Standard image High-resolution image

The reconstruction loss used is the mean-squared error,

Equation (1)

2.2. Iterative K-means

The clustering algorithm performed in the embedding space $\mathcal{N}$ is an iterative version of the standard K-means clustering algorithm [46], where the number of clusters K is obtained by optimizing the silhouette score [61]. The K-means initialisation can be selected to either be random points or k++ [62], while the algorithm implementation is the standard Lloyd’s algorithm [63].

The iterative selection works as explained in algorithm 1 5 (dubbed IKMeans).

Algorithm 1. Iterative K-means.
 1: procedure Iterative K-means(X, Ni , Nf )
 2:  for $k \in [N_i, N_{f}]$ do     $\rhd$Iterate over k clusters
 3:    $X^{(i)} \in C_{I = 1, \ldots k} \leftarrow \mathrm{KMeans}[X]$;     $\rhd$Perform K-means
 4:    $s_k \leftarrow \mathrm{silhouette}[C_I]$;     $\rhd$Compute silhouette score
 5:  Select $\hat{k} : \max_{k} s_k = s_{\hat{k}}$     $\rhd$Pick k
     return $s_{\hat{k}}, C_{I = 1, \ldots \hat{k}}$

The Silhouette score is computed as [65]

Equation (2)

i.e. $a_I(i)$ is the mean intra-cluster distance, while $b_I(i)$ is the smallest mean distance of the point i to all points in any other cluster (the extra-cluster distance). Notice that $\mathrm{silhouette}[C_{I = 1, \ldots k}] \in [-1, +1]$, where values closer to −1 means very ill-formed clusters, while values closer to +1 means well formed clusters.

2.3. Deep clustering

Assembling the ingredients introduced in the sections above, the AE architecture and the iterative K-means clustering, it is possible to build a deep clustering AE, by defining the total loss as

Equation (3)

where we have defined

Equation (4)

It is important to notice that the implementation of the IKMeans clustering algorithm has to be written in a way that it does not break the back-propagation algorithm. We implemented IKMeans as a PyTorch module, which ensures adherence to the differentiable programming paradigm.

2.4. A deep clustering variational AE? variational deep embedding

This approach suggests that we can move even further. We can use the approach of β-VAE [66] (for more details on Variational AEs, see [1, 57, 67, 68], and references therein). The use of β-VAE has been explored in the context of deep clustering in a few seminal papers [6971], giving rise to what is called Variational Deep Embedding (VaDE) models [72]. For a visual representation of VaDE model, see figure 2.

Figure 2. Refer to the following caption and surrounding text.

Figure 2. A visual representation of the deep clustering architecture.

Standard image High-resolution image

In addition, to address the well-known issues of posterior collapse [7376] and information preference problem [77] during training, we employ an infoVAE architecture [78, 79], i.e. instead of using the Kullback-Leibler divergence, we use the maximum-mean discrepancy (MMD) approach [80, 81]. MMD is a framework to quantify the distance between two distributions by comparing all of their moments. We implemented it using the kernel trick: letting $k( \cdot, \cdot)$ be any positive definite kernel, the MMD between two distributions p and q is defined as

Equation (5)

Additionally, we allow for either β and/or γ to be epoch-dependent, in order to dynamically tune how the MMD and the Silhouette terms affect the reconstruction during the training. The loss we use is

Equation (6)

where i is the ith training epoch, and where $\mathcal{L}_{\mathrm{MMD}} \equiv \mathrm{MMD} (\mathrm{Enc}[X] || \mathcal{N}(0, \unicode{x1D7D9}_d) )$.

2.5. Training vs evaluation settings

Before moving on into the topic of synthetic datasets generation, and how the deep embedding models are trained and thus evaluated, we discuss in depth the logical steps necessary to get from a Deep (Variational) Embedding of 1D spectra to a 3D datacube unsupervised segmentation algorithm.

The model is trained on 1D spectra, i.e. the pixel-wise content of a standard datacube. In fact, we refer to datacube as the data structure $I_{i,j;e}$ described by three indices, $\{i,j;e\}$; (i, j) indices refer to the pixel position along X and Y image axes, while e is the energy channel. In analogy, an RGB image is a datacube with three possible energy channels, corresponding to the Red, Blue and Green pixel content. So, the RGB equivalent of the training process is to train a 1D model to embed single RGB triples into a different, low dimensional metric space; here, the spectra are multi-channels 6 , living in a very high dimensional space ($\mathbb{N}^{512}$ or $\mathbb{N}^{1024}$ once rebinned).

Notice that the model does not learn from images (i.e. it does not have access to geographical information, like pixels proximity or location); it is trained over (synthetic) 1D signals.

The training of the deep neural network model is done using mini-batch gradient descent (with Adam [82] optimiser). The batch size is kept quite large (512 1D spectra per batch), in order to have a meaningful per-batch clustering; for each cluster, an iterative KMeans clustering is performed, to find a per-batch optimum number of clusters, in an overall fixed range of possible cluster numbers (which are training hyperparameters).

The fact that IKMeans is a distance-based clustering algorithm should force the Encoder model to map spectra into a well-behaved low dimensional space, which has to be metric (where the distance map inducing the metric is the same used in the IKMeans algorithm).

Furthermore, for each hyperparameter set training trial, a ReduceLROnPlateau learning rate scheduler is used, in its standard PyTorch implementation. We notice that our training is unsupervised and is based solely on 1D signals (not images, at this level).

After having trained the best model for the task, we try to assess the quality of the trained network. To do so, we generate a synthetic datacube, starting from a seed RGB image (how this is done will be explained in more depth in the following sections); we then flatten the image (thus seeing it as a collection of spectra), pass it to the trained model, get the embedded version (by reverse the flattening) and related clustering. We will show in the following that, as an emergent phenomenon, we experience a sort of semantic segmentation of the datacube in this process.

2.5.1. Alternative approaches: deep embedding vs dimensional reduction

The emergence of a semantic segmentation through clustering in lower dimensional embedded space can also be obtained using other dimensional reduction algorithms, such as Principal Component Analysis (PCA) [83, 84] (for a modern introduction to the subject, see [85] and references therein) or Manifold Learning algorithms, such as t-distributed stochastic neighbour embedding (t-SNE) [86, 87] or Uniform Manifold Approximation and Projection (UMAP) [88] (for a recent review on manifold learning and non-linear dimensional reduction, see [8993] and references therein).

While applying dimensional reduction algorithms to a spectral datacube, followed by a clustering algorithm is a viable method to get an emergent datacube semantic segmentation, this approach is rather different from the one we propose here.

Firstly, a deep embedding model, having a decoder network, is invertible. While PCA is itself invertible, being a linear mapping, manifold learning such as t-SNE and UMAP are inherently difficult to invert.

Furthermore, t-SNE and UMAP (as well as other non-linear dimensional reduction algorithms) have a computational cost that scales rapidly with vector dimensions; this means that, usually, we need to perform a preliminary PCA reduction to an intermediate dimensional space [87] 7 . Thus, while deep embedding models need a slow train phase, which may be slower compared to the other methods, it is fast in its inference phase.

Additionally, the embedding space may not have a metric meaning for non-linear dimensional reduction; this latter point may be relevant, because it means that distances in the embedded, low dimensional space are not proxies for similarity (i.e. nearby points are not necessarily more similar, nor far away points are necessarily more different).

PCA, as well as other linear dimensional reduction methods, are incapable of capturing complex, non-linear relations among data points, due to their linear nature. This also implies that PCA-reduced datasets may be difficult to cluster properly.

Finally, the crucial difference is that, while the deep embedding model is trained on a dataset formed as a collection of 1D spectra, and then applied onto datacubes (seen as a collection of spectra) at inference, the other models do need to be trained for each image (to have a well tailored behaviour).

In this sense, the two approaches are rather different:

  • (i)  
    In deep clustering, we train on an ad-hoc dataset and apply it on datacubes, and see what happens.
  • (ii)  
    In dimensional reduction + clustering, we fit-and-predict on per-image basis and see what happens.

The two methods are different and, in some cases, may be complementary, since they fulfil different goals.

Finally, if we train a Deep Variational Embedding model, we add a generative spin, which cannot, in any case, be obtained in the Dimensional Reduction case.

In table 1 we summarise these differences.

Table 1. Summary of differences between deep embedding, PCA and manifold learning algorithms.

 Deep embeddingPCAManifold learning
Invertible
Pure high dimensionality origin
Fast inference
Fast training✔ / ✗
Metric embedding space
Captures non-linear relations
Clustering-apt
Image independent
Generative✔ / ✗

3. Results: deep clustering synthetic datacubes

Since the work is an exploratory phase, we applied the developed architecture on two synthetic datasets (ordered alphabetically). The first is a synthetic dataset of astrophysical interest; the second, is a synthetic dataset of imaging obtained by nuclear techniques on pictorial artworks.

The detailed description of how each dataset is created is described in the relative section. In both cases, to generate the datacube we have used ganX, an open-source Python package to create spectral datacubes starting from RGB images and a well-defined dictionary of rgb-spectral signal relations [35, 94].

All the datasets, trained models, and produced code is publicly available online. See section 5 for more details.

3.1. Deep clustering astrophysical synthetic datacubes

To create a realistic synthetic dataset of datacubes with astrophysical meaning, we started from synthetic 1D signals belonging to three classes: HII regions, planetary nebulae and shock regions. These classes represent three types of ionized gas nebulae found in galaxies, characterized by distinctive morphological features and ionization mechanisms. As a consequence, when observed, they produce optical spectra, which differ mainly in the relative intensities of their spectral lines. Therefore, traditional classification diagnostics are generally based on the differences in specific line ratios but do not exploit the information contained in their entire spectra, e.g. the Baldwin–Phillips–Terlevich diagrams [95]. These techniques generally suffer from theoretical limits (see e.g. [96]), and often require human aid or a specific and complex analysis, thus they are less suitable for handling large datasets, obtained from current instruments with integral field units. This is why ML algorithms, capable of encoding the information from a wide sample of emission-line regions in the full spectra, are a valid choice.

We generate a collection of synthetic nebular optical spectra, based on photoionization models run with CLOUDY [97], withdrawn from the Mexican Million Models database (3MdB) [98]. We augment the number of models by combining them linearly using random coefficients. To create mock spectra starting from these models, we first select relevant emission lines to simulate, based on real data. In this case, we create spectra based on observations of the local galaxy M33 obtained with the Multi Unit Spectroscopic Explorer (MUSE) [99]. Hence, the selected emission lines are defined in the MUSE wavelength domain. Simulated emission lines are modeled with Gaussian profiles, whose fluxes are given by the model line intensities, and whose widths consist of two components: an instrumental width, caused by the finite spectral resolution of the telescope, and a physical one, given by the velocity dispersion in the nebulae, where the latter is chosen randomly from a range of possible values. Then, Gaussian noise (with standard deviations based on real data) is added to the spectra. Thus, for each model resulting from the linear combination of two of the ‘original’ database models, we can create multiple spectra with different noise levels and line widths. This process is based on solid, realistic physical assumptions and includes typical noise distributions and specific instrumental effects, ensuring the resulting spectra are highly realistic.

Then, we used ganX to create a datacube, using each synthetic signal, and assigning it (arbitrarily) to a colour; the mapping is the following: pure red ($[255,0,0]$) for the HII regions, pure green ($[0, 255,0]$) for the planetary nebulae regions, pure blue ($[0,0,255]$) for the shock regions. The RGB image used as a seed to generate the test datacube is reported in figure 3.

Figure 3. Refer to the following caption and surrounding text.

Figure 3. RGB seed image for the synthetic test datacube.

Standard image High-resolution image

We trained a pure AE model on 237.003 synthetic multi-class spectra, divided into 161.001 for the training set, 46.001 for the validation set, and 30.001 for the test set. We limited the size of the dataset to overcome the time constraint put on top of hardware limitations.

The model hyperparameters are obtained using a plain grid-search method. The Encoder has 3+1 layers, while the Decoder has 3 layers; the latent space has dimension 3, the MLPs are SNNs [59], and the sizes are $1024\mapsto 512 \mapsto 256\mapsto 32\mapsto 3$ for the encoder, and $3 \mapsto 128 \mapsto 256 \mapsto 1024$ for the decoder.

3.1.1. Results

We applied the trained model on the synthetic datacube obtained using ganX [35] on the RGB data of figure 3 with threshold 0.2, giving rise to 4 clusters.

The results are reported in figure 4; the figure reads: on the left, the binary map of the cluster pixels, where pixels are colored in yellow if they belong to the cluster, and in purple if not. In the middle, the reconstructed signal averaged over the cluster. Furthermore, the silhouette score obtained on the test datacube is $0.946\,110$, a very high result, due to the simplicity of the synthetic datacube.

Figure 4. Refer to the following caption and surrounding text.

Figure 4. Clustered datacube using the trained model. On the left, is the binary map of the cluster pixels. In the middle, the Reconstructed Signal averaged over the cluster; on the right, the ‘true’ signal averaged over the cluster.

Standard image High-resolution image

It is easy to recognise in Cluster 1 the nebulae region signals, in Cluster 2 the HII signals, in Cluster 3 the shock regions, and in Cluster 0 the noisy background; since no noise signal has been fed to the network during training, the decoder was not able to reproduce the signal; nevertheless, the clustering process in encoded space was able to isolate those signals, as expected.

Another relevant plot is figure 5, which shows the energy-integrated data-cube, i.e. the gray-scale image (in viridis color scale) obtained integrating over the whole energy channel. The left image in figure 5 represents the original, ‘true’ 8 spectral energy-integrated datacube $I_{i,j} = \sum_e I_{i,j ; e}$. The middle image is the decoded energy-integrated datacube $D_{i,j} = \sum_e \mathrm{Dec}[\mathrm{Enc}[I_{i,j ; e}]]$. It immediately shows noise reduction (or signal enhancement). The right image in figure 5 represent the embedded energy-integrated datacube $E_{i,j} = \sum_n \mathrm{Enc}[I_{i,j ; n}]$, which further refine the readability.

Figure 5. Refer to the following caption and surrounding text.

Figure 5. Energy integrated spectral data-cubes.

Standard image High-resolution image

3.1.2. Check against RGB clustering

Since, as mentioned in section 2, the datacube segmentation is an emergent result from clustering in latent space, and comes from a completely unsupervised approach, there is no direct evaluation check with standard semantic segmentation loss and measures, such as the Tversky Index (see, e.g. [100102], and references therein).

Nervetheless, since we are discussing the application of synthetic datacubes, obtained starting from an RGB image (figure 3), we can perform the same iterative K-means clustering approach to the RGB image, and use it as a baseline to compare it to the clustering obtained with the model.

This is a check that cannot be done in a real scenario, since

  • (i)  
    we may not have an RGB image;
  • (ii)  
    RGB clustering may not be able to capture the relevant information;

Nevertheless, since our synthetic datacube generation pipeline relies on information extracted from the RGB image via a clustering process, we may use it as a proxy for a benchmark.

In figure 6 we report the result of this check, done cluster-by-cluster. The columns are, from left to right: the black-and-white image of the cluster mask (i.e. pixels are white if they do belong to the cluster, and black if they do not) as obtained from the deep embedding model; the same, but obtained from the RGB clustering; the image of mislabelled pixels (i.e. pixels appears red if they are mislabelled and blue if they are correctly labelled); the confusion matrix. On the confusion matrix is reported the number of counts (from left to right, from top to bottom: true positives, false positives, false negatives, true negatives). In the confusion matrix image subtitle is reported the computed Tversky Index. The Tversky Index (TI) formula is

Equation (7)

we use the parameter set $\alpha = 0.7, \beta = 0.3$ as in [103].

Figure 6. Refer to the following caption and surrounding text.

Figure 6. Deep clustering segmentation vs RGB clustering deep embedding model on astrophysical synthetic datacube.

Standard image High-resolution image

Furthermore, in the figure subtitle we reported the macro averages for Accuracy, Precision, Recall, F1 score and Tversky Index.

In figure 7 we report the multi-class confusion matrix; on each element of the confusion matrix is reported the counts normalised to the whole population. In the figure subtitle is reported the weighted average of accuracy, recall, precision, and F1 score.

Figure 7. Refer to the following caption and surrounding text.

Figure 7. Multi-class confusion matrix of deep clustering segmentation vs RGB clustering on the synthetic astrophysical datacube.

Standard image High-resolution image

Notice that the result obtained from the datacube has a large count in the square (true label, predicted label) $ = (4,1)$; this, again, is due to the fact that the network has not seen any synthetic background signal during training, and yet it was still capable of cluster (most of) the background pixels.

In appendix B we report a test to show how the trained model behaves under the degradation of the signal (i.e. by adding noise such that the peak-signal-to-noise is reduced).

3.2. Deep variational embedding of MA-XRF datacubes on cultural heritage

To create the synthetic dataset of imaging with nuclear techniques applied to cultural heritage assets, we employed the XRF pigment information from [104] 9 ; we have used the pigment palette of [105], extended with few additional pigments that allow covering the set of most common Italian medieval choir books pigments [106]. This allows us to create seven synthetic palettes (intended as different ensembles of pigment signals and associate RGB colour) with the nearest possible historical coherence within the synthetic dataset.

The RGB images are scraped from the Web Gallery of Art [107]. In particular, we will show the application of the trained DNN on a test sample (never seen by the network during training) which is a miniature representing the Portrait of an Old Humanist. It is a miniature contained in the Codex Heroica by Phoilostratus, illuminated by the Florentine painter Attavante degli Attavanti (1487–1490 circa) 10 . The original is held at the National Széchényi Library, Budapest.

We trained the model on 891.276 synthetic spectra, coming from 17 RGB miniatures, divided into 576.708 for the training set, 209.712 for the validation set, and 104.856 for the test set. We limited the size of the dataset to overcome the time constraint put on top of hardware limitations. We plan to move to the whole synthetic dataset we built out of 1636 RGB images (around 108 spectra).

The model hyperparameters are obtained using a plain grid-search method. Encoder has 4+1 layer, while the decoder has 4 layers; the latent space has dimension 3, the MLP are SNNs [59], and the sizes are $512\mapsto 256\mapsto 128\mapsto 64\mapsto 32\mapsto 3$ for the encoder, and $3 \mapsto 64 \mapsto 128 \mapsto 256 \mapsto 512$ for the decoder.

Furthermore, $\gamma(i) = 0.01~\forall i$, while $\beta(i) = 0.01 \cdot \Theta(i - 30)$.

3.2.1. Results

We applied the trained model on the synthetic XRF datacube obtained using ganX [35] on the RGB data of figure 8(a) with threshold 0.6, which gave rise to 4 RGB clusters (figure 8(b)) 11 .

Figure 8. Refer to the following caption and surrounding text.

Figure 8. Original RGB and Clustered RGB of the test sample used in the text. It is a miniature painted by the Florentine painter Attavante degli Attavanti and contained in the Codex Heroica by Phoilostratus (1487–1490 circa).

Standard image High-resolution image

The results are reported in figure 9; the figure reads: on the left, the binary map of the cluster pixels, where pixels are coloured by yellow if they belong to the cluster, and by purple if not. In the middle, the reconstructed XRF averaged over the cluster is reported, i.e.

Equation (8)

where $\mu_a = \mathrm{Enc}[X_a]$.

Figure 9. Refer to the following caption and surrounding text.

Figure 9. Clustered XRF using the trained variational deep embedding trained model. On the left, the binary map of the cluster pixels. In the middle, the Reconstructed XRF averaged over the cluster; on the right, the ‘true’ XRF averaged over the cluster.

Standard image High-resolution image

Similarly, on the right for figure 9, and in dash-dotted green for figure 10, the ‘true’ XRF averaged over the cluster, as well as the colormap of the flattened ‘true’ XRF 12 , i.e.:

Equation (9)

Figure 10. Refer to the following caption and surrounding text.

Figure 10. Plot reporting the Reconstructed, Latent and ‘true’ average plotted against each other in the same plot.

Standard image High-resolution image

Figure 10 shows additional plots: in solid red, the aforementioned average over the cluster in the reconstructed space; in dashed gold, there is the decoded average in latent space, i.e.

Equation (10)

For a geometric intuition, the former is the center-of-mass of the reconstructed cluster, seen as a set of vectors in the high-dimensional original space; the latter, instead, is the reconstruction (through the decoder) of the center-of-mass of the cluster computed in the low-dimensional latent space.

Evidently, the two do not match, due to the non-linearity of the decoder. Nevertheless, from figure 9 we may see that the relevant feature (i.e. the highest peaks) are related in the two pictures, while the gold dashed line presents a more noisy behaviour outside of relevant features, as expected (since there is no suppression of noise due to dimensionality of space in the latter case, and thus we have an apparent noise enhancement).

In the background of those plots, it is reported a flattened representation of the whole datacube, using the jet colormap. On the Y-axis we see the pixel id (i.e. a linear combination of the (i, j)-pixel indices, namely $\mathrm{id(i,j)} = i + j\cdot \mathrm{width}$), while in the X-axis the Energy channel depth (which is shared with the aforementioned plots).

Figure 10 shows the three averages on the sample plot, and the background is the absolute difference of the flattened ‘true’ vs reconstructed data-cubes.

Please notice that the iterative K-means in latent space has correctly—and independently—identified 4 clusters, i.e. the correct number of clusters identified by a similar, but completely unrelated, iterative clustering algorithm, performed in RGB space, which was thus used to generate the synthetic XRF.

Similarly as in the previous case, we report in figure 11 the energy-integrated data-cube. It immediately shows an unexpected noise reduction (or signal enhancement), which adds readability to the image. Finally, the right image in figure 11 represents the embedded energy-integrated datacube $E_{i,j} = \sum_n \mathrm{Enc}[I_{i,j ; n}]]$, which further refine the readability. It is expected, since the non-linear low-dimensional embedding should project away less relevant directions, especially noise.

Figure 11. Refer to the following caption and surrounding text.

Figure 11. Energy integrated spectral data-cubes.

Standard image High-resolution image

The last plot we present is figure 12, where we report, in the first row, the masked clustered RGB appearing in figure 8(b). The masking is done such that we have a black pixel, if that pixel does not belong to the cluster, and a coloured pixel, if the pixel belongs to the cluster. For example, in the 1th cluster, it emerges the structure of the illuminated Q letter, which corresponds to the blueish cluster of figure 8(b). Thus, it emerges that each cluster has an average RGB colour which is similar to the one of the RGB cluster of figure 8(b). The second row is just the cluster representation in black and white binary colour map.

Figure 12. Refer to the following caption and surrounding text.

Figure 12. Masked clustered RGB image using clusters binary maps as masks.

Standard image High-resolution image

3.2.2. Check against RGB clustering

We are now in the position to perform the same evaluation as the one done in section 3.1.2; we thus perform the same iterative K-means clustering approach to the RGB image, and use it as a baseline to compare it to the clustering obtained with the model for the cultural heritage dataset.

Again, in figure 13 we report the result of the check, on a cluster-by-cluster basis. We recall that the columns are, from left to right: the black-and-white image of the cluster mask (i.e. pixels are white if they do belong to the cluster, and black if they do not) as obtained from the deep embedding model; the same, but obtained from the RGB clustering; the image of mislabelled pixels (i.e. pixels appears red if they are mislabelled and blue if they are correctly labelled); the confusion matrix. On the confusion matrix is reported the number of counts (from left to right, from top to bottom: true positives, false positives, false negatives, true negatives). In the confusion matrix image subtitle is reported the computed Tversky Index.

Figure 13. Refer to the following caption and surrounding text.

Figure 13. Deep clustering segmentation vs RGB clustering for the deep variational embedding model on cultural heritage synthetic datacube.

Standard image High-resolution image

In figure 14 we report the multi-class confusion matrix; on each element of the confusion matrix is reported the counts normalised to the whole population. In the figure subtitle is reported the weighted average of accuracy, recall, precision, and F1 score.

Figure 14. Refer to the following caption and surrounding text.

Figure 14. Multi-class confusion matrix of deep clustering segmentation vs RGB clustering on cultural heritage synthetic datacube.

Standard image High-resolution image

4. Conclusions

In this contribution, we have demonstrated the feasibility of training a deep variational embedding model to perform spectral datacube segmentation through deep clustering in latent space. This approach leverages the power of deep learning to analyse and categorise (in a self-supervised manner) complex spectral datacubes, which are multi-dimensional arrays of data commonly used in various scientific fields.

We conducted our experiments using two models that were trained on two synthetic datasets. These models were designed to test the effectiveness of our approach in two distinct scenarios: an astrophysics case and a Heritage Science case.

In the astrophysics case, our deep embedding model was tasked with analysing and segmenting synthetic spectral datacubes that represent astronomical observations. These datacubes contain spectra of simulated ionised belonging to different classes, namely HII regions, shock regions and/or planetary nebulae, and our model was able to successfully segment this data, demonstrating its potential for use in astrophysical research.

In the Heritage Science case, our deep variational embedding model was used to analyse and segment synthetic spectral datacubes of MA-XRF imaging on medieval Italian manuscripts. Such methods are often used in the study and preservation of cultural heritage artifacts (for example, but not limited to, pictorial artworks), and our model’s successful segmentation of this data shows its potential for aiding conservation scientists in their analysis.

In both cases, our deep variational embedding model was able to achieve successful segmentation of the spectral datacubes. This not only validates our approach but also opens up new possibilities for the application of deep learning techniques in the analysis of spectral datacubes in various scientific fields. Our findings pave the way for further research and development in this area, with the potential benefits either in the astrophysical and Heritage Science fields.

We are eagerly anticipating the application of this methodology to real-world data, encompassing both astrophysical and heritage scientific. This will be achieved by fine-tuning the models we have trained on synthetic data, adapting them to handle the complexities and nuances of real-world data.

Furthermore, we plan to use the cloud-based distributed computing infrastructure that is currently developed at Istituto Nazionale di Fisica Nucleare (INFN) and ICSC. Using software tools for distributed computing, such as Dask, will facilitate distributed inference, enabling us to rapidly compress spectral datacubes. This approach not only enhances the efficiency of our model but also significantly reduces the computational resources required, making the process more accessible and scalable.

Acknowledgments

This work is supported by ICSC—Centro Nazionale di Ricerca in High Performance Computing, Big Data and Quantum Computing, funded by European Union—NextGenerationEU, by the European Commission within the Framework Programme Horizon 2020 with the project ‘4CH—Competence Centre for the Conservation of cultural heritage’ (GA n.101004468 - 4CH) and by the project AIRES-CH—Artificial Intelligence for digital REStoration of cultural heritage jointly funded by the Tuscany Region (Progetto Giovani Si) and INFN.

The work of AB was funded by Progetto ICSC—Spoke 2 - Codice CN00000013 - CUP I53C21000340006 - Missione 4 Istruzione e ricerca—Componente 2 Dalla ricerca all’impresa—Investimento 1.4.

The work of FGAB was funded by the research grant titled ‘Artificial Intelligence and Big Data’ and funded by the AIRES-CH Project cod. 291514 (CUP I95F21001120008).

Data availability statement

The code used in this project can be found at the ICSC Spoke 2 GitHub repository [109].

The synthetic dataset [110] and trained models [111] are available at the INFN Open Access Repository service.

The data that support the findings of this study are openly available at the following URL/DOI: https://doi.org/10.15161/oar.i t/143545.

Appendix A: Datacube segmentation using iterative K-means clustering after non-linear dimensional reduction with UMAP

In this appendix we report a single-datacube application of the alternative pipeline described in 2.5.1: first, we perform a non-linear dimensional reduction on the datacube spectra using the Uniform Manifold Approximation and Projection (UMAP) algorithm [88]; then, we perform the iterative K-means clustering to get an emergent datacube segmentation, as discussed for the deep embedding models in section 2.

A.1. Astrophysical datacube

We perform the same check done in section 3.1.2 for the deep embedding model. Using the standard UMAP code implementation of [88].

The results of the clustering are reported in figures A1 and A2. The results are overall less noisy and have higher scores with respect to the deep embedding model ones, reported in figures 6 and 7. Nevertheless, the UMAP algorithm was trained on the whole datacube image, and thus it was also informed with background signal. That was not the case for the deep embedding model, which was nevertheless capable of correctly clustering most of the background points.

Figure A1. Refer to the following caption and surrounding text.

Figure A1. UMAP+clustering segmentation vs RGB clustering deep embedding model on astrophysical synthetic datacube.

Standard image High-resolution image
Figure A2. Refer to the following caption and surrounding text.

Figure A2. Multi-class confusion matrix of UMAP segmentation vs RGB clustering on the synthetic astrophysical datacube.

Standard image High-resolution image

A.2. Cultural heritage datacube

We now perform the same check done in section 3.2.2, but with UMAP as a dimensional reduction algorithm. A similar approach was hinted in [24], but not described in depth.

The results of the clustering are reported in figures A3 and A4. The results are in line with the one obtained in section 3.2.2, even slightly worse; this may due to the fact that this test case has a more complex spectral composition, as well as the fact that all the pixels signals are statistically similar to the ones used in the training phase, and there are no regions of out-of-distribution like in the previous case, where any background-like spectral signal has not been seen during training by the deep embedding model.

Figure A3. Refer to the following caption and surrounding text.

Figure A3. UMAP+clustering segmentation vs RGB clustering deep embedding model on CH synthetic datacube.

Standard image High-resolution image
Figure A4. Refer to the following caption and surrounding text.

Figure A4. Multi-class confusion matrix of UMAP segmentation vs RGB clustering on the synthetic CH datacube.

Standard image High-resolution image

Nevertheless, as described in section 2.5.1, while it is possible to obtain datacube segmentation with dimensional reduction algorithm followed by embedded space clustering, the features of the two approaches are different, and are thus non-competing approaches, but rather two complementary ones.

Appendix B: Study of model resilience against signal degradation

In the astrophysics case, since the model used for generating the synthetic signal is capable of noise modelling, it allows us to perform an evaluation of the resilience of the trained model of section 3.1 under the degradation of signal, by reducing the peak-signal-to-noise ratio, without further re-training. This is due to the fact that AEs can also be used to remove noise (these methods are often called denoising AEs). [112].

In the following, we use the formulae

Equation (B.1)

Equation (B.2)

where X is the original signal, and Y the one where we have added noise.

We thus re-perform, for each of the following degradation levels

Equation (B.3)

the same analysis conducted in section 3.1. We report here the results of this analysis.

In figure B1 we report the datacube segmentation as a function of the noise level; the rows represent the cluster, while the columns are defined as follows. The leftmost column, in viridis colorscale, is the clustering obtained from the RGB image; then, from left to right, we show the corresponding found cluster at increasing values of noise.

Figure B1. Refer to the following caption and surrounding text.

Figure B1. Effect of Noise addition in the datacube segmentation.

Standard image High-resolution image

In figure B2 we report how the noise levels affect the multi-class weighted accuracy and F1 score.

Figure B2. Refer to the following caption and surrounding text.

Figure B2. Loss of weighted accuracy and F1.

Standard image High-resolution image

We notice that the introduction of noise raises the misclassification of non-background signals, as expected.

Footnotes

  • For an overview of applications of unsupervised metric in K-means clustering, see [64] and references therein.

  • They should actually be continuous, representing actual spectra (in the visible/near-infrared region for the Astrophysical case, in the x-ray region for the cultural heritage case); their discrete nature is of course given by the Analog-to-Digital converter which bins the data into discretised digital values.

  • In [88], instead of PCA, a preliminary Spectral embedding is performed by considering the 1-skeleton of the global fuzzy topological representation as a weighted graph. They show that, empirically, UMAP is faster than t-SNE while scaling with ambient dimension; still, the computational cost does increase hugely, and non-linearly, even though more slowly.

  • Here, we refer to the ‘true’ signal as the input signal of the neural network, while with ‘reconstructed’ we refer to the output of the network.

  • 10 
  • 11 

    We recall here the non-Riemannian nature of colour spaces [108], which give rise to impossibilities of using metric approaches in colour space, leaving only distance-based methods.

  • 12 

    Here we use the quote marks on the word true, because the XRF is synthetic. Nevertheless, it is the ‘true’ XRF in the sense it is the input from the DNN architecture perspective.

Please wait… references are loading.