1 Introduction

Cell counting is a crucial step in biological experiments and medical diagnosis to provide both quantitative and qualitative information on cells. While the process can be tedious, low-efficiency and prone to subjective errors, especially when cell clustering occurs and cells show high variance in shapes and contrasts. Automation offers a quick and accurate estimation of cell quantity in a sample. In this paper, we focus on the cell counting task to quantify population on fluorescent images. Previous studies [1,2,3,4] used deep learning with a Unet architecture or a similar one, which maps each input image to a density map, the integral of which yields the cell count. Another approach is to use an architecture for object detection (as YOLO [5] or Faster R-CNN [6]) and count the detected cells [5, 7].

These methods achieve satisfactory results, but use a large number of parameters even when the task seems quite simpler than crowd counting, as for example when cells, possibly diverse in shapes and partially overlapping, still look like mere regional maxima over a dark background. Furthermore, density maps are ambiguous regarding what the model considers as a cell, especially in clusters, where the precise location of cells is often lost. This can be a limitation for users willing to check a posteriori the reliability of the count returned by the model, especially as the deep learning model is a black box and its decisions are hardly explainable. Images where cells appear as bright objects on a dark background are an interesting study case where a much simpler model, built on a priori knowledge, may achieve similar results as large deep networks but with higher interpretability, regarding for example geometrical and contrast criteria used to recognize cells.

In this paper, following and extending [8], we propose to model cells as regional maxima with sufficient dynamic and size. The dynamic of an extremum is, simply speaking, its depth (see Fig. 1) and well-known morphological methods exist to select extrema with dynamic larger than a threshold h, called h-maxima or h-minima [9]. Accordingly, in [8] was presented a simple morphological pipeline consisting in counting the h-maxima of a size-filtered version of the input image, for a given dynamic h. Since the image resolution is constant in the dataset, as well as the cell size, this approach only depends on the choice of the dynamic parameter, which needs, however, to be adapted to each image in order to achieve accurate counting. Taking advantage of the recently introduced geodesic reconstruction layers [10, 11], which provide a neural implementation of the h-maxima computation, a pipeline has been designed in [8], where a small convolutional neural network (CNN) predicts the optimal dynamic h for the subsequent morphological layers. This pipeline is trained end-to-end by minimizing a joint loss where the main term focuses on an intermediate output image (the so-called h-transform of a filtered input image) and a second term, with much smaller weight, penalizes counting errors.

The goal of the present paper is to make the latter pipeline trainable by minimizing the counting errors only, because achieving so has several advantages. First, it saves us from pre-computing, for each training image, the intermediate results needed for other loss terms. Second, avoiding to pre-compute these intermediate outputs allows to make trainable any preprocessing of the input images, as opposed to the need to freeze it when the intermediate outputs are pre-computed using a fixed preprocessing. Third, it alleviates a lot the burden of annotations, as it only requires a number of cells for each input image, which does not need a precise location of each cell (although the proposed method does provide the locations of the counted cells).

To that aim, in this article we investigate more in-depth theoretical aspects of the pipeline proposed in [8], in particular its derivability, so as to find out how it can be trained with gradient descent when minimizing the counting errors only. The main novel contributions of this article with respect to Liu et al. [8] are the following:

  1. 1.

    A theoretical analysis of the aforementioned pipeline derivability, based on a detailed theoretical review of h-maxima.

  2. 2.

    A practical solution to overcome the derivability issues revealed by this analysis, allowing for the optimization of the pipeline by minimization of a loss penalizing the counting errors only.

  3. 3.

    New results on other cell image datasets (Cellpose, Fluorescent Neuronal Cells) with respect to [8].

The paper is structured as follows. In Sect. 2, the theoretical framework of h-maxima is presented, and a simple morphological algorithm using them to count cells, already introduced in [8], is recalled. In Sect. 3, we remind the deep learning architecture of [8], which includes this morphological algorithm, before studying its derivability and proposing a solution for training it with gradient descent even when minimizing the counting errors only. Experiments are conducted in Sect. 4, and their results are analyzed in Sect. 5. Conclusions are drawn in Sect. 6.

2 Setting and Definitions

We assume images to be mappings of the discrete domain \(\Omega {:}{=} ([0, W-1] \times [0, H-1])\cap \mathbb {N}^2\), \(W, H\in \mathbb {N}^*\) to \(\mathbb {R}\). We will denote by \(\mathcal {F}(\Omega ,\mathbb {R})\) the set of such functions and call pixels the elements of \(\Omega \).

2.1 Image Poset

The set \(\mathcal {F}(\Omega ,\mathbb {R})\) is a partially ordered set (or poset) for the partial order \(\le _{\mathcal {F}}\) defined, for any \(f,g \in \mathcal {F}(\Omega ,\mathbb {R})\), by

$$\begin{aligned} f \le _{\mathcal {F}} g \iff f(x) \le g(x) \ \ \forall x\in \Omega . \end{aligned}$$
(1)

In the rest of the paper, we will simply write \(f\le g\) when it is clear we are comparing functions.

The supremum and infimum of two functions f and g is denoted, respectively, by \(f\vee g\) and \(f\wedge g\) and also induced by the pointwise ordering, i.e.,

$$\begin{aligned} \begin{aligned} \forall x\in \Omega ,\;\;&(f\vee g) (x) {:}{=} f (x) \vee g (x) \\ \text { and }&(f\wedge g) (x) {:}{=} f (x) \wedge g (x) \end{aligned} \end{aligned}$$
(2)

where here \(\vee \) and \(\wedge \) also denote the supremum and infimum on real numbers, \(f (x) \vee g (x) = \sup \lbrace f(x), g(x)\rbrace \) and \(f (x) \wedge g (x) = \inf \lbrace f(x), g(x)\rbrace \). The supremum and infimum of an arbitrary family of functions is defined likewise.

2.2 Connectivity

Throughout the paper, the notion of connectivity between pixels is crucial because the definitions of path and reconstruction operator, given below and extensively used in the paper, rely on it. It is determined by the unit discrete square \(\textrm{B}{:}{=} \lbrace x = (x_1, x_2) \in \mathbb {Z}^2, \; \max (|x_1|, |x_2|) \le 1 \rbrace \), classically referred to as the 8-connectivity structuring element in mathematical morphology.

Denoting by \(\textrm{B}_x {:}{=} x + \textrm{B}= \lbrace x + b, b\in \textrm{B}\rbrace \) the set \(\textrm{B}\) translated at \(x\in \Omega \), we say that \(y\in \Omega \) is a neighbor of x, noted \(y \in \mathcal {N}_x\), if and only if \(y \in \textrm{B}_x\cap \Omega \). Similarly, the neighbors of a set \(X\subset \Omega \) are defined as \(\mathcal {N}_X {:}{=} \cup _{x\in X}\mathcal {N}_x.\)

The notion of neighborhood leads to classical definitions of graph theory, the vertices being here the pixels and the edges being the couples \((x, y) \in \Omega ^2\) such that \(y\in \mathcal {N}_x\). A path \(\gamma \) from x to y of length \(n\in \mathbb {N}\) is a tuple \(\gamma = (u_0, u_1,\dots , u_n)\in \Omega ^{n+1}\)such that \(u_0 = x\), \(u_n = y\) and for any \(0\le k \le n-1\), \(u_{k+1}\in \mathcal {N}_{u_k}\). We will denote by \(\Gamma _{xy}\) the set of paths from x to y, and say that y is connected to x in a subset \(C \subseteq \Omega \) if there is a path from x to y included in C, i.e., \(\lbrace \gamma \in \Gamma _{xy}, \gamma \subseteq C \rbrace \ne \emptyset \). Note that in our setting, since \(\textrm{B}\) is symmetric (\(b\in \textrm{B}\iff -b\in \textrm{B}\)), y is connected to x if and only if x is connected to y, and hence, we shall simply say that x and y are connected. In particular, all pixels are connected in \(\Omega \). Furthermore, the connectivity relation is also reflexive (\(x\in \mathcal {N}_x\) because \(0\in \textrm{B}\)) and transitive. It is therefore an equivalence relation, and the equivalence classes induced by it are the connected components. Finally, we say that a subset \(C \subseteq \Omega \) is connected if any two elements x and \(y \in C\) are connected.

2.3 Morphological Operators

We shall now recall the morphological operators which will be used in the rest of the paper. Note that these definitions are not the most general ones, but adapted to the structuring element \(\textrm{B}\) considered here, which is symmetric.

2.3.1 Dilation, Erosion, Opening and Closing by \(\textrm{B}\)

Given \(f\in \mathcal {F}(\Omega ,\mathbb {R})\), the dilation of f by \(\textrm{B}\), noted \(\delta _\textrm{B}(f)\), is the element of \(\mathcal {F}(\Omega ,\mathbb {R})\) defined by

$$\begin{aligned} \forall x\in \Omega , \;\; \delta _\textrm{B}(f)(x) = \max _{b\in \textrm{B}}f(x+b) = \max _{y\in \mathcal {N}_x}f(y) \end{aligned}$$
(3)

while the erosion of f by \(\textrm{B}\), noted \(\varepsilon _\textrm{B}(f)\), is given by

$$\begin{aligned} \forall x\in \Omega , \;\; \varepsilon _\textrm{B}(f)(x) = \min _{b\in \textrm{B}}f(x+b) = \min _{y\in \mathcal {N}_x}f(y). \end{aligned}$$
(4)

Since \(0\in \textrm{B}\), here \(\delta _\textrm{B}\) is an extensive operator, i.e., \(\forall x\in \Omega ,\; \delta _\textrm{B}(f)(x)\ge f(x)\). (And \(\varepsilon _\textrm{B}\) is anti-extensive, \(\varepsilon _\textrm{B}(f)(x)\le f(x)\).)

The opening and closing of f by \(\textrm{B}\), noted \(\gamma _\textrm{B}(f)\) and \(\varphi _\textrm{B}(f)\), respectively, are the compositions of dilation and erosion, namely \(\gamma _\textrm{B}= \delta _\textrm{B}\circ \varepsilon _\textrm{B}\) and \(\varphi _\textrm{B}= \varepsilon _\textrm{B}\circ \delta _\textrm{B}\).

2.3.2 Geodesic Operators

Given two functions \(f,g \in \mathcal {F}(\Omega ,\mathbb {R})\) such that \(f\le g\) the geodesic dilation of f under g, noted \(\delta _{g}(f)\), is defined by [9, 12]

$$\begin{aligned} \delta _{g}(f):=\delta _{\textrm{B}}(f) \wedge g \end{aligned}$$
(5)

The geodesic dilation can be iterated and for \(p\ge 1\) we note \(\delta _{g}^{(p+1)}(f){:}{=} \delta _{g}\big (\delta _{g}^{(p)}(f)\big )\), with \(\delta _{g}^{(1)} = \delta _{g}\). It is well known and easy to verify that this process converges to a fixed point after a finite number k of iterations. (One can notice that, \(\delta _\textrm{B}\) being an extensive operator, the iterations of geodesic dilation produce an increasing sequence of values \(\delta _{g}^{(p)}(f)(x)\) at each pixel x; furthermore, the possible values for each \(\delta _{g}^{(p)}(f)(x)\) are finite, as they come from those of f and g which both have \(W\times H\) pixels.)

The reconstruction by dilation of g from f, noted \(\texttt {REC} (f, g)\) or \(\texttt {REC} _{g}(f)\), is then:

$$\begin{aligned} \texttt {REC} (f, g):=\texttt {REC} _{g}(f):= \delta _{g}^{(k)}(f) \end{aligned}$$
(6)

where k is the smallest integer such that \(\delta _{g}^{(k)}(f)=\delta _{g}^{(k+1)}(f)\).

Alternative formulations of the reconstruction by dilation exist, e.g., based on the level sets of f and g [13]. Here, we mention one based on connectivity, which will be more convenient than (6) to prove the theoretical results of this paper. For any \(x\in \Omega \), one can write

$$\begin{aligned} \texttt {REC} (f, g)(x) = \max _{y\in \Omega , \gamma \in \Gamma _{xy}} \left( f(y) \wedge \min _{z\in \gamma } g(z)\right) . \end{aligned}$$
(7)

To the best of our knowledge, this formulation has not been stated (hence not proved) in the morphological literature, but it can be deduced easily by induction on the length of paths using the sequential definition of (6), and the fact that \(\texttt {REC} (f_1\vee f_2, g) = \texttt {REC} (f_1, g)\vee \texttt {REC} (f_2, g)\).

2.4 Regional Maxima and h-maxima

In this section, besides recalling necessary definitions, we state several theoretical results which will not appear as new to experts in mathematical morphology. However, since we could not find them clearly stated and proved in the literature, we choose to present and prove them here.

Regional maxima Given an image \(f \in \mathcal {F}(\Omega ,\mathbb {R})\), we say that \(M\subset \Omega \), \(M\ne \Omega \), is a regional maximum at level \(t\in \mathbb {R}\) if M is connected, \(f(x) = t\) for all \(x\in M\), and \(f(y)<t\) for any \(y\in \mathcal {N}_M\setminus M\). In words, a regional maximum is a plateau of f surrounded by lower values. Note that this definition imposes that a regional maximum M needs to be strictly included in \(\Omega \), and hence, \(\mathcal {N}_M\setminus M\ne \emptyset \) and \(\Omega \) itself cannot be a regional maximum. We will note \(\mathcal {M}(f)\) the set of regional maxima of f and \(\mathcal{P}\mathcal{M}(f)\) the corresponding set of pixels, that is

$$\begin{aligned} \mathcal{P}\mathcal{M}(f) {:}{=} \bigcup \limits _{M\in \mathcal {M}(f)}M. \end{aligned}$$

Also, since f takes a unique value over a regional maximum M, we shall conveniently note f(M) this value (although M is not a pixel but a set of pixels). Importantly, because of the connectedness property, different regional maxima are necessarily disjoint.

Finally, we will denote by \(\texttt {RMAX} \) the characteristic function of regional maxima, i.e.,

$$\begin{aligned} \texttt {RMAX} (f) {:}{=} \mathbbm {1}_{\mathcal{P}\mathcal{M}(f)} = \sum \limits _{M\in \mathcal {M}(f)}\mathbbm {1}_{M}. \end{aligned}$$
(8)

In the case where f takes values in a quantized set \(\lbrace k\cdot \epsilon , \ 0\le k\le L\rbrace \) for some \(L\in \mathbb {N}^*\) and \(\epsilon > 0\), then regional maxima can be computed thanks to the geodesical reconstruction of f from \(f-\epsilon \). Indeed, in that case one can show [13] that the regional maxima of f are the connected components of

$$\begin{aligned} X_{\max }(f) = \lbrace x\in \Omega , \ f(x) > \texttt {REC} _f(f-\epsilon )(x) \rbrace , \end{aligned}$$
(9)

the set of pixels where the reconstruction is different from (hence smaller than) f.Footnote 1 In that case, \(\mathcal{P}\mathcal{M}(f)=X_{\max }(f)\) and therefore

$$\begin{aligned} \texttt {RMAX} (f) = \mathbbm {1}_{X_{\max }(f)}. \end{aligned}$$
(10)

Dynamic The dynamic [14] of a regional maximum M of level \(t\in \mathbb {R}\) is the smallest height one needs to come down from M in order to reach a higher regional maximum \(M^\prime \) of level \(t^\prime > t\), when there is one. Formally, in that case (refer to Section 6.3.5 in [9], and see Fig. 1)

$$\begin{aligned} \texttt {DYNMAX} (M) {:}{=} \min _{\begin{array}{c} \gamma \in \Gamma _{M M^\prime },\\ f(M^\prime ) > f(M) \end{array}} \left\{ \max _{z\in \gamma } (f(M) - f(z))\right\} \end{aligned}$$
(11)

where \(\Gamma _{M M^\prime } {:}{=} \lbrace \gamma \in \Gamma _{xy}, x\in M, y\in M^\prime \rbrace \) and the variable \(M^\prime \) spans all the regional maxima of f such that \(f(M^\prime ) > f(M)\). Note that since f(M) does not depend on the path \(\gamma \), the \(\max \) term in (11) can be written \(f(M) - \min \limits _{z\in \gamma }f(z)\), and finally

$$\begin{aligned} \texttt {DYNMAX} (M) = f(M) - \max _{\begin{array}{c} \gamma \in \Gamma _{M M^\prime },\\ f(M^\prime ) > f(M) \end{array}}\min _{z\in \gamma }f(z). \end{aligned}$$
(12)

By convention, the dynamic of the global maximum of f can be set to the difference between the maximum and the minimum values of f, denoted by

$$\begin{aligned} \Delta _\Omega f {:}{=} \max \limits _\Omega f - \min \limits _\Omega f. \end{aligned}$$
(13)

The dynamic is commonly used to distinguish between irrelevant maxima caused by noise and significant ones corresponding to underlying bright objects.

Fig. 1
figure 1

The dynamics of three maxima in a one-dimensional function. Note that in our 1D illustrations, \(\Omega \) is a discrete interval and the structuring element defining the neighborhood of each pixel is \(\textrm{B}= \lbrace -1, 0, 1\rbrace \). Here, each maximum \(M_i\) is reduced to one singleton

h-maxima In order to define the h-maxima of f, we first need to introduce the h-maxima transformation. Given a parameter \(h\in [0,\Delta _\Omega f]\), the \(\texttt {HMAX} _h\) image operator is the reconstruction by dilation of f from \(f-h\), i.e.,

$$\begin{aligned} \texttt {HMAX} _{h}(f) =\texttt {REC} _{f}(f-h) \end{aligned}$$
(14)

As illustrated in Fig. 2a, this transformation suppresses what are called h-domes in [13].

Now, for a given \(h\in [0, \Delta _\Omega f]\), the h-maxima of f, denoted by \(\mathcal {M}_h(f)\), are defined as the regional maxima of \(\texttt {HMAX} _{h}(f)\):

$$\begin{aligned} \mathcal {M}_h(f) {:}{=} \mathcal {M}\big (\texttt {HMAX} _{h}(f)\big ) = \mathcal {M}\big (\texttt {REC} _{f}(f-h)\big ). \end{aligned}$$
(15)

In particular, \(\mathcal {M}_0(f) = \mathcal {M}(f)\) and \(\mathcal {M}_{\Delta _f\Omega }(f) = \emptyset \). Furthermore, as regional maxima of a function, distinct h-maxima are pairwise disjoint. Similarly to earlier, we will note \(\mathcal{P}\mathcal{M}_h(f)\) the set of pixels belonging to h-maxima, that is

$$\begin{aligned} \mathcal{P}\mathcal{M}_h(f) {:}{=} \bigcup \limits _{M\in \mathcal {M}_h(f)}M. \end{aligned}$$

Figure 2b shows in green the characteristic function of the h-maxima of f, noted \(\texttt {EMAX} _h\) for “extended maxima”:

$$\begin{aligned} {\texttt {EMAX} _h(f){:}{=} \mathbbm {1}_{\mathcal{P}\mathcal{M}_h(f)} = \sum _{M\in \mathcal {M}_h(f)} \mathbbm {1}_{M}.} \end{aligned}$$
(16)

Note that \(\texttt {EMAX} _h = \texttt {RMAX} \circ \texttt {HMAX} _h\), and \(\texttt {EMAX} _0 = \texttt {RMAX} \).

Fig. 2
figure 2

Illustration of the computations of \(\texttt {HMAX} _h(f)\) and \(\texttt {EMAX} _h(f)\) for the same one-dimensional function f as in Fig. 1, and for \(h_2 > h \ge h_3\). a \(\texttt {HMAX} _h\) produces a regional maximum wherever there is a maximum with dynamic greater than h in f. b \(\texttt {EMAX} _h(f)\) is a binary function whose nonzero values indicate where the maxima of \(\texttt {HMAX} _h\), hence the h-maxima of f, are. Note that in our 1D illustrations, \(\Omega \) is a discrete interval, represented by the x axis, and the structuring element defining the neighborhood of each pixel is \(\textrm{B}= \lbrace -1, 0, 1\rbrace \)

Links with the dynamic An important property of h-maxima is that they contain the regional maxima of f of dynamic larger than h. These are given by the connected components where f is maximal inside the h-maxima. In turn, the regional maxima with dynamic larger than h determine the value of \(\texttt {HMAX} _h(f)\) over the h-maxima they belong to.

To put this more formally in Proposition 1, let us first introduce some notations. Given \(f\in \mathcal {F}(\Omega , \mathbb {R})\), recall [15] that its upper level set of level \(t\in \mathbb {R}\), which we will denote by \(\chi _t(f)\), is defined by

$$\begin{aligned} \chi _t(f){:}{=} \lbrace x\in \Omega , \ f(x) \ge t\rbrace . \end{aligned}$$
(17)

Also, given a non-empty set \(M\subseteq \Omega \), and \(M_0\) a connected subset of M, we denote by \(\texttt {CC}(M_0, M)\) the unique connected component of M containing \(M_0\). Furthermore, \(\arg \max \limits _M f\) denotes the set \(\lbrace x\in M, f(x) = \max \limits _{y\in M} f(y)\rbrace \). Then, the following proposition holds.

Proposition 1

Let \(f\in \mathcal {F}(\Omega , \mathbb {R})\), \(h\ge 0\) and \(\lambda \in \mathbb {R}\). Then,

$$\begin{aligned} & \left\{ \begin{array}{l} M \in \mathcal {M}_h(f)\\ M_0 \text { connected component of } \arg \max \limits _M f\\ \lambda = {{\texttt {HMAX}}}_h(f)(M) \end{array}\right. \nonumber \\ & \quad \iff \left\{ \begin{array}{l} M_0\in \mathcal {M}(f), {{\texttt {DYNMAX}}}(M_0) > h, \\ \lambda = f(M_0) - h\\ M = {{\texttt {CC}}}\left( M_0, \chi _{\lambda }(f)\right) . \end{array}\right. \end{aligned}$$
(18)

According to Proposition 1, computing the h-maxima allows to detect bright objects of dynamic larger than h. This strategy is well known in the morphological literature and is the one applied to count cells with the simple morphological algorithm described in Sect. 2.5.

Although this result has been at least partially stated in the literature [9], to the best of our knowledge no proof has been published, probably because it may appear completely obvious to experts, used to observe it on simple examples, like the one of Fig. 3. We provide a proof in “Appendix A.”

Larger h , fewer h-maxima It is well known in the morphological literature, and easy to realize on simple examples, that the number of h-maxima is a decreasing function of h. This comes from a specific ordering relation between \(\mathcal {M}_h\) for different h values. Let us introduce the ordering relation \(\preceq \) on collections pairwise disjoint connected components of \(\Omega \) defined, for any two such collections \(\mathcal {M}\) and \(\mathcal {M}^\prime \), by

$$\begin{aligned} \mathcal {M}\preceq \mathcal {M}^\prime \iff \forall M \in \mathcal {M}, \exists M^\prime \in \mathcal {M}^\prime , M^\prime \subseteq M. \end{aligned}$$
(19)

One can check that this is a partial ordering. (Reflexivity is ensured by the pairwise disjunction of two elements a collection.) Then, we have the following result.

Proposition 2

For \(f\in \mathcal {F}(\Omega , \mathbb {R})\) and \(h_1, h_2 \ge 0\)

$$\begin{aligned} h_1 \le h_2 \Rightarrow \mathcal {M}_{h_2}(f) \preceq \mathcal {M}_{h_1}(f). \end{aligned}$$
(20)

Recall that for \(h\ge 0\), the elements of \(\mathcal {M}_h(f)\) are pairwise disjoint, as regional maxima of a function. Therefore, a \(h_1\)-maximum is included in at most one \(h_2\)-maximum. This is why \(\mathcal {M}_{h_2}(f) \preceq \mathcal {M}_{h_1}(f)\) implies that \(\mathcal {M}_{h_2}(f)\) has no more elements than \(\mathcal {M}_{h_1}(f)\). Hence, the number of h-maxima decreases with h from the total number of regional maxima, when \(h=0\), to zero, when \(h = \Delta _\Omega f\). (Indeed one can see that \(\texttt {HMAX} _h(f)\) is just a constant function when \(h = \Delta _\Omega f \), and has therefore no regional maximum by definition, i.e., \(\mathcal {M}_{\Delta _\Omega f} = \emptyset \).) This is illustrated in Fig. 3.

Note that Proposition 2 generalizes a part of Proposition 1, namely the one stating that \(\mathcal {M}_{h}(f) \preceq \mathcal {M}_{0}(f) = \mathcal {M}(f)\) for \(h\ge 0\). The proof of Proposition 2 is provided in “Appendix B.”

Fig. 3
figure 3

a A function f, the result of \(\texttt {HMAX} _h(f)\) for \(h=2.5\) and the corresponding set \(\mathcal {M}_h(f)\) of h-maxima. The dynamics of the regional maxima of f are, from left to right: 2, 1, 4, 5, 5 and 3, respectively. b The evolution of \(h\mapsto \mathcal {M}_h(f)\) for \(h \in [0, \Delta _\Omega f]\)

Right continuity of \(\varvec{\mu }: \textbf{h} \mapsto \varvec{\mathcal {M}}_\mathbf{h(f)}\) Let us denote by \(\mu \) the function mapping \(h\in [0, \Delta _\Omega f]\) to \(\mathcal {M}_h(f)\), the set of h-maxima of f. With Proposition 2 in the previous paragraph, we saw that \(\mu \) is decreasing for the orderings \(\le \) and \(\preceq \). Now, we will see that it is also piecewise constant and right continuous as illustrated in Fig. 3. This will be useful in the study of the derivability of \(h\mapsto \texttt {HMAX} _h(f)\), in Sect. 3.3.1.

Proposition 3

Let \(f\in \mathcal {F}(\Omega , \mathbb {R})\) non-constant. Then, there exists \(K\in \mathbb {N}\) and \(h_0< h_1< \dots < h_{K+1} \in [0, \Delta _\Omega f]\) with \(h_0 = 0\) and \(h_{K+1} = \Delta _\Omega f\), such that

$$\begin{aligned} \forall k\in & \lbrace 0, \dots , K\rbrace , \ \forall h\in [h_k, h_{k+1}),\nonumber \\ \mu (h)= & \mu (h_k) = \mathcal {M}_{h_k}(f) \end{aligned}$$
(21)

and for \(k \ne k^\prime \), \(\mathcal {M}_{h_{k^\prime }}(f) \ne \mathcal {M}_{h_k}(f)\).

From now, on we will note

$$\begin{aligned} \mathcal {H}_f {:}{=} \lbrace h_0, h_1, \dots , h_{K+1}\rbrace \end{aligned}$$

the finite subset of \([0, \Delta _\Omega f]\) introduced by the proposition. Clearly, the elements of \(\mathcal {H}_f\) are the points of discontinuity of \(\mu \), but since \(\mu (h_k) = \lim \limits _{h\downarrow h_k}\mu (h)\), \(\mu \) is right continuous at these points and, therefore, everywhere. The proof of Proposition 3 is given in “Appendix C.”

Fig. 4
figure 4

Example of cell counting by identifying cells to h-maxima, in the TRP1 melanocyte dataset (see Sect. 4 for the description of datasets). From left to right: input 8-bits RGB image (resized); green channel marked with ground truth location of the 107 cells (in red); and detection results using the best h for the image: \(h_{\texttt {I}}^{*}=54\). Note that despite the perfect count, there are some false positives and false negatives (e.g., just below the top right hand corner)

2.5 Morphological Pipeline for Cell Counting

In this section, we present a simple morphological algorithm for cell counting in images where individual cells are well approximated by regional extrema of significant dynamic,Footnote 2 like in Fig. 4. We assume images to take quantized values in \([0, V_{\max }]\cap \epsilon \mathbb {N}\) with \(V_{\max }>0\) and \(0<\epsilon <V_{\max }\).

The algorithm takes as input an image f, as parameter a number \(h\ge 0\) and returns an estimated number \(n_c\) of cells after running the following steps:

  1. 1.

    Apply an alternate filter

    $$\begin{aligned} \tilde{f} \leftarrow \varphi _\textrm{B}\circ \gamma _\textrm{B}(f) \end{aligned}$$

    where \(\varphi _\textrm{B}\) and \(\gamma _\textrm{B}\) are, respectively, the closing and opening by \(\textrm{B}\), the unit square structuring element (see Sect. 2).

  2. 2.

    Compute the characteristic function of the h-maxima of \(\tilde{f}\),

    $$\begin{aligned} \texttt {I} _h \leftarrow \texttt {EMAX} _h(\tilde{f}) = \texttt {RMAX} (\texttt {HMAX} _h(\tilde{f})), \end{aligned}$$

    which is a binary image (see Sect. 2.4).

  3. 3.

    Apply to \(\texttt {I} _h\) the operator \(\texttt {CCC} \), which maps any binary image to its number of connected components:

    $$\begin{aligned} n_c \leftarrow \texttt {CCC} (\texttt {I} _h). \end{aligned}$$

Let us briefly comment on the steps of this algorithm. The alternate filter is a very common one in morphological processing of images. The opening \(\gamma _{\textrm{B}}\) first removes bright structures smaller than the support of \(\textrm{B}\), after what \(\varphi _{\textrm{B}}\) has the dual effect of filling in dark structures smaller than \(\textrm{B}\). Hence, the alternate filter denoises the image by suppressing small variations but preserving the shape of structures larger than \(\textrm{B}\).

The second step of the algorithm detects, in the filtered image, the bright domes containing at least a regional maximum of dynamic strictly larger than h. Our hypothesis is that in such an image, and for a well-chosen h, each cell contains one and only one such dome. This assumption is realistic given that cells do appear as bright shapes on a dark background, and that their brightness is fairly uniform over one image. Note that, in the present setting where f takes quantized values, \(\texttt {RMAX} \) can be computed with a geodesical reconstruction, following Eqs. (9) and (10).

According to that hypothesis, counting the number \(n_c\) of connected components of the binary image \(\texttt {I} _h\) yields an estimation of the number of cells in f. As we saw in Sect. 2, with Proposition 2, this number \(n_c\) is a decreasing step function of h, ranging from the total number of local maxima (for \(h=0\)) to zero (for \(h = \Delta _{\Omega }f\), the dynamic of any global maximum). This is why we expect that for each input image, there is an interval of h values for which \(\vert \mathcal {M}_h(f)\vert \) is close to the right number of cells.

The goal of our method is to train end-to-end a deep learning architecture which first estimates the right dynamic and then applies the above algorithm. This is the motivation of the next section.

3 Integration in a Deep Learning Architecture

The morphological pipeline described in the previous section depends on one crucial parameter, namely the minimum dynamic for a regional maximum to be considered a relevant object (in our case, a cell). Following [8], in this section we describe how this pipeline is implemented as a sequence of neural layers and embedded in an architecture where it is preceded by a CNN which predicts an optimal dynamic for each image, in order to count the right number of cells at the output of the network. We also analyze the ability of such pipeline to be trained with gradient descent with respect to an appropriate loss, by studying the derivability of its components with respect to their parameters and input variables.

3.1 Counting Connected Components (\(\texttt {CCC} \)) Layer

Counting connected components is an important step in many of the classical methods in data processing [16], especially in images and graphs. In the context of deep learning, this problem has been used to evaluate the generalization capacity of neural networks [17]. However, to be used as a layer inside a network, it must be implemented so that gradient backpropagation can be computed in an auto-derivation software [18]. In this subsection, we present an algorithm, already introduced in [8], meant to count the connected components of a binary image and relying mainly on the geodesic reconstruction defined in (6).

Let \(f\in \mathcal {F}(\Omega ,\{0,1\})\) be a binary image, and \(\mathcal {U}_{\Omega }\in \mathcal {F}(\Omega ,[0,1])\) an injective image, i.e., such that \(\forall x,y\in \Omega , \;\; x\ne y \Rightarrow \mathcal {U}_{\Omega }(x)\ne \mathcal {U}_{\Omega }(y)\). Then, it is clear, from Eq. (7), that reconstructing \(\mathcal {U}_{\Omega }\wedge f\) under f will label every connected component \(\texttt {CC}\) of f with the maximum value taken by \(\mathcal {U}_{\Omega }\) in \(\texttt {CC}\). Since \(\mathcal {U}_{\Omega }\) is injective, this maximum is achieved only once in each connected component. Hence, setting to one each pixel \(x\in \Omega \) such that \(\texttt {REC} (\mathcal {U}_{\Omega }\wedge f,f)(x) = \mathcal {U}_{\Omega }(x)\) produces a binary image where exactly one pixel per connected component of f is lit. This is illustrated in Fig. 5.

Fig. 5
figure 5

Illustration of the \(\texttt {CCC} \) algorithm. a \(\mathcal {U}_\Omega \) coincides only once with \(\texttt {REC} (\mathcal {U}_{\Omega }\wedge f,f)\) inside each connected component, and never outside the connected components. b Hence, \(\mathcal {D}_f\) has only two pixels activated, one per connected component of f

Therefore, we define

$$\begin{aligned} \mathcal {D}_f:= \mathcal {B}\big (\mathcal {U}_{\Omega },\texttt {REC} (\mathcal {U}_{\Omega }\wedge f,f)\big ) \end{aligned}$$
(22)

where the binarization operator \(\mathcal {B}\) for two functions \(f,g \in \mathcal {F}(\Omega ,\mathbb {R})\) is:

$$\begin{aligned} \forall x \in \Omega , \quad \mathcal {B}(f,g)(x):={\left\{ \begin{array}{ll} 1 \text { if } \quad f(x) = g(x)\\ 0 \text { otherwise}. \end{array}\right. } \end{aligned}$$
(23)

\(\mathcal {D}_f\) denotes the final detection result, represented by locations of the isolated connected components.

Accordingly, one can count the number of connected components on f by simply summing (22):

$$\begin{aligned} \texttt {CCC} (f):= \sum _{x\in \Omega } \mathcal {D}_f(x). \end{aligned}$$
(24)

3.2 Loss Functions

The parameters of the proposed neural architecture, which are those of the \(\texttt {CNN} \) estimating h, are optimized by minimizing a loss function which can penalize different errors made by the network at different stages.

Since we are dealing with a counting task, a natural error to penalize is the counting error. Therefore, we define the counting loss as the mean absolute error (MAE) between the estimated number of cells \(\widehat{\texttt {CCC}}_{\texttt {I}}:=\texttt {CCC} (\texttt {EMAX} _{\hat{h}_{\texttt {I}}})\) and the true number of cells \(\texttt {CCC} _{\texttt {I}}^{*}\), for an input image \(\texttt {I} \):

$$\begin{aligned} \mathcal {L}_{count}^{\texttt {I}} :=\left| \texttt {CCC} _{\texttt {I}}^{*} - \widehat{\texttt {CCC}}_{\texttt {I}} \right| = \left| \texttt {CCC} _{\texttt {I}}^{*} - \texttt {CCC} (\texttt {EMAX} _{\hat{h}_{\texttt {I}}}(\texttt {I}))\right| . \end{aligned}$$
(25)

In the case where we do know, for each training image \(\texttt {I} \), an optimal value \(h_{\texttt {I}}^*\) for which the number of h-maxima is the closest to the number of cells, and then, we have two more possible errors to penalize. One is the error on the estimation of h, i.e., \(|h^*_{\texttt {I}} - \hat{h}_{\texttt {I}}|\), and the other one is the reconstruction loss, defined as the mean squared error (MSE) between the predicted geodesic reconstruction image \(\texttt {HMAX} _{\hat{h}_{\texttt {I}}}\) and the ground truth one \(\texttt {HMAX} _{h_{\texttt {I}}^{*}}\):

$$\begin{aligned} \mathcal {L}_{rec}^{\texttt {I}} = \frac{1}{|\Omega |}\sum _{x\in \Omega }(\texttt {HMAX} _{h_{\texttt {I}}^{*}}(x)-\texttt {HMAX} _{\hat{h}_{\texttt {I}}}(x))^2. \end{aligned}$$
(26)

Yet another possibility is the binary cross entropy between \(\texttt {EMAX} _{\hat{h}_I}\) and \(\texttt {EMAX} _{{h}_I^*}\), \(\texttt {BCE} (\texttt {EMAX} _{\hat{h}_I}, \texttt {EMAX} _{{h}_I^*})\). Note that these loss terms are less and less constraining as we move from h to the output count, in the sense that

$$\begin{aligned} |h^*_{\texttt {I}} - \hat{h}_{\texttt {I}}| = 0&\Rightarrow \mathcal {L}_{rec}^{\texttt {I}} = 0 \nonumber \\&\Rightarrow \texttt {BCE} (\texttt {EMAX} _{\hat{h}_I}, \texttt {EMAX} _{{h}_I^*}) = 0 \nonumber \\&\Rightarrow \mathcal {L}_{count}^{\texttt {I}} = 0 \end{aligned}$$
(27)

, whereas none of the converse implications is true. In the present paper, we focus on the counting and reconstruction losses, as previous experiments presented no improvement in including the other two terms.

Hence, the final loss we consider is a joint loss of the form

$$\begin{aligned} \mathcal {L}^{\texttt {I}} = \alpha \mathcal {L}_{rec}^{\texttt {I}} + \beta \mathcal {L}_{count}^{\texttt {I}} \end{aligned}$$
(28)

where \(\alpha \ge 0\) and \(\beta \ge 0\) control the importance of each of the terms, and need to be adjusted as hyperparameters (see Sect. 4).

We shall stress on the fact that while both terms can help the optimization of the parameters, in practice the reconstruction loss is much more constraining as it requires the computation of an optimal h and its corresponding \(\texttt {HMAX} _h\) reconstruction, for each training image. This implies that we know, before training, the grayscale images that will be passed to the h-estimator CNN. In particular, this prevents from learning an optimal preprocessing of images before the estimation of h, e.g., a trainable filtering that would replace the proposed alternate filter \(\varphi _{\textrm{B}}\circ \gamma _{\textrm{B}}\), and/or a trainable mapping from three-channel to one-channel images.

Therefore, training our pipeline by minimizing the counting loss only (i.e., \(\alpha =0\)) is highly desirable. It is one of the targets of the present paper, as an improvement to [8].

3.3 Derivability of the Pipeline

In order to train the proposed pipeline with a gradient descent algorithm, we need to check that a gradient can flow during backpropagation, allowing for the update of the CNN parameters. The alternate filter is non-trainable and applied to the input before the CNN, so it is not involved in the backpropagation. The CNN itself does not raise any particular problem compared to usual deep networks for images. The next step takes as input the h value estimated by the CNN, to compute the h-transformation \(\texttt {HMAX} _h(\tilde{f})\), which should therefore be viewed as a function of h for a fixed input image \(\tilde{f}\). The last step consists in counting the number of regional maxima of \(\texttt {HMAX} _h(\tilde{f})\), i.e., applying \(\texttt {CCC} \circ \texttt {RMAX} \) to \(\texttt {HMAX} _h(\tilde{f})\). Let us now analyze the derivability of these last two steps.

Fig. 6
figure 6

Illustration of Proposition 4, showing the changes from \(\texttt {HMAX} _h(f)\) to \(\texttt {HMAX} _{h\pm \eta }(f)\) for a function f with integer values (black line), and two different h. In a, f and h fulfill the stability condition of the proposition in both positive and negative directions, therefore a shift of \(\pm \eta \) occurs exactly at the h-maxima (regional maxima of the green function), confirming Eq. (29). In b, however, the stability holds for nonnegative \(\eta \), but the h-maxima vary even for the smallest negative \(\eta \). Hence, in the latter case the reconstructions vary outside the h-maxima and remain unchanged in some parts of these, contradicting (29)

3.3.1 Derivative of \(h\mapsto \texttt {HMAX} _h(f)\)

Here, we consider a fixed \(f\in \mathcal {F}(\Omega , \mathbb {R})\) and see \(\texttt {HMAX} _h(f)\) as a function of h. Recall that we denote by \(\mathcal {M}_h(f)\) the set of h-maxima of f (see Sect. 2). Then, we can state the following result.

Proposition 4

Let \(h \ge 0\), and assume there exists an interval I with \(0\in I\), such that for any \(\eta \in I\), \(\mathcal {M}_{h+\eta }(f) = \mathcal {M}_{h}(f)\). Then, we have

$$\begin{aligned} & \forall \eta \in I, \nonumber \\ & \quad {{\texttt {HMAX}}}_{h+\eta }(f) = {{\texttt {HMAX}}}_h(f) -\eta {{\texttt {EMAX}}}_h(f). \end{aligned}$$
(29)

The interpretation of Proposition 4 is the following. First, the assumption that \(\mathcal {M}_{h+\eta }(f) = \mathcal {M}_{h}(f)\) for \(\eta \in I\) just means that, for the considered f and value h, the set of h-maxima of f remains the same if we change h by a small enough (but nonzero) quantity, in at least one direction, positive or negative. We will refer to this concept as the stability of h-maxima. Figure 6 shows an example where I is open, in (a), thus the stability holds in both directions, and an example where \(0 = \min I\), in (b), where the stability holds only for nonnegative \(\eta \).

Second, Eq. (29) indicates that, under the condition of stability of h-maxima, a small change in h only changes the height of the h-maxima and leaves the rest of \(\texttt {HMAX} _h(f)\) unchanged, as shown in Fig. 6. The proof of Proposition 4 is provided in “Appendix D.”

Proposition 4 leads to the following consequence regarding the Fréchet derivability of \(h\mapsto \texttt {HMAX} _h(f)\).

Corollary 5

Let \(f\in \mathcal {F}(\Omega , \mathbb {R})\) and \(\mathcal {H}_f = \lbrace h_0, h_1, \dots , h_{K+1}\rbrace \) be the finite subset of \([0, \Delta _\Omega f]\) containing the points of discontinuity of \(h\mapsto \mathcal {M}_h(f)\), as introduced by Proposition 3. Then, the function \(h \mapsto {{\texttt {HMAX}}}_{h}(f)\) is differentiable at \(h^*\) if and only if \(h^* \notin \mathcal {H}_f\), and in that case its derivative is

$$\begin{aligned} \frac{\partial {{\texttt {HMAX}}}_{h^*}(f)}{\partial h} = -{{\texttt {EMAX}}}_{h^*}(f). \end{aligned}$$
(30)

The non-derivability of \(h \mapsto \texttt {HMAX} _{h}(f)\) at any \(h\in \mathcal {H}_f\) appears clearly on the example of Fig. 6b, where \(h^* = 2\). In this case, on the one hand we do have, for \(0\le \eta <1\),

$$\begin{aligned} \texttt {HMAX} _{h^*+\eta }(f) = \texttt {HMAX} _{h^*}(f) -\eta \texttt {EMAX} _{h^*}(f). \end{aligned}$$

Therefore, if \(h \mapsto \texttt {HMAX} _{h}(f)\) was Fréchet differentiable at \(h^*=2\), its derivative would be \(\frac{\partial \texttt {HMAX} _{h^*}(f)}{\partial h} = -\texttt {EMAX} _{h^*}(f)\). But on the other hand, we have, for \(-1<\eta <0\),

$$\begin{aligned} & \texttt {HMAX} _{h^*+\eta }(f) = \texttt {HMAX} _{h^*}(f) -\eta \texttt {EMAX} _{h^*}(f) \\ & \quad - \eta \Big (\texttt {EMAX} _{h^*+\eta }(f) - \texttt {EMAX} _{h^*}(f)\Big ) \end{aligned}$$

where \(\texttt {EMAX} _{h^*+\eta }(f) - \texttt {EMAX} _{h^*}(f) = \mathbbm {1}_{\lbrace 1\rbrace }-\mathbbm {1}_E\), with \(E =\lbrace 3, 4, 10\rbrace \) if we index pixels from left to right and starting from zero. Since that quantity does not depend on \(\eta \), it does not converge to the null function when \(\eta \) goes to zero from below, which contradicts the differentiability of \(h \mapsto \texttt {HMAX} _{h}(f)\) at \(h^* = 2\). The general case is handled likewise in the proof of Corollary 5, in “Appendix E.”

It is clear from this example, and more formally from Proposition 1, that discontinuities in \(h\mapsto \mathcal {M}_h(f)\) can only be caused by discontinuities of the upper level set function \(t\mapsto \chi _t(f)\). For functions f taking quantized values (like in Fig. 6), we have the guarantee that the level sets remain constant in between these values, and that therefore \(h \mapsto \texttt {HMAX} _{h}(f)\) is differentiable in the corresponding intervals.

3.3.2 Derivability of \(\texttt {CCC} \circ \texttt {RMAX} \)

As explained earlier, the operator \(\texttt {CCC} \circ \texttt {RMAX} \) consists in counting the regional maxima of a function (in our case, \(\texttt {HMAX} _h(f)\)), \(\texttt {RMAX} \) returning the characteristic function of the regional maxima, and \(\texttt {CCC} \) the one counting the connected components of a binary image.

But, as seen in the previous paragraph, this number of h-maxima remains constant over intervals \([h_k, h_{k+1})\), and therefore, the derivative of \(\texttt {CCC} \circ \texttt {RMAX} \) at \(\texttt {HMAX} _h(f)\), \(h\notin \mathcal {H}_f\) is zero.

On the other hand, when \(h\in \mathcal {H}_f\), a discontinuity in the h-maxima occurs, causing a non-derivability of \(\texttt {CCC} \circ \texttt {RMAX} \) at \(\texttt {HMAX} _h(f)\) since, by Propositions 2 and 3, a change in \(\mathcal {M}_h(f)\) also changes its cardinality. When relying on auto-differentiation with the algorithm described in Sect. 3.1, we rather observe exploding gradients.

Therefore, in order to avoid vanishing as well as exploding gradients, we propose to use explicit (approximations of) derivatives, as explained next.

3.4 Modified Backpropagation

Let us consider the setting where the training is done by minimizing the joint loss (28) with \(\beta > 0\), that is to say including the counting loss term. As we have just shown in the previous section, numerical issues such as vanishing or exploding gradients are expected when relying on the auto-differentiation of (25), and therefore of \(h\mapsto \vert \mathcal {M}_h(f)\vert = \texttt {CCC} \circ \texttt {RMAX} \circ \texttt {HMAX} _h(f)\). In our experiments, significant optimization difficulties have been observed, especially when the counting loss term was weighted similarly as the reconstruction term, that is, \(\beta \approx \alpha \). We overcame these difficulties by using derivatives of smooth approximations of the function \(h\mapsto \vert \mathcal {M}_h(f)\vert \).

We propose two approximations, which were both tested in our experiments (see Sect. 4). The first one, more naive, is only based on the fact that \(\vert \mathcal {M}_h(f)\vert \) is a decreasing function of h. Without any further knowledge, we assume

$$\begin{aligned} \vert \mathcal {M}_h(f)\vert \approx -h + cst \end{aligned}$$
(31)

which gives

$$\begin{aligned} \frac{\partial \vert \mathcal {M}_h(f)\vert }{\partial h} \approx -1. \end{aligned}$$
(32)

We denote by (MB, \(-1\)) the class of models trained with this approximation, MB standing for modified backpropagation.

The second approximation, more sophisticated and realistic, assumes

$$\begin{aligned} \vert \mathcal {M}_h(f)\vert \approx \sum _{x\in \Omega }\mathcal {I}_N(x) \end{aligned}$$
(33)

where, for some positive integer N,

$$\begin{aligned} \mathcal {I}_N = \left( \frac{f - \texttt {HMAX} _h(f)}{h}\right) ^N. \end{aligned}$$
(34)

The justification for this approximation is given by Corollary 7, based on Proposition 6. We first introduce the notation

$$\begin{aligned} \overline{\mathcal {D}}_{h}(f) {:}{=} \left\{ M_0\in \mathcal {M}(f),\ \texttt {DYNMAX} (M_0) \ge h\right\} \end{aligned}$$
(35)

for the set of regional maxima of f with dynamic not smaller than h. We can now state the following proposition, proved in “Appendix F.”

Proposition 6

Let \(f\in \mathcal {F}(\Omega , \mathbb {R})\) non-constant, and \(h\in {(0, \Delta _\Omega f]}\). Then, for any \(x\in \Omega \),

$$\begin{aligned} & {{\texttt {HMAX}}}_h(f)(x) = f(x)-h \nonumber \\ & \quad \iff \exists M_0\in \overline{\mathcal {D}}_{h}(f), x\in M_0. \end{aligned}$$
(36)

Given that, additionally, we have by definition \(f \ge \texttt {HMAX} _h(f) \ge f-h\), it follows that for any \(x\in \Omega \) and \(h>0\),

$$\begin{aligned} 0\le \frac{f(x) - \texttt {HMAX} _h(f)(x)}{h} \le 1 \end{aligned}$$
(37)

with equality to 1 if and only if \(x\in M_0\) for some \(M_0\in \overline{\mathcal {D}}_{h}(f)\). Hence, when N goes to infinity, we get

$$\begin{aligned} \lim \limits _{N\rightarrow +\infty }\sum _{x\in \Omega }\mathcal {I}_N(x) = \sum \limits _{M_0\in \overline{\mathcal {D}}_{h}(f)}\mathbbm {1}_{M_0}. \end{aligned}$$
(38)

In words, when N is large enough, the sum in (33) approximates the number of pixels belonging to regional maxima of dynamic not smaller than h.

Now, note that for \(h\notin \mathcal {H}_f\), as introduced by Proposition 3, there is no regional maximum \(M_0\) with dynamic exactly h. Indeed, if there was one, by Proposition 1 we would have \(M_0\notin \mathcal {M}_h(f)\), but \(M_0\notin \mathcal {M}_{h-\epsilon }(f)\) for any \(\epsilon > 0\) (smaller than h), so h would be a discontinuity point of \(h\mapsto \mathcal {M}_h(f)\), i.e., in \(\mathcal {H}_f\). Therefore, for such h we have \(\overline{\mathcal {D}}_{h}(f) = \mathcal {D}_{h}(f)\), where we recall that

$$\begin{aligned} \mathcal {D}_{h}(f) {:}{=} \left\{ M_0\in \mathcal {M}(f), \texttt {DYNMAX} (M_0) > h\right\} \end{aligned}$$

is the set of regional maxima of f with dynamic larger than h, already defined in (C8). Furthermore, Proposition 1 tells us that \(M_0\in \mathcal {D}_{h}(f)\) if and only if \(M_0\) is a connected component of \(\arg \max \limits _M f\) for some \(M\in \mathcal {M}_h(f)\). Thus, for \(h\notin \mathcal {H}_f\), we get

$$\begin{aligned} \sum \limits _{M_0\in \overline{\mathcal {D}}_{h}(f)}\mathbbm {1}_{M_0} = \sum \limits _{M\in \mathcal {M}_h(f)}\mathbbm {1}_{\arg \max \limits _M f}. \end{aligned}$$
(39)

As a consequence, if we further assume that in every h-maximum M, the set \(\arg \max \limits _{M} f\) contains only one pixel, then when N goes to infinity, the sum in (33) does provide the exact number of h-maxima. We have just proved the following corollary.

Corollary 7

Let \(f\in \mathcal {F}(\Omega , \mathbb {R})\) be injective, \(h\in [0, \Delta _\Omega f]{\setminus } \mathcal {H}_f\), and \((\mathcal {I}_N)_{N\ge 1}\) the sequence of images defined by (34). Then,

$$\begin{aligned} \lim \limits _{N\rightarrow +\infty }\sum _{x\in \Omega }\mathcal {I}_N(x) = \vert \mathcal {M}_h(f)\vert . \end{aligned}$$
(40)

Note that the approximation is quite accurate as soon as we add some small random noise to f, preventing any two pixels to have the same value, and for large enough N, as illustrated in Figs. 78.

Fig. 7
figure 7

Illustration of Corollary 7. The function \(\tilde{f}\) is defined by adding some Gaussian noise (zero mean, 0.2 standard deviation) to the function f of Figs. 36. Here, \(h=2.5\notin \mathcal {H}_{\tilde{f}}\) and \(N=10\). The value to be approximated is \(\vert \mathcal {M}_h(f)\vert = 3 = \vert \mathcal {M}_h(\tilde{f})\vert \), and we get \(\sum \limits _{x\in \Omega }\mathcal {I}_N \approx 3.14\), where \(\mathcal {I}_N\) is computed for \(\tilde{f}\), which is injective

This leads to the following approximation of the derivative

$$\begin{aligned} \frac{\partial \ \vert \mathcal {M}_h(f)\vert }{\partial h} \approx \sum _{x\in \Omega }\mathcal {I}^\prime _N(x) \end{aligned}$$
(41)

with

$$\begin{aligned} \mathcal {I}_N^\prime = -N\cdot \frac{\rho (f,h)^{N-1}}{h^N}\left( \frac{\rho (f,h)}{h} - \texttt {EMAX} _h(f)\right) \end{aligned}$$
(42)

and \(\rho (f, h) = f - \texttt {HMAX} _h(f)\). In the experimental section, we will refer to this approximation as (MB, N).

Fig. 8
figure 8

Illustration of Equations (33) and (41), where f is the preprocessed green channel of the cell image of Fig. 4. Left: The actual number of h-maxima is plotted in blue as a function of h, whereas its approximation \(\sum \limits _{x\in \Omega }\mathcal {I}_N(x)\) is plotted in orange. Right: In blue, we plot an estimation of \(\frac{\partial \vert \mathcal {M}_h(f)\vert }{\partial h}\) obtained by finite differences \(0.5\times (\vert \mathcal {M}_{h+1}(f)\vert - \vert \mathcal {M}_{h-1}(f)\vert )\). The orange curve is the approximation \(\sum \limits _{x\in \Omega }\mathcal {I}^{\prime }_N(x)\). \(\mathcal {I}_N\) and \(\mathcal {I}^{\prime }_N\) are computed for \(N=50\) and the perturbed injective function \(\tilde{f}\) is obtained by adding Gaussian noise to f with standard deviation \(\sigma = 0.2 h\)

4 Experiments

4.1 Data

To evaluate the benefits of our proposition, three databases are used. It is important to note that the three types of images considered have as common factor that the cells of interest are bright objects on dark backgrounds, for which using the dynamic for cell counting makes sense. In the three considered datasets, the input image is transformed to a single input channel and then rescaled at a resolution of \(256 \times 256\) pixels. The image range was between 0 and \(V_{\max } = 255\), quantized with a float precision \(\epsilon \). Hence, the previously described method applies, taking binary images with values in \(\lbrace 0, 255\rbrace \). Applying the morphological pipeline described in Sect. 2.5 for all the integer values h in \([0, \Delta _{\Omega }f]\), allowed to determine, for each training image I, an optimal dynamic \(h_{\texttt {I}}^{*}\), defined as the lower bound of the interval of h values minimizing the difference between the true number of cells and the estimated one.

4.1.1 TRP1 Dataset

The TRP1 dataset [3] we used contains two sets, called set1 and set2, of 76 fluorescent melanocytes RGB images each, with resolution \(1024\times 1024\) pixels, which showed a high variability in density and shapes.Footnote 3 They come along with manual annotations of the cell coordinates. The images were acquired from in vitro pigmented reconstructed skin samples submitted to TRP1 fluorescent labeling [19]. Figures  4 and 14 show examples of images and manual annotations.

4.1.2 Fluorescent Neuronal Cells Dataset

The Fluorescent Neuronal Cells dataset [4] comprises 283 fluorescence microscopy images of neuron slices from the mice brain, with a resolution of \(1600\times 1200\) pixels. These are accompanied by the corresponding ground truth cell location labels. The dataset presents a number of challenges for the cell counting task. Firstly, the cell shapes are variable, the contrast of the cells relative to the background is variable, and the backgrounds themselves exhibit some variation in brightness. Example of input image is shown in Fig. 15. In addition, the images contain biological structures that are not neuronal cells, and the number of cells contained in the images is highly variable, ranging from none to dozens of cells, with instances of overlapping clasping between cells. These factors combine to present significant challenges for accurate cell counting.

4.1.3 Subset of Cellpose Dataset

Originally used for generalist segmentation model training and evaluation, the Cellpose dataset [20] comprises a diverse array of microscopic cell images sourced from multiple avenues, including previous research and online searches. Furthermore, the dataset includes a proportion of non-cell images, which serve to enhance the model’s generalization capability. The dataset comprises a total of 616 images, of which 69 have been reserved for testing purposes. Each image is accompanied by ground truth segmentation labels.

Fig. 9
figure 9

Proposed end-to-end differentiable pipeline using a \(\texttt {CNN} \) and an \(\texttt {HMAX} \) layer to cell counting. The network is trained by minimizing a loss function that depends on (1) the difference between the geodesic reconstruction with the dynamic \(\hat{h}\) estimated by the network, and the best possible one, and (2) the difference between the number of cells and the number of regional maxima of \(\texttt {HMAX} _{\hat{h}}\)

As our method is constrained to images featuring bright objects in darker backgrounds, we manually selected a subset of the entire dataset, comprising images that align with this hypothesis. The selected fluorescent microscopic images typically feature nuclei that have been stained with DAPI and proteins that have been labeled with a fluorescent substance in the cytoplasm, the two stains are represented in different color channels. The examples of the selected cell images can be illustrated in left column in Fig. 16. Finally, 213 data samples are selected for training, validation and test.

4.2 Overall Architecture

The proposed neural architectureFootnote 4 is illustrated in Fig. 9. Morphological layers apply an alternate filter (opening followed by closing) by a \(3\times 3\) square structuring element, to the resized green channel, which yields the first preprocessed image \(\texttt {I} \). This image is passed as input to a CNN, which estimates the best dynamic \(\hat{h}_{\texttt {I}}=\texttt {CNN} _\theta (\texttt {I})\), where \(\theta \) denotes the parameters of the CNN. This parameter is then used to compute the \(\hat{h}_{\texttt {I}}\)-maxima of the filtered image, by first feeding it to the \(\texttt {HMAX} \) reconstruction layer, followed by the \(\texttt {RMAX} \) one to get the binary image \(\texttt {EMAX} _{\hat{h}_{\texttt {I}}}\). Finally, the number of connected components of the latter is obtained at the output of the proposed connected components counting layer, \(\texttt {CCC} (\texttt {EMAX} _{\hat{h}_{\texttt {I}}})\). The CNN architecture is summarized in Fig. 10. In particular, each convolutional layerFootnote 5 is followed by a Batch Normalization layer with ReLU as activation function. Following [11], global average pooling layer is used after the second Maxpooling layer instead of stacked flatten and fully connected layers, which has the advantage of further reducing the number of parameters of the network. What’s more, extracting global feature information is coherent with the objective of our task.

Regarding the theoretical complexity of this architecture, for the chosen fixed size of kernels and structuring elements, the CNN part of the pipeline has linear complexity in the number n of input pixels, because convolutions do. A dilation also has linear complexity but the geodesical reconstruction might require as many as \(\sqrt{n}\) dilations, each one followed by a pixelwise minimum and an image comparison test after each dilation. Therefore, with our implementation (which does not use special data structures to reduce complexity) the morphological part has a much higher complexity of \(O(n^{3/2})\) in the worst case, which determines the complexity of the whole pipeline. Despite this asymptotic result, it is worth noting that the actual number of operations performed by the CNN is about \(5000 \times n = 5000 \times 256^2\), and the morphological part performs two geodesical reconstructions, resulting in at most about \(20\sqrt{n} \times n \approx 5000 \times 256^2 \) operations as well. So one could expect the CNN part to take about the same time as the morphological one during inference. In practice, however, we observe that the morphological part takes from 13 to 29 times longer than the CNN on average, depending on the dataset (although all input images are resized at the fixed input size \(256\times 256\) pixels). This could be explained by the fact that, within the machine learning framework we use (TensorFlow), convolutions are optimized way better than our custom geodesical reconstruction, which relies on a while loop and an image comparison after each geodesical dilation.

Fig. 10
figure 10

CNN architecture used in the proposed pipeline

4.3 Training Protocol

TRP1 dataset: To train the architecture presented earlier, hence fit the parameters of the CNN, we randomly split set1, containing 76 images, into 60 training images and 16 validation images. We only worked with the green channel of image images. The parameters of the CNN were adjusted through gradient descent on the training set, and the best model weights were chosen through evaluation metrics on the validation set. The whole set2, which also contains 76 images, was used for the final test only.

Fluorescent Neuronal Cells dataset: The same test set used in [4] is employed in our experiments, comprising 70 selected images. The remaining data are randomly split into a training set (80%) and a validation set (20%). For each fluorescent image, we converted it to grayscale using the linear luminance formula defined as \(0.2126 R + 0.7152 G + 0.0722 B\), where R, G and B correspond to the red, green and blue channels, respectively. The image was then resized to a resolution of \(256 \times 256\) for input to the CNN.

Subset of Cellpose dataset: In order to keep consistency with the original Cellpose dataset, among the 213 selected data samples, we keep the ones that originally in the test split of Cellpose dataset as test set, the rest ones are randomly split into train and validation set with ratio 8:2, resulting in 158 samples for training, 39 for validation and 16 for test. For each input image, the maximum value for the RGB channels for each pixel is extracted in order to convert the input image into a single channel. Prior to inputting the CNN, the same resizing process is applied to the resolution of \(256\times 256\).

4.3.1 Data Augmentation

Convolutional operations are translation invariant, but not rotation or flip invariant. Moreover, for the image samples in the dataset, the flipped or rotated version still looks like a valid sample. Thus it is reasonable to use flip and rotation as data augmentation methods in the training stage. More precisely, we used both vertical and horizontal flip transformations with probability 1/2; a rotation with the same probability and the rotation angle was uniformly selected from \(\{90^{\circ },180^{\circ },270^{\circ }\}\). The true values of the number of cells, and the optimal value of dynamics, are not changed by this augmentation.

4.3.2 Optimization

The joint loss function was minimized with stochastic gradient descent. More precisely, we used the RMSProp optimizer [21], with initial learning rate of \(10^{-3}\). Moreover, we used a batch size of 15 images for TRP1 and Cellpose, 17 for Fluorescent Neural Cells, and trained for a maximal number of 1600 epochs with early stopping as soon as there was no improvement on the validation loss for 300 epochs. Only the model weights with lowest validation error were saved and used for final evaluation on the test set.

4.3.3 Hyperparameters

Loss weights \((\alpha , \beta )\): In [8], experiments with the joint loss (28) had been carried out but without the modified backpropagation introduced in Sect. 3.4. Fivefold cross-validation experiments on set1, with 11 ratios \(\beta /\alpha \) tested (ranging from \(10^{-4}\) to \(10^2\), including zeroFootnote 6) had indicated that best results were expected for \(\alpha = 1,\beta = 0.001\), that is to say for very little weight given to the counting loss. We now have an explanation for this, as we showed in Sect. 3.3 that derivating the reconstruction loss with respect to h is not an issue in general, but derivating the counting loss is. On the contrary, with the modified backpropagation, we observed no significant effect of the balance between \(\alpha \) and \(\beta \): \((\alpha = 1, \beta = 0.001)\) gave similar performances on the validation sets as \(\alpha = \beta = 0.5\) and as \((\alpha = 0, \beta = 1)\). Therefore, we chose to set \(\alpha = \beta = 0.5\).

Table 1 Quantitative evaluation of the proposed method on the three datasets and comparison to the state of the art (for TRP1 and Fluorescent Neural Cells) or a baseline deep network (for Cellpose)
Fig. 11
figure 11

Results on the test set of the TRP1 dataset, for the best of our models on this dataset, trained with the joint loss and modified backpropagation (MB, \(-1\)). Left: predicted \(\hat{h}\) compared to the ground truth optimal \(h^{*}\). Right: predicted versus true cell numbers. Each dot represents an individual image. In both plots, the red line represents the identity function. See Fig. 14 for examples of predictions by this model

Fig. 12
figure 12

Results on the test set of the Fluorescent Neuronal Cells dataset, for the best of our models on this dataset, trained with the counting loss only and modified backpropagation (MB, \(N=50\)). Left: predicted \(\hat{h}\) compared to the ground truth optimal \(h^{*}\). Right: predicted versus true cell numbers. Each dot represents an individual image. In both plots, the red line represents the identity function. See Fig. 15 for examples of predictions by this model

Fig. 13
figure 13

Results on the test set of Cellpose, for the best of our models on this dataset, trained with the counting loss only and modified backpropagation (MB, \(-1\)). Left: predicted \(\hat{h}\) compared to the ground truth optimal \(h^{*}\). Right: predicted versus true cell numbers. Each dot represents an individual image. In both plots, the red line represents the identity function. See Fig. 16 for examples of predictions by this model

Fig. 14
figure 14

Examples of predictions on the test set of the TRP1 dataset, for the best of our models on this dataset, trained with the joint loss and modified backpropagation (MB, \(-1\)). From left to right: Input resized RGB image, green channel image with ground truth cell locations, true number of cells \(n^*\) and optimal dynamic \(h_{\texttt {I}}^*\); green channel image with predicted cell locations and their estimated number \(n_c\) obtained for the estimated dynamic \(\hat{h}_{\texttt {I}}\)

Fig. 15
figure 15

Examples of predictions on the test set of the Fluorescent Neuronal Cells dataset, for the best of our models on this dataset, trained with the counting loss only and the modified backpropagation (MB, N=50). From left to right: Input resized RGB image, green channel image with ground truth cell locations, true number of cells \(n^*\) and optimal dynamic \(h_{\texttt {I}}^*\); grayscale image with predicted cell locations and their estimated number \(n_c\) obtained for the estimated dynamic \(\hat{h}_{\texttt {I}}\)

Fig. 16
figure 16

Examples of predictions on the test set of Cellpose, for the best of our models on this dataset, trained with the counting loss only and the modified backpropagation (MB, \(-1\)). From left to right: Input resized RGB image, maximum of three channels image with ground truth cell locations, true number of cells \(n^*\) and optimal dynamic \(h_{\texttt {I}}^*\); grayscale image with predicted cell locations and their estimated number \(n_c\) obtained for the estimated dynamic \(\hat{h}_{\texttt {I}}\)

Asymptotic value N: For the experiments where the modified backpropagation was defined following (41) and (42), the value of N was chosen by testing on training images the trade-off between the quality approximation of the number of h-maxima given by (33), and its smoothness. Indeed, a large N gives a better approximation of \(h \mapsto \vert \mathcal {M}_h(f) \vert \), but also makes this approximation converge to a step function with zero derivatives almost everywhere, which is what we want to avoid. This study led us to choose \(N=50\), and therefore, the class of models trained with this modified backpropagation is denoted (MB, \(N=50\)).

Random noise to compute \(\mathcal {I}_N^\prime \): Recall that the approximation (41) of \(\frac{\partial \ \vert \mathcal {M}_h(f)\vert }{\partial h}\) relies on Corollary 7, which holds for injective images. To obtain almost surely an injective version \(\tilde{f}\) of an image f, possibly non-injective, we add some random noise to the latter. The level of noise must be small enough not to change (too much) the number of h-maxima, but large enough to ensure a sufficient difference between the maximum value and the second largest value within a h-maximum, hence ensuring a quicker convergence in (40). Again, trials on the training data led to the choice of adding Gaussian noise with \(\sigma _h = 0.2 h\) as standard deviation, where h is the value at which we want to approximate \(\frac{\partial \ \vert \mathcal {M}_h(f)\vert }{\partial h}\).

4.4 Evaluation Metrics

The following metrics have been used to evaluate our proposed pipeline. The average relative error averages the per-image relative counting error:

$$\begin{aligned} \mathcal {A}_{err}(\mathcal {S}) = \frac{1}{\mathcal {N}} \sum _{\texttt {I} \in \mathcal {S}}~ \frac{\left| \texttt {CCC} _{\texttt {I}}^{*}-\widehat{\texttt {CCC}}_{\texttt {I}}\right| }{\texttt {CCC} _{\texttt {I}}^{*}} \end{aligned}$$
(43)

where \(\mathcal {S}\) denotes the test set with sample number \(\mathcal {N}\). Considering the variability in cell numbers of the considered datasets, we also monitor the total relative error [2, 3], which sees the whole dataset as one skin sample, but avoids compensations of errors by taking an absolute difference in each image:

$$\begin{aligned} \mathcal {T}_{err}(\mathcal {S}) = \frac{\sum _{\texttt {I} \in \mathcal {S}}~ \left| \texttt {CCC} _{\texttt {I}}^{*}-\widehat{\texttt {CCC}}_{\texttt {I}} \right| }{\sum _{\texttt {I} \in \mathcal {S}}~ \texttt {CCC} _{\texttt {I}}^{*}}. \end{aligned}$$
(44)

Additionally, for the comparison purpose, we use mean absolute error (\(\textrm{MAE}\)) and mean percentage error (\(\textrm{MPE}\)), recommended in [4]:

$$\begin{aligned} \textrm{MAE}(\mathcal {S})= & \frac{1}{\mathcal {N}} \sum _{\texttt {I} \in \mathcal {S}}~ \left| \texttt {CCC} _{\texttt {I}}^{*} - \widehat{\texttt {CCC}}_{\texttt {I}}\right| , \end{aligned}$$
(45)
$$\begin{aligned} \textrm{MPE}(\mathcal {S})= & \frac{1}{\mathcal {N}} \sum _{\texttt {I} \in \mathcal {S}}~ \frac{\texttt {CCC} _{\texttt {I}}^{*} - \widehat{\texttt {CCC}}_{\texttt {I}}}{\texttt {CCC} _{\texttt {I}}^{*}} \end{aligned}$$
(46)

It is worth noting that finding the right number of cells does not imply correct detections, as false positives can compensate false negatives, like illustrated in Fig. 4. However, we have deliberately limited our analysis to counting errors, as counting is the primary objective. All counting methods benefit from this type of compensation, which is only possible if the method is not overly biased toward over or underestimation.

5 Results

The results of our approach on the three test sets are summarized in Table 1, and compared to the state of the art for the TRP1 and the Fluorescent Neural Cells datasets. Since Cellpose is traditionally a segmentation benchmark, no published counting method was available (i.e., a method which does not rely on the ground truth masks of cells, but at most on their coordinates). For this reason, we followed the method described in [3] and trained a Unet to predict density maps, as a baseline for this dataset. More details for Unet architecture and training can be found in “Appendix G”.

Each of our models was trained with one of the two different modified backpropagations introduced in Sect. 3.4, namely (MB, \(-1\)) and (MB, \(N=50\)), and either with the joint loss (\(\alpha = \beta =1\)) or the counting loss only (\(\alpha =0\)). For each of these alternatives, ten models were trained following the protocol described in Sect. 4.3. We provide the average metric on the test sets ± the standard deviation over the ten training instances. More detailed results of the best performing models are plotted in Figs. 111213, as well as some examples of their predictions in Figs. 1415,  16.

Based on these experiments, the following observations can be made. Firstly, our models achieve competitive results compared to the state of the art, sometimes surpassing it, with the advantage of being much lighter (50 to 100 times fewer parameters) and more interpretable: What is detected as a cell is a h-maximum, therefore one knows that missed or spurious detections are either due to a poor estimation of h, or to a deviation from our assumption of a one-to-one correspondence between cells and h-maxima. The robustness of our method also appears in Figs. 11, 12, 13, where one can see that accurate count estimations can be achieved despite less accurate estimations of the optimal h. This is because the number of h-maxima is a step function of h, and therefore, the optimal number is reached for an interval of h values, usually wider as h is large. Furthermore, values outside that interval but close to it often give close to optimal count.

Secondly, no striking difference in performance appears between models trained with one or the other proposed modified backpropagation, nor between models trained with joint loss or counting loss only. It is remarkable in particular that training models with either the joint loss or the counting loss only, without the need for ground truth cell coordinates, converges well and consistently across different training instances. It is also interesting to note that the most naive approximation \(\frac{\partial \vert \mathcal {M}_h(f)\vert }{\partial h} = -1\), implemented in the (MB, \(-1\)) models, seems to work as well, overall, as the more sophisticated and accurate one, (MB, \(N=50\)). This can be explained by the fact that the actual derivative is not far from \(-1\) in the range of h values of interest, usually in the middle part of \([0, \Delta _{\Omega }(f)]\). Besides its simplicity of implementation, the interest of (MB, \(-1\)) lies in a significantly reduced computation time during backpropagation (about 20%) compared to (MB, \(N=50\)).

Thirdly, the limitations of our approach are well illustrated by the examples of Figs. 14, 15, 16. We can see that assuming a one-to-one correspondence between cells and h-maxima, for a single h value per image, leads to some over detections when non-cell objects happen to match this criterion, and missed cells especially when clusters of cells appear as one large maximum and cannot be distinguished at the considered resolution and after the applied preprocessing (see in particular the examples from the Fluorescent Neural Cells dataset in Fig. 15). We shall also comment on the limitation of the Unet trained on Cellpose as a baseline. It is worth noting that its architecture and training configurations were inspired by the study [3], designed for TRP1, and was not optimized for the Cellpose dataset. We decided to keep it for comparison, to show that the task is not straightforward.

6 Conclusion

In this paper, we have built on the recently introduced end-to-end trainable pipeline [8] where a CNN predicts the dynamic threshold for h-maxima to be recognized as melanocytes, and a layer counts the connected components of the h-maxima images, yielding an estimated number of cells. As reported in [8], when a counting error penalization term is included in the loss function, training with auto-differentiation often fails to converge and satisfactory results could only be achieved with a very small weight on the counting term with respect to the geodesical reconstruction term.

As an extension to [8], in this paper we further provide a detailed theoretical framework of h-maxima, on which is built the analysis of the derivability of the pipeline, explaining the issues related to the minimization of the counting loss. In order to overcome these issues, this work proposes and tests a training paradigm employing explicitly defined backpropagation rules, which show to work well even when training our pipeline with counting loss only. With respect to [8], we demonstrated the effectiveness and generalizability of the proposed pipeline on two other fluorescent cell datasets.

Training with a counting loss only allows our method to bypass the need for pre-defined preprocessing steps, paving the way for future research avenues such as the development of trainable morphological layers for adaptive cell size and shape learning, or convolutional layers for learned preprocessing. Moreover, future endeavors could also involve the prediction of local h values for smaller patches and the implementation of contrast enhancement for blurred regions. Furthermore, as proposed in [23], the graph approach based on hierarchical clustering and gradient descent, combined with cell counting loss, also represents a potential avenue for dynamic analysis and cell counting task.

We posit that the synergy between neural networks and trainable morphological layers opens the door to a broader range of applications, such as counting astronomical objects in astronomical surveys. However, a deeper reflection is necessary to understand which are the best ways to initialize, optimize and regularize morphological layers [24, 25].