# A depth iterative illumination estimation network for low-light image enhancement based on retinex theory

In this part, we describe the experimental results and analysis in detail. First, we briefly introduce the parameter setting and comparison methods. Then, the qualitative and quantitative evaluation of paired and unpaired data sets is described. Finally, the experimental results are analyzed.

### Experimental settings

In the following section, we provide detailed information on our parameter settings, comparison methods, and evaluation index.

For all experiments in this paper, we maintained a uniform configuration environment, which included an Ubuntu system with 32 GB RAM and an NVIDIA GeForce RTX3090 GPU. The network framework was constructed using PyTorch and optimized by ADMM with the following parameters: \(\beta _1=0.9\), \(\beta _2=0.99\), and \(\epsilon =0.95\). The batch size was set to 16, while the learning rate was set at 0.0003. Additionally, the training sample size was uniformly adjusted to 320 \(\times\) 320, and we used 485 paired images randomly selected from the LOL dataset to train our model. The training epoch number was set to 1000.

To evaluate the performance of our proposed network on low-light image datasets, we conducted a visual analysis and compared it with other state-of-the-art methods and codes. The traditional methods including HE^{13} and Tone Mapping^{11}, supervised methods such as Retinex-Net^{6}, unsupervised methods including RUAS^{29}, Zero-DCE^{30}, SCI^{33}, and RRDNet^{32} were used for comparison. We selected two paired data sets (LOL and LSRW) and three unpaired data sets (LIME, MEF, and NPE) for verification experiments to test their performance in image enhancement.

For quantitative evaluation, we use mean absolute error(MAE), peak signal-to-noise ratio(PSNR), structural similarity index (SSIM)^{35}, learned perceptual image patch similarity(LPIPS)^{36} and natural image quality evaluator(NIQE)^{37} as metrics. The following content provides the explicit definitions of these five evaluation metrics.

MAE: The MAE is calculated using the sum of the absolute values of the grayscale differences between the evaluation image and the original image at each point divided by the size of the image. The smaller the value indicates the smaller the deviation from the original image and the better the image quality.

$$\begin{aligned} \begin{aligned} MAE=\frac{1}{M\times N}\underset{i=1}{\overset{M}{\mathop {\sum }}}\,\underset{j=1}{\overset{N}{\mathop {\sum }}}\,\left| g(x,y)-\hat{g}(x,y) \right| \end{aligned} \end{aligned}$$

(13)

*M* and *N* denote the number of pixel points in the length and width of the image, respectively, *g*(*x*, *y*)and \(\hat{g}(x,y)\) are the gray-scale values at the points of the original image and the image to be evaluated, respectively.

PSNR: PSNR is a widely used image quality metric that measures the difference between two images based on their pixel-level differences. It is a full-reference metric, which means that it requires a reference image to evaluate the quality of a distorted or compressed image. Mathematically, The PSNR formula can be expressed as:

$$\begin{aligned} \begin{aligned} PSNR=10\times \log \frac{I_{max}^{2}}{MSE} \end{aligned} \end{aligned}$$

(14)

where *MSE* is the mean square error between images and \(I_{max}\) is the maxi-mum pixel value of the two images. its definition is:

$$\begin{aligned} MSE=\frac{1}{M\times N}\underset{i=1}{\overset{M}{\mathop {\sum }}}\,\underset{j=1}{\overset{N}{\mathop {\sum }}}\,{{(g(x,y)-\hat{g}(x,y))}^{2}} \end{aligned}$$

(15)

SSIM: SSIM is a useful metric for evaluating image quality in scenarios where a reference image is available and the perceptual features of the image are important. It is used to highlight differences in brightness, contrast, and structural similarity between two images. Its values range from 0 to 1, where values closer to 1 indicate a higher degree of similarity between the two images. Assuming x and y are two input images, the formula is:

$$\begin{aligned} SSIM={{(l(x,y))}^{\alpha }}{{(c(x,y))}^{\beta }}{{(s(x,y))}^{\gamma }} \end{aligned}$$

(16)

where *l*(*x*, *y*) is the brightness comparison, *c*(*x*, *y*) is the contrast comparison, *s*(*x*, *y*) is the structure comparison. \(\alpha\), \(\beta\), \(\gamma\) is greater than 0, is used to adjust the proportion of three parts. *l*(*x*, *y*), *c*(*x*, *y*) and *s*(*x*, *y*) are the following equations, respectively.

$$\begin{aligned} l(x,y)=\frac{2{{\mu }_{x}}{{\mu }_{y}}+{{c}_{1}}}{\mu _{x}^{2}+\mu _{y}^{2}+{{c}_{1}}},\quad c(x,y)=\frac{2{{\sigma }_{xy}}+{{c}_{2}}}{\sigma _{x}^{2}+\sigma _{y}^{2}+{{c}_{2}}},\quad s(x,y)=\frac{{{\sigma }_{xy}}+{{c}_{3}}}{{{\sigma }_{x}}{{\sigma }_{y}}+{{c}_{3}}} \end{aligned}$$

(17)

where \(\mu _x\) and \(\mu _y\) denote the mean of the two images, \(\sigma _x\) and \(\sigma _y\) denote the standard deviation of the two images, respectively. \(\sigma _{xy}\) denotes the covariance of the two images. \(c_1\), \(c_2\) and \(c_3\) serve to avoid the denominator being zero.

LPIPS: LPIPS is a deep learning based image quality assessment metric. For LPIPS, we use an AlexNet-based model to calculate perceptual similarity. the lower the LPIPS value, the closer the result is to its corresponding ground truth in terms of perceptual similarity.

Given a ground truth image reference block *x* and a noise-containing image distortion block \(x_0\), the perceptual similarity measure is formulated as follows:

$$\begin{aligned} \begin{aligned} d\left( x, x_0\right) =\sum _l \frac{1}{H_l W_l} \sum _{h, w}\left\| w_l \odot \left( \hat{y}_{h w}^l-\hat{y}_{0 h w}^l\right) \right\| _2^2 \end{aligned} \end{aligned}$$

(18)

where *d* is the distance between *x* and \(x_0\). The feature stack \(\hat{y}_{h w}^l\) and \(\hat{y}_{0 h w}^l\) are extracted from the *L* layer and unit-normalized in the channel dimension. Using vectors \(w_l\) to deflate the number of activated channels, the \(l_2\) distance is finally calculated. Finally, it is averaged over the space and summed over the channels.

NIQE: NIQE is a no-reference image quality metric that measures the perceptual quality of natural images. It is based on the hypothesis that natural images have certain statistical properties that are correlated with their perceptual quality. These properties include texture richness, edge sharpness, and colorfulness, among others. The NIQE index is calculated mainly by calculating the distance between the input image and the Multivariate Gaussian Model (MVG) of the natural image, and the lower the value of NIQE, the better the quality of the image. Mathematically, NIQE formula is :

$$\begin{aligned} NIQE=D\left( v_1, v_2, m_1, m_2\right) =\sqrt{\left( \left( v_1-v_2\right) ^T\left( \frac{m_1+m_2}{2}\right) ^{-1}\left( v_1-v_2\right) \right) } \end{aligned}$$

(19)

where \(v_1\), \(v_2\), \(m_1\) and \(m_2\) represent the average vector and covariance matrix of the natural MVG model and the distorted image MVG model, respectively.

By using diverse datasets and evaluating multiple metrics, we obtained a comprehensive evaluation of our algorithm’s performance in enhancing low-light images across various scenarios.

### Subjective visual evaluation

Figures 6 and 7 show some representative results of visual comparison of various algorithms. Figures 6 and 7 belong to the LOL and LSRW datasets, respectively. In Fig. 6, the enhanced results show that HE can significantly increase the brightness of low-light images. However, it applies contrast enhancement to each channel of RGB separately, causing color distortion. Retinex-Net significantly improves the visual quality of low-light images, but it overly smooths out details, enlarges noise, and even causes color deviation. Tone Mapping can stretch the dynamic range of the image, but it still has an insufficient enhancement for the grandstand seating section in the image. Although the image effect of RUAS is delicate and has no obvious noise interference, it does not successfully brighten the image in extremely dark areas (such as the central seat part). SCI and RRD-Net perform poorly in darker images and cannot effectively enhance low-light images. Zero-DCE can preserve the details of the image relatively completely, but the brightness enhancement is not obvious, and the color contrast of the image is significantly reduced. From Fig. 7, it can be seen that HE has obvious image distortion and color distortion; Retinex-Net amplifies inherent noise, losing image details; SCI, Zero-DCE, and RRD-Net have weak brightness enhancement capabilities; Tone Mapping, RUAS, and our method perform extremely well in brightness and color aspects. Compared to the Ground Truth, our approach not only significantly enhances image brightness but also effectively preserves the colors and intricate details of the images to a considerable extent, thereby effectively improving the overall image quality. This accomplishment can be attributed to the inherent mechanism of our model. Our model employs a multi-stage strategy to achieve brightness adjustment, enabling robustly estimating appropriate illumination map. Concurrently, we optimize the reflectance map to enhance image details and contrast. By thoughtfully integrating these two components, we ensure that the resulting enhanced images not only exhibit improved visibility but also faithfully reproduce the characteristics of the original scenes.

To comprehensively evaluate various algorithms, we also selected three unpaired benchmarks (LIME, MEF, NPE) for verification experiments. In Figs. 8, 9 and 10, we show the visual contrast effects produced by these cutting-edge methods under various benchmarks. From these enhancement results, it can be seen that HE greatly improves the contrast of the image, but there is also a significant color shift phenomenon. Retinex-Net introduces visually unsatisfactory artifacts and noise. Tone Mapping and RRD-Net can preserve image details, but the overall enhancement strength is not significant, and they fail to effectively enhance local dark areas. RUAS and SCI can effectively enhance low-contrast images, but during the enhancement process, they tend to excessively enhance originally bright areas, such as the sky and clouds in Figs. 8, 9, and 10, which are replaced by an overly enhanced whitish tone. Among all the methods, Zero-DCE and our proposed method perform well on these three benchmarks, effectively enhancing image contrast while maintaining color balance and detail clarity.

### Objective evaluation

In addition to the subjective visual evaluation, the effectiveness of the algorithms in this paper is illustrated using the recognition image quality metric used for quantitative comparison.

In this study, we have evaluated the proposed method and seven other representative methods on the LOL and LSRW paired datasets. The results have been summarized in Table 1, which shows the average MAE, PSNR, SSIM, and LPIPS scores for these two common datasets. In terms of evaluation metrics, higher PNSR and SSIM values indicate better image quality, whereas lower MAE, LPIPS and NIQE values suggest superior image quality. Observing the table, it can be inferred that no single method can achieve the best value for all image quality detection indicators. However, our method has outperformed the others in several areas. For instance, in the LOL dataset test, our method demonstrated the best performance in the PSNR index, with LPIPS ranking second among most methods. On the other hand, in the LSRW dataset test, our method was found to perform the best in all other indicators except for LPIPS.

In addition, we also evaluated these datasets using the non-reference image quality evaluator (NIQE), as shown in Table 2. Except for Zero-DCE, which had the best score on some datasets, our NIQE scores outperformed most of the other methods. Overall, Tables 1 and 2 provide stronger evidence for the effectiveness and applicability of our proposed method.