Leveraging the Availability of Two Cameras for Illuminant Estimation

Leveraging the Availability of Two Cameras for Illuminant Estimation
Abdelrahman Abdelhamed Abhijith Punnappurath Michael S. Brown
Samsung AI Center – Toronto
{a.abdelhamed,abhijith.p,michael.b1}@samsung.com
Abstract
Most modern smartphones are now equipped with two
rear-facing cameras – a main camera for standard imaging
and an additional camera to provide wide-angle or telephoto
zoom capabilities. In this paper, we leverage the
availability of these two cameras for the task of illumination
estimation using a small neural network to perform the illumination
prediction. Specifically, if the two cameras’ sensors
have different spectral sensitivities, the two images provide
different spectral measurements of the physical scene.
A linear 3×3 color transform that maps between these two
observations – and that is unique to a given scene illuminant
– can be used to train a lightweight neural network
comprising no more than 1460 parameters to predict the
scene illumination. We demonstrate that this two-camera
approach with a lightweight network provides results on par
or better than much more complicated illuminant estimation
methods operating on a single image. We validate our
method’s effectiveness through extensive experiments on radiometric
data, a quasi-real two-camera dataset we generated
from an existing single camera dataset, as well as
a new real image dataset that we captured using a smartphone
with two rear-facing cameras.

Introduction
An overwhelming percentage of consumer photographs
are currently captured using smartphone cameras. A recent
trend in smartphone imaging system design is to employ
two (or more) rear-facing cameras to ameliorate the limitations
imposed by the smartphone compact form factor.
In most cases, the two rear-facing cameras have different
focal lengths and lens configurations to allow the smartphone
to deliver DSLR-like optical capabilities (i.e., wideangle
and telephoto). In addition, the two-camera setup has
been leveraged for applications such as synthetic bokeh effect
[48] and reflection removal [40]. Given the utility of the
two-camera configuration, this design trend is likely to continue
for the foreseeable future. In this work, we show that
the two-camera setup has another benefit, that of improving
Lens
Sensor (A)
(B)
Illuminant
estimate
of camera 1
Scene illuminant
Two-camera
smartphone
1
0.5
0
1
0.5
0
400 nm Wavelength 700 nm
1
0.5
0
400 nm Wavelength 700 nm
Camera 2
Sensitivity Sensitivity
Camera 1
R
G
B
R
G
B
Sensor’s spectral sensitivity
Scene
Raw-RGB image
Raw-RGB image
Camera 2
Camera 1 Our two-camera
illuminant estimation
algorithm
Figure 1: (A) Most modern smartphones use two rearfacing
cameras. Typically, the spectral characteristics of
these two cameras’ sensors are slightly different. (B) Thus,
a two-camera system furnishes two different measurements
of the scene being imaged. Our proposed two-camera algorithm
harnesses this extra information for more accurate
and efficient illuminant estimation.
the accuracy of illuminant estimation.
Illuminant estimation is the most critical step for computational
color constancy. Color constancy refers to the
ability of the human visual system to perceive scene colors
as being the same even when observed under different
illuminations [39]. Cameras do not innately possess this illumination
adaptation ability; the raw-RGB image recorded
by the camera sensor has significant color cast due to the
scene’s illumination. As a result, computational color constancy
is applied to the camera’s raw-RGB sensor image as
one of the first steps in the in-camera imaging pipeline to
remove this undesirable color cast. The main goal of the
camera’s auto-white-balance (AWB) module, which is motivated
by the concept of computational color constancy, is
illuminant estimation. AWB involves estimating the scene
illumination in the sensor’s raw-RGB color space and then
applying a simple 3×3 diagonal matrix computed directly
6637
from the estimated illumination parameters to perform the
white-balance correction. Thus, accurate estimation of the
scene illumination is crucial to ensuring correct scene colors
in the camera image.
We demonstrate that two-camera systems have the potential
to provide more accurate illuminant estimation compared
to existing single-camera methods. A key insight is
that the spectral characteristics of the main camera’s sensor
are typically different from that of the second camera’s.
This is due to a variety of reasons. For example, the pitch
of the photodiodes and overall resolution of the two sensors
are often different to accommodate the different optics associated
with each sensor. These differences impact which
color filter arrays (CFA) manufacturers can use in the sensor’s
production process. This results in the two CFAs having
different spectral sensitivities to incoming light. While
on the surface this may appear to be a disadvantage, differences
in the CFA between the two cameras can be corrected
for by the later stages of the camera imaging pipeline to ensure
the final output colors appear the same (e.g., see [36]).
However, for our purpose, the sensors’ unprocessed raw images
effectively provide different spectral measurements of
the underlying scene. It is this complementary information
that allows us to design a two-camera illumination estimation
algorithm as shown in Fig. 1.
Contribution We propose to train a neural network for illuminant
estimation that receives as input a 3×3 matrix
computed between the two cameras’ raw sensor images simultaneously
capturing the same scene. Prior work [21] has
shown that the color transformation between different spectral
samples of the same scene has a unique signature that
is related to the scene illumination. This allows the color
transformation itself to be used as the feature for illumination
estimation. Thus, in contrast to existing single-camera
illumination estimation methods that train their deep networks
directly on image data, or on image histograms, our
network needs to examine only nine parameters in the color
transformation matrix. As a result, we can train a very
lightweight neural network comprising just 1460 parameters
that can be efficiently run on-device in real time. We
test our proposed approach extensively with experiments
on radiometric data, a quasi-real two-camera dataset we
generated from an existing single-camera color constancy
dataset [16], and finally on a real two-camera dataset that we
captured using a Samsung S20 Ultra smartphone. We compare
our technique against several state-of-the-art singleimage
illuminant estimation methods and demonstrate on
par or even improved performance.
Related work
We survey works on computational color constancy.
These algorithms can be broadly categorized into (1)
statistics-based and (2) learning-based methods. While
early learning-based approaches used hand-crafted features,
more recent works employ deep neural networks.
Statistics-based methods operate using statistics from an
image’s color distribution and spatial layout to estimate the
scene illuminant. Representative examples include gray
world [15], general gray world [6], gray edges [46], shades
of gray [24], white patch [14], bright pixels [35], and
PCA [16]. These methods are fast and easy to implement;
however, they make very strong assumptions about scene
content and fail in cases where these assumptions do not
hold.
Learning-based methods use labelled training data where
the ground truth illumination corresponding to each input
image is known from physical color charts placed in the
scene. In general, learning-based approaches are shown
to be more accurate than statistical-based methods. However,
learning-based methods usually include many more
parameters than statistics-based ones; their number could
reach up to tens of millions in some models (e.g., [10])
and they typically have relatively longer training time.
Representative learning-based approaches include Bayesian
methods [13, 26, 44], gamut-based methods [23, 25, 28],
exemplar-based methods [5, 27, 34], and bias-correction
methods [4, 18, 19]. While early learning-based methods
used hand-crafted features, more recently, deep neural
networks (DNN) have demonstrated superior performance
[32, 38, 41, 45, 11, 12, 43, 10, 31, 47, 3, 8, 9]. It
is important to note that the aforementioned methods are
designed to work with a single image captured using three
channel sensor. Work by [42] explored the idea of adding
an additional color channel, but in the context of resolving
scene metamerism. The approach in this paper is based on
a pair of images of the same scene captured using a twocamera
system.
Our approach is inspired by the chromagenic color constancy
technique of Finlayson et al. [22, 21, 17]. The chromagenic
approach showed that the parameters of a 3 × 3
linear transform that relates the color values of a scene captured
with different spectral sensitivities are correlated with
the scene’s illumination. The chromagenic approach used
two images captured from the same sensor, but with a color
filter applied between image capture; however, two sensors
with different spectral sensitivities could also be used.
Classification of the scene illumination was performed using
a set of pre-selected illuminants via a nearest-neighbour
search operation. We build on this method and integrate it
into a modern smartphone design with two cameras with
different fields of view. Furthermore, we combine it with
the power of neural networks to regress over the space of
illuminations.
6638
Camera 1
Camera 2
Illuminant
estimate
of camera 1
Scene
Downsample
Warp & crop
Two-camera
smartphone
Compute 3 × 3
color transform
𝑡1 𝑡2 𝑡3
𝑡4 𝑡5 𝑡6
𝑡7 𝑡8 𝑡9
Our two-camera illuminant
estimation network
෠𝐿
=
Ƹ 𝑟1෠𝑏
Ƹ 𝑟෠
𝑏
𝑇3×3
Raw-RGB image
Scene
illuminant
Downsample
…
A set of
hidden layers
∈ ℝ9
Input
layer
∈ ℝ9
Output
layer
∈ ℝ2
Raw-RGB image
Figure 2: An overview of our proposed two-camera illuminant estimation algorithm. We compute a linear 3×3 transform
matrix T that maps the downsampled raw-RGB image from the main camera to the corresponding aligned and downsampled
raw-RGB image from the second camera. For a particular scene illuminant, this color transformation T is unique [21]. We
feed this mapping T as input to a small lightweight neural network. The network predicts a 2D [R/G B/G] chromaticity value
that corresponds to the illuminant estimate of the main camera.
Two-camera illuminant estimation
In this section, we describe the various steps of our twocamera
illuminant estimation algorithm – spatially aligning
the image pairs (Section 3.1); computing color transforms
between them (Section 3.2); constructing our two-camera
illuminant estimation network (Section 3.3); and augmenting
our training data (Section 3.4).
3.1. Image spatial alignment
Our method is based on computing a color transform between
a pair of images captured using a two-camera system.
These two images usually have different views and need to
be registered before computing the color transform. In our
experiments on real images, we found that a global homography
is sufficient for image alignment. We downsample
the images by a factor of six prior to computing the color
transform, and this makes our method robust to any small
misalignments and slight parallax in the two views. Moreover,
since the hardware arrangement of the two cameras
does not change for a given device, the homography can be
pre-computed and remains fixed for all image pairs from the
same device.
3.2. Color transforms for image pairs
Given two raw-RGB images I1 2 Rn×3 and I2 2 Rn×3
with n pixels of the same scene captured by two different
sensors or cameras, under the same illumination L 2 R3,
there exists a linear transformation T 2 R3×3 between the
color values of the two images as
I2 I1 T, (1)
such that T is unique to the scene illumination L [22, 21].
Despite Equation 1 being an approximation, for simplicity,
we will use the equality sign instead. We first spatially align
the two images using the pre-computed homography, downsample
them, and then compute T using the pseudo inverse
as follows:
T = (I1
T I1)−1 I1
T I2. (2)
3.3. Twocamera
illuminant estimation network
Given a dataset of M image pairs
I = {(I11, I21) , . . . , (I1M, I2M)}, (3)
we compute the corresponding color transformations between
each pair of images using Equation 2:
T = {T1, . . . , TM}. (4)
Given the set of corresponding target ground truth illuminants
of I1i (i.e., as measured by the first camera) from each
pair
L = {L1, . . . ,LM}, (5)
we can train a neural network f : T ! L, with parameters
, to model the mapping between the color transforms T
and scene illuminations L. Then, f can be used to predict
the scene illumination for the main camera given the color
transform between the two images
ˆL
= f (T) . (6)
Without loss of generality, our method can be trained to
predict the illuminant for the second camera as well, using
the same color transforms; however, for simplicity, we focus
on estimating the illumination for the main camera only.
We train our network by minimizing the L1 loss between
the predicted illuminants and the ground truth:
min
1
M
M
Xi
=1

ˆL
i − Li
. (7)
Our network of choice is lightweight, consisting of a
small number (e.g., 2, 5, or 16) of dense layers; each layer
has nine neurons only. The total number of parameters
6639
Image 1𝑖 Camera 1 Image 1𝑗
Original
images
Augmented
images
𝑇𝐶
1𝑖→1𝑗 𝑇𝐶
1𝑗→1𝑖
X X
Compute 3 × 3
color transforms
1𝑖 → 1𝑗
1𝑗 → 1𝑖
Figure 3: Our image illumination augmentation method.
Given a pair of images, we re-illuminate them by each
other’s illumination based on 3 × 3 color transformations
between their color chart values. This figure shows augmentation
of an image pair from one camera only. The corresponding
pair of images from the second camera is augmented
in the same way. The images shown are in demosaiced
raw-RGB format with gamma correction for better
visualization.
ranges from 200 for the 2-layer architecture up to 1460 parameters
for the 16-layer network. The input to the network
is the flattened nine values of the color transform T and the
output is two values corresponding to the illumination estimation
in the 2D [R/G B/G] chromaticity color space where
the green channel’s value is always set to 1. An overview of
our method is provided in Fig. 2.
3.4. Data augmentation
Due to the lack of large datasets of image pairs captured
with two cameras under the same illumination, and to increase
the number of training samples and the generalizability
of our model, we propose to augment the training
images as follows. Given a small dataset of raw-RGB image
pairs captured with two cameras and including color
rendition charts, we extract the color values of the 24 color
chart patches, C 2 R24×3, from each image. Then, we
compute an accurate color transformation, TC 2 R3×3, between
each pair of images from the main camera 􀀀I1i, I1j
based only on the color chart values from the two images as
T
1i!1j
C = I1
T
i I1i−1
I1
T
i I1j , (8)
and similarly, for image pairs from the second camera
􀀀I2i, I2j as
T
2i!2j
C = I2
T
i I2i−1
I2
T
i I2j . (9)
Figure 4: Three samples from our radiometric dataset. Each
pair shows images from the main (left) and second (right)
cameras. The ground truth illumination colors are presented
in the bottom row.
Next, we use this bank of color transformations to augment
our images by re-illuminating any given pair of images from
the two cameras (I1i, I2i) to match their colors to any target
pair of images 􀀀I1j , I2j, as follows:
I1i!j = I1i T
1i!1j
C , (10)
I2i!j = I2i T
2i!2j
C , (11)
where i ! j means re-illuminating image i to match
the colors of image j. Using this illuminant augmentation
method, we can increase the number of training image
pairs from M to M2. Fig. 3 illustrates an example of reilluminating
a pair of images given another target pair of
images.
4. Experiments
To train our two-camera illuminant estimation network,
we need a dataset of image pairs of the same scene captured
with two different cameras under the same illumination.
To our knowledge, there are no publicly available
image datasets for color constancy captured using a twocamera
system containing labelled ground truth illumination.
To validate our method, we first present a synthetic
radiometric dataset in Section 4.1. Next, in Section 4.2, we
describe how to generate a quasi-real two-camera dataset
from an existing single-camera color constancy dataset. Finally,
we evaluate our method on a real two-camera image
dataset that we captured using a Samsung S20 Ultra smartphone,
in Section 4.3.
4.1. Radiometric dataset
To evaluate our method, we generate a synthetic dataset
from radiometric data. According to the image formation
model, the sensor response is the product of the scene illumination,
the surface reflectance, and the sensor’s spectral
sensitivity, integrated over the visible spectrum. For
data generation, we adopt the experimental procedure proposed
in [6]. In particular, a scene illuminant and a random
set of surface reflectances are selected from a hyperspectral
dataset of lights and surfaces [7]. Two different camera sensors
with different spectral sensitivity functions are chosen
from the camera spectral sensitivity dataset of [33]. The
RGB responses for both sensors can then be calculated by
6640
0.4 0.6 0.8 1.0 1.2
R/G
0.4
0.5
0.6
0.7
0.8
0.9
1.0
B/G
Radiometric dataset
Ground truth illuminants
Camera 1
Camera 2
0.4 0.6 0.8 1.0 1.2
R/G
0.2
0.3
0.4
0.5
0.6
0.7
0.8
B/G
NUS dataset
Ground truth illuminants
Camera 1
Camera 2
0.4 0.5 0.6 0.7 0.8 0.9
R/G
0.4
0.5
0.6
0.7
0.8
B/G
S20 dataset
Ground truth illuminants
Camera 1
Camera 2
(A) Radiometric dataset (B) Quasi-real NUS dataset © Our real dataset
Figure 5: Plots of ground truth illuminants for the two cameras for our (A) radiometric dataset, (B) quasi-real NUS dataset,
and © real dataset.
simple numerical integration. The response induced by a
pure reflector is treated as the corresponding ground truth.
The advantage of this procedure is that it is easy to generate
a large amount of labelled data to evaluate color constancy
algorithms, and arrive at statistically meaningful performance
measures.
The reflectance set of [7] consists of 1995 hyperspectral
surface reflectance measurements of various natural objects,
color charts, and so forth. The dataset of [7] also contains
87 different measured or synthesized illuminant spectra.
The camera spectral sensitivity dataset of [33] contains
the spectral sensitivity functions for 28 cameras, including
mobile phone cameras. We select two sensors from this set
to serve as our main camera and the second camera. To generate
images, we choose a scene illumination and 24 different
surfaces at random, and synthesize the raw-RGB sensor
responses for both cameras. We generate thumbnail images
of size 32×48 pixels. A few representative examples with
associated ground truth are shown in Fig. 4. In total, we
generate 18,000 pairs; 10,800 (60%) for training, and 3,600
Method Mean Med B25% W25% Q1 Q3
GW [15] 4.09 3.68 1.36 7.51 2.21 5.56
SoG [24] 4.56 4.11 1.51 8.41 2.43 6.21
GE-1 [46] 5.20 4.64 1.65 9.62 2.76 7.18
GE-2 [46] 5.41 4.69 1.72 10.25 2.83 7.37
WGE [29] 4.14 3.25 1.13 8.72 1.82 5.41
PCA [16] 4.55 3.09 1.03 10.67 1.68 5.85
WP [14] 5.49 4.96 1.83 10.02 2.94 7.48
Gamut Pixel [28] 3.68 3.07 1.05 7.30 1.70 5.10
Gamut Edge [28] 6.09 5.34 1.95 11.49 3.15 8.42
Ours (200 params) 2.80 2.20 0.72 5.87 1.19 3.81
Ours (470 params) 2.65 2.00 0.64 5.72 1.07 3.61
Table 1: Angular errors (degrees) on our radiometric
dataset. B and W stand for best and worst, while Q1 and
Q3 denote the first and third quantile, respectively. Best results
are in bold.
(20%) each for validation and testing. A plot of the distribution
of ground truth illuminants corresponding to the two
cameras for 200 random samples is shown in Fig. 5(A). It is
evident from the separation between the scatter points corresponding
to the two cameras that the same illumination
induces very different raw responses in the two sensors owing
to the difference in their spectral sensitivity functions.
For this experiment, we skip the alignment and downsampling
steps of Section 3.1 since there is no misalignment,
and compute our color transform for each pair from
the 24 correspondences. We also omit the data augmentation
procedure described in Section 3.4 since we have sufficient
training examples. We use the Adam [37] optimizer
with a learning rate of 10−4. We train our network for 1
million epochs. The training process takes about 10 hours
on a 32 GB nVidia Tesla V100 GPU. Table 1 reports statistics
of the angular errors [20] obtained by our method, along
with comparisons. The results of comparison methods were
computed using open source codes downloaded from [1]
or from the authors’ webpages. Note that all comparison
algorithms are single-image methods, and therefore were
given the image from the main camera alone as input. For
this experiment, we omit comparisons against deep learning
methods since they are typically trained on natural images,
whereas our images resemble color checker patches
only. From Table 1, we can observe that our method performs
better than well-established single-image illuminant
estimation methods. Note that although we show illuminant
estimation results only for the main camera for simplicity of
comparison, our method, without loss of generality, can be
used to predict the scene illuminant for the other camera.
Please see the supplementary material for more details.
4.2. Quasireal
NUS dataset
In this section, we go a step further beyond synthetic data
towards a more real dataset. In particular, we describe a
procedure to generate a quasi-real two-camera dataset from
an existing single-camera color constancy dataset. Towards
this goal, we select the NUS [16] dataset, which has images
6641
Camera 1 Camera 2 Camera 1 Camera 2 Camera 1 Camera 2
Figure 6: Sample matched pairs from the NUS dataset that we use to generate our quasi-real dataset.
Compute 3 × 3
color transforms
𝑇𝑇𝑀𝑀
1𝑖𝑖→2𝑖𝑖 X
Camera 𝟏𝟏
Camera 𝟐𝟐 Camera 𝟏𝟏 → 𝟐𝟐
Figure 7: Our method for generating spatially-aligned twocamera
image pairs from the NUS dataset. A color transform
is used to map the image’s colors from one camera to
the other.
of the same scene mostly under the same illumination captured
using different cameras. We choose the Nikon D5200
as the main camera while the Canon 1Ds Mark III serves as
the second camera. We select only those images where the
two cameras are observing the same scene with no visible
changes in the illumination. After filtering, we obtain 195
matched pairs from the two cameras. All images in the NUS
dataset have a Macbeth color chart placed in the scene. The
ground truth scene illumination can be obtained from the
achromatic patches in the color chart. A plot of the ground
truth illuminants for the 195 images from the two cameras
is shown in the plot of Fig. 5(B), and it can be observed that
the two sensors record different measurements for the same
illumination. A few representative examples of matched image
pairs from the two cameras are shown in Fig. 6. Notice
that some pairs have a significant change in viewpoint although
the scene is the same. Therefore, we preprocess the
data to generate our quasi-real dataset, as described next.
For each image pair (I1i, I2i) from the two cameras, we
first compute an accurate 3×3 transform T1i!2i
M that maps
the raw-RGB image from the main camera to the second
camera using only the 24 correspondences from the color
checker patches. Next, we apply this color transform on the
main camera image to synthesize a new second camera image.
This procedure is shown in Fig. 7. These two spatially
aligned images constitute a pair in our quasi-real dataset.
We use a standard three-fold cross validation protocol
to evaluate performance. For learning-based methods, including
our own approach, we augment the training folds
using the procedure described in Section 3.4. Testing is
performed on the original unaugmented set. In particular,
for each image pair, we generate another 99 randomly reilluminated
image pairs to obtain a total of 19500 pairs. The
color chart is then masked out in all training, validation, and
testing images. The results of our method, along with comparisons,
are presented in Table 2. In addition to several
classical methods, we also test against the recent learning
approaches of [3, 10, 32, 9]. For all four learning methods,
publicly available implementations provided by the authors
were used to report results. The method of [3] is sensorindependent,
and does not require re-training. The quasi unsupervised
color constancy algorithm of [10], while inherently
sensor-agnostic, can be fine-tuned if annotated training
data is available. In Table 2, we report results both without
and with fine-tuning, using the pre-trained models made
available by the authors. For the fine-tuned result, we selected
the appropriate pre-trained model for testing based on
the three-fold partitioning indices of the NUS dataset used
by the authors. For FC4 [32], we trained the model from
Method Mean Med B25% W25% Q1 Q3
GW [15] 4.43 3.42 0.90 9.82 1.54 6.11
SoG [24] 3.31 2.63 0.70 7.20 1.18 4.17
GE-1 [46] 4.49 3.03 0.87 10.38 1.40 6.34
GE-2 [46] 4.99 3.28 0.94 11.83 1.54 6.65
WGE [29] 5.77 3.11 0.77 14.75 1.38 7.89
PCA [16] 4.01 2.68 0.69 9.20 1.22 6.07
WP [14] 4.49 3.47 0.93 9.99 1.42 6.09
Gamut Pixel [28] 5.99 3.70 0.90 14.95 1.41 8.65
Gamut Edge [28] 4.99 3.38 0.85 11.63 1.72 7.22
CM [18] 2.80 2.09 0.66 6.12 1.21 3.67
Homography [19] (SoG) 2.70 1.95 0.69 5.88 1.06 3.71
Homography [19] (PCA) 2.97 2.16 0.72 6.47 1.14 4.22
APAP [4] (GW) 2.64 2.00 0.60 5.99 1.02 3.26
APAP [4] (SoG) 2.49 1.75 0.60 5.61 0.88 3.14
APAP [4] (PCA) 2.77 1.83 0.60 6.45 0.94 3.49
SIIE [3] 2.04 1.55 0.51 4.41 0.80 2.80
Quasi U CC [10] 3.57 2.77 0.62 8.04 1.09 5.06
Quasi U CC finetuned [10] 2.68 1.72 0.57 6.25 0.98 3.67
FC4 [32] 2.65 2.06 0.67 5.69 1.12 3.49
FFCC [9] 2.44 1.50 0.40 5.87 0.75 3.19
Ours (200 params) 2.39 1.44 0.46 5.95 0.81 2.81
Ours (470 params) 1.91 1.24 0.36 4.78 0.62 2.22
Ours (1460 params) 1.69 1.09 0.37 4.02 0.59 2.02
Table 2: Angular errors (degrees) on the main camera from
our quasi-real NUS [16] dataset. Best results are in bold.
6642
Main camera
Wide-angle camera
Sample image pairs from our real dataset
(Left) Our imaging rig with the color chart at a fixed position relative to
the camera. (Right) An outdoor scene being imaged using our setup.
Figure 8: Our data capture setup and representative examples from our dataset. Note that while illuminant estimation is
performed on the raw-RGB sensor image, we show here the corresponding sRGB images to aid visualization.
scratch using the hyperparameters recommended by the authors.
For FFCC [9], the hyperparameters were carefully
tuned to achieve the best performance. In the literature,
FC4 and FFCC are currently the best-performing methods
across all color constancy datasets, including NUS. It can be
observed that our model with 1460 parameters outperforms
both FC4 and FFCC, as well as other competitors.
4.3. S20 realimage
dataset
The final step in our evaluation is to collect and test on
a real dataset of image pairs captured using a two-camera
system. Towards this goal, we examined various recent
Method Mean Med B25% W25% Q1 Q3
GW [15] 3.25 2.55 0.90 6.94 1.46 3.73
SoG [24] 3.07 2.03 0.62 7.33 0.98 4.16
GE-1 [46] 4.79 3.91 0.94 10.72 1.49 6.35
GE-2 [46] 5.23 3.96 0.95 11.74 1.57 7.76
WGE [29] 6.17 4.76 0.85 14.17 1.50 9.79
PCA [16] 4.56 3.35 0.85 10.50 1.32 6.42
WP [14] 3.24 2.30 0.56 7.48 1.03 4.16
Gamut Pixel [28] 6.81 5.62 0.97 14.20 1.61 10.29
Gamut Edge [28] 5.00 3.60 0.94 11.20 1.46 6.58
CM [18] 3.51 2.64 0.66 7.64 1.25 4.83
Homography [19] (SoG) 3.43 2.38 0.46 7.93 1.13 4.90
Homography [19] (PCA) 4.35 3.12 0.58 10.09 1.05 6.35
APAP [4] (GW) 4.21 2.32 0.53 11.34 0.88 4.81
APAP [4] (SoG) 3.46 2.29 0.39 8.53 0.73 5.18
APAP [4] (PCA) 3.96 2.77 0.47 9.14 0.88 6.06
Linear regression 2.49 1.79 0.80 4.94 1.02 3.29
SIIE [3] 4.71 3.37 0.99 9.98 1.55 7.50
Quasi U CC [10] 3.94 2.66 0.71 9.16 1.21 5.71
Quasi U CC finetuned [10] 2.55 1.55 0.56 6.15 0.84 3.03
FC4 [32] 2.14 1.64 0.69 4.38 1.15 2.67
FFCC [9] 2.51 2.05 0.80 4.95 1.20 3.20
Ours (200 params) 1.73 1.29 0.37 3.75 0.70 2.32
Ours (470 params) 0.94 0.69 0.17 2.14 0.31 1.24
Ours (1460 params) 1.08 0.71 0.16 2.57 0.27 1.47
Table 3: Angular errors on the main camera from our S20
two-camera dataset. Best results are in bold.
smartphones with two rear-facing cameras. Our method requires
access to the raw-RGB images from both cameras.
The Samsung S20 Ultra is one smartphone we found that
has the desired camera configuration and allows saving to
the raw format. The S20 Ultra is equipped with a wideangle
rear-facing camera that provides a larger field of view
than the main camera. The two camera sensors are different:
the main camera is a Samsung HM1 sensor (108 MP,
3×3 Nonacell, 0.8μm pitch), while the second camera is a
Samsung S5K2L3SX sensor (12 MP, 1.4μm pitch). While
we do not have access to the sensors’ CFA spectral sensitivities,
it is easy to verify the CFAs are different by observing
a color checker chart under the same controlled illumination
and plotting the responses. See Fig. S1 of supplemental for
more details on how we validate that the spectral sensitivities
of the two cameras are different. We used image pairs
from the main camera and the wide-angle camera for our
experiments. We developed a simple Android application
with the aid of the Camera2 API [30] to save the raw-DNG
files from both cameras with a single button press. To obtain
the ground truth, a Macbeth color chart was placed in every
scene. For ease of ground truth labelling, we used a custom
rig (see Fig. 8) that allows the color chart to be placed at a
fixed position relative to the camera. This ensures that the
color chart always occupies a fixed spatial location in the
captured images. We collected a total of 156 image pairs,
spanning a diverse range of lighting conditions and scene
content. Some representative examples from our dataset are
shown in Fig. 8. Fig. 5© shows a plot of the distribution
of the ground truth illuminants for the two cameras. It
is clear from the spread in the distribution that our working
assumption of two-camera systems having different spectral
profiles can likely hold true on real data.
As a preprocessing step, the raw-DNG images from our
dataset were demosaiced and the black level was adjusted.
The ground truth illumination was also extracted from the
color chart. Since the field of view is different between the
two cameras, before downsampling, we registered the images
using a fixed pre-computed homography as described
6643
(A) Raw-RGB input (B) SIIE [3] © Quasi U CC tuned [10] (D) FC4 [32] (F) Ours (G) Ground truth
AE
11.1°
AE
3.58°
AE
7.59°
AE
3.69°
AE
0.16°
AE
8.87°
AE
4.83°
AE
1.00°
AE
1.38°
AE
5.11°
AE
4.79°
AE
2.03°
AE
6.74°
AE
9.50°
AE
6.00°
(E) FFCC [9]
Figure 9: Qualitative results from our real dataset. (A) Input raw-RGB image. (B-F) Results of [3, 10, 32, 9], and our method,
respectively, after correcting the input images using the estimated illuminants. (G) Result of correcting using the ground truth
illuminant. A gamma has been applied to the raw-RGB images in (A) for illustration. The results in the remaining columns
have been rendered to the sRGB color space using [2] to aid visualization. Black boxes are used to mask out the color charts.
in Section 3.1. We then computed the transformation for
each image pair. Data augmentation was performed as before
to generate a total of 15600 image pairs.
The results on our real dataset are presented in Table 3.
Three-fold cross validation was used as before, and training
data was augmented for all learning-based methods.
As a baseline comparison, we applied a linear regression
model in place of our trained network. As seen from Table
3, linear regression yields good results, but our small
network performs better because of non-linearities. Results
for [3, 32, 9] were computed in the same manner as in Section
4.2. For [10], we fine-tuned the model for each fold
using the parameters recommended by the authors. It can
be observed that our model with just 200 parameters outperforms
all competitors, including FC4 and FFCC. A few
qualitative results, along with comparisons, are presented
Dataset Method Mean Med B25% W25%
Real
Ours w/o 2.11 1.46 0.61 4.65
Ours 1.73 1.29 0.37 3.75
NUS
Ours w/o 4.74 2.61 0.66 12.47
Ours 2.39 1.44 0.46 5.95
Table 4: The results of our method with and without (w/o)
data augmentation. All models shown have 200 parameters
and were trained with a learning rate of 10−3. Best results
are in bold.
in Fig. 9. Table 4 reports results of training without and
with our data augmentation technique. It is evident from
the results that our augmentation framework improves performance.
5. Conclusion
In this work, we take advantage of the availability of
two rear-facing cameras, commonly used in modern smartphone
design, to perform illumination estimation. Our approach
leverages the differences in the sensor’s spectral profile
between these two cameras. In particular, we trained a
lightweight neural network to estimate the scene illumination
based on a 3×3 linear color transform that maps between
the two cameras’ colors. We demonstrated state-ofthe-
art illuminant estimation performance over contemporary
single-image methods through extensive experiments
on radiometric data, a quasi-real two-camera dataset generated
from an existing single-camera dataset, and a real
dataset that we captured using a two-camera smartphone.
We believe our work may lead to design changes regarding
how current camera devices perform illuminant estimation,
leveraging the ubiquity of multi-camera devices. Our code,
datasets involving radiometric, quasi-real, and real images
from the S20 smartphone, and our trained models will be
publicly released to the community. We hope our findings
will spur further innovation in smartphone imaging through
ideas that leverage multiple cameras.
6644
References
[1] Color constancy : Research website on illuminant estimation.
https://colorconstancy.com/sourcecode/
index.html. Accessed: 2020-11-01. 5
[2] Abdelrahman Abdelhamed, Stephen Lin, and Michael S.
Brown. A high-quality denoising dataset for smartphone
cameras. In CVPR, 2018. 8
[3] Mahmoud Afifi and Michael S. Brown. Sensor-independent
illumination estimation for DNN models. In BMVC, 2019.
2, 6, 7, 8
[4] Mahmoud Afifi, Abhijith Punnappurath, Graham D. Finlayson,
and Michael S. Brown. As-projective-as-possible
bias correction for illumination estimation algorithms.
JOSA-A, 36(1):71–78, 2019. 2, 6, 7
[5] Nikola Banic and Sven Loncaric. Color dog – Guiding the
global illumination estimation to better accuracy. In 10th International
Conference on Computer Vision Theory and Applications,
2015. 2
[6] Kobus Barnard, Vlad Cardei, and Brian Funt. A comparison
of computational color constancy algorithms. I: Methodology
and experiments with synthesized data. TIP, 11(9):972–
984, 2002. 2, 4
[7] Kobus Barnard, Lindsay Martin, Brian Funt, and Adam
Coath. A data set for color research. Color Research &
Application, 27(3):147–151, 2002. 4, 5
[8] Jonathan T. Barron. Convolutional color constancy. In ICCV,
2015. 2
[9] Jonathan T. Barron and Yun-Ta Tsai. Fast fourier color constancy.
In CVPR, 2017. 2, 6, 7, 8
[10] Simone Bianco and Claudio Cusano. Quasi-unsupervised
color constancy. In CVPR, 2019. 2, 6, 7, 8
[11] Simone Bianco, Claudio Cusano, and Raimondo Schettini.
Color constancy using CNNs. In CVPR Workshops, 2015. 2
[12] Simone Bianco, Claudio Cusano, and Raimondo Schettini.
Single and multiple illuminant estimation using convolutional
neural networks. TIP, 26(9):4347–4362, 2017. 2
[13] David H. Brainard and William T. Freeman. Bayesian color
constancy. JOSA-A, 14(7):1393–1411, 1997. 2
[14] David H. Brainard and Brian A. Wandell. Analysis of the
retinex theory of color vision. JOSA-A, 3(10):1651–1661,
1986. 2, 5, 6, 7
[15] Gershon Buchsbaum. A spatial processor model for object
colour perception. Journal of the Franklin Institute, 310(1):1
– 26, 1980. 2, 5, 6, 7
[16] Dongliang Cheng, Dilip K. Prasad, and Michael S. Brown.
Illuminant estimation for color constancy: Why spatialdomain
methods work and the role of the color distribution.
JOSA-A, 31(5):1049–1058, 2014. 2, 5, 6, 7
[17] Graham Finlayson. Image recording apparatus employing a
single CCD chip to record two digital optical images, 2006.
US Patent 7,046,288. 2
[18] Graham D. Finlayson. Corrected-moment illuminant estimation.
In ICCV, 2013. 2, 6, 7
[19] Graham D. Finlayson. Colour and illumination in computer
vision. Interface Focus, 8, 2018. 2, 6, 7
[20] Graham D. Finlayson, Brian V. Funt, and Kobus Barnard.
Color constancy under varying illumination. In ICCV, 1995.
5
[21] Graham D. Finlayson, Steven D. Hordley, and Peter Morovic.
Chromagenic colour constancy. In 10th Congress of
the International Colour Association, 2005. 2, 3
[22] Graham D. Finlayson, Steven D. Hordley, and Peter Morovic.
Colour constancy using the chromagenic constraint.
In CVPR, 2005. 2, 3
[23] Graham D. Finlayson, Steven D. Hordley, and Ingeborg
Tastl. Gamut constrained illuminant estimation. IJCV,
67(1):93–109, 2006. 2
[24] Graham D. Finlayson and Elisabetta Trezzi. Shades of gray
and colour constancy. In Color Imaging Conference, 2004.
2, 5, 6, 7
[25] David A. Forsyth. A novel algorithm for color constancy.
IJCV, 5:5–35, 2004. 2
[26] Peter V. Gehler, Carsten Rother, Andrew Blake, Tom Minka,
and Toby Sharp. Bayesian color constancy revisited. In
CVPR, 2008. 2
[27] Arjan Gijsenij and Theo Gevers. Color constancy using natural
image statistics and scene semantics. TPAMI, 33(4):687–
698, 2011. 2
[28] Arjan Gijsenij, Theo Gevers, and Joost Van De Weijer. Generalized
gamut mapping using image derivative structures for
color constancy. IJCV, 86:127–139, 2008. 2, 5, 6, 7
[29] Arjan Gijsenij, Theo Gevers, and Joost Van De Weijer. Improving
color constancy by photometric edge weighting.
TPAMI, 34(5):918–929, 2012. 5, 6, 7
[30] Google. Android Camera2 API. https://developer.
android.com/reference/android/hardware/
camera2/package-summary.html. Accessed: 2017-
11-10. 7
[31] Daniel Hernandez-Juarez, Sarah Parisot, Benjamin Busam,
Ales Leonardis, Gregory Slabaugh, and Steven McDonagh.
A multi-hypothesis approach to color constancy. In CVPR,
2020. 2
[32] Yuanming Hu, Baoyuan Wang, and Stephen S Lin.
FC4: Fully convolutional color constancy with confidenceweighted
pooling. In CVPR, 2017. 2, 6, 7, 8
[33] Jun Jiang, Dengyu Liu, Jinwei Gu, and Sabine S¨usstrunk.
What is the space of spectral sensitivity functions for digital
color cameras? In WACV, 2013. 4, 5
[34] Hamid Reza Vaezi Joze and Mark S. Drew. Exemplarbased
color constancy and multiple illumination. TPAMI,
36(5):860–873, 2014. 2
[35] Hamid Reza Vaezi Joze, Mark S. Drew, Graham D. Finlayson,
and Perla Aurora Troncoso Rey. The role of bright
pixels in illumination estimation. In Color Imaging Conference,
2012. 2
[36] Hakki C. Karaimer and Michael S. Brown. A software
platform for manipulating the camera imaging pipeline. In
ECCV, 2016. 2
[37] Diederik P. Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. ICLR, 2014. 5
[38] Zhongyu Lou, Theo Gevers, Ninghang Hu, and Marcel P.
Lucassen. Color constancy by deep learning. In BMVC,
2015. 2
6645
[39] Laurence T. Maloney. Physics-based approaches to modeling
surface color perception. Color vision: From genes to
perception, 1999. 1
[40] Simon Niklaus, Xuaner Cecilia Zhang, Jonathan T. Barron,
Neal Wadhwa, Rahul Garg, Feng Liu, and Tianfan
Xue. Learned dual-view reflection removal. arXiv preprint
arXiv:2010.00702, 2020. 1
[41] Seoung Oh and Seon Kim. Approaching the computational
color constancy as a classification problem through deep
learning. Pattern Recognition, 2016. 2
[42] Dilip K. Prasad. Strategies for resolving camera metamers
using 3+1 channel. In CVPR Workshop, 2016. 2
[43] Yanlin Qian, Ke Chen, Jarno Nikkanen, Joni-Kristian Kamarainen,
and Jiri Matas. Recurrent color constancy. In ICCV,
2017. 2
[44] Charles Rosenberg, Alok Ladsariya, and Tom Minka.
Bayesian color constancy with non-gaussian models. In
NeurIPS. 2004. 2
[45] Wu Shi, Chen Change Loy, and Xiaoou Tang. Deep specialized
network for illuminant estimation. In ECCV, 2016.
2
[46] Joost Van DeWeijer, Theo Gevers, and Arjan Gijsenij. Edgebased
color constancy. TIP, 16(9), 2007. 2, 5, 6, 7
[47] Jin Xiao, Shuhang Gu, and Lei Zhang. Multi-domain learning
for accurate and few-shot color constancy. In CVPR,
2020. 2
[48] Yinda Zhang, Neal Wadhwa, Sergio Orts-Escolano, Christian
H¨ane, Sean Fanello, and Rahul Garg. Du2net: Learning
depth estimation from dual-cameras and dual-pixels. arXiv
preprint arXiv:2003.14299, 2020. 1
6646

原文链接：https://blog.csdn.net/weixin_42176038/article/details/124101824

你可能也喜欢