Leveraging the Availability of Two Cameras for Illuminant Estimation

  • Post author:
  • Post category:其他

Leveraging the Availability of Two Cameras for Illuminant Estimation
Abdelrahman Abdelhamed Abhijith Punnappurath Michael S. Brown
Samsung AI Center – Toronto
{a.abdelhamed,abhijith.p,michael.b1}@samsung.com
Abstract
Most modern smartphones are now equipped with two
rear-facing cameras – a main camera for standard imaging
and an additional camera to provide wide-angle or telephoto
zoom capabilities. In this paper, we leverage the
availability of these two cameras for the task of illumination
estimation using a small neural network to perform the illumination
prediction. Specifically, if the two cameras’ sensors
have different spectral sensitivities, the two images provide
different spectral measurements of the physical scene.
A linear 3×3 color transform that maps between these two
observations – and that is unique to a given scene illuminant
– can be used to train a lightweight neural network
comprising no more than 1460 parameters to predict the
scene illumination. We demonstrate that this two-camera
approach with a lightweight network provides results on par
or better than much more complicated illuminant estimation
methods operating on a single image. We validate our
method’s effectiveness through extensive experiments on radiometric
data, a quasi-real two-camera dataset we generated
from an existing single camera dataset, as well as
a new real image dataset that we captured using a smartphone
with two rear-facing cameras.

  1. Introduction
    An overwhelming percentage of consumer photographs
    are currently captured using smartphone cameras. A recent
    trend in smartphone imaging system design is to employ
    two (or more) rear-facing cameras to ameliorate the limitations
    imposed by the smartphone compact form factor.
    In most cases, the two rear-facing cameras have different
    focal lengths and lens configurations to allow the smartphone
    to deliver DSLR-like optical capabilities (i.e., wideangle
    and telephoto). In addition, the two-camera setup has
    been leveraged for applications such as synthetic bokeh effect
    [48] and reflection removal [40]. Given the utility of the
    two-camera configuration, this design trend is likely to continue
    for the foreseeable future. In this work, we show that
    the two-camera setup has another benefit, that of improving
    Lens
    Sensor (A)
    (B)
    Illuminant
    estimate
    of camera 1
    Scene illuminant
    Two-camera
    smartphone
    1
    0.5
    0
    1
    0.5
    0
    400 nm Wavelength 700 nm
    1
    0.5
    0
    400 nm Wavelength 700 nm
    Camera 2
    Sensitivity Sensitivity
    Camera 1
    R
    G
    B
    R
    G
    B
    Sensor’s spectral sensitivity
    Scene
    Raw-RGB image
    Raw-RGB image
    Camera 2
    Camera 1 Our two-camera
    illuminant estimation
    algorithm
    Figure 1: (A) Most modern smartphones use two rearfacing
    cameras. Typically, the spectral characteristics of
    these two cameras’ sensors are slightly different. (B) Thus,
    a two-camera system furnishes two different measurements
    of the scene being imaged. Our proposed two-camera algorithm
    harnesses this extra information for more accurate
    and efficient illuminant estimation.
    the accuracy of illuminant estimation.
    Illuminant estimation is the most critical step for computational
    color constancy. Color constancy refers to the
    ability of the human visual system to perceive scene colors
    as being the same even when observed under different
    illuminations [39]. Cameras do not innately possess this illumination
    adaptation ability; the raw-RGB image recorded
    by the camera sensor has significant color cast due to the
    scene’s illumination. As a result, computational color constancy
    is applied to the camera’s raw-RGB sensor image as
    one of the first steps in the in-camera imaging pipeline to
    remove this undesirable color cast. The main goal of the
    camera’s auto-white-balance (AWB) module, which is motivated
    by the concept of computational color constancy, is
    illuminant estimation. AWB involves estimating the scene
    illumination in the sensor’s raw-RGB color space and then
    applying a simple 3×3 diagonal matrix computed directly
    6637
    from the estimated illumination parameters to perform the
    white-balance correction. Thus, accurate estimation of the
    scene illumination is crucial to ensuring correct scene colors
    in the camera image.
    We demonstrate that two-camera systems have the potential
    to provide more accurate illuminant estimation compared
    to existing single-camera methods. A key insight is
    that the spectral characteristics of the main camera’s sensor
    are typically different from that of the second camera’s.
    This is due to a variety of reasons. For example, the pitch
    of the photodiodes and overall resolution of the two sensors
    are often different to accommodate the different optics associated
    with each sensor. These differences impact which
    color filter arrays (CFA) manufacturers can use in the sensor’s
    production process. This results in the two CFAs having
    different spectral sensitivities to incoming light. While
    on the surface this may appear to be a disadvantage, differences
    in the CFA between the two cameras can be corrected
    for by the later stages of the camera imaging pipeline to ensure
    the final output colors appear the same (e.g., see [36]).
    However, for our purpose, the sensors’ unprocessed raw images
    effectively provide different spectral measurements of
    the underlying scene. It is this complementary information
    that allows us to design a two-camera illumination estimation
    algorithm as shown in Fig. 1.
    Contribution We propose to train a neural network for illuminant
    estimation that receives as input a 3×3 matrix
    computed between the two cameras’ raw sensor images simultaneously
    capturing the same scene. Prior work [21] has
    shown that the color transformation between different spectral
    samples of the same scene has a unique signature that
    is related to the scene illumination. This allows the color
    transformation itself to be used as the feature for illumination
    estimation. Thus, in contrast to existing single-camera
    illumination estimation methods that train their deep networks
    directly on image data, or on image histograms, our
    network needs to examine only nine parameters in the color
    transformation matrix. As a result, we can train a very
    lightweight neural network comprising just 1460 parameters
    that can be efficiently run on-device in real time. We
    test our proposed approach extensively with experiments
    on radiometric data, a quasi-real two-camera dataset we
    generated from an existing single-camera color constancy
    dataset [16], and finally on a real two-camera dataset that we
    captured using a Samsung S20 Ultra smartphone. We compare
    our technique against several state-of-the-art singleimage
    illuminant estimation methods and demonstrate on
    par or even improved performance.
  2. Related work
    We survey works on computational color constancy.
    These algorithms can be broadly categorized into (1)
    statistics-based and (2) learning-based methods. While
    early learning-based approaches used hand-crafted features,
    more recent works employ deep neural networks.
    Statistics-based methods operate using statistics from an
    image’s color distribution and spatial layout to estimate the
    scene illuminant. Representative examples include gray
    world [15], general gray world [6], gray edges [46], shades
    of gray [24], white patch [14], bright pixels [35], and
    PCA [16]. These methods are fast and easy to implement;
    however, they make very strong assumptions about scene
    content and fail in cases where these assumptions do not
    hold.
    Learning-based methods use labelled training data where
    the ground truth illumination corresponding to each input
    image is known from physical color charts placed in the
    scene. In general, learning-based approaches are shown
    to be more accurate than statistical-based methods. However,
    learning-based methods usually include many more
    parameters than statistics-based ones; their number could
    reach up to tens of millions in some models (e.g., [10])
    and they typically have relatively longer training time.
    Representative learning-based approaches include Bayesian
    methods [13, 26, 44], gamut-based methods [23, 25, 28],
    exemplar-based methods [5, 27, 34], and bias-correction
    methods [4, 18, 19]. While early learning-based methods
    used hand-crafted features, more recently, deep neural
    networks (DNN) have demonstrated superior performance
    [32, 38, 41, 45, 11, 12, 43, 10, 31, 47, 3, 8, 9]. It
    is important to note that the aforementioned methods are
    designed to work with a single image captured using three
    channel sensor. Work by [42] explored the idea of adding
    an additional color channel, but in the context of resolving
    scene metamerism. The approach in this paper is based on
    a pair of images of the same scene captured using a twocamera
    system.
    Our approach is inspired by the chromagenic color constancy
    technique of Finlayson et al. [22, 21, 17]. The chromagenic
    approach showed that the parameters of a 3 × 3
    linear transform that relates the color values of a scene captured
    with different spectral sensitivities are correlated with
    the scene’s illumination. The chromagenic approach used
    two images captured from the same sensor, but with a color
    filter applied between image capture; however, two sensors
    with different spectral sensitivities could also be used.
    Classification of the scene illumination was performed using
    a set of pre-selected illuminants via a nearest-neighbour
    search operation. We build on this method and integrate it
    into a modern smartphone design with two cameras with
    different fields of view. Furthermore, we combine it with
    the power of neural networks to regress over the space of
    illuminations.
    6638
    Camera 1
    Camera 2
    Illuminant
    estimate
    of camera 1
    Scene
    Downsample
    Warp & crop
    Two-camera
    smartphone
    Compute 3 × 3
    color transform
    𝑡1 𝑡2 𝑡3
    𝑡4 𝑡5 𝑡6
    𝑡7 𝑡8 𝑡9
    Our two-camera illuminant
    estimation network
    ෠𝐿
    =
    Ƹ 𝑟1෠𝑏
    Ƹ 𝑟෠
    𝑏
    𝑇3×3
    Raw-RGB image
    Scene
    illuminant
    Downsample

    A set of
    hidden layers
    ∈ ℝ9
    Input
    layer
    ∈ ℝ9
    Output
    layer
    ∈ ℝ2
    Raw-RGB image
    Figure 2: An overview of our proposed two-camera illuminant estimation algorithm. We compute a linear 3×3 transform
    matrix T that maps the downsampled raw-RGB image from the main camera to the corresponding aligned and downsampled
    raw-RGB image from the second camera. For a particular scene illuminant, this color transformation T is unique [21]. We
    feed this mapping T as input to a small lightweight neural network. The network predicts a 2D [R/G B/G] chromaticity value
    that corresponds to the illuminant estimate of the main camera.
  3. Two-camera illuminant estimation
    In this section, we describe the various steps of our twocamera
    illuminant estimation algorithm – spatially aligning
    the image pairs (Section 3.1); computing color transforms
    between them (Section 3.2); constructing our two-camera
    illuminant estimation network (Section 3.3); and augmenting
    our training data (Section 3.4).
    3.1. Image spatial alignment
    Our method is based on computing a color transform between
    a pair of images captured using a two-camera system.
    These two images usually have different views and need to
    be registered before computing the color transform. In our
    experiments on real images, we found that a global homography
    is sufficient for image alignment. We downsample
    the images by a factor of six prior to computing the color
    transform, and this makes our method robust to any small
    misalignments and slight parallax in the two views. Moreover,
    since the hardware arrangement of the two cameras
    does not change for a given device, the homography can be
    pre-computed and remains fixed for all image pairs from the
    same device.
    3.2. Color transforms for image pairs
    Given two raw-RGB images I1 2 Rn×3 and I2 2 Rn×3
    with n pixels of the same scene captured by two different
    sensors or cameras, under the same illumination L 2 R3,
    there exists a linear transformation T 2 R3×3 between the
    color values of the two images as
    I2 I1 T, (1)
    such that T is unique to the scene illumination L [22, 21].
    Despite Equation 1 being an approximation, for simplicity,
    we will use the equality sign instead. We first spatially align
    the two images using the pre-computed homography, downsample
    them, and then compute T using the pseudo inverse
    as follows:
    T = (I1
    T I1)−1 I1
    T I2. (2)
    3.3. Twocamera
    illuminant estimation network
    Given a dataset of M image pairs
    I = {(I11, I21) , . . . , (I1M, I2M)}, (3)
    we compute the corresponding color transformations between
    each pair of images using Equation 2:
    T = {T1, . . . , TM}. (4)
    Given the set of corresponding target ground truth illuminants
    of I1i (i.e., as measured by the first camera) from each
    pair
    L = {L1, . . . ,LM}, (5)
    we can train a neural network f : T ! L, with parameters
    , to model the mapping between the color transforms T
    and scene illuminations L. Then, f can be used to predict
    the scene illumination for the main camera given the color
    transform between the two images
    ˆL
    = f (T) . (6)
    Without loss of generality, our method can be trained to
    predict the illuminant for the second camera as well, using
    the same color transforms; however, for simplicity, we focus
    on estimating the illumination for the main camera only.
    We train our network by minimizing the L1 loss between
    the predicted illuminants and the ground truth:
    min

    1
    M
    M
    Xi
    =1

ˆL
i − Li
. (7)
Our network of choice is lightweight, consisting of a
small number (e.g., 2, 5, or 16) of dense layers; each layer
has nine neurons only. The total number of parameters
6639
Image 1𝑖 Camera 1 Image 1𝑗
Original
images
Augmented
images
𝑇𝐶
1𝑖→1𝑗 𝑇𝐶
1𝑗→1𝑖
X X
Compute 3 × 3
color transforms
1𝑖 → 1𝑗
1𝑗 → 1𝑖
Figure 3: Our image illumination augmentation method.
Given a pair of images, we re-illuminate them by each
other’s illumination based on 3 × 3 color transformations
between their color chart values. This figure shows augmentation
of an image pair from one camera only. The corresponding
pair of images from the second camera is augmented
in the same way. The images shown are in demosaiced
raw-RGB format with gamma correction for better
visualization.
ranges from 200 for the 2-layer architecture up to 1460 parameters
for the 16-layer network. The input to the network
is the flattened nine values of the color transform T and the
output is two values corresponding to the illumination estimation
in the 2D [R/G B/G] chromaticity color space where
the green channel’s value is always set to 1. An overview of
our method is provided in Fig. 2.
3.4. Data augmentation
Due to the lack of large datasets of image pairs captured
with two cameras under the same illumination, and to increase
the number of training samples and the generalizability
of our model, we propose to augment the training
images as follows. Given a small dataset of raw-RGB image
pairs captured with two cameras and including color
rendition charts, we extract the color values of the 24 color
chart patches, C 2 R24×3, from each image. Then, we
compute an accurate color transformation, TC 2 R3×3, between
each pair of images from the main camera 􀀀I1i, I1j
based only on the color chart values from the two images as
T
1i!1j
C = I1
T
i I1i−1
I1
T
i I1j , (8)
and similarly, for image pairs from the second camera
􀀀I2i, I2j as
T
2i!2j
C = I2
T
i I2i−1
I2
T
i I2j . (9)
Figure 4: Three samples from our radiometric dataset. Each
pair shows images from the main (left) and second (right)
cameras. The ground truth illumination colors are presented
in the bottom row.
Next, we use this bank of color transformations to augment
our images by re-illuminating any given pair of images from
the two cameras (I1i, I2i) to match their colors to any target
pair of images 􀀀I1j , I2j, as follows:
I1i!j = I1i T
1i!1j
C , (10)
I2i!j = I2i T
2i!2j
C , (11)
where i ! j means re-illuminating image i to match
the colors of image j. Using this illuminant augmentation
method, we can increase the number of training image
pairs from M to M2. Fig. 3 illustrates an example of reilluminating
a pair of images given another target pair of
images.
4. Experiments
To train our two-camera illuminant estimation network,
we need a dataset of image pairs of the same scene captured
with two different cameras under the same illumination.
To our knowledge, there are no publicly available
image datasets for color constancy captured using a twocamera
system containing labelled ground truth illumination.
To validate our method, we first present a synthetic
radiometric dataset in Section 4.1. Next, in Section 4.2, we
describe how to generate a quasi-real two-camera dataset
from an existing single-camera color constancy dataset. Finally,
we evaluate our method on a real two-camera image
dataset that we captured using a Samsung S20 Ultra smartphone,
in Section 4.3.
4.1. Radiometric dataset
To evaluate our method, we generate a synthetic dataset
from radiometric data. According to the image formation
model, the sensor response is the product of the scene illumination,
the surface reflectance, and the sensor’s spectral
sensitivity, integrated over the visible spectrum. For
data generation, we adopt the experimental procedure proposed
in [6]. In particular, a scene illuminant and a random
set of surface reflectances are selected from a hyperspectral
dataset of lights and surfaces [7]. Two different camera sensors
with different spectral sensitivity functions are chosen
from the camera spectral sensitivity dataset of [33]. The
RGB responses for both sensors can then be calculated by
6640
0.4 0.6 0.8 1.0 1.2
R/G
0.4
0.5
0.6
0.7
0.8
0.9
1.0
B/G
Radiometric dataset
Ground truth illuminants
Camera 1
Camera 2
0.4 0.6 0.8 1.0 1.2
R/G
0.2
0.3
0.4
0.5
0.6
0.7
0.8
B/G
NUS dataset
Ground truth illuminants
Camera 1
Camera 2
0.4 0.5 0.6 0.7 0.8 0.9
R/G
0.4
0.5
0.6
0.7
0.8
B/G
S20 dataset
Ground truth illuminants
Camera 1
Camera 2
(A) Radiometric dataset (B) Quasi-real NUS dataset © Our real dataset
Figure 5: Plots of ground truth illuminants for the two cameras for our (A) radiometric dataset, (B) quasi-real NUS dataset,
and © real dataset.
simple numerical integration. The response induced by a
pure reflector is treated as the corresponding ground truth.
The advantage of this procedure is that it is easy to generate
a large amount of labelled data to evaluate color constancy
algorithms, and arrive at statistically meaningful performance
measures.
The reflectance set of [7] consists of 1995 hyperspectral
surface reflectance measurements of various natural objects,
color charts, and so forth. The dataset of [7] also contains
87 different measured or synthesized illuminant spectra.
The camera spectral sensitivity dataset of [33] contains
the spectral sensitivity functions for 28 cameras, including
mobile phone cameras. We select two sensors from this set
to serve as our main camera and the second camera. To generate
images, we choose a scene illumination and 24 different
surfaces at random, and synthesize the raw-RGB sensor
responses for both cameras. We generate thumbnail images
of size 32×48 pixels. A few representative examples with
associated ground truth are shown in Fig. 4. In total, we
generate 18,000 pairs; 10,800 (60%) for training, and 3,600
Method Mean Med B25% W25% Q1 Q3
GW [15] 4.09 3.68 1.36 7.51 2.21 5.56
SoG [24] 4.56 4.11 1.51 8.41 2.43 6.21
GE-1 [46] 5.20 4.64 1.65 9.62 2.76 7.18
GE-2 [46] 5.41 4.69 1.72 10.25 2.83 7.37
WGE [29] 4.14 3.25 1.13 8.72 1.82 5.41
PCA [16] 4.55 3.09 1.03 10.67 1.68 5.85
WP [14] 5.49 4.96 1.83 10.02 2.94 7.48
Gamut Pixel [28] 3.68 3.07 1.05 7.30 1.70 5.10
Gamut Edge [28] 6.09 5.34 1.95 11.49 3.15 8.42
Ours (200 params) 2.80 2.20 0.72 5.87 1.19 3.81
Ours (470 params) 2.65 2.00 0.64 5.72 1.07 3.61
Table 1: Angular errors (degrees) on our radiometric
dataset. B and W stand for best and worst, while Q1 and
Q3 denote the first and third quantile, respectively. Best results
are in bold.
(20%) each for validation and testing. A plot of the distribution
of ground truth illuminants corresponding to the two
cameras for 200 random samples is shown in Fig. 5(A). It is
evident from the separation between the scatter points corresponding
to the two cameras that the same illumination
induces very different raw responses in the two sensors owing
to the difference in their spectral sensitivity functions.
For this experiment, we skip the alignment and downsampling
steps of Section 3.1 since there is no misalignment,
and compute our color transform for each pair from
the 24 correspondences. We also omit the data augmentation
procedure described in Section 3.4 since we have sufficient
training examples. We use the Adam [37] optimizer
with a learning rate of 10−4. We train our network for 1
million epochs. The training process takes about 10 hours
on a 32 GB nVidia Tesla V100 GPU. Table 1 reports statistics
of the angular errors [20] obtained by our method, along
with comparisons. The results of comparison methods were
computed using open source codes downloaded from [1]
or from the authors’ webpages. Note that all comparison
algorithms are single-image methods, and therefore were
given the image from the main camera alone as input. For
this experiment, we omit comparisons against deep learning
methods since they are typically trained on natural images,
whereas our images resemble color checker patches
only. From Table 1, we can observe that our method performs
better than well-established single-image illuminant
estimation methods. Note that although we show illuminant
estimation results only for the main camera for simplicity of
comparison, our method, without loss of generality, can be
used to predict the scene illuminant for the other camera.
Please see the supplementary material for more details.
4.2. Quasireal
NUS dataset
In this section, we go a step further beyond synthetic data
towards a more real dataset. In particular, we describe a
procedure to generate a quasi-real two-camera dataset from
an existing single-camera color constancy dataset. Towards
this goal, we select the NUS [16] dataset, which has images
6641
Camera 1 Camera 2 Camera 1 Camera 2 Camera 1 Camera 2
Figure 6: Sample matched pairs from the NUS dataset that we use to generate our quasi-real dataset.
Compute 3 × 3
color transforms
𝑇𝑇𝑀𝑀
1𝑖𝑖→2𝑖𝑖 X
Camera 𝟏𝟏
Camera 𝟐𝟐 Camera 𝟏𝟏 → 𝟐𝟐
Figure 7: Our method for generating spatially-aligned twocamera
image pairs from the NUS dataset. A color transform
is used to map the image’s colors from one camera to
the other.
of the same scene mostly under the same illumination captured
using different cameras. We choose the Nikon D5200
as the main camera while the Canon 1Ds Mark III serves as
the second camera. We select only those images where the
two cameras are observing the same scene with no visible
changes in the illumination. After filtering, we obtain 195
matched pairs from the two cameras. All images in the NUS
dataset have a Macbeth color chart placed in the scene. The
ground truth scene illumination can be obtained from the
achromatic patches in the color chart. A plot of the ground
truth illuminants for the 195 images from the two cameras
is shown in the plot of Fig. 5(B), and it can be observed that
the two sensors record different measurements for the same
illumination. A few representative examples of matched image
pairs from the two cameras are shown in Fig. 6. Notice
that some pairs have a significant change in viewpoint although
the scene is the same. Therefore, we preprocess the
data to generate our quasi-real dataset, as described next.
For each image pair (I1i, I2i) from the two cameras, we
first compute an accurate 3×3 transform T1i!2i
M that maps
the raw-RGB image from the main camera to the second
camera using only the 24 correspondences from the color
checker patches. Next, we apply this color transform on the
main camera image to synthesize a new second camera image.
This procedure is shown in Fig. 7. These two spatially
aligned images constitute a pair in our quasi-real dataset.
We use a standard three-fold cross validation protocol
to evaluate performance. For learning-based methods, including
our own approach, we augment the training folds
using the procedure described in Section 3.4. Testing is
performed on the original unaugmented set. In particular,
for each image pair, we generate another 99 randomly reilluminated
image pairs to obtain a total of 19500 pairs. The
color chart is then masked out in all training, validation, and
testing images. The results of our method, along with comparisons,
are presented in Table 2. In addition to several
classical methods, we also test against the recent learning
approaches of [3, 10, 32, 9]. For all four learning methods,
publicly available implementations provided by the authors
were used to report results. The method of [3] is sensorindependent,
and does not require re-training. The quasi unsupervised
color constancy algorithm of [10], while inherently
sensor-agnostic, can be fine-tuned if annotated training
data is available. In Table 2, we report results both without
and with fine-tuning, using the pre-trained models made
available by the authors. For the fine-tuned result, we selected
the appropriate pre-trained model for testing based on
the three-fold partitioning indices of the NUS dataset used
by the authors. For FC4 [32], we trained the model from
Method Mean Med B25% W25% Q1 Q3
GW [15] 4.43 3.42 0.90 9.82 1.54 6.11
SoG [24] 3.31 2.63 0.70 7.20 1.18 4.17
GE-1 [46] 4.49 3.03 0.87 10.38 1.40 6.34
GE-2 [46] 4.99 3.28 0.94 11.83 1.54 6.65
WGE [29] 5.77 3.11 0.77 14.75 1.38 7.89
PCA [16] 4.01 2.68 0.69 9.20 1.22 6.07
WP [14] 4.49 3.47 0.93 9.99 1.42 6.09
Gamut Pixel [28] 5.99 3.70 0.90 14.95 1.41 8.65
Gamut Edge [28] 4.99 3.38 0.85 11.63 1.72 7.22
CM [18] 2.80 2.09 0.66 6.12 1.21 3.67
Homography [19] (SoG) 2.70 1.95 0.69 5.88 1.06 3.71
Homography [19] (PCA) 2.97 2.16 0.72 6.47 1.14 4.22
APAP [4] (GW) 2.64 2.00 0.60 5.99 1.02 3.26
APAP [4] (SoG) 2.49 1.75 0.60 5.61 0.88 3.14
APAP [4] (PCA) 2.77 1.83 0.60 6.45 0.94 3.49
SIIE [3] 2.04 1.55 0.51 4.41 0.80 2.80
Quasi U CC [10] 3.57 2.77 0.62 8.04 1.09 5.06
Quasi U CC finetuned [10] 2.68 1.72 0.57 6.25 0.98 3.67
FC4 [32] 2.65 2.06 0.67 5.69 1.12 3.49
FFCC [9] 2.44 1.50 0.40 5.87 0.75 3.19
Ours (200 params) 2.39 1.44 0.46 5.95 0.81 2.81
Ours (470 params) 1.91 1.24 0.36 4.78 0.62 2.22
Ours (1460 params) 1.69 1.09 0.37 4.02 0.59 2.02
Table 2: Angular errors (degrees) on the main camera from
our quasi-real NUS [16] dataset. Best results are in bold.
6642
Main camera
Wide-angle camera
Sample image pairs from our real dataset
(Left) Our imaging rig with the color chart at a fixed position relative to
the camera. (Right) An outdoor scene being imaged using our setup.
Figure 8: Our data capture setup and representative examples from our dataset. Note that while illuminant estimation is
performed on the raw-RGB sensor image, we show here the corresponding sRGB images to aid visualization.
scratch using the hyperparameters recommended by the authors.
For FFCC [9], the hyperparameters were carefully
tuned to achieve the best performance. In the literature,
FC4 and FFCC are currently the best-performing methods
across all color constancy datasets, including NUS. It can be
observed that our model with 1460 parameters outperforms
both FC4 and FFCC, as well as other competitors.
4.3. S20 realimage
dataset
The final step in our evaluation is to collect and test on
a real dataset of image pairs captured using a two-camera
system. Towards this goal, we examined various recent
Method Mean Med B25% W25% Q1 Q3
GW [15] 3.25 2.55 0.90 6.94 1.46 3.73
SoG [24] 3.07 2.03 0.62 7.33 0.98 4.16
GE-1 [46] 4.79 3.91 0.94 10.72 1.49 6.35
GE-2 [46] 5.23 3.96 0.95 11.74 1.57 7.76
WGE [29] 6.17 4.76 0.85 14.17 1.50 9.79
PCA [16] 4.56 3.35 0.85 10.50 1.32 6.42
WP [14] 3.24 2.30 0.56 7.48 1.03 4.16
Gamut Pixel [28] 6.81 5.62 0.97 14.20 1.61 10.29
Gamut Edge [28] 5.00 3.60 0.94 11.20 1.46 6.58
CM [18] 3.51 2.64 0.66 7.64 1.25 4.83
Homography [19] (SoG) 3.43 2.38 0.46 7.93 1.13 4.90
Homography [19] (PCA) 4.35 3.12 0.58 10.09 1.05 6.35
APAP [4] (GW) 4.21 2.32 0.53 11.34 0.88 4.81
APAP [4] (SoG) 3.46 2.29 0.39 8.53 0.73 5.18
APAP [4] (PCA) 3.96 2.77 0.47 9.14 0.88 6.06
Linear regression 2.49 1.79 0.80 4.94 1.02 3.29
SIIE [3] 4.71 3.37 0.99 9.98 1.55 7.50
Quasi U CC [10] 3.94 2.66 0.71 9.16 1.21 5.71
Quasi U CC finetuned [10] 2.55 1.55 0.56 6.15 0.84 3.03
FC4 [32] 2.14 1.64 0.69 4.38 1.15 2.67
FFCC [9] 2.51 2.05 0.80 4.95 1.20 3.20
Ours (200 params) 1.73 1.29 0.37 3.75 0.70 2.32
Ours (470 params) 0.94 0.69 0.17 2.14 0.31 1.24
Ours (1460 params) 1.08 0.71 0.16 2.57 0.27 1.47
Table 3: Angular errors on the main camera from our S20
two-camera dataset. Best results are in bold.
smartphones with two rear-facing cameras. Our method requires
access to the raw-RGB images from both cameras.
The Samsung S20 Ultra is one smartphone we found that
has the desired camera configuration and allows saving to
the raw format. The S20 Ultra is equipped with a wideangle
rear-facing camera that provides a larger field of view
than the main camera. The two camera sensors are different:
the main camera is a Samsung HM1 sensor (108 MP,
3×3 Nonacell, 0.8μm pitch), while the second camera is a
Samsung S5K2L3SX sensor (12 MP, 1.4μm pitch). While
we do not have access to the sensors’ CFA spectral sensitivities,
it is easy to verify the CFAs are different by observing
a color checker chart under the same controlled illumination
and plotting the responses. See Fig. S1 of supplemental for
more details on how we validate that the spectral sensitivities
of the two cameras are different. We used image pairs
from the main camera and the wide-angle camera for our
experiments. We developed a simple Android application
with the aid of the Camera2 API [30] to save the raw-DNG
files from both cameras with a single button press. To obtain
the ground truth, a Macbeth color chart was placed in every
scene. For ease of ground truth labelling, we used a custom
rig (see Fig. 8) that allows the color chart to be placed at a
fixed position relative to the camera. This ensures that the
color chart always occupies a fixed spatial location in the
captured images. We collected a total of 156 image pairs,
spanning a diverse range of lighting conditions and scene
content. Some representative examples from our dataset are
shown in Fig. 8. Fig. 5© shows a plot of the distribution
of the ground truth illuminants for the two cameras. It
is clear from the spread in the distribution that our working
assumption of two-camera systems having different spectral
profiles can likely hold true on real data.
As a preprocessing step, the raw-DNG images from our
dataset were demosaiced and the black level was adjusted.
The ground truth illumination was also extracted from the
color chart. Since the field of view is different between the
two cameras, before downsampling, we registered the images
using a fixed pre-computed homography as described
6643
(A) Raw-RGB input (B) SIIE [3] © Quasi U CC tuned [10] (D) FC4 [32] (F) Ours (G) Ground truth
AE
11.1°
AE
3.58°
AE
7.59°
AE
3.69°
AE
0.16°
AE
8.87°
AE
4.83°
AE
1.00°
AE
1.38°
AE
5.11°
AE
4.79°
AE
2.03°
AE
6.74°
AE
9.50°
AE
6.00°
(E) FFCC [9]
Figure 9: Qualitative results from our real dataset. (A) Input raw-RGB image. (B-F) Results of [3, 10, 32, 9], and our method,
respectively, after correcting the input images using the estimated illuminants. (G) Result of correcting using the ground truth
illuminant. A gamma has been applied to the raw-RGB images in (A) for illustration. The results in the remaining columns
have been rendered to the sRGB color space using [2] to aid visualization. Black boxes are used to mask out the color charts.
in Section 3.1. We then computed the transformation for
each image pair. Data augmentation was performed as before
to generate a total of 15600 image pairs.
The results on our real dataset are presented in Table 3.
Three-fold cross validation was used as before, and training
data was augmented for all learning-based methods.
As a baseline comparison, we applied a linear regression
model in place of our trained network. As seen from Table
3, linear regression yields good results, but our small
network performs better because of non-linearities. Results
for [3, 32, 9] were computed in the same manner as in Section
4.2. For [10], we fine-tuned the model for each fold
using the parameters recommended by the authors. It can
be observed that our model with just 200 parameters outperforms
all competitors, including FC4 and FFCC. A few
qualitative results, along with comparisons, are presented
Dataset Method Mean Med B25% W25%
Real
Ours w/o 2.11 1.46 0.61 4.65
Ours 1.73 1.29 0.37 3.75
NUS
Ours w/o 4.74 2.61 0.66 12.47
Ours 2.39 1.44 0.46 5.95
Table 4: The results of our method with and without (w/o)
data augmentation. All models shown have 200 parameters
and were trained with a learning rate of 10−3. Best results
are in bold.
in Fig. 9. Table 4 reports results of training without and
with our data augmentation technique. It is evident from
the results that our augmentation framework improves performance.
5. Conclusion
In this work, we take advantage of the availability of
two rear-facing cameras, commonly used in modern smartphone
design, to perform illumination estimation. Our approach
leverages the differences in the sensor’s spectral profile
between these two cameras. In particular, we trained a
lightweight neural network to estimate the scene illumination
based on a 3×3 linear color transform that maps between
the two cameras’ colors. We demonstrated state-ofthe-
art illuminant estimation performance over contemporary
single-image methods through extensive experiments
on radiometric data, a quasi-real two-camera dataset generated
from an existing single-camera dataset, and a real
dataset that we captured using a two-camera smartphone.
We believe our work may lead to design changes regarding
how current camera devices perform illuminant estimation,
leveraging the ubiquity of multi-camera devices. Our code,
datasets involving radiometric, quasi-real, and real images
from the S20 smartphone, and our trained models will be
publicly released to the community. We hope our findings
will spur further innovation in smartphone imaging through
ideas that leverage multiple cameras.
6644
References
[1] Color constancy : Research website on illuminant estimation.
https://colorconstancy.com/sourcecode/
index.html. Accessed: 2020-11-01. 5
[2] Abdelrahman Abdelhamed, Stephen Lin, and Michael S.
Brown. A high-quality denoising dataset for smartphone
cameras. In CVPR, 2018. 8
[3] Mahmoud Afifi and Michael S. Brown. Sensor-independent
illumination estimation for DNN models. In BMVC, 2019.
2, 6, 7, 8
[4] Mahmoud Afifi, Abhijith Punnappurath, Graham D. Finlayson,
and Michael S. Brown. As-projective-as-possible
bias correction for illumination estimation algorithms.
JOSA-A, 36(1):71–78, 2019. 2, 6, 7
[5] Nikola Banic and Sven Loncaric. Color dog – Guiding the
global illumination estimation to better accuracy. In 10th International
Conference on Computer Vision Theory and Applications,
2015. 2
[6] Kobus Barnard, Vlad Cardei, and Brian Funt. A comparison
of computational color constancy algorithms. I: Methodology
and experiments with synthesized data. TIP, 11(9):972–
984, 2002. 2, 4
[7] Kobus Barnard, Lindsay Martin, Brian Funt, and Adam
Coath. A data set for color research. Color Research &
Application, 27(3):147–151, 2002. 4, 5
[8] Jonathan T. Barron. Convolutional color constancy. In ICCV,
2015. 2
[9] Jonathan T. Barron and Yun-Ta Tsai. Fast fourier color constancy.
In CVPR, 2017. 2, 6, 7, 8
[10] Simone Bianco and Claudio Cusano. Quasi-unsupervised
color constancy. In CVPR, 2019. 2, 6, 7, 8
[11] Simone Bianco, Claudio Cusano, and Raimondo Schettini.
Color constancy using CNNs. In CVPR Workshops, 2015. 2
[12] Simone Bianco, Claudio Cusano, and Raimondo Schettini.
Single and multiple illuminant estimation using convolutional
neural networks. TIP, 26(9):4347–4362, 2017. 2
[13] David H. Brainard and William T. Freeman. Bayesian color
constancy. JOSA-A, 14(7):1393–1411, 1997. 2
[14] David H. Brainard and Brian A. Wandell. Analysis of the
retinex theory of color vision. JOSA-A, 3(10):1651–1661,
1986. 2, 5, 6, 7
[15] Gershon Buchsbaum. A spatial processor model for object
colour perception. Journal of the Franklin Institute, 310(1):1
– 26, 1980. 2, 5, 6, 7
[16] Dongliang Cheng, Dilip K. Prasad, and Michael S. Brown.
Illuminant estimation for color constancy: Why spatialdomain
methods work and the role of the color distribution.
JOSA-A, 31(5):1049–1058, 2014. 2, 5, 6, 7
[17] Graham Finlayson. Image recording apparatus employing a
single CCD chip to record two digital optical images, 2006.
US Patent 7,046,288. 2
[18] Graham D. Finlayson. Corrected-moment illuminant estimation.
In ICCV, 2013. 2, 6, 7
[19] Graham D. Finlayson. Colour and illumination in computer
vision. Interface Focus, 8, 2018. 2, 6, 7
[20] Graham D. Finlayson, Brian V. Funt, and Kobus Barnard.
Color constancy under varying illumination. In ICCV, 1995.
5
[21] Graham D. Finlayson, Steven D. Hordley, and Peter Morovic.
Chromagenic colour constancy. In 10th Congress of
the International Colour Association, 2005. 2, 3
[22] Graham D. Finlayson, Steven D. Hordley, and Peter Morovic.
Colour constancy using the chromagenic constraint.
In CVPR, 2005. 2, 3
[23] Graham D. Finlayson, Steven D. Hordley, and Ingeborg
Tastl. Gamut constrained illuminant estimation. IJCV,
67(1):93–109, 2006. 2
[24] Graham D. Finlayson and Elisabetta Trezzi. Shades of gray
and colour constancy. In Color Imaging Conference, 2004.
2, 5, 6, 7
[25] David A. Forsyth. A novel algorithm for color constancy.
IJCV, 5:5–35, 2004. 2
[26] Peter V. Gehler, Carsten Rother, Andrew Blake, Tom Minka,
and Toby Sharp. Bayesian color constancy revisited. In
CVPR, 2008. 2
[27] Arjan Gijsenij and Theo Gevers. Color constancy using natural
image statistics and scene semantics. TPAMI, 33(4):687–
698, 2011. 2
[28] Arjan Gijsenij, Theo Gevers, and Joost Van De Weijer. Generalized
gamut mapping using image derivative structures for
color constancy. IJCV, 86:127–139, 2008. 2, 5, 6, 7
[29] Arjan Gijsenij, Theo Gevers, and Joost Van De Weijer. Improving
color constancy by photometric edge weighting.
TPAMI, 34(5):918–929, 2012. 5, 6, 7
[30] Google. Android Camera2 API. https://developer.
android.com/reference/android/hardware/
camera2/package-summary.html. Accessed: 2017-
11-10. 7
[31] Daniel Hernandez-Juarez, Sarah Parisot, Benjamin Busam,
Ales Leonardis, Gregory Slabaugh, and Steven McDonagh.
A multi-hypothesis approach to color constancy. In CVPR,
2020. 2
[32] Yuanming Hu, Baoyuan Wang, and Stephen S Lin.
FC4: Fully convolutional color constancy with confidenceweighted
pooling. In CVPR, 2017. 2, 6, 7, 8
[33] Jun Jiang, Dengyu Liu, Jinwei Gu, and Sabine S¨usstrunk.
What is the space of spectral sensitivity functions for digital
color cameras? In WACV, 2013. 4, 5
[34] Hamid Reza Vaezi Joze and Mark S. Drew. Exemplarbased
color constancy and multiple illumination. TPAMI,
36(5):860–873, 2014. 2
[35] Hamid Reza Vaezi Joze, Mark S. Drew, Graham D. Finlayson,
and Perla Aurora Troncoso Rey. The role of bright
pixels in illumination estimation. In Color Imaging Conference,
2012. 2
[36] Hakki C. Karaimer and Michael S. Brown. A software
platform for manipulating the camera imaging pipeline. In
ECCV, 2016. 2
[37] Diederik P. Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. ICLR, 2014. 5
[38] Zhongyu Lou, Theo Gevers, Ninghang Hu, and Marcel P.
Lucassen. Color constancy by deep learning. In BMVC,
2015. 2
6645
[39] Laurence T. Maloney. Physics-based approaches to modeling
surface color perception. Color vision: From genes to
perception, 1999. 1
[40] Simon Niklaus, Xuaner Cecilia Zhang, Jonathan T. Barron,
Neal Wadhwa, Rahul Garg, Feng Liu, and Tianfan
Xue. Learned dual-view reflection removal. arXiv preprint
arXiv:2010.00702, 2020. 1
[41] Seoung Oh and Seon Kim. Approaching the computational
color constancy as a classification problem through deep
learning. Pattern Recognition, 2016. 2
[42] Dilip K. Prasad. Strategies for resolving camera metamers
using 3+1 channel. In CVPR Workshop, 2016. 2
[43] Yanlin Qian, Ke Chen, Jarno Nikkanen, Joni-Kristian Kamarainen,
and Jiri Matas. Recurrent color constancy. In ICCV,
2017. 2
[44] Charles Rosenberg, Alok Ladsariya, and Tom Minka.
Bayesian color constancy with non-gaussian models. In
NeurIPS. 2004. 2
[45] Wu Shi, Chen Change Loy, and Xiaoou Tang. Deep specialized
network for illuminant estimation. In ECCV, 2016.
2
[46] Joost Van DeWeijer, Theo Gevers, and Arjan Gijsenij. Edgebased
color constancy. TIP, 16(9), 2007. 2, 5, 6, 7
[47] Jin Xiao, Shuhang Gu, and Lei Zhang. Multi-domain learning
for accurate and few-shot color constancy. In CVPR,
2020. 2
[48] Yinda Zhang, Neal Wadhwa, Sergio Orts-Escolano, Christian
H¨ane, Sean Fanello, and Rahul Garg. Du2net: Learning
depth estimation from dual-cameras and dual-pixels. arXiv
preprint arXiv:2003.14299, 2020. 1
6646


版权声明:本文为weixin_42176038原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。