tossss



//first segment done
In this section, we start with a feasibility test that reveals the
3 key building blocks of the webcam peeking threat model,
namely (1) reflection pixel size, (2) viewing angle, and (3) light
signal-to-noise ratio (SNR). For the first two building blocks,
we develop a mathematical model that quantifies the related
impact factors. For light SNR, we analyze one major factor
it encompasses, i.e., image distortions caused by shot noise,
and investigate using multi-frame super resolution (MFSR)
to enhance reflection images. We will analyze other physical
factors that affect light SNR in Section IV-D. Experiments are
conducted with the Acer laptop with its built-in 720p webcam,
the pair of BLB glasses, and the pair of prescription glasses
described in Appendix A.


A. Feasibility Test:
We conduct a feasibility test of recognizing single alphabet
letters with a similar setup as in Figure 1. A mannequin
wears the BLB glasses with a glass-screen distance of 30
cm. Capital letters with different cap heights (80, 60, 40, 20,
10 mm) are displayed and captured by the webcam. Figure
2 (upper) shows the captured reflections. We find that the 5
different cap heights resulted in letters with heights of 40, 30,
20, 10, and 5 pixels in the captured images. As expected, texts represented by fewer pixels are harder to recognize.
The reflection pixel size acquired by adversaries is thus one
key building block of the characteristics of webcam peeking
attack that we need to model. In addition, Figure 2 (lower)
shows the ideal reflections with these pixel sizes by resampling
the template image. Comparing the two, we notice smallsize texts are subjected to additional distortions besides the
issue of small pixel resolution and noise caused by the face
background, resulting in a bad signal-to-noise ratio (SNR) of
the textual signals.

To quantify the differences using objective metrics, we
embody the notion of reflection quality in the similarity
between the reflected texts and the original templates. We
compared multiple widely-used image structural and textural
similarity indexes including structural similarity Index (SSIM)
[56], complex-wavelet SSIM (CWSSIM) [53], feature similarity (FSIM) [59], deep image structure and texture similarity
(DISTS) [32] as well as self-built indexes based on scaleinvariant feature transform (SIFT) features [49]. Overall, we
found CWSSIM which spans the interval [0, 1] with larger
numbers representing higher reflection quality produces the
best match with human perception results. Figure 2 shows the
CWSSIM scores under each image.
The differences show that the SNR of reflected light corresponding to the textual targets is another key building block we
need to characterize. Finally, we notice that when we rotate
the mannequin with an angle exceeding a certain threshold,
the webcam images do not contain the displayed letters on the
screen anymore. It suggests that the viewing angle is another
critical building block of the webcam peeking threat model
which acts as an on/off function for successful recognition
of screen contents. In the following sections, we seek to
characterize these three building blocks.

B. Reflection Pixel Size:
In the attack, the embodiment of textual targets undergoes
a 2-stage conversion process: digital (victim software) →
physical (victim screen) → digital (adversary camera). In the
first stage, texts specified usually in point size in software by
the user or web designers are rendered on the victim screen
with corresponding physical cap heights. In the second stage,
the on-screen texts get reflected by the glass, captured by the camera, digitized, and transferred to the adversary’s software as an image with certain pixel sizes. Generally, more usable
pixels representing the texts enable adversaries to recognize
texts more easily. The key is thus to understand the mechanism
of point size → cap height → pixel size conversion.

Point Size → Cap Height. Mapping between digital point
size and physical cap height is not unique but dependent on
user-specific factors and software.


Cap Height → Pixel Size. We would like to remind the
readers that we only use pixel size to represent the size of
texts living in the images acquired by the adversary2
. Figure
3 shows the model for this conversion process. To simplify
the model, we assume the glasses lens, screen contents, and
webcam are aligned on the same line with the same angle.
The result of this approximation is the loss of projective transformation information, which only causes small inaccuracies
for reflection pixel size estimation in most webcam peeking
scenarios. Figure 3 only depicts one dimension out of the
horizontal and vertical dimensions of the optical system but
can be used for both dimensions. In this work we focus on the
vertical dimension for analysis, i.e., the reflection pixel size
we discuss is the height of the captured reflections in pixels.


C. Viewing Angle:
To model the effect of viewing angle and the range of
angles that enables webcam peeking attack, we model the lens
as spherical with a radius 2푓푔
. A detailed derivation of the
viewing angle model can be found in Appendix B. We consider
two cases of successful peeking with a rotation of the glass
lens. The first case All Page claims success as long as there
exists a point on the screen whose emitted light ray can reach
the camera. The second case Center claims success only if the
contents at the center of the screen have emitted lights that can
be reflected to camera. Table II summarizes the calculated
theoretical angle ranges and the measured values. Both the
theoretical model and measurements show that the webcam
peeking attack is relatively robust to human positioning with different head viewing angles, which is validated later by the
user study results (Section V-B).


D. Image Distortion Characterization:
Generally, the possible distortions are composed of imaging
systems’ inherent distortions and other external distortions. Inherent distortions mainly include out-of-focus blur and various
imaging noises introduced by non-ideal camera circuits. Such
inherent distortions exist in camera outputs even when no user
interacts with the camera. External distortions, on the other
hand, mainly include factors like motion blur caused by the
movement of active webcam users.

User Movement-caused Motion Blur: When users move
in front of their webcams, reflections from their glasses move
accordingly which can cause blurs in the camera images. User
motions can be decomposed into two components, namely
involuntary periodic small-amplitude tremors that are always
present [33], and intentional non-periodic large-amplitude
movements that are occasionally caused by random events such
as a user moving its head to look aside. For tremor-based motion, existing research suggests the
mean displacement amplitude of dystonia patients’ head
tremors is under 4 mm with a maximum frequency of about
6 Hz [34]. Since dystonia patients have stronger tremors than
healthy people, this provides an estimation of the tremor amplitude upper bound. With the example glass in Section III-B
and a 30 fps camera, the estimated pixel blur is under 1
pixel. Such a motion blur is likely to affect the recognition of
extremely small reflections. Intentional motion is not a focus
of this work due to its random, occasional, and individualspecific characteristics. We will experimentally involve the
impacts of intentional user motions in the user study by letting
users behave normally.


Distortion Analysis: To observe and analyze the dominant
types of distortions, we recorded videos with the laptop
webcam and a Nikon Z7 DSLR [17] representing a higherquality imaging system. The setup is the same as the feasibility
test except that we tested with both the still mannequin and a
human to analyze the effects of human tremor. Figure 14 (a)
shows the comparison between the ideal reflection capture and
the actual captures in three consecutive video frames of the
webcam (1st row) and Nikon Z7 (2nd row) when the human
wears the glasses. Empirically, we observed the following
three key features of the video frames in this setup with both
the mannequin and human (see Appendix D for details):

∙ Out-of-focus blur and tremor-caused motion blur are generally negligible when the reflected texts are recognizable.
∙ Inter-frame variance: The distortions at the same position
of each frame are different, generating different noise
patterns for each frame.
∙ Intra-frame variance: Even in a single frame, the distortion patterns are spatially non-uniform.
One key observation is that the captured texts are subjected
to occlusions (the missing or faded parts) caused by shot
noise [19] when there is an insufficient number of photons
hitting the sensors. This can be easily reasoned in light of the
short exposure time and small text pixel size causing reduced
photons emitted and received. In addition, other common
imaging noise such as Gaussian noise gets visually amplified
by relatively higher ISO values due to the bad light sensitivity
of the webcam sensors. We call such noise ISO noise. Both two
types of distortions have the potential to cause intra-frame and
inter-frame variance. The shot and ISO noise in the webcam
peeking attack plays on a see-saw with an equilibrium point
posed by the quality of the camera imaging sensors. It suggests
that the threat level will further increase (see the comparison
between the webcam and Nikon Z7’s images in Figure 14)
as future webcams get equipped with better-quality sensors at
lower costs.


E. Image Enhancing with MFSR:
The analysis of distortions calls for an image reconstruction
scheme that can reduce multiple types of distortions and
tolerate inter-frame and intra-frame variance. One possible
method is to reconstruct a better-quality image from multiple
low-quality frames. Such reconstruction problem is usually
defined as multi-frame super resolution (MFSR) [58]. The
basic idea is to combine non-redundant information in multiple
frames to generate a better-quality frame.
We tested 3 common light-weight MFSR approaches that
do not require a training phase, including cubic spline interpolation [58], fast and robust MFSR [36], and adaptive
kernel regression (AKR) based MFSR [41]. Test results on the
reflection images show that the AKR-based approach generally
yields better results than the other two approaches in our
specific application and setup. All three approaches outperform
a simple averaging plus upsampling of the frames after frame
registration, which may be viewed as a degraded form of
MFSR. An example of the comparison between the different
methods and the original 8 frames used for MFSR is shown
in Figure 4 (a). We thus use the AKR-based approach for the
following discussions.
One parameter to decide for the use of webcam peeking
is the number of frames used to reconstruct the high-quality
image. Figure 4 (b) shows the CWSSIM score improvement of
the reconstructed image over the original frames with different
numbers of frames used for MFSR when a human wears the
glasses to generate the reflections. Note that increasing the
number of frames do not monotonically increase the image
quality since live users’ occasional intentional movements can
degrade image registration effectiveness in the MFSR process and thus undermine the reconstruction quality. Based on the
results, we empirically choose to use 8 frames for the following
evaluations. In addition, the improvement in CWSSIM scores
also validates that MFSR-resulted images have better quality
than most of the original frames. We thus only consider
evaluation using the MFSR images in the following sections.