Advertisement
mahmud11556

tossss

Jul 26th, 2023
70
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 11.71 KB | None | 0 0
  1.  
  2.  
  3. //first segment done
  4. In this section, we start with a feasibility test that reveals the
  5. 3 key building blocks of the webcam peeking threat model,
  6. namely (1) reflection pixel size, (2) viewing angle, and (3) light
  7. signal-to-noise ratio (SNR). For the first two building blocks,
  8. we develop a mathematical model that quantifies the related
  9. impact factors. For light SNR, we analyze one major factor
  10. it encompasses, i.e., image distortions caused by shot noise,
  11. and investigate using multi-frame super resolution (MFSR)
  12. to enhance reflection images. We will analyze other physical
  13. factors that affect light SNR in Section IV-D. Experiments are
  14. conducted with the Acer laptop with its built-in 720p webcam,
  15. the pair of BLB glasses, and the pair of prescription glasses
  16. described in Appendix A.
  17.  
  18.  
  19. A. Feasibility Test:
  20. We conduct a feasibility test of recognizing single alphabet
  21. letters with a similar setup as in Figure 1. A mannequin
  22. wears the BLB glasses with a glass-screen distance of 30
  23. cm. Capital letters with different cap heights (80, 60, 40, 20,
  24. 10 mm) are displayed and captured by the webcam. Figure
  25. 2 (upper) shows the captured reflections. We find that the 5
  26. different cap heights resulted in letters with heights of 40, 30,
  27. 20, 10, and 5 pixels in the captured images. As expected, texts represented by fewer pixels are harder to recognize.
  28. The reflection pixel size acquired by adversaries is thus one
  29. key building block of the characteristics of webcam peeking
  30. attack that we need to model. In addition, Figure 2 (lower)
  31. shows the ideal reflections with these pixel sizes by resampling
  32. the template image. Comparing the two, we notice smallsize texts are subjected to additional distortions besides the
  33. issue of small pixel resolution and noise caused by the face
  34. background, resulting in a bad signal-to-noise ratio (SNR) of
  35. the textual signals.
  36.  
  37. To quantify the differences using objective metrics, we
  38. embody the notion of reflection quality in the similarity
  39. between the reflected texts and the original templates. We
  40. compared multiple widely-used image structural and textural
  41. similarity indexes including structural similarity Index (SSIM)
  42. [56], complex-wavelet SSIM (CWSSIM) [53], feature similarity (FSIM) [59], deep image structure and texture similarity
  43. (DISTS) [32] as well as self-built indexes based on scaleinvariant feature transform (SIFT) features [49]. Overall, we
  44. found CWSSIM which spans the interval [0, 1] with larger
  45. numbers representing higher reflection quality produces the
  46. best match with human perception results. Figure 2 shows the
  47. CWSSIM scores under each image.
  48. The differences show that the SNR of reflected light corresponding to the textual targets is another key building block we
  49. need to characterize. Finally, we notice that when we rotate
  50. the mannequin with an angle exceeding a certain threshold,
  51. the webcam images do not contain the displayed letters on the
  52. screen anymore. It suggests that the viewing angle is another
  53. critical building block of the webcam peeking threat model
  54. which acts as an on/off function for successful recognition
  55. of screen contents. In the following sections, we seek to
  56. characterize these three building blocks.
  57.  
  58. B. Reflection Pixel Size:
  59. In the attack, the embodiment of textual targets undergoes
  60. a 2-stage conversion process: digital (victim software) →
  61. physical (victim screen) → digital (adversary camera). In the
  62. first stage, texts specified usually in point size in software by
  63. the user or web designers are rendered on the victim screen
  64. with corresponding physical cap heights. In the second stage,
  65. the on-screen texts get reflected by the glass, captured by the camera, digitized, and transferred to the adversary’s software as an image with certain pixel sizes. Generally, more usable
  66. pixels representing the texts enable adversaries to recognize
  67. texts more easily. The key is thus to understand the mechanism
  68. of point size → cap height → pixel size conversion.
  69.  
  70. Point Size → Cap Height. Mapping between digital point
  71. size and physical cap height is not unique but dependent on
  72. user-specific factors and software.
  73.  
  74.  
  75. Cap Height → Pixel Size. We would like to remind the
  76. readers that we only use pixel size to represent the size of
  77. texts living in the images acquired by the adversary2
  78. . Figure
  79. 3 shows the model for this conversion process. To simplify
  80. the model, we assume the glasses lens, screen contents, and
  81. webcam are aligned on the same line with the same angle.
  82. The result of this approximation is the loss of projective transformation information, which only causes small inaccuracies
  83. for reflection pixel size estimation in most webcam peeking
  84. scenarios. Figure 3 only depicts one dimension out of the
  85. horizontal and vertical dimensions of the optical system but
  86. can be used for both dimensions. In this work we focus on the
  87. vertical dimension for analysis, i.e., the reflection pixel size
  88. we discuss is the height of the captured reflections in pixels.
  89.  
  90.  
  91. C. Viewing Angle:
  92. To model the effect of viewing angle and the range of
  93. angles that enables webcam peeking attack, we model the lens
  94. as spherical with a radius 2푓푔
  95. . A detailed derivation of the
  96. viewing angle model can be found in Appendix B. We consider
  97. two cases of successful peeking with a rotation of the glass
  98. lens. The first case All Page claims success as long as there
  99. exists a point on the screen whose emitted light ray can reach
  100. the camera. The second case Center claims success only if the
  101. contents at the center of the screen have emitted lights that can
  102. be reflected to camera. Table II summarizes the calculated
  103. theoretical angle ranges and the measured values. Both the
  104. theoretical model and measurements show that the webcam
  105. peeking attack is relatively robust to human positioning with different head viewing angles, which is validated later by the
  106. user study results (Section V-B).
  107.  
  108.  
  109. D. Image Distortion Characterization:
  110. Generally, the possible distortions are composed of imaging
  111. systems’ inherent distortions and other external distortions. Inherent distortions mainly include out-of-focus blur and various
  112. imaging noises introduced by non-ideal camera circuits. Such
  113. inherent distortions exist in camera outputs even when no user
  114. interacts with the camera. External distortions, on the other
  115. hand, mainly include factors like motion blur caused by the
  116. movement of active webcam users.
  117.  
  118. User Movement-caused Motion Blur: When users move
  119. in front of their webcams, reflections from their glasses move
  120. accordingly which can cause blurs in the camera images. User
  121. motions can be decomposed into two components, namely
  122. involuntary periodic small-amplitude tremors that are always
  123. present [33], and intentional non-periodic large-amplitude
  124. movements that are occasionally caused by random events such
  125. as a user moving its head to look aside. For tremor-based motion, existing research suggests the
  126. mean displacement amplitude of dystonia patients’ head
  127. tremors is under 4 mm with a maximum frequency of about
  128. 6 Hz [34]. Since dystonia patients have stronger tremors than
  129. healthy people, this provides an estimation of the tremor amplitude upper bound. With the example glass in Section III-B
  130. and a 30 fps camera, the estimated pixel blur is under 1
  131. pixel. Such a motion blur is likely to affect the recognition of
  132. extremely small reflections. Intentional motion is not a focus
  133. of this work due to its random, occasional, and individualspecific characteristics. We will experimentally involve the
  134. impacts of intentional user motions in the user study by letting
  135. users behave normally.
  136.  
  137.  
  138. Distortion Analysis: To observe and analyze the dominant
  139. types of distortions, we recorded videos with the laptop
  140. webcam and a Nikon Z7 DSLR [17] representing a higherquality imaging system. The setup is the same as the feasibility
  141. test except that we tested with both the still mannequin and a
  142. human to analyze the effects of human tremor. Figure 14 (a)
  143. shows the comparison between the ideal reflection capture and
  144. the actual captures in three consecutive video frames of the
  145. webcam (1st row) and Nikon Z7 (2nd row) when the human
  146. wears the glasses. Empirically, we observed the following
  147. three key features of the video frames in this setup with both
  148. the mannequin and human (see Appendix D for details):
  149.  
  150. ∙ Out-of-focus blur and tremor-caused motion blur are generally negligible when the reflected texts are recognizable.
  151. ∙ Inter-frame variance: The distortions at the same position
  152. of each frame are different, generating different noise
  153. patterns for each frame.
  154. ∙ Intra-frame variance: Even in a single frame, the distortion patterns are spatially non-uniform.
  155. One key observation is that the captured texts are subjected
  156. to occlusions (the missing or faded parts) caused by shot
  157. noise [19] when there is an insufficient number of photons
  158. hitting the sensors. This can be easily reasoned in light of the
  159. short exposure time and small text pixel size causing reduced
  160. photons emitted and received. In addition, other common
  161. imaging noise such as Gaussian noise gets visually amplified
  162. by relatively higher ISO values due to the bad light sensitivity
  163. of the webcam sensors. We call such noise ISO noise. Both two
  164. types of distortions have the potential to cause intra-frame and
  165. inter-frame variance. The shot and ISO noise in the webcam
  166. peeking attack plays on a see-saw with an equilibrium point
  167. posed by the quality of the camera imaging sensors. It suggests
  168. that the threat level will further increase (see the comparison
  169. between the webcam and Nikon Z7’s images in Figure 14)
  170. as future webcams get equipped with better-quality sensors at
  171. lower costs.
  172.  
  173.  
  174. E. Image Enhancing with MFSR:
  175. The analysis of distortions calls for an image reconstruction
  176. scheme that can reduce multiple types of distortions and
  177. tolerate inter-frame and intra-frame variance. One possible
  178. method is to reconstruct a better-quality image from multiple
  179. low-quality frames. Such reconstruction problem is usually
  180. defined as multi-frame super resolution (MFSR) [58]. The
  181. basic idea is to combine non-redundant information in multiple
  182. frames to generate a better-quality frame.
  183. We tested 3 common light-weight MFSR approaches that
  184. do not require a training phase, including cubic spline interpolation [58], fast and robust MFSR [36], and adaptive
  185. kernel regression (AKR) based MFSR [41]. Test results on the
  186. reflection images show that the AKR-based approach generally
  187. yields better results than the other two approaches in our
  188. specific application and setup. All three approaches outperform
  189. a simple averaging plus upsampling of the frames after frame
  190. registration, which may be viewed as a degraded form of
  191. MFSR. An example of the comparison between the different
  192. methods and the original 8 frames used for MFSR is shown
  193. in Figure 4 (a). We thus use the AKR-based approach for the
  194. following discussions.
  195. One parameter to decide for the use of webcam peeking
  196. is the number of frames used to reconstruct the high-quality
  197. image. Figure 4 (b) shows the CWSSIM score improvement of
  198. the reconstructed image over the original frames with different
  199. numbers of frames used for MFSR when a human wears the
  200. glasses to generate the reflections. Note that increasing the
  201. number of frames do not monotonically increase the image
  202. quality since live users’ occasional intentional movements can
  203. degrade image registration effectiveness in the MFSR process and thus undermine the reconstruction quality. Based on the
  204. results, we empirically choose to use 8 frames for the following
  205. evaluations. In addition, the improvement in CWSSIM scores
  206. also validates that MFSR-resulted images have better quality
  207. than most of the original frames. We thus only consider
  208. evaluation using the MFSR images in the following sections.
  209.  
  210.  
  211.  
  212.  
  213.  
  214.  
  215.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement