A major limitation of visual communication is that people can attend to only a very limited number of features and events at any one time. For example, we can see with full resolution only in the fovea, a tiny part of the retina. To fully take in a visual scene, we must move our eyes around to bring objects into focus one at a time. Although people are rarely aware of these two to three eye movements per second, where they look ultimately determines what they perceive. We are therefore developing gaze-guidance systems to help users deploy their limited attentional resources more effectively and to augment natural sight with computer-vision technology in an unobtrusive way (see Figure 1). Such guidance might be useful in safety-critical applications, such as driving, and would ultimately allow novices to ‘see with the eyes of experts.’
The information conveyed by images is determined by the image in conjunction with the observer's gaze pattern, which may vary considerably from person to person and with context. Therefore, gaze is as important as brightness or colour for defining the message that reaches the observer. However, existing information and communication-technology systems do not take this into account. Therefore, we propose that the observer's gaze pattern should be sensed and displayed just like physical image attributes. While gaze sensors, or eye trackers, have long been commercially available, our goal is to guide the observer's gaze through the image.
Implementing such a system requires further insights into the function of human vision, and several technical challenges must be met. Our research is funded by the European Commission's Future Emerging Technologies programme, which focuses on high-risk basic research with high-potential future applications. The consortium consists of human- and computer-vision partners from universities in Lübeck and Giessen (Germany), Groningen (Netherlands), Leuven (Belgium) and Bucharest (Romania), as well as a German company providing eye-tracking technology.
Gaze-contingent interactive displays statically present information and guide the user's gaze to help them deploy their limited attentional resources more effectively.
We show subjects high-resolution natural (non-computer-generated) videos while a camera tracks their eye movements. Based on eye position and video input, we first predict a set of candidate points where the subject may look next. We then perform real-time image transformations to increase the likelihood for the desired candidate location (e.g., by increasing contrast or introducing motion) and decrease the likelihood for all others (e.g., by reducing contrast or blurring). To evaluate the effect gaze guidance has on both eye movements and actual behaviour, we carried out similar experiments in a virtual-reality driving simulator. (Videos are available online.)1
To investigate which image transformations can make a movie patch more or less attractive to an observer, we first collected a large set of eye-movement data on natural videos. We employed machine-learning techniques to distill the structural differences between patches observers looked at and those they missed. Learning these differences on a very simple, low-dimensional representation of the videos (the average local feature energy) significantly outperformed previously published approaches in predicting whether an observer will attend to a given new patch.
We can also use information on the class boundary that separates attended and nonattended movie patches to derive image transformations. These are then implemented as local changes on a spatio-temporal Laplacian pyramid, i.e., a decomposition of the input video into spatio-temporal frequency bands. Because decomposition of high-resolution video is computationally expensive, we implemented the system on dedicated graphics hardware to meet real-time constraints. We obtained several intermediate results. We developed novel algorithms for eye-movement prediction that significantly outperform the state of the art (prediction rates improved from 65 to 84%). We showed that simple gaze-guidance techniques (e.g., flashing red dots that remain unobtrusive) can improve performance in various visual tasks such as reading and driving. We pushed technological development to the limits using low-latency eye tracking and complex real-time and gaze-contingent modifications of high-resolution videos. We can now perform space-variant, spatio-temporal filtering of high-definition TV video with an image-processing latency of 2ms.
Although work on determining the optimal strategy for natural environments is still ongoing, we demonstrated that gaze guidance is possible in principle and that it may have a significant positive impact on visual performance. The next step will be to take gaze guidance from the laboratory to real-world applications, such as training and simulator scenarios. Further details about the GazeCom project are available on our website.2