beth@ptolemy.arc.nasa.gov (Elizabeth) (07/19/90)
> >At the Interactive graphics conference in Snowbird at the end of March, >their was a demo of a project at NASA Ames (in conjunction with the head >mounted display project as seen in Scientific American) in which synthesized >sound was fed to headphones based on relative positions of head and "source" >as sensed by Polhemus trackers. > >In the talk, a comment from the audience mentioned a canadian project to >do the same kind of thing, but through speakers! Much harder! There was >some skepticism about whether it could work, and others claiming to have >experienced it. > >The contractor here at NASA doing the work is Scott Foster. > >Sam Uselton uselton@nas.nasa.gov >employed by CSC >working for NASA >speaking for myself > >---------------------------------------------------------------------------- > >Moderator's Note: > >The Canadian project Sam refers to is underway within a small company known >as Gehring Research of Toronto, Ontario. The system is known as the Focal >Point 3-D Audio System. > And another posting stated: >I refer everyone to the excellent work being done at NASA/Ames by Elizabeth >Wenzel and Scott Foster. I saw their equipment demoed at the 1990 Symposium >on Interactive 3D Graphics (Snowbird Utah, March 1990), and the audio >effects were very well done. > >Their aparatus was a Sennheiser HD-540 Pro headphone with a Polhemus 6 DoF >tracker cemented to it, and another 6 DoF tracker for you to hold. >A PC & a snazzy custom DSP board took a monaural sound source (in the >demo, from a CD player) and positioned it to "seem" like it was located >at the hand-held tracker. Move either your head or the tracker, and >a pretty good approximation of the "right" effect happened. > >The Head-Related Transfer Function (HRTF) is synthesized from 144 pairs >of Finite Impulse Responses (FIRs) measured for the sample head. > >Superb work. Read about it in COMPUTER GRAPHICS, V 24, Number 2, March 1990 >"A publication of ACM SIGGRAPH, special issue on the 1990 Symposium on >Interactive 3D Graphics". pp 139-140. > > Best, > -Mike Muuss Just to clear up a couple of things up about the above discussion of 3D Sound: Although Bo's system is certainly quite relevant, the Canadian company referred to at Snowbird was Q-Sound who (as I understand it) claims to have developed true 3D sound using presentation with only two, or even one, loudspeakers such that there is an infinite "sweet spot"; i.e., you can move around the room and localized sources stay put. This is quite a different problem from headphone presentation, and as Sam Uselton suggested, much harder. As far as I know, the Gehring system was developed for headphones, not speaker presentation. At least that was my impression when Bo demonstrated the system to me, especially since one of the sets of HRTFs he uses were obtained from Fred Wightman and are basically a shortened/windowed version of one of the HRTF sets we use in our lab. (My apologies to Bo if this assessment is incorrect.) My opinion is that one could get a substantial enhancement of apparent auditory spaciousness by presenting HRTF-filtered stimuli over loudspeakers, but that reliable realtime placement and movement of sound sources which is independent of listener-position is basically impossible. The physical acoustics of this situation are tremendously complex and one needs to be able to take into account the realtime relationships between the positions of the listener, the desired virtual sources, and the real sources (loudspeakers), as well as compensate for the auditory crosstalk between loudpseakers which will defeat the ability to precisely control the waveforms entering the two ears of the listener. This sort of precision is essential for precise manipulation of location and is why headphones are the transducer of choice. This doesn't mean that you can't get dramatic spatial EFFECTS over speakers, but if the goal is presentation of location as a piece of INFORMATION as in a spatial display, my feeling is that one can't get reliable and predictable 3D placement with loudspeakers. However, this doesn't mean that such techniques are useless for some applications. It often seems that, in the area of 3D sound, there is a confusion between the goals of precise information presentation and the goals of aesthetics and the creation of special effects. These goals are both laudable but they, and the methods used to acheive them, may not always coincide. There is another point to be raised when one is listening to and trying to evaluate 3D sound systems. Depending upon the nature of the demo material, one can create a strong impression of precise localization control which may or may not be really there. For example, cognitive cues may strongly bias your perception of location. You are more likely to believe an aircraft is above you rather than below you, hear the sound of lighting a cigarette in front where your mouth is rather than behind you, or hear someone clipping your hair above or behind you rather than in front of your nose. Such cognitive effects could actually be quite useful to spatial display designers under certain circumstances but they do not reflect the direct manipulation of spatial auditory cues per se. A related cognitive effect is the notion of visual capture, i.e., the visual location of an object tends to dominate its auditory location even when the two are in substantially different positions (as in the "ventriloquism effect" and sound systems in movie theatres). I am appending a brief summary of a recent overview talk I gave at the Santa Barbara conference on telepresence and virtual environments; it has a fair number of references which people may find useful. If anyone is interested in finding out more about the realtime 3D sound hardware developed in our lab (the "Convolvotron"), you can contact Scott Foster, the designer, directly at: Crystal River Engineering 12350 Wards Ferry Road Groveland, CA 95321 (209) 962-6382 Regards, Beth Wenzel NASA-Ames Research Center VIRTUAL ACOUSTIC DISPLAYS Presented at the Conference on Human-Machine Interfaces for Teleoperators and Virtual Environments Santa Barbara, CA, March 4-9, 1990 Elizabeth M. Wenzel Aerospace Human Factors Research Division NASA-Ames Research Center Mail Stop 262-2 Moffett Field, CA 94035 (415) 604-6290 As with most research in information displays, virtual displays have generally emphasized visual information. Many investigators, however, have pointed out the importance of the auditory system as an alternative or supplementary information channel (e.g., Deatherage, 1972; Doll, et. al., 1986; Patterson, 1982; Gaver, 1986). A three-dimensional auditory display can potentially enhance information transfer by combining directional and iconic information in a quite naturalistic representation of dynamic objects in the interface. Borrowing a term from Gaver (1986), an obvi- ous aspect of "everyday listening" is the fact that we live and listen in a three-dimensional world. Indeed, a primary advantage of the auditory system is that it allows us to monitor and identify sources of information from all possi- ble locations, not just the direction of gaze. This feature would be especially useful in an application that is inherently spatial, such as an air traffic control display for the tower or cockpit. A further advantage of the binaural system, often referred to as the "cocktail party effect" (Cherry, 1953), is that it improves the intelligi- bility of sources in noise and assists in the segregation of multiple sound sources. This effect could be critical in applications involving encoded nonspeech messages as in scientific "visualization", the acoustic representation of multi-dimensional data (e.g., Bly, 1982), and the develop- ment of alternative interfaces for the visually-impaired (Edwards, 1989; Loomis, et. al., 1990). Another aspect of auditory spatial cues is that, in conjunction with other modalities, it can act as a potentiator of information in the display. For example, visual and auditory cues together can reinforce the information content of the display and provide a greater sense of presence or realism in a manner not readily acheivable by either modality alone (Colquhoun, 1975; Warren, et. al., 1981; O'Leary & Rhodes, 1984). This phenomenon will be particularly useful in telepresence applications, such as advanced teleconferencing environ- ments, shared electronic workspaces, and monitoring telero- botic activities in remote or hazardous situations. Thus, the combination of direct spatial cues with good principles of iconic design could provide an extremely powerful and information-rich display which is also quite easy to use. This type of display could be realized with an array of real sound sources or loudspeakers for listeners seated in a fixed position (Doll, et. al., 1986; Calhoun, et. al., 1987). An alternative approach, recently developed at NASA-Ames, generates externalized, three- dimensional sound cues over headphones in realtime using digital signal-processing (Wenzel, et. al., 1988a). Here, the synthesis technique involves the digital generation of stimuli using Head-Related Transfer Functions (HRTFs) meas- ured in the two ear-canals of individual subjects (see Wightman & Kistler, 1989a). Up to four moving or static sources can be simulated in a head-stable environment by digital filtering of arbitrary signals with the appropriate HRTFs. This type of presentation system is desirable because it allows complete control over the acoustic waveforms delivered to the two ears and the ability to interact dynam- ically with the virtual display. Other similar approaches include an analog system developed by Loomis, et. al. (1990) and digital systems which make use of transforms derived from normative mannikins and simulations of room acoustics (Genuit, 1986; Posselt, et. al., 1986; McKinley & Ericson, 1988; Persterer, 1989; Lehnert & Blauert, 1989). Such an interface also requires the careful psychophy- sical evaluation of listeners' ability to accurately local- ize the virtual or synthetic sound sources. For example, a recent study by Wightman & Kistler (1988b) confirmed the perceptual adequacy of the basic technique for static sources; source azimuth was synthesized nearly perfectly for all listeners while source elevation was somewhat less well-defined in the headphone conditions. From an applied standpoint, measurement of each poten- tial listener's HRTFs may not be possible in practise. It may also be the case that the user of such a display will not have the opportunity for extensive training. Thus, a critical research issue for virtual acoustic displays is the degree to which the general population of listener's can obtain adequate localization cues from stimuli based on non-individualized transforms. Preliminary data (Wenzel, et. al., 1988b) suggest that using non-listener-specific transforms to achieve synthesis of localized cues is at least feasible. For experienced listeners, localization per- formance was only slightly degraded compared to a subject's inherent ability, even for the less robust elevation cues, as long as the transforms were derived from what one might call a "good" localizer. Further, the fact that individual differences in performance, particularly for elevation, could be traced to acoustical idiosyncracies in the stimulus suggests that it may eventually be possible to create a set of "universal transforms" by appropriate averaging (Genuit, 1986) and data reduction techniques (e.g., principal com- ponents analysis), or perhaps even enhancing the spectra of empirically-derived transfer functions (Durlach & Pang, 1986). Alternatively, even inexperienced listeners may be able to adapt to a particular set of HRTFs as long as they pro- vide adequate cues for localization. A reasonable approach is to use the HRTFs from a subject whose measurements have been "behaviorally-calibrated" and are thus correlated with known perceptual ability in both free-field and headphone conditions. In a recently completed study, sixteen inexperi- enced listeners judged the apparent spatial location of sources presented over loudspeakers in the free-field or over headphones. The headphone stimuli were generated digi- tally using HRTFs measured in the ear canals of a represen- tative subject (a "good localizer") from Wightman & Kistler (1988a,b). For twelve of the subjects, localization perfor- mance was quite good, with judgements for the non- individualized stimuli nearly identical to those in the free-field. In general, these data suggest that most listeners can obtain useful directional information from an auditory display without requiring the use of individually-tailored HRTFs. However, a caveat is important here. The results described above are based on analyses in which errors due to front/back confusions were resolved. For free-field versus simulated free-field stimuli, experienced listeners exhibit front/back confusion rates of about 5 vs. 10 % and inexperi- enced listeners show average rates of about 20 vs. 30 % Although the reason for such confusions is not completely understood, they are probably due in large part to the static nature of the stimulus and the ambiguity resulting from the so-called cone of confusion (see Blauert, 1983). Several stimulus characteristics may help to minimize these errors. For example, the addition of dynamic cues correlated with head-motion and well-controlled environmental cues derived from models of room acoustics may improve the abil- ity to resolve these ambiguities. REFERENCES Blauert, J. (1983) Spatial Hearing. The MIT Press: Cam- bridge, MA. Bly, S. (1982) Sound and computer information presentation. Unpublished doctoral thesis (UCRL-53282) Lawrence Livermore National Laboratory and University of Cali- fornia, Davis, CA. Calhoun, G.L., Valencia, G., & Furness, T.A. III (1987) Three-dimensional auditory cue simulation for crew sta- tion design/evaluation. Proc. Hum. Fac. Soc., 31, 1398-1402. Cherry, E.C. (1953) Some experiments on the recognition of speech with one and two ears. J. Acoust. Soc. Am., 22, 61-62. Colquhoun, W.P. (1975) Evaluation of auditory, visual, and dual-mode displays for prolonged sonar monitoring in repeated sessions. Hum. Fac., 17, 425-437. Deatherage, B.H. (1972) Auditory and other sensory forms of information presentation. In H.P. Van Cott & R.G. Kin- cade (Eds.), Human Engineering Guide to Equipment Design, (rev. ed.), Washington, DC: U.S. Government Printing Office, 123-160. Doll, T.J., Gerth, J.M., Engelman, W.R. & Folds, D.J. (1986) Development of simulated directional audio for cockpit applications. USAF Report No. AAMRL-TR-86-014. Durlach, N.I. & Pang, X.D. (1986) Interaural magnification. J. Acoust. Soc. Am., 80, 1849-1850. Edwards, A.D.N. (1989) Soundtrack: An auditory interface for blind users. Hum. Comp. Interact., 4, 45-66. Gaver, W. (1986) Auditory icons: Using sound in computer interfaces. Hum.-Comp. Interact., 2, 167-177. Genuit, K. (1986) A description of the human outer ear transfer function by elements of communication theory. Proc. 12th ICA (Toronto), Paper B6-8. Lehnert, H. & Blauert, J. (1989) A concept for binaural room simulation. ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY. Loomis, J.M., Hebert, C., & Cicinelli, J.G. (1989) Active localization of virtual sound sources. Submitted to J. Acoust. Soc. Am. McKinley, R.L. & Ericson, M.A. (1988) Digital synthesis of binaural auditory localization azimuth cues using head- phones. J. Acoust. Soc. Am., 83, S18. O'Leary, A. & Rhodes, G. (1984) Cross-modal effects on visual and auditory object perception. Perc. & Psycho- phys., 35, 565-569. Patterson, R.R. (1982) Guidelines for Auditory Warning Sys- tems on Civil Aircraft. Civil Aviation Authority Paper No. 82017, London. Posselt, C., Schroter, J., Opitz, M., Divenyi, P., & Blauert, J. (1986) Generation of binaural signals for research and home entertainment. Proc. 12th ICA (Toronto), Paper B1-6. Persterer, A. (1989) A very high performance digital audio signal processing system. ASSP Workshop on Applica- tions of Signal Processing to Audio and Acoustics, New Paltz, NY. Warren, D.H., Welch, R.B., & McCarthy, T.J. (1981) The role of visual-auditory "compellingness" in the ventrilo- quism effect: Implications for transitivity among the spatial senses. Perc. & Psychophys., 30, 557-564. Wenzel, E.M., Wightman, F.L., & Foster, S.H. (1988a) A vir- tual display system for conveying three-dimensional acoustic information. Proc. Hum. Fac. Soc., 32, 86-90. Wenzel, E.M., Wightman, F.L., Kistler, D.J., & Foster, S.H. (1988b) Acoustic origins of individual differences in sound localization behavior. J. Acoust. Soc. Amer., 84, S79. Wightman, F.L. & Kistler, D.J. (1989a) Headphone simulation of free-field listening I: stimulus synthesis. J. Acoust. Soc. Amer., 85, 858-867. Wightman, F.L. & Kistler, D.J. (1989b) Headphone simulation of free-field listening II: psychophysical validation. J. Acoust. Soc. Amer., 85, 868-878. BIOGRAPHY Elizabeth M. Wenzel received a B.A. in psychology from the University of Arizona in 1976 and a Ph.D. in cognitive psychology with an emphasis in psychoacoustics from the University of California, Berkeley in 1984. From 1985-1986 she was a National Research Council post-doctoral research associate at NASA-Ames Research Center working on the audi- tory display of information for rotorcraft cockpits. Since 1986 she has been a Research Psychologist in the Aerospace Human Factors Research Division at NASA-Ames, directing technology development efforts and conducting supporting research in auditory localization for the three-dimensional auditory display project.