[sci.virtual-worlds] 3D Sound Discussion Redux

beth@ptolemy.arc.nasa.gov (Elizabeth) (07/19/90)
>
>At the Interactive graphics conference in Snowbird at the end of March,
>their was a demo of a project at NASA Ames (in conjunction with the head
>mounted display project as seen in Scientific American) in which synthesized
>sound was fed to headphones based on relative positions of head and "source"
>as sensed by Polhemus trackers.
>
>In the talk, a comment from the audience mentioned a canadian project to
>do the same kind of thing, but through speakers!  Much harder!  There was 
>some skepticism about whether it could work, and others claiming to have
>experienced it.
>
>The contractor here at NASA doing the work is Scott Foster.
>
>Sam Uselton           uselton@nas.nasa.gov
>employed by CSC
>working for NASA
>speaking for myself
>
>----------------------------------------------------------------------------
>
>Moderator's Note:
>
>The Canadian project Sam refers to is underway within a small company known
>as Gehring Research of Toronto, Ontario. The system is known as the Focal 
>Point 3-D Audio System.
>

And another posting stated:


>I refer everyone to the excellent work being done at NASA/Ames by Elizabeth
>Wenzel and Scott Foster.  I saw their equipment demoed at the 1990 Symposium
>on Interactive 3D Graphics (Snowbird Utah, March 1990), and the audio
>effects were very well done.
>
>Their aparatus was a Sennheiser HD-540 Pro headphone with a Polhemus 6 DoF
>tracker cemented to it, and another 6 DoF tracker for you to hold.
>A PC & a snazzy custom DSP board took a monaural sound source (in the
>demo, from a CD player) and positioned it to "seem" like it was located
>at the hand-held tracker.  Move either your head or the tracker, and
>a pretty good approximation of the "right" effect happened.
>
>The Head-Related Transfer Function (HRTF) is synthesized from 144 pairs
>of Finite Impulse Responses (FIRs) measured for the sample head.
>
>Superb work.  Read about it in COMPUTER GRAPHICS, V 24, Number 2, March 1990
>"A publication of ACM SIGGRAPH, special issue on the 1990 Symposium on
>Interactive 3D Graphics".  pp 139-140.
>
>	Best,
>	 -Mike Muuss


Just to clear up a couple of things up about the above discussion of 3D
Sound:

Although Bo's system is certainly quite relevant, the Canadian company referred 
to at Snowbird was Q-Sound who (as I understand it) claims to have developed 
true 3D sound using presentation with only two, or even one, loudspeakers 
such that there is an infinite "sweet spot"; i.e., you can move around the 
room and localized sources stay put.  This is quite a different problem
from headphone presentation, and as Sam Uselton suggested, much harder.
As far as I know, the Gehring system was developed for headphones, not
speaker presentation. At least that was my impression when Bo demonstrated
the system to me, especially since one of the sets of HRTFs he uses were
obtained from Fred Wightman and are basically a shortened/windowed version
of one of the HRTF sets we use in our lab. (My apologies to Bo if this 
assessment is incorrect.) 

My opinion is that one could get a substantial enhancement of apparent 
auditory spaciousness by presenting HRTF-filtered stimuli over loudspeakers, 
but that reliable realtime placement and movement of sound sources which is 
independent of listener-position is basically impossible. The physical 
acoustics of this situation are tremendously complex and one needs to be able 
to take into account the realtime relationships between the positions of the 
listener, the desired virtual sources, and the real sources (loudspeakers), 
as well as compensate for the auditory crosstalk between loudpseakers which 
will defeat the ability to precisely control the waveforms entering the two ears
of the listener. This sort of precision is essential for precise manipulation
of location and is why headphones are the transducer of choice. This doesn't
mean that you can't get dramatic spatial EFFECTS over speakers, but if the
goal is presentation of location as a piece of INFORMATION as in a spatial
display, my feeling is that one can't get reliable and predictable 3D
placement with loudspeakers. However, this doesn't mean that such techniques
are useless for some applications. It often seems that, in the area of 3D
sound, there is a confusion between the goals of precise information
presentation and the goals of aesthetics and the creation of special effects. 
These goals are both laudable but they, and the methods used to acheive them, 
may not always coincide.

There is another point to be raised when one is listening to and trying to
evaluate 3D sound systems. Depending upon the nature of the demo material,
one can create a strong impression of precise localization control which 
may or may not be really there. For example, cognitive cues may strongly bias 
your perception of location. You are more likely to believe an aircraft is 
above you rather than below you, hear the sound of lighting a cigarette in 
front where your mouth is rather than behind you, or hear someone clipping 
your hair above or behind you rather than in front of your nose. Such 
cognitive effects could actually be quite useful to spatial display designers 
under certain circumstances but they do not reflect the direct manipulation 
of spatial auditory cues per se. A related cognitive effect is the notion of 
visual capture, i.e., the visual location of an object tends to dominate its 
auditory location even when the two are in substantially different positions 
(as in the "ventriloquism effect" and sound systems in movie theatres). 

I am appending a brief summary of a recent overview talk I gave at the Santa 
Barbara conference on telepresence and virtual environments; it has a fair 
number of references which people may find useful.

If anyone is interested in finding out more about the realtime 3D sound 
hardware developed in our lab (the "Convolvotron"), you can contact Scott 
Foster, the designer, directly at:

Crystal River Engineering
12350 Wards Ferry Road
Groveland, CA 95321
(209) 962-6382


Regards,

Beth Wenzel
NASA-Ames Research Center



                 VIRTUAL ACOUSTIC DISPLAYS

Presented at the Conference on Human-Machine Interfaces for
           Teleoperators and Virtual Environments
             Santa Barbara, CA, March 4-9, 1990



                    Elizabeth M. Wenzel
         Aerospace Human Factors Research Division
                 NASA-Ames Research Center
                      Mail Stop 262-2
                  Moffett Field, CA  94035
                       (415) 604-6290



     As with most research in information displays,  virtual
displays  have generally emphasized visual information. Many
investigators, however, have pointed out the  importance  of
the  auditory  system  as  an  alternative  or supplementary
information channel (e.g., Deatherage, 1972; Doll, et.  al.,
1986;  Patterson,  1982;  Gaver,  1986). A three-dimensional
auditory  display  can   potentially   enhance   information
transfer  by combining directional and iconic information in
a quite naturalistic representation of  dynamic  objects  in
the  interface. Borrowing a term from Gaver (1986), an obvi-
ous aspect of "everyday listening" is the fact that we  live
and  listen  in a three-dimensional world. Indeed, a primary
advantage of the auditory system is that  it  allows  us  to
monitor  and identify sources of information from all possi-
ble locations, not just the direction of gaze. This  feature
would  be  especially  useful  in  an  application  that  is
inherently spatial, such as an air traffic  control  display
for  the  tower  or  cockpit.   A  further  advantage of the
binaural system, often referred to as  the  "cocktail  party
effect"  (Cherry,  1953), is that it improves the intelligi-
bility of sources in noise and assists in the segregation of
multiple  sound  sources.  This  effect could be critical in
applications involving  encoded  nonspeech  messages  as  in
scientific  "visualization",  the acoustic representation of
multi-dimensional data (e.g., Bly, 1982), and  the  develop-
ment  of  alternative  interfaces  for the visually-impaired
(Edwards, 1989; Loomis, et. al., 1990).  Another  aspect  of
auditory  spatial  cues  is  that, in conjunction with other
modalities, it can act as a potentiator  of  information  in
the  display. For example, visual and auditory cues together
can reinforce the information content  of  the  display  and
provide  a  greater sense of presence or realism in a manner
not readily acheivable by either modality alone  (Colquhoun,
1975;  Warren,  et. al., 1981; O'Leary & Rhodes, 1984). This
phenomenon  will  be  particularly  useful  in  telepresence
applications,  such  as  advanced  teleconferencing environ-
ments, shared electronic workspaces, and monitoring  telero-
botic  activities  in  remote or hazardous situations. Thus,
the combination of direct spatial cues with good  principles
of  iconic  design  could  provide an extremely powerful and
information-rich display which is also quite easy to use.

     This type of display could be realized with an array of
real  sound  sources  or  loudspeakers  for listeners seated
in  a fixed position  (Doll, et. al., 1986;
Calhoun, et. al., 1987). An alternative  approach,  recently
developed   at  NASA-Ames,  generates  externalized,  three-
dimensional sound cues over  headphones  in  realtime  using
digital  signal-processing  (Wenzel,  et. al., 1988a). Here,
the synthesis technique involves the digital  generation  of
stimuli  using Head-Related Transfer Functions (HRTFs) meas-
ured in the  two  ear-canals  of  individual  subjects  (see
Wightman  &  Kistler,  1989a).  Up  to four moving or static
sources can be simulated in  a  head-stable  environment  by
digital  filtering of arbitrary signals with the appropriate
HRTFs. This type of presentation system is desirable because
it  allows  complete  control  over  the  acoustic waveforms
delivered to the two ears and the ability to interact dynam-
ically  with  the virtual display.  Other similar approaches
include an  analog  system  developed  by  Loomis,  et.  al.
(1990)  and  digital  systems  which  make use of transforms
derived from normative mannikins  and  simulations  of  room
acoustics  (Genuit, 1986; Posselt, et. al., 1986; McKinley &
Ericson, 1988; Persterer, 1989; Lehnert & Blauert, 1989).

     Such an interface also requires the careful  psychophy-
sical  evaluation of listeners' ability to accurately local-
ize the virtual or synthetic sound sources. For  example,  a
recent  study  by  Wightman  & Kistler (1988b) confirmed the
perceptual  adequacy  of  the  basic  technique  for  static
sources; source azimuth was synthesized nearly perfectly for
all listeners  while  source  elevation  was  somewhat  less
well-defined in the headphone conditions.

     From an applied standpoint, measurement of each  poten-
tial  listener's  HRTFs  may not be possible in practise. It
may also be the case that the user of such  a  display  will
not  have  the  opportunity for extensive training.  Thus, a
critical research issue for virtual acoustic displays is the
degree  to  which  the  general population of listener's can
obtain adequate localization  cues  from  stimuli  based  on
non-individualized  transforms.   Preliminary  data (Wenzel,
et. al., 1988b)  suggest  that  using  non-listener-specific
transforms  to  achieve  synthesis  of  localized cues is at
least feasible. For experienced listeners, localization per-
formance  was only slightly degraded compared to a subject's
inherent ability, even for the less robust  elevation  cues,
as  long  as the transforms were derived from what one might
call a "good" localizer. Further, the fact  that  individual
differences  in  performance,  particularly  for  elevation,
could be traced to acoustical idiosyncracies in the stimulus
suggests  that it may eventually be possible to create a set
of "universal transforms" by appropriate averaging  (Genuit,
1986)  and  data  reduction techniques (e.g., principal com-
ponents analysis), or perhaps even enhancing the spectra  of
empirically-derived  transfer  functions  (Durlach  &  Pang,
1986).

     Alternatively, even inexperienced listeners may be able
to  adapt  to a particular set of HRTFs as long as they pro-
vide adequate cues for localization. A  reasonable  approach
is  to  use the HRTFs from a subject whose measurements have
been "behaviorally-calibrated" and are thus correlated  with
known  perceptual  ability  in both free-field and headphone
conditions. In a recently completed study, sixteen inexperi-
enced  listeners  judged  the  apparent  spatial location of
sources presented over loudspeakers  in  the  free-field  or
over  headphones. The headphone stimuli were generated digi-
tally using HRTFs measured in the ear canals of a  represen-
tative  subject (a "good localizer") from Wightman & Kistler
(1988a,b). For twelve of the subjects, localization  perfor-
mance   was   quite  good,  with  judgements  for  the  non-
individualized stimuli nearly  identical  to  those  in  the
free-field.

     In general, these data suggest that most listeners  can
obtain  useful  directional  information  from  an  auditory
display without requiring the use  of  individually-tailored
HRTFs.  However,  a  caveat  is  important here. The results
described above are based on analyses in which errors due to
front/back  confusions  were resolved. For free-field versus
simulated free-field stimuli, experienced listeners  exhibit
front/back confusion rates of about 5 vs. 10 % and inexperi-
enced listeners show average rates of  about  20  vs.  30  %
Although  the  reason  for such confusions is not completely
understood, they are probably  due  in  large  part  to  the
static  nature  of  the stimulus and the ambiguity resulting
from the so-called cone of confusion  (see  Blauert,  1983).
Several  stimulus characteristics may help to minimize these
errors. For example, the addition of dynamic cues correlated
with  head-motion  and  well-controlled  environmental  cues
derived from models of room acoustics may improve the  abil-
ity to resolve these ambiguities.

                         REFERENCES

Blauert, J. (1983) Spatial Hearing.   The  MIT  Press:  Cam-
     bridge, MA.

Bly, S. (1982) Sound and computer information  presentation.
     Unpublished   doctoral   thesis  (UCRL-53282)  Lawrence
     Livermore National Laboratory and University  of  Cali-
     fornia, Davis, CA.

Calhoun, G.L., Valencia, G.,  &  Furness,  T.A.  III  (1987)
     Three-dimensional auditory cue simulation for crew sta-
     tion  design/evaluation.  Proc.  Hum.  Fac.  Soc.,  31,
     1398-1402.

Cherry, E.C. (1953) Some experiments on the  recognition  of
     speech with one and two ears.  J. Acoust. Soc. Am., 22,
     61-62.

Colquhoun, W.P. (1975) Evaluation of auditory,  visual,  and
     dual-mode  displays  for  prolonged sonar monitoring in
     repeated sessions.  Hum. Fac., 17, 425-437.

Deatherage, B.H. (1972) Auditory and other sensory forms  of
     information  presentation. In H.P. Van Cott & R.G. Kin-
     cade  (Eds.),  Human  Engineering  Guide  to  Equipment
     Design,  (rev.  ed.),  Washington,  DC: U.S. Government
     Printing Office, 123-160.

Doll, T.J., Gerth, J.M., Engelman, W.R. & Folds, D.J. (1986)
     Development  of simulated directional audio for cockpit
     applications. USAF Report No. AAMRL-TR-86-014.

Durlach, N.I. & Pang, X.D. (1986) Interaural  magnification.
     J. Acoust. Soc. Am., 80, 1849-1850.

Edwards, A.D.N. (1989) Soundtrack: An auditory interface for
     blind users.  Hum. Comp. Interact., 4, 45-66.

Gaver, W. (1986) Auditory icons:  Using  sound  in  computer
     interfaces.  Hum.-Comp. Interact., 2, 167-177.

Genuit, K. (1986) A  description  of  the  human  outer  ear
     transfer  function by elements of communication theory.
     Proc. 12th ICA (Toronto), Paper B6-8.

Lehnert, H. & Blauert, J. (1989) A concept for binaural room
     simulation.   ASSP  Workshop  on Applications of Signal
     Processing to Audio and Acoustics, New Paltz, NY.

Loomis, J.M., Hebert, C., & Cicinelli,  J.G.  (1989)  Active
     localization  of virtual sound sources. Submitted to J.
     Acoust. Soc. Am.

McKinley, R.L. & Ericson, M.A. (1988) Digital  synthesis  of
     binaural auditory localization azimuth cues using head-
     phones.  J. Acoust. Soc. Am., 83, S18.

O'Leary, A. &  Rhodes,  G.  (1984)  Cross-modal  effects  on
     visual and auditory object perception.  Perc. & Psycho-
     phys., 35, 565-569.

Patterson, R.R. (1982) Guidelines for Auditory Warning  Sys-
     tems  on Civil Aircraft. Civil Aviation Authority Paper
     No. 82017, London.

Posselt,  C.,  Schroter,  J.,  Opitz,  M.,  Divenyi,  P.,  &
     Blauert,  J.  (1986) Generation of binaural signals for
     research  and  home  entertainment.   Proc.  12th   ICA
     (Toronto), Paper B1-6.

Persterer, A. (1989) A very high performance  digital  audio
     signal  processing  system.   ASSP Workshop on Applica-
     tions of Signal Processing to Audio and Acoustics,  New
     Paltz, NY.

Warren, D.H., Welch, R.B., & McCarthy, T.J. (1981) The  role
     of  visual-auditory  "compellingness"  in the ventrilo-
     quism effect: Implications for transitivity  among  the
     spatial senses.  Perc. & Psychophys., 30, 557-564.

Wenzel, E.M., Wightman, F.L., & Foster, S.H. (1988a) A  vir-
     tual  display  system  for  conveying three-dimensional
     acoustic information.  Proc. Hum. Fac. Soc., 32, 86-90.

Wenzel, E.M., Wightman, F.L., Kistler, D.J., & Foster,  S.H.
     (1988b)  Acoustic  origins of individual differences in
     sound localization behavior. J. Acoust. Soc. Amer., 84,
     S79.

Wightman, F.L. & Kistler, D.J. (1989a) Headphone  simulation
     of  free-field  listening  I:  stimulus  synthesis.  J.
     Acoust. Soc. Amer., 85, 858-867.

Wightman, F.L. & Kistler, D.J. (1989b) Headphone  simulation
     of  free-field listening II: psychophysical validation.
     J. Acoust. Soc. Amer., 85, 868-878.


                            BIOGRAPHY

     Elizabeth M. Wenzel received a B.A. in psychology  from
the  University  of Arizona in 1976 and a Ph.D. in cognitive
psychology with an  emphasis  in  psychoacoustics  from  the
University  of California, Berkeley in 1984.  From 1985-1986
she was a National Research Council  post-doctoral  research
associate  at NASA-Ames Research Center working on the audi-
tory display of information for rotorcraft  cockpits.  Since
1986  she  has been a Research Psychologist in the Aerospace
Human Factors  Research  Division  at  NASA-Ames,  directing
technology  development  efforts  and  conducting supporting
research in auditory localization for the  three-dimensional
auditory display project.