[sci.virtual-worlds] Virtual Reality, Volume Visualization and Video

herbt@apollo.sarnoff.com (Herbert H Taylor III) (06/05/91)
 The recent series of postings on VR/Video following the exchange
between myself and Chris Shaw have been most informative. I would also
like to publicly thank Chris for asking some very tough questions. I
can certainly take a "whipping" without losing my sense of humour -
although I hope we can both avoid the use of the pejorative. It was
unfortunate that the HDTV emphasis of previous posts defocused the
more important topic of VR Architectures. Chris challenged the
applicability of those ideas to VR and while I remain convinced that
they will prove important they seem to represent a small conceptual
detour. Likewise, a strong technical challenge was raised to our
description of several 3D video based scenarios. Much of the technical
criticism which followed was the result of our poor description of our
ideas. Real-time 3D imagery alone will not be sufficient to construct
and manage entirely "simulated" virtual worlds. We will not be able to
look under the Kitchen table for the "Chewing Gum". (Of course if the
CGI modeler doesn't have a "Chewing Gum" model, "it", to borrow a
favorite colloquial expression of Chris', "ain't there either.")
Whether these limitations preclude the use of 3D imagery as a useful
interface component of VR remains a topic of further research...

 In our original speculative post of ca. 3/10/91, (responding to our
moderators request for summaries of current research), we mused about
our desire to explore what VR will be like in ten years. We admitted
that the system we were using (aka the Princeton Engine) was not a
general solution to VR processing. I hope, however, that we use
Supercomputers as vehicles to explore otherwise impossible ideas.
Machines such as the UNC PxPL5, the CM2 or the Princeton Engine are
not "practical" single user architectures - at least not yet - but
surely in not too many years we will have desktop massive parallelism
with the potential for real-time interactive VR applications. This
leads naturally to motivating questions: will the "ultimate" VR system
of the year 2000 be more SGI-MIMD like - with a large number of
powerful processors in the rendering pipeline - or, will it be a
hybrid of SIMD and MIMD, as in the PxPL5?  Perhaps VR specific
architectures will emerge. What impact does our choice of VR world
model have on architecture? Is the evolution of VR only going to be in
the direction of increasingly realistic CGI rendering?  Are worlds
derived from sampled data going to become more viable?

 What are the applications which motivate future VR architectures?
Certainly, the potential for applications in medicine, architecture,
simulated experience, "gaming" and product design will continue to
provide motivation for developing systems with improved visual realism
and more natural interactions. Likewise, scientific data visualization
offers fertile ground for VR research and future application - where
we have this notion of interacting with and literally "experiencing"
our data. There has been significant independent progress in recent
years in each of the fields of interactive data visualization, VR and
specifically, Volume Visualization (VV), however, it will be the
convergence of all three technologies into a single computing and
interaction framework where the true enabling leap of functionality
will occur. Scientists will be able to simulate and visualize complex
phenomena, in some sense, actually "participate" in their experiments.
This kind of interaction will revolutionize scientific research in
much the same way that the computer itself has.

 There are probably those who will question our enthusiasm and observe
that our scientific forebears drew marvelous insight from very simple
physical models of complex structure without the benefit of computers
or graphics. It is said that the structure of benzene came to Kekule
in a dream. Certainly, Watson and Crick were able to visualize amazing
structure without the benefit of complex "tools"... which is exactly
the point. When the visualization systems of the future are as easy to
use as a box of snap together molecular models, as interactive as the
microscope or as "free associative" as a dream - only then will they
realize their full potential. These advances, however, will come at
great computational cost.

 Where are the computational boundaries for VR? To address these
issues we must first establish complexity bounds for VR in terms of
computation (rendering, dynamics, constraints, etc) and I/O. The
processing requirements of VR have been studied in terms of system
dynamics and constraint satisfaction by [Pentland90] giving O(n**3)
"calculations" per vertex for the dynamical system. For a 1000 objects
with a 1000 vertices a 100 TFLOP performance is required to achieve
interactivity (Assuming 100 floating point operations per system
"calculation"). That astounding number is still two orders removed
from the NEXT generation of supercomputers. The authors propose a
reduced complexity model - still with computational complexity in the
10 GFLOP range to satisfy system constraints and 100MFLOP for the
dynamics. A system which implements this approach is described in
[PentE90]. [Witkin90] also discusses the constrained dynamical system
in some detail. Polygon rendering has been discussed in [Ake88] with
floating point requirements in the range of 40MFLOP for 100,000
polygons. Further research needs to be done to reduce world complexity
and to make higher resolution worlds with complex objects more
tractable on near term computers.

 Several posters to this group have suggested that VR input processing
requirements are quite modest - at least in terms of the data glove.
We might ask how does the complexity scale as more and higher
"resolution" input devices are introduced?  Devices such as the Eye
tracker [Levoy90] and 3D head tracker [Wang90] would seem to add
significant complexity to VR processing requirements even with
dedicated interface hardware.  Comparatively less research has been
done on the physical "output" side of VR. [Minsky90] describes a
system using the sense of touch. What other input or output devices
can we look forward to and what are the performance specs?

  Clearly, the exponential growth of the computational and I/O
requirements of VR will motivate both algorithmic and architectural
solutions. A recent estimate by the US Government projects "terraflop"
commercial Supercomputers before the year 2000 [USG91] with the first
demonstration systems emerging from the DARPA High Performance
Computing Systems (HPCS) initiative in 1992-3. Whether the spectacular
performance of these machines can be fully harnessed for VR remains to
be seen. Perhaps what is really needed is a combination of VR specific
architectures with VR specific algorithms - architectures which are on
the HPCS technology learning curve (i.e. that employ MCM packaging,
superdense ULSI, optical inteconnect, etc) COMBINED with algorithms
which can replace O(n**3) with say O(n log n). A number of research
groups have proposed (and in some cases actually built) "application
specific" or "algorithm specific" visualization computers, including
the well known UNC PxPL5 [Fuchs82], the SUNY CUBE [Kaufman88] and the
Stanford SLAM [Dem86] systems. (We are not sure if a full version of
the later machine was ever built.)  In general, these researchers were
motivated by the desire to explore "future" visualization algorithms.

  Our original motivation in developing the Princeton Engine was the
desire to explore "future" video systems. However, we have found that
its applicability is in no way "limited" to video and it can serve as
a useful architecture to study visualization algorithms and future
visualization systmes. We certainly do believe we can accomplish much
of what we described in our previous posts. After all, the system can
turn over 30 (16bit) GIGAOPS, or over 1 GFLOP (for you scientific
types). More important, ALL system I/O is continuous and transparent
to the CPU. A 48bitx28MHZ (=1.4GBPS) digital input bus and a
64bitx28MHZ (= 1.8GBPS) digital output bus can drive any combination
of analog or digital I/O devices. Transparent, continuous gigabit I/O
should be important to the NEXT generation of VR peripherals: Data
Gloves, eye trackers, you invent it. Finally, while there is no
special system requirement that either the I/O or the application be
"video", a typical application will often COMBINE scientific computing
AND real-time data visualization.

 Reat-time Interactive Volume Visualization
 ------------------------------------------
 To calibrate this machine for graphics applications, we are in the
process of implementing a real-time volume rendering system that can
arbitrarily rotate and render 256x256x256 volumes at 30fps. We believe
we can volume render 1Kx1Kx1K at about 8fps using single axis
rotation. This is possible because the Princeton Engine can perform a
"continuous" real-time transpose (512x512x32bitsx30fps) for very
little CPU cost.  The programmer effectively has an array and its
transpose as working data structures. At any line "time" each
processor has a row and colume of the current frame in hand.
Therefore, "scanline" algorithms are relatively straightforward...

 A number of recent papers suggest that Volume Visualization (VV) and
Virtual Reality are closely related, convergent applications. The
"edvol" system combines a VPL Data Glove and 3SPACE Polhemus Tracker
to provide direct interaction with volumetric data [Kaufman90]. The
authors do not characterize the size and complexity of either the VR
or the VV but do describe it as "small scale". It often seems as
though the electronic "media" believe that VV is "already" a standard
component of VR. This obvious misconception has occured because volume
visualization demos are usually presented as real-time interactive
simulations, when the visualization actually took CPU hours to
orchestrate. But the "dream" clearly is real-time interactive volume
rendering and visualization.  One of the best demonstrations I have
seen of the potential for a VV fly through was presented by Mark Levoy
(then of UNC) using volume rendered CT. In [Levoy89] the intended use
of a head mounted display interface to the PxPL5 is described while in
[Levoy90] the use of eye tracking hardware is described - in both
cases specifically for VV.  UNC's Steve Pizer showed a video of 8fps
single axis rotation volume rendering on PxPL5 at the San Diego
Workshop on VV. A head display system has also been developed at UNC
to assist radiologists in treatment planning.  Although these examples
may not in all cases qualify as pure "VR" they certainly speak to the
potential for a real-time interactive VR interface to a volume
visualization environment.

  Volumetric data sets come in two basic forms: "real" sampled data
(as in CT, MRI, Ultrasound, Optical or Xray Microscopy, etc.)  and
computed or "synthetic" data (weather models, CFD, etc.). In the later
case the volume is usually "simulated" while in the former the "raw"
data is sampled and sometimes preprocessed before the volume is
rendered. For example, before an MRI image is produced the "raw"
sampled data must be Fourier Transformed.  With either approach the
resulting data set is a 3D spatial volume. In the case of synthetic
simulated data there is also a "timestep" - the fluid flows, the
turbine spins, etc. With sampled data there is often no clear notion
of time, the data is entirely static; however, the interaction with
the data can be dynamic and even involve the "introduction" of time. A
traveler passing through a sampled and rendered volume certainly
experiences the passage of time, however, the "world" itself remains
static. By analogy one can imagine walking through a museum (static
sampled "objects") verses walking along the bank of a river (dynamic
simulated "objects"). In our conceptual museum, as we begin to
interact with objects we can simplify the system constraint dynamics
as much as desired, literally "determining" the laws of physics. That
Ming Dynasty vase I knocked over? It never touched the ground. If we
are "in" an MRI or CT museum we might wish to change opacity, point of
view or other parameters which effect our visual perception of the
phenomena we are studying. Of course, the same control of time is
possible in the "synthetic" case, however, only at the risk of
undermining the scientific interpretation of the simulation i.e.
correctly visualizing and understanding the physics is often
fundamental to the experiment.

  With the emergence of increasingly real-time instrumention there is
a second form of sampled data to consider: 3D spatial volumes with
time varying data. Imagine a sampled volume of a living organism or
dynamic micro structure which is updated 30 times a second. If we were
"inside" this museum while it was "open" we could watch cells as they
proceed through nuclear envelope breakdown, divide and emerge as two
identical cells. (We are working with this kind of data now.) The
degree to which this form of interaction is a Virtual Reality results
not from our ability to "alter the experiment" on the fly, but from
our ability to control the dynamics of how we "view" the experiment
while it is taking place. We may ultimately be able to turn up the
heat or add some catalyst to a chemical reaction from within the
Virtual experience but that ability neither defines VR nor does the
absense of that ability preclude VR. IMO, it is the VR observers
perceived sense of simulated presence combined with the ability to
control the visual experience which principally defines the
interaction as VR.

 Two related projects provide sources of volumetric data which we are
using at Sarnoff and which we feel have VR "prospects": from an
experimental ultrasound instrument and from a differential inferential
contrast (DIC) micrograph which produces a sequence of image slices
through a cell embryo. The DIC volume can be acquired at or near
real-time with the latest instrumentation - hence, a "video volume"
(sorry Chris, it really can be...) In the case of the ultrasound
instrument the Princeton Engine will also perform the front-end signal
processing required to produce a volume from the sampled data.
Presently, a "raw" 3D ultrasound data set cannot be acquired in
real-time, however, the signal processing to produce a data volume
from an unprocessed data set can potentially be accomplished in
real-time, as can the Volume Rendering process. The KEY POINT is that
we are going to see more and more real-time instrumentation which can
produce true sampled data volumes. As the aquisition time of systems
such as MRI, ultrasound, Electron Microscopy and DIC decrease we will
also see greater coupling between the front-end signal processing, the
data visualization and perhaps even the user interaction. At a recent
workshop at Princeton University, "Seeing Into Materials: Imaging
Complex Structures", both optical and EM microscopy systems capable of
real-time 3D acquisition were described... Actually "4D" is used to
refer to a 3D spatial volume "moving" in time. BTW, these scientists
have a strong intuitive feel for the potential of VR - at least what
they call "VR". That is, the ability to interact with an experiment -
either in situ OR as part of post analysis - as in our museum
examples.

 Is it fair to ask where in the taxonomy of VR systems we should place
these kinds of applications? True, the worlds are derived from various
real world spectra but the interactions are entirely SIMULATED, one
can change viewing parameters, etc. The exact meaning of virtually
touching "objects" or surfaces in such worlds remains unclear... but
really no more so than in a CGI simulated world where everything is
built from models. In either case the eventual consequence of our
fundamental interactions within a world must be determined by a "law
giver". If I put my data gloved hand directly into the burner of a VR
Kitchen stove what happens?

 It should be noted that there are several potential problems with
these methods of data visualization. First, a number of people report
varying degrees of motion sickness when observing through a head
mounted display. That may be acceptable if I am performing a "hammer-
head" maneuver in my Superdecathalon simulator, but probably is not
acceptable if I am inside someones brain. (Informal Poll: How Many of
you have experienced this effect? RSVP and I will tabulate.) A second
potential problem results from "persistance" effects on the human
visual system. We vividly recall in the early days of VLSI CAD
workstations the problem IC draft persons experienced after long hours
staring at color stripes and squares. [Frome83] describes the so
called "McCollough effect", wherein, after looking at color stripes
for only a short time, high contrast B&W stripes suddenly appear to
have color where none is present. To dramatize this effect during her
talk at DAC83, Francine Frome periodically displayed slides with green
and red stripes. About halfway through the talk she displayed a slide
with a striped "BTL" in bright green foreground offset from a bright
red background - before informing the audience that the slide was
totally black and white! It was quite remarkable. These issues and
other "human factor" issues will need to be fully understood before
head mounted displays achieve broad use and certainly before we let
Nintendo sell one to every ten year old...

 More VR/Video
 -------------
  In our original post we also speculated that multiple camaras might
be used to develop a "Video" data glove or "whole body" interface to a
virtual world. In particular, we asked how such an interface would
impact the future design of the data glove.  We received a number of
thoughtful comments following this post. It is important to note that
the VR world itself COULD STILL be CGI - with video only providing a
framework for interaction. With support for up to six camaras one
could surround participants either individually or collectively with
video. (Remember we pay no CPU cost to load frames into memory from
each camara, however, we do pay once we start to do something with the
data.) Participants might wear "chroma keyed" gloves (wireless gloves
of a reserved key color) or even body suites. Chroma keying is a well
known technique for creating simple special effects such as the
"weatherman" overlay [Ennes77] [Watk90]. We would NOT use this merely
for special effects, however, but to provide a means of isolating the
hands so we can build a useful model. On the Engine the ammount of
processing for each chroma key is only about 5-10% of the "real-time
budget" at 30fps. A second chroma key is used for the background. This
is similar to Myron Kruegers Videoplace which uses white backing
screens. It differs in that Videoplace produces only silhouettes of
"Artificial Reality" participants as a group and provides a limited
framework for identifying individual participants.  ( Don't get me
wrong - Videoplace is still a lot of Fun! - I recently spent a day at
the Franklin Institute watching kids play in it and was impressed by
the overall effect produced. )

 The next major technical step is to be able to exploit this interface
in a useful way. In particular we want to study the effect of multiple
"individual" participants. If two video channels are paired to each
participant with a distinct chroma key can we construct a useful model
and use that to interact with and control the dynamics of the
visualization process. Our present plan will focus on the real-time
recognition of simple hand gestures from each "pair of hands".

 The idea of using sign language as some posters have suggested is
very interesting - particularly coupled with a neural net based
recognizer. Recognizing full ASL in "continuous" real-time by any
approach is probably ambitious, however, a useful subset might be
possible. We have used a neural net approach to detect and remove
characteristic AM impulse noise (aka "hair dryer noise") in a TV
receiver [Pearson90].  One network is trained to detect AM impulses on
an image line and a second network is trained to look at the entire
image and determine which of the detected pulses are really "false"
positives. This program runs in continuous real-time on the Princeton
Engine. We also demonstrated real-time BEP training on a simple three
layer MLP (a total of 86 w's and th's). For hand signs, however, a new
network topology would be required - with the input to the network
derived from the subsampled chroma key image segment of the original
image.  However, if my understanding of "conversational" ASL is
correct - and each hand sign is typically an entire word or concept -
then the resulting training set still might be hugh. Also, I believe
that hand motion itself plays a significant role in the interpretation
of signs - not just in the transition from one sign to another - as in
cursive writing. This implies that a robust sign recognition system
would need to compute a motion vector and use that as part of the
training set.  We would appreciate references to current work...
particularly how one detects individual signs when in "continuous"
conversation i.e.  when does one sign end and the next begin?

  Lastly, a second video experiment would involve the use of the
chroma key to present to each of three remote participants a composite
image of their two neighbors, to form a virtual conference. While this
interaction is entirely real-time we recognize that there will be a
significant limit to the quality of interaction between subjects. We
are interested in the degree of "total" immersion each person
experiences.  If we also mix the audio does the participant "feel"
like he or she is having a conversation with three people.
Unfortunately, I would imagine that the head mounted displays would
tend to undermine intimacy - "perhaps" we could image warp new faces
on everybody - just kidding Chris - although now that I think about
it...

 References
 ----------

[Pentland90] "Computational Complexity Verses Virtual Worlds", A
        Pentland, 1990 Symposium on Interactive 3D Graphics. Vol 24,
        No 2, March 1990, ACM SIGGRAPH

 ( Based on the quality of papers in the proceedings, this must have
been a great conference! )

[Witkin90] "Interactive Dynamics", A Witkin, M Gleicher, W Welch, 
        1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March
        1990, ACM SIGGRAPH

[PentE90] "The ThingWorld Modeling System: Virtual Sculpting by Modal
        Forces" A Pentland, I Essa, M Freidmann, B Horowitz.
        1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March
        1990, ACM SIGGRAPH

[Ack88] "High Performance Polygon Rendering" K Akeley, T Jermoluk,
        Computer Graphics Vol 22, No 4, August 1988. ACM

[Levoy90] "Gaze Directed Volume Visualization", M Levoy, W Whitaker.
        1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March
        1990, ACM SIGGRAPH

[Wang90] "A Real-time Optical 3D Tracker For Head Mounted Display
        Systems" J Wang, V Chi, H Fuchs. 1990 Symposium on Interactive
        3D Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH

[Minsky90,vr-66] "Feeling and Seeing: Issues in Force Display" M
        Minsky, O Ming, O Steele, F Brooks, Jr. 1990 Symposium on
        Interactive 3D Graphics. Vol 24, No 2, March 1990, ACM
        SIGGRAPH

[USG91] "Grand Challenges: High Performance Computing and
        Communications" A Report by the Committee on Physical,
        Mathematical and Engineering Sciences; Federal Coordinating
        Council for Science, Engineering, and Technology; Office of
        Science and Technology Policy.

[Fuchs82] "Developing Pixel-Planes, A Smart Memory Based Raster
        Graphics System", H Fuchs, J Poulton, A Paeth, A Bell. 1982
        Conference on Advanced Research in VLSI.

[Kauf88] "Memory and Processing Architecture for 3D Voxel-Based
        Imagery" A Kaufman, R Bakalash, IEEE Computer Graphics and
        Applications, Volume 8. No 11, November 1988, pg 10-23
        reprinted in "Volume Visualization", edited by A Kaufman, IEEE
        Computer Society, 1991.

[Dem86] "Scan Line Access Memories for High Speed Image
        Rasterization", S.G. Demetrescu. Phd Dissertation, Stanford
        University. June 1986.

[Kauf90] "Direct Interaction with a 3D Volumetric Environment", A
        Kaufman, R Yagel, R Bakalash, 1990 Symposium on Interactive 3D
        Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH

  (We also highly recommend, "Volume Visualization", edited by A
Kaufman, IEEE Computer Society, 1991 which contains a large survey of
relevant publications.)

[Levoy89] "Design for a Real-Time High Quality Volume Rendering
        Workstation" Chapel Hill Workshop on Volume Visualization,
        1989, Department of Computer Science, University of North
        Carolina. C. Upson, Editor.

[Frome83] "Incorporating the Human Factor in Color CAD Systems", F.S.
        Frome, 20th Design Automation Conference, June 1983, IEEE
        Computer Society. 

[Ennes77] "Television Broadcasting: Equipement, Systems and Operating
        Fundamentals", H.W Sams, 1979. pg 319-323

[Watk90] "The Art of Digital Video", John Watkinson, Focal Press,
        1990, pg 75-77

[Pearson90] "Artificial Neural Networks as TV Signal Processors"
        Clay D. Spence, John C. Pearson, Ronald Sverdlove SPIE
        Proceedings Vol. 1469: Applications of Artificial Neural
        networks, 1991