[sci.virtual-worlds] More on VR Architectures LONG

hht@sarnoff.com (Herbert H. Taylor x2733) (04/05/91)

  Chris Shaw has challenged a number of our assertions about VR
processing requirements while sustaining strong opinions of what VR is
or is not, particularly in regard to the use of exclusively CGI based
worlds. With our moderators indulgence we would like to respond to the
technical content of those arguments as they pertain to VR.

  Ivan Sutherlands extraordinary vision in 1965 of the ultimate
computer display and interaction (as reported by Fred Brooks) remains
the most succinct summary of what today we call VR:

1. Display as a window into a virtual world.
2. Improve image generation until the picture in the window looks real.
3. Computer maintains world model in real time.
4. User directly manipulates virtual objects.
5. Manipulated objects move realistically.
6. Immersion in virtual world via head-mounted display.
7. Virtual World also sounds real, feels real.

  With the possible exception of the head-mounted display I would
expect that these remain the essential elements of VR. It is not so
much the physical apparatus ("head-mounted") as the effect of total
"immersion" which is critical to VR. If the same "feel" can be
achieved by other means then I believe we still have VR. Today,
however, the head-mounted display IS clearly the best way to achieve
that effect.

  The ultimate goal of this research is realism (even "real" realism)
within the Virtual world. That is not to say that virtual world
participants must appear as they "really" are in the VR.  For some of
us that could be a bit of a bummer. However, people and objects must
be recognizable in some context. Each time I return to the VR my
"neighbors" should be able to say - "Hey, there's Herb."

  It was my contention that there is nothing in VR which requires that
worlds be polygonally based. That is a choice made specifically to
meet the above criteria - not fundamental to the experience. I further
concluded that other approaches to the world processing function would
require less floating point operation. It was not my intention to
preclude polygon rendering, nor floating point - which is critical to
scientific computation, as Alan has pointed out. (In fact, we can turn
a GFLOP with 2048 16bit DSP processors) However, I certainly hope
researchers are and will continue to look for alternative methods to
build and manage virtual worlds then strictly graphics based methods -
that was the real point. This is clearly an important research topic.
I don't know about Chris Shaw, but when I look in my headset I would
much rather see a real looking (if "imaginary") person on the other
side of the virtual world then a SEGMENTROID (TM) - gouraud shaded or
not <8-)

In my original post I had said: 
** >Other approaches forego polygon rendering entirely and hence
** >possess little or no floating point.

And Chris Shaw asked:
** Other approaches such as what, I wonder?

   In a workshop presentation at SIGGRAPH 89 Scott Fisher outlined
future research issues of VR. These included the objective of
"combining real and virtual environments". In a paper included in the
course notes of that workshop Fisher states that the Ames Virtual
Environment Workstation "system provides a multisensory, interactive
display environment in which a user can virtually explore a 360-degree
synthesized or REMOTELY SENSED environment and can virtually interact
with its components." [Fisher] I presume therefore, that telepresence,
telerobotics or teleoperation are considered worthy subfields of
Virtual Reality?  Certainly much of that work is non polygonal. Again,
Fisher: "The virtual environment display system is currently used to
interact with a simulated telerobotic environment.  The system
operator can call up multiple images of the remote task environment
that represent viewpoints from free-flying or telerobot-mounted CAMARA
platforms...Switching to telepresence control mode, the operator's
wide angle, stereoscopic display is directly linked to the telerobot
3D camara system..."

  Similarly, if a system can process multiple simultaneous video
streams and construct a real-time video 3D world with which the user
can directly interact does that not qualify as VR? Ultimately we would
like to walk through all sorts of complex 3D data sets, perhaps even
volume rendered or terrain rendered ones (voxels, not polygons). A
number of near real-time demonstrations of such walk throughs have
been made - and systems such as PxPl5 and the Princeton Engine will be
likely platforms for further performance gains in this area.

  Other examples of non polygonal approachs would include the famous
video based Aspen driving simulation. Does this qualify as a VR? If a
system can provide a completely user directed experience of driving in
a "virtual" city (in this case modeled on a real one) is that not a
virtual reality?  Chris Shaw does seem to consider E&S simulators to
be VR - is that only because they use graphics?  Under the same
criteria are not interactive video systems such as the DVI based
Palenque walk-through - not also forms of VR?  This system uses data
compressed video scenes on CD which can be accessed and decompressed
in real-time. If the user changes his or her point of view (albeit
using a pointing device) the world processor locates the correct
orientation on disk and makes it appear as if the user turned.  The
interaction meets Sutherlands criteria #1,2,3,and 7. The choice of
pointing device was arbitrary - it would not be difficult to configure
the system with a head mounted display (criteria #6) and to track body
motion to provide a true sense of walking through the ancient ruins.
Perhaps the most difficult of Sutherlands criteria involves the
requirement for manipulated objects to move realistically under the
control of the user (criteria # 4 and 5). These constraints are more
difficult because they are in addition to the inherent real-time
constraints of the world processor. It is my impression that in
present VR systems the latency between object movement and visual
feedback is the single most significant limitation to interaction and
"feel". Presumedly, in flight or driving simulators the "user
manipulated objects" are the vehicals themselves and their controls.

  In outlining what I viewed as the fundamental limitations of a
single processor model for a virtual world processor (using the Cray
as an example) I began by characterizing the number of elements in
future VR displays - under the assumption that that represented an
upper limit to the number of world elements which must be processed. I
thought it was obvious that I was using a head-mounted display to
establish a near term system of 1Kx1Kx30 frames per second. (Obviously
workstation monitors already have had much higher resolution for
several years.) The resolution of the head mounted display (in
addition to frame rate and latency) has been identified as A MAJOR
LIMIT TO VR.  Presumedly, VPL and others are very hot after the very
latest, highest resolution head mounted displays - which I merely
pointed out will be here soon.  LCD technology developed for rear
projection HDTV will approach 800 lines per inch. It appears that this
can be readily adapted to head mounted displays. Other technologies
loom in various labs around the world. However, THE POINT I was making
was that resolution was not the entire problem - as each head mounted
display ultimately must have a continuously processed VIDEO source. If
all you want is to redirect your SGI polygon world output to the head
display, fine, no problem. Be happy. However, I hope in the research
community we want much more. In particular I hope we want HDTV.

I also argued that networked VR motivated "MPEG-like" data compression
algorithms which in turn may rob cycles from our world processor.
Chris Shaw was incredulous about that:

** What do you need MPEG for???

   I can only say that we have been approached by a number of groups
about the feasibility of networked VR (and other applications) with a
heavy emphasis on data compression. Perhaps I'm thinking ahead with
MPEG but there are and will be requirements for "on the fly" data
compression, to be sure. Certainly interactive video applications such
as those supported by DVI technology depend on data compression.

 Also, in my original posting I had speculated on the requirement for
high frame rates in VR. The example I chose was motivated by a NASA
workshop on High Frame Rate High Resolution video - where scientists
were struggling with the problem of monitoring and controling space
based experiments. These experiments must be controlled completely
from the ground station - as the astronaut crew will not be able to
interact with the testbed. Combined requirements for frame rates up to
1000 fps and resolutions up to 1Kx1K were stated - way beyond the
combined state of the art. The conclusion of the workshop was that
only by trading off spatial resolution could very high frame rates be
possible. Now, while the ground-space interaction is not a candidate
for VR because of the latency thru the down link, the captured data
might very well be incorporated into a VR - in a "post operative"
review of the experiment.

I had said regarding the need for higher frame rates:
** >For example, to project images from a remotely sensed combustion experiment
** >will require hundreds of frames per second of aquisition - but in a burst
** >of only a few seconds duration. In order to walk through the data set
** >we must have considerable flexibility in the play back frame rate.

To which Chris Shaw replied:
** This is very lovely, and presumably [... my favorite machine ] does a fine 
** job.
** BUT. This isn't Virtual Reality (TM). This is [ ... something else ] 

   Again, I was attempting to simply illustrate where high frame rate
might be useful. Unfortunately, the example did not stimulate a
creative nerve for Chris. However, one could easily imagine a system
with a "continuous" high frame rate source but at reduced resolution.
The point is that if you want to integrate real imagery from a variety
of sensors into a VR you must have frame rate conversion algortihms.

Chris asks:
** Where's the interaction? 

   For telerobotic applications a remotely located, variable frame
rate camara (or multiple camaras) sends a 3D burst back to the VR.
This burst can be captured in virtual world managed frame buffers.
There are two obvious modes of user interaction either via continuous
teleoperation (at low resolution) or by traversing the captured 3D
"volume" of video data.

** Can you change your point of view?

   To the extent that the system has been configured to support either
of the two modes previously described. If 3D camaras are used then in
principal one could change point of view within a captured video
volume. Arbitrary point of view. Dare I mention that in the consumer
television research community there is interest in "user directed
intereactive tv" i.e. the home viewer controls the camara point of
view, zoom factor, etc. The transmitted video contains an entire video
volume into which you navigate. If all you want to do is stare at the
fans in the stands, the "information" is there for you to do so. Look
for it at about Superbowl 40.

** Can you change the experiment as it runs? No. 

   In the specific example of space based experiments, no. However,
within the limits of continuous teleoperation and if the timeframe of
the experiment is long compared to human response times, then yes.
Alternatively, within a captured volume of data you can change
visualization controls (such as opacity) while you walk through the
data. 

** The situation you describe allows N camera views at pre-programmed
** locations. If you want a new view that your camera(s) didn't get, you
** have to run the experiment again.
 
   Not exactly correct. Again, within the continuous video volume
there is sufficient information to construct a new and entirely
arbitrary point of view of the experiment.

** There's nothing wrong with this, but it ain't virtual reality, because the
** level of interaction is severely limited. 

    If level of interactivity defines VR then this system can be VR...

  Herb Taylor 

References:

"Virtual Environments, Personal Simulation & Telepresence". Scott
Sinkinson Fisher, Course Notes for SIGGRAPH Tutorial #29, 1989


[MODERATOR'S NOTE:  Without endorsing Herb's points -- which must stand
on their own -- I want to thank him for this impressive synthesis of
points of view, as well as a nice statement of his own position. -- Bob J.]

cdshaw@cs.UAlberta.CA (Chris Shaw) (04/06/91)

In article <1991Apr5.030424.16993@milton.u.washington.edu> Herb Taylor writes:
>Sutherland's vision ...  (as reported by Fred Brooks) 
>
>1. Display as a window into a virtual world.
>2. Improve image generation until the picture in the window looks real.
>3. Computer maintains world model in real time.
>4. User directly manipulates virtual objects.
>5. Manipulated objects move realistically.
>6. Immersion in virtual world via head-mounted display.
>7. Virtual World also sounds real, feels real.

There's a lot of Brooks in this statement. Sutherland's 1965 IFIP
Congress paper basically says "We can simulate anything but taste & smell". 

Anyway, I quote Sutherland's introduction here.. "We live in a physical
world whose properties we have come to know well through long
familiarity. We lack corresponding familiarity with the forces on
charged particles, forces in nonuniform fields, the effects of
nonprojective geometric transformations, and high-inertia, low friction
motion." Sutherland then goes on to talk about simulating these things
to higher and higher degrees of realism.

The key point here is SIMULATE. You can't take a picture of something
that doesn't actually exist. What do you point your camera at? Even
more to the point, almost all of Scientific Visualization is conducted
simply because the phenomenon being simulated is too difficult to
observe. Other phenomena are impossible to observe with a camera.

Design is based on the process of creation followed by critique. A CAD
system helps the creation, and simulation helps you provide the critique.
If you're designing something, you can't take a picture of the design
until you've built it.
Anyway, simulation is usually cheaper. All of the "canonical VR
applications" have simulation as a key component. 

>It is not so much the physical apparatus ("head-mounted") as the effect of
>total "immersion" which is critical to VR. 

Yes. One example is numerous projection TVs showing the world model on all
sides, sort of like the 360 degree movie at Disney Land. This has the
advantage of not encumbering the user with head-mounted cables, etc.
The disadvantage is that everything's on the wall, there's limited
stereopsis, and so on. Still, a valid approach at the outset. Talk
to Myron Kreuger at UConn and to Bryan Lewis' group at IBM TJ Watson
Labs for this kind of stuff.

>  It was my contention that there is nothing in VR which requires that
>worlds be polygonally based. That is a choice made specifically to
>meet the above criteria - not fundamental to the experience.

Yes and no. If you allow other types of synthetic surfaces (bicubic patches,
implicit surfaces, voxels, etc), then you get an obvious yes.
Otherwise one is left with camera imagery.

>I further concluded that other approaches to the world processing function
>would require less floating point operation. 

I don't see what is the point of avoiding floating point.
If one avoids floating point because one's box can't do floating point,
then change your box.

>telerobotics or teleoperation are considered worthy subfields of
>Virtual Reality? 

Well, I think that telerobotics is telerobotics. There are probably 5-10
papers worth mentioning (not counting repeats) in the field of Virtual
Reality, but there are numerous telerobotics and teleoperation journals. 
Which is a subfield of which? I prefer the narrower use of the term
"Virtual Reality", which essentially means "Highly Interactive 3D
Simulation". In any case, what's Virtual about telerobotics? The
operator is in some sense virtual, not the environment.
But really, this is a semantic argument.

>  Other examples of non polygonal approachs would include the famous
>video based Aspen driving simulation. Does this qualify as a VR?

Again, no. What if you wanted to see Aspen on a cloudy day? Or anything
that's not a canned image? Well, you have to go back to Aspen & get
more footage!

>IF a system can provide a completely user directed experience of driving in
>a "virtual" city (in this case modeled on a real one) 

It WAS a real city! It took about a month to shoot, and although they
had stacks of images, you could still only enter each store in a
certain way, and so forth.

>Chris Shaw does seem to consider E&S simulators to
>be VR - is that only because they use graphics?  

It's because they are *simulation*.

>the DVI based Palenque walk-through .. it would not be difficult to configure
>the system with a head mounted display (criteria #6) and to track body
>motion to provide a true sense of walking through the ancient ruins.

Ignoring the lag issue, so far so good.

>Perhaps the most difficult of Sutherlands criteria involves the
>requirement for manipulated objects to move realistically under the
>control of the user (criteria # 4 and 5). 

These constraints cannot be met with canned video. The communication
is one-way in this case. VR systems have two-way communication. Hence
the teleoperation distinction.

>The resolution of the head mounted display (in addition to frame
>rate and latency) has been identified as A MAJOR LIMIT TO VR. 

Look. Unless you're talking < 1/60th second lag, latency has nothing
to do with the display technology. Latency arises from the other
system components, such as the head tracker technology, the computers
driving the trackers, and the renderers drawing the scene. (Or to be
more general, the video source drawing the scene).

Secondly, the "frame rate" limitation is not display refresh (fixed at
60Hz), but display update. In a CGI system, display update rate
depends on scene complexity -- the number of polygons/NURBS/voxels you
have to process. For video, of course, display update rate equals
display refresh rate.

Thirdly, the display resolution on the VPL's EyePhone is very low.
320x240 max. The source of this problem, as Herb Taylor
rightly points out, is the LCD TV's that they use. You have to expand
your fonts to 2-3 times a "normal" readable size for text to be
readable in an EyePhone.

>If all you want is to redirect your SGI polygon world output to the head
>display, fine, no problem.

Good. That's what I want. Curse my blinkered stick-in-the-mud pig-ignorance.

>** What do you need MPEG for???
>   I can only say that we have been approached by a number of groups
>about the feasibility of networked VR (and other applications) with a
>heavy emphasis on data compression.

My incredulousness was based on inherent silliness of sending computer
generated images around, given that these images were generated
in real time, and could be generated by the recipient in real-time.
A much better approach is to send the model. It's a much better use
of communication and computational resources, and you can compress
the polygonal models when you send them, too. You also get the behaviour
description of the model, and you can customize it if you wish.

MPEG would probably be useful for sending images, but it ain't
realtime, you still have to pay for the extra bandwidth, and
you're stuck with somebody's canned image.

>Chris asks:
>** Where's the interaction?  <in relation to the astronaut experiment>
>
>   For telerobotic applications a remotely located, variable frame
>rate camera (or multiple cameras) sends a 3D burst back to the VR.
Excuse me? What's a "3D burst"?
>This burst can be captured in virtual world managed frame buffers.
>There are two obvious modes of user interaction either via continuous
>teleoperation (at low resolution) or by traversing the captured 3D
>"volume" of video data.

Huh? Can Herb tell us what algorithm he's talking about here? Granted,
given my naivete about image processing, maybe you can do this if you
have enough cameras. But I am quite doubtful that you can do full 3D
reconstruction of 3D phenomena on-the-fly with video, which is what
Herb is essentially claiming.

>** Can you change your point of view?
>   To the extent that the system has been configured to support either
>of the two modes previously described.

Be specific. This doesn't answer the question.

>If 3D cameras are used then in principal one could change point of view
>within a captured video volume. Arbitrary point of view.

3D cameras? In principle? What are you talking about here? This sounds
as far-off and high-priced as the direct neural interface. This also
looks an awful lot like a proof by claim. CGI can give you arbitrary
point of view in practice, right now, at under $100,000.

>The transmitted video contains an entire video 
>volume into which you navigate.

What's a "video volume"? How many cameras do you need? If you get a
video volume of my house, do you have to go into all my cupboards?
What if I want to move the furniture? What if I want to check the
state of my chimney? Look under the carpets? 

Now, it may seem that these are smart-ass questions, but I'm serious.
What kind of work do we need to do to create a "video volume"? I'm
guessing that you need to do a 6D scan of the entire volume of the house,
including the hidden surfaces. 3D is position, and 3D is orientation.
My point is, that in absence of such an exhaustive scan, you're going
to have to accept some loss in position and orientation control.

Strictly speaking, it's a trade-off between static visual realism and
dynamic visual realism. The more you want static visual realism, the
more appealing the video approach is. The more you want dynamic visual
realism, the more you want CGI. My basic point is that beyond stringing
together a certain set of canned still shots to simulate motion,
the video approach fails because the dynamic realism is lacking. 
Flaw number two is that you can't see hidden surfaces, even if you
want to. The fatal drawback, however, is the cost of putting
something like this together. I can't imaging a more tedious task
than scanning a room in 3D.

Of course, the CGI drawbacks are obvious. The images aren't real, and
producing them in real time depends on fast hardware. The more
polygons you have, the faster the machine needed to maintain a given
update rate.

>** Can you change the experiment as it runs? 
>within the limits of continuous teleoperation and if the timeframe of
>the experiment is long compared to human response times, then yes.

In other words, if you have true teleoperation, then the answer is
yes. But, how much it will cost to build a teleoperation system that
will allow you to alter arbitrary experimental parameters? Given that
system, is it a general enough tool to be used for any experiment? My
intuition tells me that the costs for the basic tool will be
astronomical.

In any case, Herb's answer dodges the real issue. If you can simulate
it, the time scale of the phenomenon is of no consequence, you slow it
down to suit. If you *must* do the experiment, then canned video is
probably the way to go. But the problem is that the video is all you
get. THERE IS NO MODEL.

>Alternatively, within a captured volume of data you can change
>visualization controls (such as opacity) while you walk through the
>data. 

I don't understand this point. Are we doing video, or are we doing
voxel stuff? If we're doing voxel image processing, then we're in the
realm of graphics. (Strictly speaking, no polygons though)

>** The situation you describe allows N camera views at pre-programmed
>** locations. If you want a new view that your camera(s) didn't get, you
>** have to run the experiment again.
> 
>   Not exactly correct. Again, within the continuous video volume
>there is sufficient information to construct a new and entirely
>arbitrary point of view of the experiment.

What is the proof of this somewhat surprising claim? Having
"sufficient information" doesn't buy you anything! For example one
has "sufficient information" to solve the satisfying assignment problem,
but it still could take exponential time in the number of variables to
solve! (The satisfying assignment problem requires that you find an
assignment to the variables of a product-of-sums boolean equation that
makes the equation true. The problem is NP-complete.)

>** There's nothing wrong with this, but it ain't virtual reality, because the
>** level of interaction is severely limited. 
>
>If level of interactivity defines VR then this system can be VR...

Hardly. I get 1 interaction per second typing on an ASCII terminal over
a modem. Is this VR?

>  Herb Taylor 

Clearly, this could degenerate into a stupid argument over semantics,
and that's not something I'm really interested in following up.
So, I'll summarize what I think VR is, and leave it at that.

Herb has a radically different view of what VR is than I do.
Herb gives the impression that video is a suitable replacement
for all Computer Generated Imagery in a Virtual Reality system.
I do not agree. I also don't know anybody else who would agree.
While video may be a useful adjunct to a virtual reality system,
it lacks the fundamental property of arbitrary real-time view
position and orientation control. CGI gives you this as part of
the package.

In fact, I would call a remote-controlled camera system such as
Herb mentions a teleoperation or telepresence system. Why? Because the
simulation component doesn't exist, because the view cannot be
arbitrarily controlled, and because the operator cannot manipulate
arbitrary objects, given a reasonable-cost system.

The teleoperation distinction is important also for the types of
technology that you need. Teleoperation relies on robotics if
any view change is to be possible. Computer Graphics and Robotics have
much in common, but they are also quite different. Similarly, VR and
Teleoperation have common ground, but only on the operator side. On
the side being operated, the difference is that VR is a virtual space,
and teleoperation is a real space. I have to wonder why Herb insists
on calling his work Virtual Reality when it so clearly is not.

<Rampant speculation follows>
I think that the Princeton box is solution looking for a problem. This
is why HDTV, video compression, MPEG, etc keep coming up. The HDTV
box wants to become a VR box. This will be a truly Procrustean effort.

Ten years from now, things may be different, but I can wait.
-- 
Chris Shaw     University of Alberta
cdshaw@cs.UAlberta.ca           Now with new, minty Internet flavour!
CatchPhrase: Bogus as HELL !

[MODERATOR'S NOTE:  Thanks to both Chris and Herb for a stimulating and
provocative discussion, without degenerating into name calling.  (Is it
okay if someone calls themselves a "smart-ass"?  What humility!)  Please
don't stop if there's more to be said.  We're all learning a lot, from
two different but indicative points of view. -- Bob J., now wafting over
the Arctic....]

naimark@apple.com (Michael Naimark) (04/07/91)

There are several references to cameras and displays made here that concern
realworld recording on which I'd like to comment. Realworld recording (for
realworld simulation or, perhaps more interesting, for abstraction) is often
assumed as trivial or possible at some future time. It ain't that easy.

Disney's CircleVision (nine screen 35mm cylindrical theater) has NO stereoscopy.
If two camera pods could be placed interocular distance apart (which they can't,
resulting in hyperstereoscopy) and two projectors were dedicated for each of the
nine screens (ie., via polarized filters and glasses), the degree of parallax
would change from normal stereoscopic to zero to pseudoscopic as the viewer's
head rotates 180 degrees. This "stereoscopic panorama dilemma" is also true for
shooting realworld images for goggles (like shooting with two fisheye lenses).

The Aspen Moviemap (as well as Polenque, Paris, and Golden Gate) were shot to
maximize user control, such as shooting via distance triggering rather than time
triggering and shooting for "seamless" match-cuts at nodes, but are ultimately
lookup media with minimal computation needed (searching videodisc frame
numbers). Consequently, the end-user can never travel where the camera didn't
shoot. (A favorite flame of mine is that while much of the computer community
dismisses this as not virtual reality, they continue to call their little green
cones "trees.")

Aspen was digitized into a cartoon-like 3D model, by brute force (students) lead
by Walter Bender. The opus document of what it takes to automate this process is
Michael Bove's PhD dissertation "Synthetic Movies Derived from Multi-Dimensional
Image Sensors" (MIT Media Lab 1989). It's exciting and depressing: this is not
going to happen tomorrow. Human's are particularly good at recognizing what is
important in an image and workstations could be optimized to make digitization
less brute force and more the artform that it is.

Cameras could also be optimized for digitization into 3- or 4- D models:
"spacecode" (my term for the sensing and recording of spatial data), z data
sensing, panoramic optics, and auxillary data tracks for annotation would be
good starts. My experience with major cameras and camera companies (ie., Sony)
is that they haven't a clue how, or why.