herbt@apollo.sarnoff.com (Herbert H Taylor III) (06/05/91)
The recent series of postings on VR/Video following the exchange between myself and Chris Shaw have been most informative. I would also like to publicly thank Chris for asking some very tough questions. I can certainly take a "whipping" without losing my sense of humour - although I hope we can both avoid the use of the pejorative. It was unfortunate that the HDTV emphasis of previous posts defocused the more important topic of VR Architectures. Chris challenged the applicability of those ideas to VR and while I remain convinced that they will prove important they seem to represent a small conceptual detour. Likewise, a strong technical challenge was raised to our description of several 3D video based scenarios. Much of the technical criticism which followed was the result of our poor description of our ideas. Real-time 3D imagery alone will not be sufficient to construct and manage entirely "simulated" virtual worlds. We will not be able to look under the Kitchen table for the "Chewing Gum". (Of course if the CGI modeler doesn't have a "Chewing Gum" model, "it", to borrow a favorite colloquial expression of Chris', "ain't there either.") Whether these limitations preclude the use of 3D imagery as a useful interface component of VR remains a topic of further research... In our original speculative post of ca. 3/10/91, (responding to our moderators request for summaries of current research), we mused about our desire to explore what VR will be like in ten years. We admitted that the system we were using (aka the Princeton Engine) was not a general solution to VR processing. I hope, however, that we use Supercomputers as vehicles to explore otherwise impossible ideas. Machines such as the UNC PxPL5, the CM2 or the Princeton Engine are not "practical" single user architectures - at least not yet - but surely in not too many years we will have desktop massive parallelism with the potential for real-time interactive VR applications. This leads naturally to motivating questions: will the "ultimate" VR system of the year 2000 be more SGI-MIMD like - with a large number of powerful processors in the rendering pipeline - or, will it be a hybrid of SIMD and MIMD, as in the PxPL5? Perhaps VR specific architectures will emerge. What impact does our choice of VR world model have on architecture? Is the evolution of VR only going to be in the direction of increasingly realistic CGI rendering? Are worlds derived from sampled data going to become more viable? What are the applications which motivate future VR architectures? Certainly, the potential for applications in medicine, architecture, simulated experience, "gaming" and product design will continue to provide motivation for developing systems with improved visual realism and more natural interactions. Likewise, scientific data visualization offers fertile ground for VR research and future application - where we have this notion of interacting with and literally "experiencing" our data. There has been significant independent progress in recent years in each of the fields of interactive data visualization, VR and specifically, Volume Visualization (VV), however, it will be the convergence of all three technologies into a single computing and interaction framework where the true enabling leap of functionality will occur. Scientists will be able to simulate and visualize complex phenomena, in some sense, actually "participate" in their experiments. This kind of interaction will revolutionize scientific research in much the same way that the computer itself has. There are probably those who will question our enthusiasm and observe that our scientific forebears drew marvelous insight from very simple physical models of complex structure without the benefit of computers or graphics. It is said that the structure of benzene came to Kekule in a dream. Certainly, Watson and Crick were able to visualize amazing structure without the benefit of complex "tools"... which is exactly the point. When the visualization systems of the future are as easy to use as a box of snap together molecular models, as interactive as the microscope or as "free associative" as a dream - only then will they realize their full potential. These advances, however, will come at great computational cost. Where are the computational boundaries for VR? To address these issues we must first establish complexity bounds for VR in terms of computation (rendering, dynamics, constraints, etc) and I/O. The processing requirements of VR have been studied in terms of system dynamics and constraint satisfaction by [Pentland90] giving O(n**3) "calculations" per vertex for the dynamical system. For a 1000 objects with a 1000 vertices a 100 TFLOP performance is required to achieve interactivity (Assuming 100 floating point operations per system "calculation"). That astounding number is still two orders removed from the NEXT generation of supercomputers. The authors propose a reduced complexity model - still with computational complexity in the 10 GFLOP range to satisfy system constraints and 100MFLOP for the dynamics. A system which implements this approach is described in [PentE90]. [Witkin90] also discusses the constrained dynamical system in some detail. Polygon rendering has been discussed in [Ake88] with floating point requirements in the range of 40MFLOP for 100,000 polygons. Further research needs to be done to reduce world complexity and to make higher resolution worlds with complex objects more tractable on near term computers. Several posters to this group have suggested that VR input processing requirements are quite modest - at least in terms of the data glove. We might ask how does the complexity scale as more and higher "resolution" input devices are introduced? Devices such as the Eye tracker [Levoy90] and 3D head tracker [Wang90] would seem to add significant complexity to VR processing requirements even with dedicated interface hardware. Comparatively less research has been done on the physical "output" side of VR. [Minsky90] describes a system using the sense of touch. What other input or output devices can we look forward to and what are the performance specs? Clearly, the exponential growth of the computational and I/O requirements of VR will motivate both algorithmic and architectural solutions. A recent estimate by the US Government projects "terraflop" commercial Supercomputers before the year 2000 [USG91] with the first demonstration systems emerging from the DARPA High Performance Computing Systems (HPCS) initiative in 1992-3. Whether the spectacular performance of these machines can be fully harnessed for VR remains to be seen. Perhaps what is really needed is a combination of VR specific architectures with VR specific algorithms - architectures which are on the HPCS technology learning curve (i.e. that employ MCM packaging, superdense ULSI, optical inteconnect, etc) COMBINED with algorithms which can replace O(n**3) with say O(n log n). A number of research groups have proposed (and in some cases actually built) "application specific" or "algorithm specific" visualization computers, including the well known UNC PxPL5 [Fuchs82], the SUNY CUBE [Kaufman88] and the Stanford SLAM [Dem86] systems. (We are not sure if a full version of the later machine was ever built.) In general, these researchers were motivated by the desire to explore "future" visualization algorithms. Our original motivation in developing the Princeton Engine was the desire to explore "future" video systems. However, we have found that its applicability is in no way "limited" to video and it can serve as a useful architecture to study visualization algorithms and future visualization systmes. We certainly do believe we can accomplish much of what we described in our previous posts. After all, the system can turn over 30 (16bit) GIGAOPS, or over 1 GFLOP (for you scientific types). More important, ALL system I/O is continuous and transparent to the CPU. A 48bitx28MHZ (=1.4GBPS) digital input bus and a 64bitx28MHZ (= 1.8GBPS) digital output bus can drive any combination of analog or digital I/O devices. Transparent, continuous gigabit I/O should be important to the NEXT generation of VR peripherals: Data Gloves, eye trackers, you invent it. Finally, while there is no special system requirement that either the I/O or the application be "video", a typical application will often COMBINE scientific computing AND real-time data visualization. Reat-time Interactive Volume Visualization ------------------------------------------ To calibrate this machine for graphics applications, we are in the process of implementing a real-time volume rendering system that can arbitrarily rotate and render 256x256x256 volumes at 30fps. We believe we can volume render 1Kx1Kx1K at about 8fps using single axis rotation. This is possible because the Princeton Engine can perform a "continuous" real-time transpose (512x512x32bitsx30fps) for very little CPU cost. The programmer effectively has an array and its transpose as working data structures. At any line "time" each processor has a row and colume of the current frame in hand. Therefore, "scanline" algorithms are relatively straightforward... A number of recent papers suggest that Volume Visualization (VV) and Virtual Reality are closely related, convergent applications. The "edvol" system combines a VPL Data Glove and 3SPACE Polhemus Tracker to provide direct interaction with volumetric data [Kaufman90]. The authors do not characterize the size and complexity of either the VR or the VV but do describe it as "small scale". It often seems as though the electronic "media" believe that VV is "already" a standard component of VR. This obvious misconception has occured because volume visualization demos are usually presented as real-time interactive simulations, when the visualization actually took CPU hours to orchestrate. But the "dream" clearly is real-time interactive volume rendering and visualization. One of the best demonstrations I have seen of the potential for a VV fly through was presented by Mark Levoy (then of UNC) using volume rendered CT. In [Levoy89] the intended use of a head mounted display interface to the PxPL5 is described while in [Levoy90] the use of eye tracking hardware is described - in both cases specifically for VV. UNC's Steve Pizer showed a video of 8fps single axis rotation volume rendering on PxPL5 at the San Diego Workshop on VV. A head display system has also been developed at UNC to assist radiologists in treatment planning. Although these examples may not in all cases qualify as pure "VR" they certainly speak to the potential for a real-time interactive VR interface to a volume visualization environment. Volumetric data sets come in two basic forms: "real" sampled data (as in CT, MRI, Ultrasound, Optical or Xray Microscopy, etc.) and computed or "synthetic" data (weather models, CFD, etc.). In the later case the volume is usually "simulated" while in the former the "raw" data is sampled and sometimes preprocessed before the volume is rendered. For example, before an MRI image is produced the "raw" sampled data must be Fourier Transformed. With either approach the resulting data set is a 3D spatial volume. In the case of synthetic simulated data there is also a "timestep" - the fluid flows, the turbine spins, etc. With sampled data there is often no clear notion of time, the data is entirely static; however, the interaction with the data can be dynamic and even involve the "introduction" of time. A traveler passing through a sampled and rendered volume certainly experiences the passage of time, however, the "world" itself remains static. By analogy one can imagine walking through a museum (static sampled "objects") verses walking along the bank of a river (dynamic simulated "objects"). In our conceptual museum, as we begin to interact with objects we can simplify the system constraint dynamics as much as desired, literally "determining" the laws of physics. That Ming Dynasty vase I knocked over? It never touched the ground. If we are "in" an MRI or CT museum we might wish to change opacity, point of view or other parameters which effect our visual perception of the phenomena we are studying. Of course, the same control of time is possible in the "synthetic" case, however, only at the risk of undermining the scientific interpretation of the simulation i.e. correctly visualizing and understanding the physics is often fundamental to the experiment. With the emergence of increasingly real-time instrumention there is a second form of sampled data to consider: 3D spatial volumes with time varying data. Imagine a sampled volume of a living organism or dynamic micro structure which is updated 30 times a second. If we were "inside" this museum while it was "open" we could watch cells as they proceed through nuclear envelope breakdown, divide and emerge as two identical cells. (We are working with this kind of data now.) The degree to which this form of interaction is a Virtual Reality results not from our ability to "alter the experiment" on the fly, but from our ability to control the dynamics of how we "view" the experiment while it is taking place. We may ultimately be able to turn up the heat or add some catalyst to a chemical reaction from within the Virtual experience but that ability neither defines VR nor does the absense of that ability preclude VR. IMO, it is the VR observers perceived sense of simulated presence combined with the ability to control the visual experience which principally defines the interaction as VR. Two related projects provide sources of volumetric data which we are using at Sarnoff and which we feel have VR "prospects": from an experimental ultrasound instrument and from a differential inferential contrast (DIC) micrograph which produces a sequence of image slices through a cell embryo. The DIC volume can be acquired at or near real-time with the latest instrumentation - hence, a "video volume" (sorry Chris, it really can be...) In the case of the ultrasound instrument the Princeton Engine will also perform the front-end signal processing required to produce a volume from the sampled data. Presently, a "raw" 3D ultrasound data set cannot be acquired in real-time, however, the signal processing to produce a data volume from an unprocessed data set can potentially be accomplished in real-time, as can the Volume Rendering process. The KEY POINT is that we are going to see more and more real-time instrumentation which can produce true sampled data volumes. As the aquisition time of systems such as MRI, ultrasound, Electron Microscopy and DIC decrease we will also see greater coupling between the front-end signal processing, the data visualization and perhaps even the user interaction. At a recent workshop at Princeton University, "Seeing Into Materials: Imaging Complex Structures", both optical and EM microscopy systems capable of real-time 3D acquisition were described... Actually "4D" is used to refer to a 3D spatial volume "moving" in time. BTW, these scientists have a strong intuitive feel for the potential of VR - at least what they call "VR". That is, the ability to interact with an experiment - either in situ OR as part of post analysis - as in our museum examples. Is it fair to ask where in the taxonomy of VR systems we should place these kinds of applications? True, the worlds are derived from various real world spectra but the interactions are entirely SIMULATED, one can change viewing parameters, etc. The exact meaning of virtually touching "objects" or surfaces in such worlds remains unclear... but really no more so than in a CGI simulated world where everything is built from models. In either case the eventual consequence of our fundamental interactions within a world must be determined by a "law giver". If I put my data gloved hand directly into the burner of a VR Kitchen stove what happens? It should be noted that there are several potential problems with these methods of data visualization. First, a number of people report varying degrees of motion sickness when observing through a head mounted display. That may be acceptable if I am performing a "hammer- head" maneuver in my Superdecathalon simulator, but probably is not acceptable if I am inside someones brain. (Informal Poll: How Many of you have experienced this effect? RSVP and I will tabulate.) A second potential problem results from "persistance" effects on the human visual system. We vividly recall in the early days of VLSI CAD workstations the problem IC draft persons experienced after long hours staring at color stripes and squares. [Frome83] describes the so called "McCollough effect", wherein, after looking at color stripes for only a short time, high contrast B&W stripes suddenly appear to have color where none is present. To dramatize this effect during her talk at DAC83, Francine Frome periodically displayed slides with green and red stripes. About halfway through the talk she displayed a slide with a striped "BTL" in bright green foreground offset from a bright red background - before informing the audience that the slide was totally black and white! It was quite remarkable. These issues and other "human factor" issues will need to be fully understood before head mounted displays achieve broad use and certainly before we let Nintendo sell one to every ten year old... More VR/Video ------------- In our original post we also speculated that multiple camaras might be used to develop a "Video" data glove or "whole body" interface to a virtual world. In particular, we asked how such an interface would impact the future design of the data glove. We received a number of thoughtful comments following this post. It is important to note that the VR world itself COULD STILL be CGI - with video only providing a framework for interaction. With support for up to six camaras one could surround participants either individually or collectively with video. (Remember we pay no CPU cost to load frames into memory from each camara, however, we do pay once we start to do something with the data.) Participants might wear "chroma keyed" gloves (wireless gloves of a reserved key color) or even body suites. Chroma keying is a well known technique for creating simple special effects such as the "weatherman" overlay [Ennes77] [Watk90]. We would NOT use this merely for special effects, however, but to provide a means of isolating the hands so we can build a useful model. On the Engine the ammount of processing for each chroma key is only about 5-10% of the "real-time budget" at 30fps. A second chroma key is used for the background. This is similar to Myron Kruegers Videoplace which uses white backing screens. It differs in that Videoplace produces only silhouettes of "Artificial Reality" participants as a group and provides a limited framework for identifying individual participants. ( Don't get me wrong - Videoplace is still a lot of Fun! - I recently spent a day at the Franklin Institute watching kids play in it and was impressed by the overall effect produced. ) The next major technical step is to be able to exploit this interface in a useful way. In particular we want to study the effect of multiple "individual" participants. If two video channels are paired to each participant with a distinct chroma key can we construct a useful model and use that to interact with and control the dynamics of the visualization process. Our present plan will focus on the real-time recognition of simple hand gestures from each "pair of hands". The idea of using sign language as some posters have suggested is very interesting - particularly coupled with a neural net based recognizer. Recognizing full ASL in "continuous" real-time by any approach is probably ambitious, however, a useful subset might be possible. We have used a neural net approach to detect and remove characteristic AM impulse noise (aka "hair dryer noise") in a TV receiver [Pearson90]. One network is trained to detect AM impulses on an image line and a second network is trained to look at the entire image and determine which of the detected pulses are really "false" positives. This program runs in continuous real-time on the Princeton Engine. We also demonstrated real-time BEP training on a simple three layer MLP (a total of 86 w's and th's). For hand signs, however, a new network topology would be required - with the input to the network derived from the subsampled chroma key image segment of the original image. However, if my understanding of "conversational" ASL is correct - and each hand sign is typically an entire word or concept - then the resulting training set still might be hugh. Also, I believe that hand motion itself plays a significant role in the interpretation of signs - not just in the transition from one sign to another - as in cursive writing. This implies that a robust sign recognition system would need to compute a motion vector and use that as part of the training set. We would appreciate references to current work... particularly how one detects individual signs when in "continuous" conversation i.e. when does one sign end and the next begin? Lastly, a second video experiment would involve the use of the chroma key to present to each of three remote participants a composite image of their two neighbors, to form a virtual conference. While this interaction is entirely real-time we recognize that there will be a significant limit to the quality of interaction between subjects. We are interested in the degree of "total" immersion each person experiences. If we also mix the audio does the participant "feel" like he or she is having a conversation with three people. Unfortunately, I would imagine that the head mounted displays would tend to undermine intimacy - "perhaps" we could image warp new faces on everybody - just kidding Chris - although now that I think about it... References ---------- [Pentland90] "Computational Complexity Verses Virtual Worlds", A Pentland, 1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH ( Based on the quality of papers in the proceedings, this must have been a great conference! ) [Witkin90] "Interactive Dynamics", A Witkin, M Gleicher, W Welch, 1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH [PentE90] "The ThingWorld Modeling System: Virtual Sculpting by Modal Forces" A Pentland, I Essa, M Freidmann, B Horowitz. 1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH [Ack88] "High Performance Polygon Rendering" K Akeley, T Jermoluk, Computer Graphics Vol 22, No 4, August 1988. ACM [Levoy90] "Gaze Directed Volume Visualization", M Levoy, W Whitaker. 1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH [Wang90] "A Real-time Optical 3D Tracker For Head Mounted Display Systems" J Wang, V Chi, H Fuchs. 1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH [Minsky90,vr-66] "Feeling and Seeing: Issues in Force Display" M Minsky, O Ming, O Steele, F Brooks, Jr. 1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH [USG91] "Grand Challenges: High Performance Computing and Communications" A Report by the Committee on Physical, Mathematical and Engineering Sciences; Federal Coordinating Council for Science, Engineering, and Technology; Office of Science and Technology Policy. [Fuchs82] "Developing Pixel-Planes, A Smart Memory Based Raster Graphics System", H Fuchs, J Poulton, A Paeth, A Bell. 1982 Conference on Advanced Research in VLSI. [Kauf88] "Memory and Processing Architecture for 3D Voxel-Based Imagery" A Kaufman, R Bakalash, IEEE Computer Graphics and Applications, Volume 8. No 11, November 1988, pg 10-23 reprinted in "Volume Visualization", edited by A Kaufman, IEEE Computer Society, 1991. [Dem86] "Scan Line Access Memories for High Speed Image Rasterization", S.G. Demetrescu. Phd Dissertation, Stanford University. June 1986. [Kauf90] "Direct Interaction with a 3D Volumetric Environment", A Kaufman, R Yagel, R Bakalash, 1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH (We also highly recommend, "Volume Visualization", edited by A Kaufman, IEEE Computer Society, 1991 which contains a large survey of relevant publications.) [Levoy89] "Design for a Real-Time High Quality Volume Rendering Workstation" Chapel Hill Workshop on Volume Visualization, 1989, Department of Computer Science, University of North Carolina. C. Upson, Editor. [Frome83] "Incorporating the Human Factor in Color CAD Systems", F.S. Frome, 20th Design Automation Conference, June 1983, IEEE Computer Society. [Ennes77] "Television Broadcasting: Equipement, Systems and Operating Fundamentals", H.W Sams, 1979. pg 319-323 [Watk90] "The Art of Digital Video", John Watkinson, Focal Press, 1990, pg 75-77 [Pearson90] "Artificial Neural Networks as TV Signal Processors" Clay D. Spence, John C. Pearson, Ronald Sverdlove SPIE Proceedings Vol. 1469: Applications of Artificial Neural networks, 1991