Now includes April 20, 2000 updates, in brown
Now includes May 12, 2000 conclusions, in red
Here
are 15 images taken from a complete revolution in one of the sites (I forget which).
Here is a
panorama I built out of them. Check out
the interactive
panorama viewer. (You may need to download the plug-in.)
I would like to apply image-based
rendering techniques to this data (soon to be ours in the original film form,
from which we can obtain digital copies), to result in a three-dimensional
environment.
We don't have the physical film, in fact. What we have is a D2
digital tape of the film (just the Timbuktu and Dubrovnik reels), with frame
numbers. What we do when we want a specific frame is tell the film lab (which is
in San Francisco), and they do the conversion for us.
We then receive the film on DLT tape in Cineon format, which is a subset of DPX
format. This is a format maintained by SMPTE for the motion picture industry, and
contains much useful information in the header regarding color metrics, frame rates,
shutter angles, and so forth. Mostly what we are interested in is the 10 bits per pixel
of image information.
These files are probably going to have to be converted into something sane before I use
them (no way I need 10 bits of color, and it's logarithmic at that). I have located
the specifications and have begun writing code to do the conversion.
The film lab did a good job on the film-to-tape transfer but this Unix tape they gave us
is like a bad joke. Each of these files is tarred individually, with an absolute pathname,
and there are many duplicate filenames on different files. Most of the ones I've been
able to read claim to be from the right camera, but there are far more than the 72 there should be...
and others are corrupted. Here is a full-size frame with
lots of JPEG compression.
I wrote a lot of irritating tar scripts trying to make this work. The files are 50 meg
each, too. This must
be what it felt like to program with punched cards.
In the meantime, I have been using the data from DV digital video tapes. Unfortunately
this introduces interlacing. Please read my discussion
on interlacing issues.
Now I'm using the images from D2 tape, which is higher quality and which I think doesn't interlace. I have found a sequence I really want to use, of San Francisco -- in the foreground is a regular geometric tiling pattern, with a cool building in the midground with waterfalls on it. I think that this footage would submit well to depth extraction and my fake transforms. The waterfalls would be tagged as moving objects and therefore I would render them by cycling the pixel data, on top of the image objects that result from the stationary scene.
The D2 frames come out 720 x 480 and are very useful, but they don't
compare to the 4k x 3k.
Here is a sample frame from Dubrovnik:
You can see in this stereogram that the stereo is preserved very nicely (with some careful matching of frames).
stereo footage
|
| register left-right, and record
\|/
V
stereo pairs
|
| depth-from-stereo
\|/
V
separate depth images
|
| combination of points, including tagging moving objects
\|/
V
colored point cloud
|
| removal or animation of moving objects
\|/
V
colored point cloud
|
| converted to triangles
\|/
V
triangle mesh
|
| custom display code
\|/
V
environment
Previously, I had been thinking in terms of extracting motion directly from the image
data. See below:
My code attempts to find the horizontal offset between these images by trying different values, taking the difference between the offset images, and minimizing over those results. Here is the difference with no offset:
And here is the difference from the best match found (also cropped):
It's an improvement, but not a close enough match to use as a mask for stationary/moving discrimination. I need to
I've now got images as close together as I want. Here is a poorly matching offset, just
so you can see what is
going on:
And here are the same two images, but at a much better offset.
Note that the man, who is moving around, shows up clearly. A combination of multiple
frames would remove him quite effectively. Here is how I combine frames:

The difference between those two, with proper offset, looks like
That is brightened way up to be more visible. Note that nearer the edges the rotation is less close to being simulated by a translation, so the correspondence is worse. One of the advantages of having enormous angular resolution is that we can use just the part in the very middle if we like.
Using that image as a mask gives us:


as the background and mover pixels respectively. We can then combine the two background images. We average them where they are both valid and use only the valid one where one is invalid. Where neither is valid, an inoffensive background color is chosen.
The "moving" pixels could then be shown on top of this, giving us the people and animals moving around but the background constant. It would look much better with more pictures averaged in, but unfortunately I designed my pipeline to pretty much only use two images at once.
I had abandoned this approach in favor of detecting movement at the depth image level...
By the way, the other people using this data are Anselmo Lastra and Voicu Popescu. They are the ones working on extracting depth from the data. When it comes time for me to create image-based objects, I will either use their results or an existing depth-from-stereo library.
Famous last words! I tried to write this code myself instead, and I am
largely stuck. I understand that you can determine depth from disparity, and find
disparity by offsetting the left and right images until they match at a feature.
The first and second strips are the left and right images. The lower strips represents
the differences between left and right for a single scanline, offset left (top) to right
(bottom). Clearly, the best matches are at the centers of the Xs. Note on the right side
how the centers go downwards; since those points are farther away, that's exactly what we
want to see.
Unfortunately, not every pixel is a feature. You can see that for most of the x values, a
whole lot of no difference must somehow yield a preference. Here is the least bad
correspondence I was able to get:
You can see the general structure. But the pixels with no difference around them get
random quickly. I think the answer is some sort of feature pre-selecting:
and interpolating between those points. That should work great for big planar objects
like buildings.
So, I am stuck at depth from stereo. That makes it hard to move on to combining the depth
views, while extracting motion.
What I hope to contribute on top of existing techniques is:
Conversion of an image file into another image file
is exactly the kind of thing this is good for, since the work can be split
up evenly.
I have explored the OpenMP parallelization library, written some code using
it, and successfully achieved increased performance by utilizing multiple
processors on evans.
(See above.)
It is my belief that since we know the extent of the transformation between
successive frames (a rotation by one minute of one degree), we can detect
depth-pixels that have moved by performing the rotation on the previous depth
image, and looking for differences. This depends on having depth for the image.
I think it can also be done without the depth information, because the rotation
is so small that it can be approximated by a translation of the projected pixels,
and because the angle subtended by a feature is more important than its precise
depth, since a small change in left-right or up-down has more of an effect in
screen space than does a small change in back-front.
(See my discussion of interlacing issues for
more about this.)
Again, these are mathematical processes that should achieve good speedup from
parallelization, as the individual pixels are largely independent.
Most of the features in these images appear to be nearly planar, and could actually
be well-represented by large textured polygons. (The sandy ground on the market floor, for example,
is unlikely to be looked at closely enough to require modeling of the footprints.)
This would certainly outperform the image-based objects
that we display from laser data (such as the reading room), which are essentially enormous polygon soups.
Perhaps more aggressive surface simplification is needed.
If Andrei State et al. get the DPLEX working so that different processors can pump geometry through different pipes, I can take advantage of this, since I am writing the application from the ground up rather than having to modify existing code which has problems such as using GLUT. I would find this satsifying personally as only toy applications run correctly in this manner currently.
We have some Timbuktu images showing on a seamlessly combined multi-projector setup, about 90 horizontal degrees in all. If I can get animated (if not 3D immersive) images showing on that it would be pretty cool.
I really think that if I could get some kind of depth images, the recombination and motion extraction part would work right on it. Depth from stereo is a fascinating problem but not, unfortunately, the one I originally wanted to address.
Additional references:
Did you remember to read the page with all the angles and diagrams and stuff ?
To sum up, I still have to
Leaving: