Light-field videos: Part I

PUBLISHED ON MAY 24, 2773 / 5 MIN READ — LIGHT-FIELD, VIDEO, WUPI

Light-fields have become a little obsession of mine. It all started when I saw this back in 2770. I wrote a C++ program that would let you view light-field images. I didn't publish it and forgot about it because I saw inherent problems with it, mainly that it only really worked for things that were at a certain distance.

Going forward in time, in 2772 I found about the LLFF project, which uses AI to find the distance map on images, generating multi-plane images for each camera. That solves the main problem I saw back in 2770, and the code is public! However in 2772 I was busy with another side-project (the Wupi app), so it is only now in 2773 that I've gone back to light-fields. In the meantime, some guys at google started doing cool things with light-fields as well!

An interesting thing about the paper related to the LLFF repo is that they describe that with, basically, 16 cameras and some AI, we can get light-fields to the nyquist frequency limit. From a business/technological perspective, my conclusion on observing the evolution of light-field technology over the years is that with the same raw data you can get more and more realistic light-field videos over time. That's really interesting, because that means I can start shooting videos for my light-field project and as technology advances, those videos will look better and better.

Unfortunately, there's nowhere I can buy a light-field video camera. And if there was it would be quite expensive. And then there's no publicly available software to process/view it. So I've started my very own quest to create both, I'll do both the hardware and the software. I want a cheap, reliable light-field video camera, the pipeline to process it, and a player to view/explore it.

Camera

If you're going for the cheap, sometimes you have to sacrifice convenience for price. So I bought 16 Apeman A77 action cameras at about £40 each. Those are capable of 4k video at a budget price, and they include a remote control that can activate all of them at once. The inconveniences I've experienced are that sometimes the remote control doesn't activate all cameras at once, and that the videos are recorded in their individual SD cards, which I then have to remove and insert into the computer to read them. It would be way nicer if I could download the videos via wifi or bluetooth or even better if the cameras supported streaming. But it works and it's cheap enough so for a first iteration it should be more than enough!

Also I discarded the option of webcams because having independent cameras scales way better and is more lightweight than having 16 or more webcams attached to a computer. Another point is I chose ultra-fast SD cards to avoid issues with recording 4k video. What it remains to be seen is whether high resolution actually matters or not, as for now I'm downscaling the video anyway. In order to play multiple videos at once I will need to downscale anyway, and as we're using many cameras to render/play a single “video”, it's possible to upscale again from multiple lower-res videos.

Cameras

Processing Pipeline

Back to the software. Two parts: the processing pipeline and the player. The main idea is to use AI to guess the distance map for each frame in each video, then for each frame in each video, we'll create an MPI (Multi-Plane Image) where each plane contains the information for a certain distance. That's more than enough to enable the player to do its thing, but we need to encode MPIs back into videos as otherwise the size of the light-field would be too big to play. So instead I'll create as many videos as depths/planes in the MPI, and each video will include the plane information for that depth for all cameras (by putting together the planes in a grid). If each video has a downscaled resolution of 640x360 pixels and we have 16 cameras, an MPI video using a 4x4 grid pattern, will have a resolution of 2560x1440 pixels. Not too big actually, considering the maximum resolution some Nvidia cards can reproduce is 4096x4096 px. But with 10 depths the player will have to play and process 10 of those in realtime.

Back to the pipeline: I'm reusing the LLFF project for the first iteration. LLFF uses Colmap to calculate the position of each of the cameras, then uses a trained AI to calculate the distance map and from there it generates the MPI, which is the output we'll use to create the MPI videos (and the metadata). Unfortunately LLFF was conceived to tackle still images, so just using LLFF will most certainly introduce “jumps” in the position of each camera, as well as in the distance maps. However that's just a limitation of LLFF, it's something that we can (certainly) improve and fix later on. Actually by reading a bit about Colmap, colmap supports aligning, which would probably solve the problem, at least for guessing the camera positions. I have to say that I fully expected this to happen, and my last experiments confirm this is the case, but I won't really know how bad that is until I actually wire up the player. But that's the beauty of it: I can continue improving the pipeline/player and keep reusing the same raw video.

This is a frame from one of the videos I recorded:

Example Frame

The depth map:

Depth map

As you can see, the depth map is quite sketchy, so I'll really need to improve its quality.

Player

I haven't made the player yet but I've been doing some tests. The main idea is to use the power of GPUs. Nvidia cards have speficic hardware to accelerate decoding video, so I'm going to just use that. A limitation I've found is that many video formats lack support of alpha channels, which in my case is quite important. For the first version I just won't use alpha channels. Alpha channels are useful for recording things that have reflections and semi-transparent objects, so this will be a limitation of the first version, but it can be fixed by either using a video format that supports them (which is challenging as nvidia only supports certain formats) or by encoding the alpha channel as another part of the video. I'll probably end up doing the latter, but I'm aiming for speed, not accuracy for my first implementation.

Note: This article has been translated to Russian here

comments powered by Disqus