Light-field videos: Part III. Google's Deepview


I've continued working on my lightfields project, and with some help I've had relative success, which I am happily publishing here. My warning is that I'm 100% amateur so expect a lot of errors and weird things though, and the code is not that clean either (lack of time or too lazy?). I've published both a github repo and a Colab notebook with the code to train the Deepview model, both with the Spaces dataset and the Real Estate 10K dataset. In fact the RE10k dataset is very large and needs pre-processing (downloading all youtube videos and getting screenshots at certain frames) so I've also published the dataset in a more readily consumable format, as a set of 39 gitlab repos. Here's the first repo (change the number to up to 39 for the other ones).

The viewer

Here's the viewer for the MPI for the first Spaces scene, after some training. I've used 200x200 px tiles and just 10 depth layers. NOTE: drag your mouse or click on the buttons to change the camera POV!


The first conclusion, and the most important for me, is that the system is highly sensitive to the camera parameters. If the inferred camera position/rotation are not correct enough, the model just won't be able to create a working MPI. So my takeaway is that I need to work harder on establishing or inferring the camera positions/extrinsics to be able to use the model if I want to be able to create MPIs from images/videos recorded by myself/others. I already have some ideas about how to go about it. As the camera rig I'm using will be on a plane surface, that's a strong prior, and as they're fixed I could even calculate the extrinsics somewhat manually. Also, before doing that, I'm going to try that “manual” method to infer the extrinsics for the cameras of the first spaces scene.

However, apart from that, in my book I've been successful. I've trained the model and it works well enough with both the trained Spaces dataset as well as with other datasets like RE10k even when it's not trained for them. Not bad for an amateur as his first ML project!

Furthermore, there are a number of considerations or improvements that can be explored.

  • The tiling method for rendering higher resolution MPIs works decently, but some borders between tiles are still visible. This improves with training but depending on the scene it's more visible. I wonder how can this be alleviated. An idea is to try a ‘raw’ approach of fixing it as a post-process step, with bundle adjustment methods or something like that. I wonder if this issue goes away or becomes more prominent when using smaller tiles (and more depths).
  • There's a window effect on the borders. Again, I wonder if this goes away when using smaller tiles or perhaps if there's something wrong with my code.
  • Right now, both at training and inference time, I'm not only using 10 depths, but also using 4 cameras/positions to create the MPI. I've just started actually looking at the Spaces data, at the extrinsics of the cameras, and it looks like it may as well be possible to improve the inference on the camera positions. I wonder if using more camera positions to create the MPI will thus introduce more noise than information, or if I can clean up/improve the extrinsics and use that to improve the MPIs. Ideally, I want to use the information coming from all 16 (or whatever number) cameras at inference time.

This adventure has not ended yet, so expect a follow-up at some point in the future. I still need to get to a point where I can produce MPIs from my own images/videos. Also, I welcome comments and advice!

comments powered by Disqus