Learning Depths of Moving People by Watching Frozen People

the goal of this work is to estimate dense depth maps in cases where both the camera and people in the scene are freely moving multi-view stereo methods apply geometric constraints that do not apply to moving people and treat people as noise or completely ignore them to overcome this problem we apply a data-driven approach to predicting depth in human regions data for learning depth of moving people is difficult to collect in this work we create and apply a new data set called mannequin challenge it contains thousands of YouTube videos of people imitating mannequins freezing and natural poses while a handheld camera tours the scene the videos contain diverse scenes like this our data set spans a large range of scenes poses ages and number of people because people are stationary we can use structure from motion and multi-view stereo to recover depth these depths are used as ground truth supervision during training in our simplest model we train our aggression network to predict the MVS depth from the input RGB image we can produce better results by including information from neighboring frames the challenge is that during training people are stationary but at inference time they move input to the network must handle both cases the additional input is depth computed from motion parallax we compute optical flow between the target and a reference frame then translate the flow into depth using the sfm camera poses at inference the depth computed from moving people is inaccurate so we remove humans using a segmentation mask the full model takes us input the RGB frame the human segmentation mask and the mask depth from parallax intuitively the network learns to in paint and refine the mask depth using the information from the RGB image our method successfully predicts depth for a large range of environments poses and lighting conditions in the mannequin challenge data set we now apply our model to scenes where people are moving compared to recent state-of-the-art learning based methods our results are more accurate and coherent over time you our depth maps support a variety of effects such as synthetic d focus and focus polls synthetic objects may also be inserted into the scene and properly occluded by moving people and the environment we can also synthesize novel views of the scene using nearby frames to fill in dis occlusions in the new view you left and right views can be generated from a monocular camera to generate a stereo video humans in the scene are moving most human regions can be effectively in painted using this technique you

10 thoughts on “Learning Depths of Moving People by Watching Frozen People”

  1. Can or is this gonna be converted to a real programm without having to install all the addons for python. Sorry I am not very talented xD

  2. For some better result you can use 4 cameras horizontally and 4 cameras vertically. So the all side parallax can be obtained that can do the work more effectively

  3. Beautiful!
    all this with no synthetic data!
    *How fast it runs with 1 mega pixel image for example ?
    *I would want to see how the model works on a scene that a human is walking in a forest environment, with a strong wind that moves everything around.
    or indoor with many non human moving objects.

Leave a Reply

Your email address will not be published. Required fields are marked *