We present a stand-alone implementation of our Merging Operator. This new repo allows using any pair of monocular depth estimations in our double estimation. This includes using separate networks for base and high-res estimations, using networks not supported by this repo (such as Midas-v3), or using manually edited depth maps for artistic use. This will also be useful for scientists developing CNN-based MDE as a way to quickly apply double estimation to their own network. For more details please take a look here.
Input | Original result | After manual editing of base |
---|---|---|
Here is a visualization of the improvement gained using LeRes instead of MiDas.
RGB | Our method using MiDaS | Our method using LeRes (NEW!) |
---|---|---|
Use --max_res as input argument for run.py in combination with --Final to set a limit on the resolution of the results that our method generates.
We provide this parameter as a trade-off between run-time and resolution. Using this reduces the run-time if only a result up to specific-megapixel is needed.
This parameter sets a limit on the bigger dimension of the result in term of pixels (while keeping aspect ratio). For example, to generate results with a bigger dimension size up to 2000 pixels use the following:
python run.py --Final --max_res 2000 --data_dir PATH_TO_INPUT --output_dir PATH_TO_RESULT --depthNet 0
S. Mahdi H. Miangoleh*, Sebastian Dille*, Long Mai, Sylvain Paris, Yağız Aksoy. Main pdf, Supplementary pdf, Project Page.
We propose a method that can generate highly detailed high-resolution depth estimations from a single image. Our method is based on optimizing the performance of a pre-trained network by merging estimations in different resolutions and different patches to generate a high-resolution estimate.
Try our model easily on Colab :
We use existing monocular depth estimation networks to generate highly detailed estimations without re-training.
We achieve our results by getting several estimations at different resolutions. We then merge these into a structurally consistent high-resolution depth map followed by a local boosting to enhance the results and generate our final result.
Monocular depth estimation uses contextual cues such as occlusions or the relative sizes of objects to estimate the structure of the scene.
We will use a pre-trained MiDas-v2 here, but our analysis with the SGR network also supports our claims.
When we feed the image to the network at different resolutions, some interesting patterns arise. At lower resolutions, many details in the scene are missing, such as birds in this example. At high resolutions, however, we start to see inconsistent overall structure, and this flat board gets significantly less flat. The advantage is that the network is able to generate high frequency details. This shows that there is a trade-off between structural consistency and high-frequency details with respect to input resolution.
We explain this behavior through two properties of convolutional neural networks: limited receptive field size and network capacity. The lack of high frequency details in low resolutions are due to a limited network capacity. A small network that generates the structure of a complex scene cannot also generate fine details.
The loss of structure at high resolutions comes from a limited receptive field size. The receptive field is the region around a pixel that contributes to the estimation at that pixel. It is set by the network configuration and training resolution, and effectively gets smaller as resolution increases. At a low resolution, every pixel can see the edges of the board, so the network judges that this is a flat wall. At a high resolution, however, some pixels do not receive any contextual information. This results in large structural inconsistencies.
For any given image, we determine the highest resolution that will result in a consistent structure by making sure that every pixel has contextual information. For this purpose, we need the distribution of contextual cues in the image. We approximate contextual cues with a simple edge map.
The resolution where every pixel is at most a half receptive field size away from context edges is called R_0. When we increase the resolution any further, structural inconsistencies will arise but more details will be generated. When 20% of the pixels do not receive any context, we call this resolution R_20. Note that R_0 and R_20 depend on the image content!
We are still able to go beyond R0 by merging the high-frequency details in the R20 resolution onto the structure of the base resolution. We call this Double Estimation. We train an image-to-image translation network to merge the low-resolution depth range of the base with the high-resolution details of R_20. It does so without inheriting the structural inconsistencies of the high-res input. This way, we go beyond R_0 and generate more details by using R_20 as our high-resolution input. In fact, the network is so robust against low-frequency artifacts that we can even use R_20 as our high-resolution input.
Note that R20 is bounded by the smoothest regions in the image, while there are image patches that could support a higher resolution. We choose candidate patches by tiling the image and discarding all patches without useful details (step1). The leftover patches are expanded until their edge density matches that of the image(step2). Finally, we merge a double estimation for each patch onto our R20 results and generate our final results (step3).
Step 1: Tile and discard | Step 2: Expand | Step 3: Merge |
---|---|---|
We Provided the implementation of our method using MiDas-v2, LeReS and SGRnet as the base. Note that MiDas-v2 and SGRnet estimate inverse depth while LeReS estimates depth.
Our mergenet model is trained using torch 0.4.1 and python 3.7 and is tested with torch<=1.8.
Download our mergenet model weights from here and put it in
.\pix2pix\checkpoints\mergemodel\latest_net_G.pth
To use MiDas-v2 or LeReS as base: Install dependancies as following:
conda install pytorch torchvision opencv cudatoolkit=10.2 -c pytorch
conda install matplotlib
conda install scipy
conda install scikit-image
For MiDaS-v2, download the model weights from MiDas-v2 and put it in
./midas/model.pt
activate the environment
python run.py --Final --data_dir PATH_TO_INPUT --output_dir PATH_TO_RESULT --depthNet 0
For LeReS, download the model weights from LeReS (Resnext101) and put it in root:
./res101.pth
activate the environment
python run.py --Final --data_dir PATH_TO_INPUT --output_dir PATH_TO_RESULT --depthNet 2
To use SGRnet as base: Install dependencies as following:
conda install pytorch=0.4.1 cuda92 -c pytorch
conda install torchvision
conda install matplotlib
conda install scikit-image
pip install opencv-python
Follow the official SGRnet repository to compile the syncbn module in ./structuredrl/models/syncbn. Download the model weights from SGRnet and put it in
./structuredrl/model.pth.tar
activate the environment
python run.py --Final --data_dir PATH_TO_INPUT --output_dir PATH_TO_RESULT --depthNet 1
Different input arguments can be used to generate R0 and R20 results as discussed in the paper.
python run.py --R0 --data_dir PATH_TO_INPUT --output_dir PATH_TO_RESULT --depthNet #[0,1 or 2]
python run.py --R20 --data_dir PATH_TO_INPUT --output_dir PATH_TO_RESULT --depthNet #[0,1 or 2]
To generate the results with CV.INFERNO colormap use --colorize_results like the sample below:
python run.py --colorize_results --Final --data_dir PATH_TO_INPUT --output_dir PATH_TO_RESULT --depthNet #[0,1 or 2]
Fill in the needed variables in the following matlab file and run:
./evaluation/evaluatedataset.m
Navigate to dataset preparation instructions to download and prepare the training dataset.
python ./pix2pix/train.py --dataroot DATASETDIR --name mergemodeltrain --model pix2pix4depth --no_flip --no_dropout
python ./pix2pix/test.py --dataroot DATASETDIR --name mergemodeleval --model pix2pix4depth --no_flip --no_dropout
This implementation is provided for academic use only. Please cite our paper if you use this code or any of the models.
@INPROCEEDINGS{Miangoleh2021Boosting,
author={S. Mahdi H. Miangoleh and Sebastian Dille and Long Mai and Sylvain Paris and Ya\u{g}{\i}z Aksoy},
title={Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging},
journal={Proc. CVPR},
year={2021},
}
The "Merge model" code skeleton (./pix2pix folder) was adapted from the pytorch-CycleGAN-and-pix2pix repository.
For MiDaS, LeReS and SGR inferences we used the scripts and models from MiDas-v2, LeReS and SGRnet respectively (./midas, ./lib and ./structuredrl folders).
Thanks to k-washi for providing us with a Google Colaboratory notebook implementation.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。