Spaces:
Configuration error
Configuration error
| ## Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer | |
| This repository contains code to compute depth from a single image. It accompanies our [paper](https://arxiv.org/abs/1907.01341v3): | |
| >Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer | |
| René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, Vladlen Koltun | |
| and our [preprint](https://arxiv.org/abs/2103.13413): | |
| > Vision Transformers for Dense Prediction | |
| > René Ranftl, Alexey Bochkovskiy, Vladlen Koltun | |
| MiDaS was trained on up to 12 datasets (ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, NYU Depth V2) with | |
| multi-objective optimization. | |
| The original model that was trained on 5 datasets (`MIX 5` in the paper) can be found [here](https://github.com/isl-org/MiDaS/releases/tag/v2). | |
| The figure below shows an overview of the different MiDaS models; the bubble size scales with number of parameters. | |
|  | |
| ### Setup | |
| 1) Pick one or more models and download the corresponding weights to the `weights` folder: | |
| MiDaS 3.1 | |
| - For highest quality: [dpt_beit_large_512](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt) | |
| - For moderately less quality, but better speed-performance trade-off: [dpt_swin2_large_384](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_swin2_large_384.pt) | |
| - For embedded devices: [dpt_swin2_tiny_256](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_swin2_tiny_256.pt), [dpt_levit_224](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_levit_224.pt) | |
| - For inference on Intel CPUs, OpenVINO may be used for the small legacy model: openvino_midas_v21_small [.xml](https://github.com/isl-org/MiDaS/releases/download/v3_1/openvino_midas_v21_small_256.xml), [.bin](https://github.com/isl-org/MiDaS/releases/download/v3_1/openvino_midas_v21_small_256.bin) | |
| MiDaS 3.0: Legacy transformer models [dpt_large_384](https://github.com/isl-org/MiDaS/releases/download/v3/dpt_large_384.pt) and [dpt_hybrid_384](https://github.com/isl-org/MiDaS/releases/download/v3/dpt_hybrid_384.pt) | |
| MiDaS 2.1: Legacy convolutional models [midas_v21_384](https://github.com/isl-org/MiDaS/releases/download/v2_1/midas_v21_384.pt) and [midas_v21_small_256](https://github.com/isl-org/MiDaS/releases/download/v2_1/midas_v21_small_256.pt) | |
| 1) Set up dependencies: | |
| ```shell | |
| conda env create -f environment.yaml | |
| conda activate midas-py310 | |
| ``` | |
| #### optional | |
| For the Next-ViT model, execute | |
| ```shell | |
| git submodule add https://github.com/isl-org/Next-ViT midas/external/next_vit | |
| ``` | |
| For the OpenVINO model, install | |
| ```shell | |
| pip install openvino | |
| ``` | |
| ### Usage | |
| 1) Place one or more input images in the folder `input`. | |
| 2) Run the model with | |
| ```shell | |
| python run.py --model_type <model_type> --input_path input --output_path output | |
| ``` | |
| where ```<model_type>``` is chosen from [dpt_beit_large_512](#model_type), [dpt_beit_large_384](#model_type), | |
| [dpt_beit_base_384](#model_type), [dpt_swin2_large_384](#model_type), [dpt_swin2_base_384](#model_type), | |
| [dpt_swin2_tiny_256](#model_type), [dpt_swin_large_384](#model_type), [dpt_next_vit_large_384](#model_type), | |
| [dpt_levit_224](#model_type), [dpt_large_384](#model_type), [dpt_hybrid_384](#model_type), | |
| [midas_v21_384](#model_type), [midas_v21_small_256](#model_type), [openvino_midas_v21_small_256](#model_type). | |
| 3) The resulting depth maps are written to the `output` folder. | |
| #### optional | |
| 1) By default, the inference resizes the height of input images to the size of a model to fit into the encoder. This | |
| size is given by the numbers in the model names of the [accuracy table](#accuracy). Some models do not only support a single | |
| inference height but a range of different heights. Feel free to explore different heights by appending the extra | |
| command line argument `--height`. Unsupported height values will throw an error. Note that using this argument may | |
| decrease the model accuracy. | |
| 2) By default, the inference keeps the aspect ratio of input images when feeding them into the encoder if this is | |
| supported by a model (all models except for Swin, Swin2, LeViT). In order to resize to a square resolution, | |
| disregarding the aspect ratio while preserving the height, use the command line argument `--square`. | |
| #### via Camera | |
| If you want the input images to be grabbed from the camera and shown in a window, leave the input and output paths | |
| away and choose a model type as shown above: | |
| ```shell | |
| python run.py --model_type <model_type> --side | |
| ``` | |
| The argument `--side` is optional and causes both the input RGB image and the output depth map to be shown | |
| side-by-side for comparison. | |
| #### via Docker | |
| 1) Make sure you have installed Docker and the | |
| [NVIDIA Docker runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Installation-\(Native-GPU-Support\)). | |
| 2) Build the Docker image: | |
| ```shell | |
| docker build -t midas . | |
| ``` | |
| 3) Run inference: | |
| ```shell | |
| docker run --rm --gpus all -v $PWD/input:/opt/MiDaS/input -v $PWD/output:/opt/MiDaS/output -v $PWD/weights:/opt/MiDaS/weights midas | |
| ``` | |
| This command passes through all of your NVIDIA GPUs to the container, mounts the | |
| `input` and `output` directories and then runs the inference. | |
| #### via PyTorch Hub | |
| The pretrained model is also available on [PyTorch Hub](https://pytorch.org/hub/intelisl_midas_v2/) | |
| #### via TensorFlow or ONNX | |
| See [README](https://github.com/isl-org/MiDaS/tree/master/tf) in the `tf` subdirectory. | |
| Currently only supports MiDaS v2.1. | |
| #### via Mobile (iOS / Android) | |
| See [README](https://github.com/isl-org/MiDaS/tree/master/mobile) in the `mobile` subdirectory. | |
| #### via ROS1 (Robot Operating System) | |
| See [README](https://github.com/isl-org/MiDaS/tree/master/ros) in the `ros` subdirectory. | |
| Currently only supports MiDaS v2.1. DPT-based models to be added. | |
| ### Accuracy | |
| We provide a **zero-shot error** $\epsilon_d$ which is evaluated for 6 different datasets | |
| (see [paper](https://arxiv.org/abs/1907.01341v3)). **Lower error values are better**. | |
| $\color{green}{\textsf{Overall model quality is represented by the improvement}}$ ([Imp.](#improvement)) with respect to | |
| MiDaS 3.0 DPT<sub>L-384</sub>. The models are grouped by the height used for inference, whereas the square training resolution is given by | |
| the numbers in the model names. The table also shows the **number of parameters** (in millions) and the | |
| **frames per second** for inference at the training resolution (for GPU RTX 3090): | |
| | MiDaS Model | DIW </br><sup>WHDR</sup> | Eth3d </br><sup>AbsRel</sup> | Sintel </br><sup>AbsRel</sup> | TUM </br><sup>δ1</sup> | KITTI </br><sup>δ1</sup> | NYUv2 </br><sup>δ1</sup> | $\color{green}{\textsf{Imp.}}$ </br><sup>%</sup> | Par.</br><sup>M</sup> | FPS</br><sup> </sup> | | |
| |-----------------------------------------------------------------------------------------------------------------------|-------------------------:|-----------------------------:|------------------------------:|-------------------------:|-------------------------:|-------------------------:|-------------------------------------------------:|----------------------:|--------------------------:| | |
| | **Inference height 512** | | | | | | | | | | | |
| | [v3.1 BEiT<sub>L-512</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt) | 0.1137 | 0.0659 | 0.2366 | **6.13** | 11.56* | **1.86*** | $\color{green}{\textsf{19}}$ | **345** | **5.7** | | |
| | [v3.1 BEiT<sub>L-512</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt)$\tiny{\square}$ | **0.1121** | **0.0614** | **0.2090** | 6.46 | **5.00*** | 1.90* | $\color{green}{\textsf{34}}$ | **345** | **5.7** | | |
| | | | | | | | | | | | | |
| | **Inference height 384** | | | | | | | | | | | |
| | [v3.1 BEiT<sub>L-512</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt) | 0.1245 | 0.0681 | **0.2176** | **6.13** | 6.28* | **2.16*** | $\color{green}{\textsf{28}}$ | 345 | 12 | | |
| | [v3.1 Swin2<sub>L-384</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_swin2_large_384.pt)$\tiny{\square}$ | 0.1106 | 0.0732 | 0.2442 | 8.87 | **5.84*** | 2.92* | $\color{green}{\textsf{22}}$ | 213 | 41 | | |
| | [v3.1 Swin2<sub>B-384</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_swin2_base_384.pt)$\tiny{\square}$ | 0.1095 | 0.0790 | 0.2404 | 8.93 | 5.97* | 3.28* | $\color{green}{\textsf{22}}$ | 102 | 39 | | |
| | [v3.1 Swin<sub>L-384</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_swin_large_384.pt)$\tiny{\square}$ | 0.1126 | 0.0853 | 0.2428 | 8.74 | 6.60* | 3.34* | $\color{green}{\textsf{17}}$ | 213 | 49 | | |
| | [v3.1 BEiT<sub>L-384</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_384.pt) | 0.1239 | **0.0667** | 0.2545 | 7.17 | 9.84* | 2.21* | $\color{green}{\textsf{17}}$ | 344 | 13 | | |
| | [v3.1 Next-ViT<sub>L-384</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_next_vit_large_384.pt) | **0.1031** | 0.0954 | 0.2295 | 9.21 | 6.89* | 3.47* | $\color{green}{\textsf{16}}$ | **72** | 30 | | |
| | [v3.1 BEiT<sub>B-384</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_base_384.pt) | 0.1159 | 0.0967 | 0.2901 | 9.88 | 26.60* | 3.91* | $\color{green}{\textsf{-31}}$ | 112 | 31 | | |
| | [v3.0 DPT<sub>L-384</sub>](https://github.com/isl-org/MiDaS/releases/download/v3/dpt_large_384.pt) | 0.1082 | 0.0888 | 0.2697 | 9.97 | 8.46 | 8.32 | $\color{green}{\textsf{0}}$ | 344 | **61** | | |
| | [v3.0 DPT<sub>H-384</sub>](https://github.com/isl-org/MiDaS/releases/download/v3/dpt_hybrid_384.pt) | 0.1106 | 0.0934 | 0.2741 | 10.89 | 11.56 | 8.69 | $\color{green}{\textsf{-10}}$ | 123 | 50 | | |
| | [v2.1 Large<sub>384</sub>](https://github.com/isl-org/MiDaS/releases/download/v2_1/midas_v21_384.pt) | 0.1295 | 0.1155 | 0.3285 | 12.51 | 16.08 | 8.71 | $\color{green}{\textsf{-32}}$ | 105 | 47 | | |
| | | | | | | | | | | | | |
| | **Inference height 256** | | | | | | | | | | | |
| | [v3.1 Swin2<sub>T-256</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_swin2_tiny_256.pt)$\tiny{\square}$ | **0.1211** | **0.1106** | **0.2868** | **13.43** | **10.13*** | **5.55*** | $\color{green}{\textsf{-11}}$ | 42 | 64 | | |
| | [v2.1 Small<sub>256</sub>](https://github.com/isl-org/MiDaS/releases/download/v2_1/midas_v21_small_256.pt) | 0.1344 | 0.1344 | 0.3370 | 14.53 | 29.27 | 13.43 | $\color{green}{\textsf{-76}}$ | **21** | **90** | | |
| | | | | | | | | | | | | |
| | **Inference height 224** | | | | | | | | | | | |
| | [v3.1 LeViT<sub>224</sub>](https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_levit_224.pt)$\tiny{\square}$ | **0.1314** | **0.1206** | **0.3148** | **18.21** | **15.27*** | **8.64*** | $\color{green}{\textsf{-40}}$ | **51** | **73** | | |
| * No zero-shot error, because models are also trained on KITTI and NYU Depth V2\ | |
| $\square$ Validation performed at **square resolution**, either because the transformer encoder backbone of a model | |
| does not support non-square resolutions (Swin, Swin2, LeViT) or for comparison with these models. All other | |
| validations keep the aspect ratio. A difference in resolution limits the comparability of the zero-shot error and the | |
| improvement, because these quantities are averages over the pixels of an image and do not take into account the | |
| advantage of more details due to a higher resolution.\ | |
| Best values per column and same validation height in bold | |
| #### Improvement | |
| The improvement in the above table is defined as the relative zero-shot error with respect to MiDaS v3.0 | |
| DPT<sub>L-384</sub> and averaging over the datasets. So, if $\epsilon_d$ is the zero-shot error for dataset $d$, then | |
| the $\color{green}{\textsf{improvement}}$ is given by $100(1-(1/6)\sum_d\epsilon_d/\epsilon_{d,\rm{DPT_{L-384}}})$%. | |
| Note that the improvements of 10% for MiDaS v2.0 → v2.1 and 21% for MiDaS v2.1 → v3.0 are not visible from the | |
| improvement column (Imp.) in the table but would require an evaluation with respect to MiDaS v2.1 Large<sub>384</sub> | |
| and v2.0 Large<sub>384</sub> respectively instead of v3.0 DPT<sub>L-384</sub>. | |
| ### Depth map comparison | |
| Zoom in for better visibility | |
|  | |
| ### Speed on Camera Feed | |
| Test configuration | |
| - Windows 10 | |
| - 11th Gen Intel Core i7-1185G7 3.00GHz | |
| - 16GB RAM | |
| - Camera resolution 640x480 | |
| - openvino_midas_v21_small_256 | |
| Speed: 22 FPS | |
| ### Changelog | |
| * [Dec 2022] Released MiDaS v3.1: | |
| - New models based on 5 different types of transformers ([BEiT](https://arxiv.org/pdf/2106.08254.pdf), [Swin2](https://arxiv.org/pdf/2111.09883.pdf), [Swin](https://arxiv.org/pdf/2103.14030.pdf), [Next-ViT](https://arxiv.org/pdf/2207.05501.pdf), [LeViT](https://arxiv.org/pdf/2104.01136.pdf)) | |
| - Training datasets extended from 10 to 12, including also KITTI and NYU Depth V2 using [BTS](https://github.com/cleinc/bts) split | |
| - Best model, BEiT<sub>Large 512</sub>, with resolution 512x512, is on average about [28% more accurate](#Accuracy) than MiDaS v3.0 | |
| - Integrated live depth estimation from camera feed | |
| * [Sep 2021] Integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See [Gradio Web Demo](https://huggingface.co/spaces/akhaliq/DPT-Large). | |
| * [Apr 2021] Released MiDaS v3.0: | |
| - New models based on [Dense Prediction Transformers](https://arxiv.org/abs/2103.13413) are on average [21% more accurate](#Accuracy) than MiDaS v2.1 | |
| - Additional models can be found [here](https://github.com/isl-org/DPT) | |
| * [Nov 2020] Released MiDaS v2.1: | |
| - New model that was trained on 10 datasets and is on average about [10% more accurate](#Accuracy) than [MiDaS v2.0](https://github.com/isl-org/MiDaS/releases/tag/v2) | |
| - New light-weight model that achieves [real-time performance](https://github.com/isl-org/MiDaS/tree/master/mobile) on mobile platforms. | |
| - Sample applications for [iOS](https://github.com/isl-org/MiDaS/tree/master/mobile/ios) and [Android](https://github.com/isl-org/MiDaS/tree/master/mobile/android) | |
| - [ROS package](https://github.com/isl-org/MiDaS/tree/master/ros) for easy deployment on robots | |
| * [Jul 2020] Added TensorFlow and ONNX code. Added [online demo](http://35.202.76.57/). | |
| * [Dec 2019] Released new version of MiDaS - the new model is significantly more accurate and robust | |
| * [Jul 2019] Initial release of MiDaS ([Link](https://github.com/isl-org/MiDaS/releases/tag/v1)) | |
| ### Citation | |
| Please cite our paper if you use this code or any of the models: | |
| ``` | |
| @ARTICLE {Ranftl2022, | |
| author = "Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun", | |
| title = "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer", | |
| journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence", | |
| year = "2022", | |
| volume = "44", | |
| number = "3" | |
| } | |
| ``` | |
| If you use a DPT-based model, please also cite: | |
| ``` | |
| @article{Ranftl2021, | |
| author = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun}, | |
| title = {Vision Transformers for Dense Prediction}, | |
| journal = {ICCV}, | |
| year = {2021}, | |
| } | |
| ``` | |
| ### Acknowledgements | |
| Our work builds on and uses code from [timm](https://github.com/rwightman/pytorch-image-models) and [Next-ViT](https://github.com/bytedance/Next-ViT). | |
| We'd like to thank the authors for making these libraries available. | |
| ### License | |
| MIT License | |