In-Place Scene Labelling and Understanding with Implicit Scene Representation

Поделиться
HTML-код
  • Опубликовано: 17 ноя 2024

Комментарии • 9

  • @kunzhang7654
    @kunzhang7654 3 года назад

    Dear Dr. Zhi, Great work and congratulations for being accepted to ICCV2021 with Oral presentation! I was trying to contact with you by e-mail but it seems that your address could not be reached. Could you provide the camera trajectories you used in the Replica dataset? Meanwhile, any plan for releasing the code? Thanks a lot and looking forward to your reply!

    • @zhishuaifeng3342
      @zhishuaifeng3342 3 года назад

      Hi Kun. Thank you for your interests in our work. I am sorry I was busy writing thesis. My email address should work well right now and not sure if it is some wierd server issues. If you can not contact me via imperial email, you can also drop me a message to z.shuaifeng@foxmail.com if you like. I will release the rendered Replica sequences after the recent thesis DDL and sorry for the delay.

  • @kwea123
    @kwea123 3 года назад

    Too much information on each slide and the slides are switched too quickly... it makes the reader have to constantly stop the video to read...
    1. The pixel denoising and region denoising results is counter-intuitive for me. With 90% chance of corruption, the same 3d point has so little change to be "consistent" across views. How can the model fuse the information, which is totally random from each view? Region-wise denoising is much more reasonable because only few images are perturbed, so the same chair has higher probability of having the same label across views. The quantitative results for pixel-wise denoising is therefore intriguing, how can it be better than region-wise denoising, despite having more noise? With 90% pixel noise I'd expect that the chairs are also 90% wrong, resulting in a lot more noise than the region-wise noise experiment...
    2. Results of Super resolution and label propagation is also confusing. Sparse label with S=16 basically means 1/256=0.3% pixels per frame, and in this case the ground class is likely to be dominant, and some small classes might not be sampled at all. Why is the mIoU better than label propagation, where at least all classes are sampled once, with 1% pixels?
    Did I misunderstand anything? Thank you

    • @zhishuaifeng3342
      @zhishuaifeng3342 3 года назад

      Hi kwea123 (AI葵), thank you for your interests and feedback. I also learn a lot from your tutorial videos of NeRF which are very helpful.
      I agree that the information in this video is a bit dense and we have tried to keep a good balance between video length and presentation experience. I could possibly make another longer version on project page so that people can better follow the details.

    • @zhishuaifeng3342
      @zhishuaifeng3342 3 года назад

      About pixel-wise denoising:
      The performance of pixel-denoising task is quite surprising at first glance, especially when some fine-structures can be well persevered. In the denoising task, we randomly vary the labels of randomly selected 90% pixels for each training label image.
      In my opinion, I think there are several factors making this happen:
      (1)Coherent consistency and smoothness within NeRF and view-invariant property of semantics are the key.
      (2)The underlying geometry and appearance play a very important role so pixels with similar texture and geometry tend to have same classes. The photometric loss is important here as an auxiliary loss.
      I personally think denosing task here is a “NeRF-CRF” given that CRF also refines semantics by modeling similarity in geometry and appearance in an explicit way.
      (3)There are still average 10% pixels unchanged per frame and in addition a 3D position may have corrupted label in one view but may have a correct label in another view. I also tried 95% or even higher noise ratio, and as expected the fine-structures become much harder to recover with less accurate boundaries, etc.
      The quantitative results does not aim to show which task is easier or harder in any sense but mainly to show that Semantic-NeRF has the ability to recover from those noisy labels. Note that the evaluation are computed using full label frames including chairs and other classes as well.

    • @zhishuaifeng3342
      @zhishuaifeng3342 3 года назад

      It is true that a larger scaling factor (x16, x32) has a risk of missing tiny structures. And we indeed observe, for example, prediction of windows frames (red) around blinds (purple) in spx8 is more accurate than that of spx16. Again, the tables does not mean to compare these two tasks but to show the capability of Semantic-NeRF.
      A better way to think about super-resolution and propagation is how they sample the sparse/partial labels. Super-resolution (e.g, SPx16) sparsely decimate label maps following a regular grid pattern with a space of 16 pixels while label propagation (LP) select a “seed” randomly from each class per frame.

    • @zhishuaifeng3342
      @zhishuaifeng3342 3 года назад

      In SP, a class/instance larger than 16 pixels is very likely to be sampled at least once (i.e., having one/more seeds on this class/instance). Therefore I think the main difference is the coverage of seeds: SP spreads the seeds within class while LP learn from more labels from a local proximity.
      This is also one of reasons why prediction of light (pink) on the ceiling (yellow) in SP has better quality (Fig.7 and 10) than that in LP(Fig. 8), partly because the appearance and geometry of light and ceiling are too similar for LP to interpolate and the spread of seeds in SP helps

    • @zhishuaifeng3342
      @zhishuaifeng3342 3 года назад

      Hope this information and my understanding is helpful. If you have any further questions, please feel free to discuss via emails.