//
Search
๐Ÿค–

Deep ViT Features as Dense Visual Descriptors

Deep ViT Features as Dense Visual Descriptors

Shir Amir, Yossi Gandelsman, Shai Bagon and Tali Dekel ECCVW 2022 โ€œWIMFโ€ Best Spotlight Presentation (The Weizmann Inst. of Science, Berkely AI Research) [paper][code][project][supplementary]

Intro & Overview

์ด ๋…ผ๋ฌธ์€ Pre-trained Vision Transformer (ViT)์—์„œ ์ถ”์ถœ๋œ ๊นŠ์€ ํŠน์ง•๋“ค(Deep Features)์„ ๋ฐ€์ง‘๋œ ์‹œ๊ฐ์  ๊ธฐ์ˆ ์ž (Dense Visual Descriptor)๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ Self-supervised ViT ๋ชจ๋ธ์ธ DINO-ViT์—์„œ ์ถ”์ถœ๋œ ํŠน์ง•๋“ค์ด ๋ช‡ ๊ฐ€์ง€ ์ฃผ๋ชฉํ•  ๋งŒํ•œ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ–ˆ์Šต๋‹ˆ๋‹ค:
1.
๊ฐ์ฒด ๋ถ€๋ถ„๊ณผ ๊ฐ™์€ ๊ฐ•๋ ฅํ•˜๊ณ  ์ž˜ ๊ตญ์†Œํ™”๋œ ์˜๋ฏธ ์ •๋ณด๋ฅผ ๋†’์€ ๊ณต๊ฐ„์  ์„ธ๋ฐ€๋„๋กœ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.
2.
์ธ์ฝ”๋”ฉ๋œ Semantic ์ •๋ณด๋Š” ๊ด€๋ จ์ด ์žˆ์ง€๋งŒ ๋‹ค๋ฅธ ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ ๊ฐ„์— ๊ณต์œ ๋ฉ๋‹ˆ๋‹ค.
3.
positional bias๋Š” ๋ชจ๋ธ์˜ ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ์ ์ง„์ ์œผ๋กœ ๋ณ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.
์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ์ถ”๊ฐ€์ ์ธ ํ›ˆ๋ จ ์—†์ด๋„ (Zero-shot) ํšจ๊ณผ์ ์ธ ๊ณต๋™ ๋ถ„ํ• , ๋ถ€๋ถ„ ๊ณต๋™ ๋ถ„ํ• , ์˜๋ฏธ์  ๋Œ€์‘๊ณผ ๊ฐ™์€ ์‹œ๊ฐ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์ œ๋กœ์ƒท ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์— ์•ž์„œ ViT์™€ DINO ViT์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ์žˆ๋‹ค๋Š” ์ „์ œํ•˜์—, ๊ธฐ๋ณธ์ ์ธ ์„ค๋ช…์€ ์ƒ๋žตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Methodology

ViT as a Local Patch Descriptor.

ViT ์•„ํ‚คํ…์ฒ˜๋Š” ์ด๋ฏธ์ง€๋ฅผ n๊ฐœ์˜ ํŒจ์น˜๋กœ ๋‚˜๋ˆˆ ํ›„ n-dim space๋กœ ํ† ํฐํ™” ์‹œ์ผœ position embedding์„ ๋”ํ•ด input์œผ๋กœ ๋ฐ›์•„๋“ค์ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ถ”๊ฐ€์ ์ธ [CLS] ํ† ํฐ์€ ์ด๋ฏธ์ง€์˜ global ํŠน์ง•์„ ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค. initial ํ† ํฐ์„ธํŠธ (T0)T^0)๋Š” ์ด L๊ฐœ์˜ Transformer๋กœ ๋“ค์–ด๊ฐ€๋Š”๋ฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ฐ์‚ฐ์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.
T^l=MSA(LN(Tlโˆ’1)+Tlโˆ’1,ย Tl=MLP(LN(T^))+T^l\hat{T}^l=\text{MSA(LN}(T^{l-1})+T^{l-1} ,\ T^l=\text{MLP(LN}(\hat T))+\hat T^l
โ€ข
Tl=[t0l,โ‹ฏโ€‰,tnl]T^l=[t^l_0,\cdots,t^l_n] are the output tokens for layer L.
โ€ข
LN: Normalization Layer.
โ€ข
MSA (Multi-head Self Attention module): project token into Q, K and V.
qil=Wqlโ‹…tilโˆ’1,kil=Wklโ‹…tilโˆ’1,vil=Wvlโ‹…tilโˆ’1,q^l_i=W^l_q\cdot t^{l-1}_i, k^l_i=W^l_k\cdot t^{l-1}_i, v^l_i=W^l_v\cdot t^{l-1}_i,
Transformer์™€ ViT์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋‹ค์Œ ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.
The Illustrated Transformer
Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations: Arabic, Chinese (Simplified) 1, Chinese (Simplified) 2, French 1, French 2, Italian, Japanese, Korean, Persian, Russian, Spanish 1, Spanish 2, Vietnamese Watch: MITโ€™s Deep Learning State of the Art lecture referencing this post In the previous post, we looked at Attention โ€“ a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer โ€“ a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloudโ€™s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So letโ€™s try to break the model apart and look at how it functions. The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvardโ€™s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter. 2020 Update: Iโ€™ve created a โ€œNarrated Transformerโ€ video which is a gentler approach to the topic: A High-Level Look Letโ€™s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

CNN vs. ViT

CNN Feature์™€ ViT Feature๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์—์„œ ์ค‘์š”ํ•œ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ ํŠน์ง•๋“ค์€ ์ด๋ฏธ์ง€ ์ธ์‹๊ณผ ๋ถ„์„์— ์žˆ์–ด์„œ ๊ณ ์œ ํ•œ ์žฅ์ ๊ณผ ๋‹จ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” CNN ํŠน์ง•๊ณผ ViT ํŠน์ง•์„ ๋น„๊ตํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
CNN Feature
1.
์ง€์—ญ์  ํŠน์„ฑ ์ธ์‹: CNN์€ ์ง€์—ญ์  ํŠน์„ฑ์„ ์ธ์‹ํ•˜๋Š” ๋ฐ ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ•„ํ„ฐ๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€์˜ ์ž‘์€ ๋ถ€๋ถ„์—์„œ ํŒจํ„ด์„ ํ•™์Šตํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ํŠน์ง•์„ ๋‹จ๊ณ„์ ์œผ๋กœ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.
2.
๊ณ„์ธต์  ๊ตฌ์กฐ: CNN์€ ์ €์ˆ˜์ค€ ํŠน์ง•์—์„œ ๊ณ ์ˆ˜์ค€ ํŠน์ง•์œผ๋กœ ์ •๋ณด๋ฅผ ๊ณ„์ธต์ ์œผ๋กœ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ด๋ฏธ์ง€์˜ ๊ธฐ๋ณธ์ ์ธ ์—์ง€๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด ์ ์  ๋” ๋ณต์žกํ•œ ๊ฐœ์ฒด์˜ ๋ถ€๋ถ„๊นŒ์ง€ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
3.
๋ณ€ํ˜•์— ๊ฐ•ํ•จ: CNN์€ ์ด๋ฏธ์ง€์˜ ๋ณ€ํ˜•(์˜ˆ: ์ด๋™, ํšŒ์ „, ํฌ๊ธฐ ๋ณ€ํ™”)์— ๊ฐ•ํ•œ ๋‚ด์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ’€๋ง ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ์–ป์–ด์ง€๋Š” ๊ณต๊ฐ„์  ๋ถˆ๋ณ€์„ฑ ๋•๋ถ„์ž…๋‹ˆ๋‹ค.
4.
ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ: CNN์€ ์ด๋ฏธ์ง€์˜ ํŠน์ • ๋ถ€๋ถ„์—๋งŒ ์ดˆ์ ์„ ๋งž์ถ”๊ธฐ ๋•Œ๋ฌธ์—, ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์—ฐ์‚ฐ์ด ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ViT Feature
1.
์ „์—ญ์  ํŠน์„ฑ ์ธ์‹: ViT๋Š” ์ด๋ฏธ์ง€ ์ „์ฒด์— ๊ฑธ์ณ ํŠน์„ฑ์„ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ด๋ฏธ์ง€์˜ ๋ชจ๋“  ๋ถ€๋ถ„์„ ๋™์‹œ์— ๊ณ ๋ คํ•˜์—ฌ ์ „์—ญ์ ์ธ ์ปจํ…์ŠคํŠธ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค.
2.
Self-attention ๋งค์ปค๋‹ˆ์ฆ˜: ViT๋Š” ์ž๊ธฐ ์ฃผ์˜(self-attention) ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ ๋‹ค๋ฅธ ๋ถ€๋ถ„ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ด๋ฏธ์ง€ ๋‚ด์˜ ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ํฌ์ฐฉํ•˜๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
3.
๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ: ViT๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ViT๊ฐ€ ํšจ๊ณผ์ ์œผ๋กœ ํ›ˆ๋ จ๋˜๊ธฐ ์œ„ํ•ด ๋งŽ์€ ์˜ˆ์ œ๋ฅผ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
4.
๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ: ViT๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ CNN๋ณด๋‹ค ๋” ๋งŽ์€ ๊ณ„์‚ฐ ๋น„์šฉ์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ž๊ธฐ ์ฃผ์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ๋ณต์žกํ•œ ์—ฐ์‚ฐ์„ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
CNN Feature vs. ViT Feature
(์ขŒ) ViT ๊ตฌ์กฐ. (์šฐ) a) ViT Feature์™€ CNN Feature์˜ ๋น„๊ต. ๊ฐ ๋ ˆ์ด์–ด์˜ Feature๋ฅผ PCA ์‹œ๊ฐํ™” ํ•˜์˜€๋‹ค.
โ€ข
Semantics vs. spatial granularity (a) CNN์€ ๊ณต๊ฐ„ ํ•ด์ƒ๋„์™€ ๋” ๊นŠ์€ ๋ ˆ์ด์–ด์˜ ์˜๋ฏธ ์ •๋ณด๋ฅผ ๊ตํ™˜ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๊นŠ์€ ๋ ˆ์ด์–ด์˜ Feature Map์€ ํ•ด์ƒ๋„๊ฐ€ ๋งค์šฐ ๋‚ฎ๊ธฐ ๋•Œ๋ฌธ์— ๋”ฐ๋ผ์„œ ์ง€์—ญํ™”๋œ Semactic ์ •๋ณด๋ฅผ ์ œ๋Œ€๋กœ ์ œ๊ณตํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. (b) ViT๋Š” ๋ชจ๋“  ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๋™์ผํ•œ ๊ณต๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ViT์˜ Receptive Field๋Š” ๋ชจ๋“  ๊ณ„์ธต์˜ ์ „์ฒด ์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค. ์ฆ‰, ๊ฐ ํ† ํฐ tilt^l_i ์€ ๋‹ค๋ฅธ ๋ชจ๋“  ํ† ํฐ tjlt^l_j์— attend ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ViT ๊ธฐ๋Šฅ์€ ์„ธ๋ถ„ํ™”๋œ ์‹œ๋งจํ‹ฑ ์ •๋ณด์™€ ๋” ๋†’์€ ๊ณต๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ์ œ๊ณตํ•œ๋‹ค.
โ€ข
Representations across layers.
โ—ฆ
(a) CNN ๊ธฐ๋ฐ˜ Feature๋Š” ๊ณ„์ธต์ ์ธ ํ‘œํ˜„ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋‹ค๋Š” ๊ฒƒ์€ ์ž˜ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์–•์€ ๋ ˆ์ด์–ด๋“ค์€ Edge, Texture๋“ฑ์„ ์บก์ณํ•˜๋Š” ๋ฐ˜๋ฉด ๊นŠ์€ ๋ ˆ์ด์–ด๋Š” Semanticํ•œ ์ •๋ณด๋‚˜ high level concept๋“ค์„ ์บก์ณํ•ฉ๋‹ˆ๋‹ค.
โ—ฆ
(b)ViT๋Š” ์–•์€ ๋ ˆ์ด์–ด๋Š” ๋Œ€๋ถ€๋ถ„ Positional Information (์œ„์น˜ ์ •๋ณด)๋ฅผ ํฌํ•จํ•˜๋Š” ๋ฐ˜๋ฉด, ๋” ๊นŠ์€ ์ธต์—์„œ๋Š” ์œ„์น˜์ •๋ณด๊ฐ€ ๊ฐ์†Œํ•˜๊ณ  Semantic ์ •๋ณด๋“ค์„ ์บก์ณํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆผ (a)๋ฅผ ๋ณด๋ฉด, ๊นŠ์€ ํŠน์ง•์€ ๊ฐœ์˜ ํŒŒํŠธ์™€ ๋ฐฐ๊ฒฝ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐ˜๋ฉด ์–•์€ ๋ ˆ์ด์–ด์˜ Feautre๋Š” ๊ณต๊ฐ„์  ์ •๋ณด๋ฅผ ๋‹ด๊ณ ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํฅ๋ฏธ๋กœ์šด ๊ฒƒ์€, ์ค‘๊ฐ„ ๋ ˆ์ด์–ด์˜ Feature๋Š” ์œ„์น˜ ์ •๋ณด์™€ ์˜๋ฏธ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
โ€ข
Semantic information across super-classes.
โ—ฆ
(a-์ƒ๋‹จ) Supervised ViT Feature๋Š” (b-ํ•˜๋‹จ) Self-supervised ViT Feature์— ๋น„ํ•ด ๋” โ€œNoisyโ€ํ•œ Feature๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด๋Š” ์•„๋ž˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๋” ์ง๊ด€์ ์œผ๋กœ ์ดํ•ด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

Properties of ViTโ€™s Features

์ €์ž๋Š” Supervised ๋ฐฉ์‹์˜ ViT์™€ Self-supervied ๋ฐฉ์‹์˜ DINO-ViT ์— ๋Œ€ํ•ด ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
5๊ฐœ ํด๋ž˜์Šค 50๊ฐœ์˜ ๋™๋ฌผ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ViT Feauture๋ฅผ ์–ป์€ ํ›„ (์ตœ์ข… ์ถœ๋ ฅ์˜ key), t-SNE์‹œ๊ฐํ™” ํ•œ ๊ฒฐ๊ณผ.
์ด๋ฅผ ์‚ดํŽด๋ณด์ž๋ฉด, (b) DINO-ViT Feature๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ์ž„์—๋„ (!) ๊ฐ ํŒŒํŠธ์— ๋Œ€ํ•œ semantic similarity๋ฅผ ์ž˜ ํ‘œํ˜„ํ•œ ๋ฐ˜๋ฉด (c) supervised ViT์˜ ๊ฒฝ์šฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ similarity, ์ฆ‰ global ์ •๋ณด์— ์ง‘์ค‘ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
ViT Feature์˜ ํŠน์ง•
โ—ฆ
์ตœ์ข… ๋ ˆ์ด์–ด ์ถœ๋ ฅ์—์„œ๋Š” K๊ฐ€ Q, V์— ๋น„ํ•ด ๋”์šฑ ๋‚˜์€ representation์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ—ฆ
์ค‘๊ฐ„ ๋ ˆ์ด์–ด ์ถœ๋ ฅ์—์„œ๋Š” K, Q๊ฐ€ V, Token๋ณด๋‹ค ๋” ๋งŽ์€ positional bias๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
โ—ฆ
์ด๋Š” ablation ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ข€ ๋•ก๊ฒจ์˜ค๊ฒ ์Šต๋‹ˆ๋‹ค.
์œ„ ํ‘œ๋ฅผ ๋ณด๋ฉด, Key๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด Q, V, Token์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ ๋ณด๋‹ค ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Experimental Results

๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ฐฉ๋ฒ•๋ก ์€ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ณต๋™ ๋ถ„ํ• , ๋ถ€๋ถ„ ๊ณต๋™ ๋ถ„ํ• , ์˜๋ฏธ์  ๋Œ€์‘ ์ž‘์—…์—์„œ ์ตœ์‹  ๊ฐ๋… ๋ฐฉ๋ฒ•๊ณผ ๋น„๊ตํ•˜์—ฌ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋ฉฐ, ํŠนํžˆ ๋น„๊ฐ๋… ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” ViT๊ฐ€ ๋‹จ์ˆœํžˆ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋” ๋ณต์žกํ•œ ์‹œ๊ฐ์  ์ดํ•ด ์ž‘์—…์—๋„ ์œ ์šฉํ•˜๊ฒŒ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

Part Co-segmentation

Part Co-segmentation์€ ๋ช‡์žฅ์˜ ์ด๋ฏธ์ง€๋“ค ์‚ฌ์ด์—์„œ ์œ ์‚ฌํ•œ ๋ถ€๋ถ„์„ ๋ถ„ํ• ํ•˜๋Š” task์ž…๋‹ˆ๋‹ค. ์ €์ž๋Š” ViT Feature์„ ์ด์šฉํ•ด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๊ณ  ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ ‘๊ทผ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
1.
Clustering: ๋ชจ๋“  ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด descriptor๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์ด๋ฅผ ๋ชจ์•„ (bag-of-descriptors) K-means ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
2.
Voting: ํด๋Ÿฌ์Šคํ„ฐ๋“ค ์‚ฌ์ด์—์„œ ์ค‘์š”ํ•œ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์„ ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด Voting์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ I์— ๋Œ€ํ•œ ํŒจ์น˜ i๊ฐ€ ์žˆ์„๋•Œ, AttniI\text{Attn}^I_i ๋ฅผ mean [CLS] attention์ด๊ณ  SkIS^I_k๋Š” ํด๋Ÿฌ์Šคํ„ฐ k์— ์†ํ•˜๋Š” ์ด๋ฏธ์ง€ I ์˜ ํŒจ์น˜ Set์ด๋ผ๊ณ  ํ•˜๋ฉด Segment SkIS^I_k์˜ saliency๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
Sal(SkI)=1โˆฃSkIโˆฃโˆ‘iโˆˆSkIAttniI\text{Sal}(S^I_k)= \frac{1}{|S^I_k|} \sum_{i\in S^I_k}\text{Attn}^I_i
๊ฐ๊ฐ์˜ segment๋Š” ํด๋Ÿฌ์Šคํ„ฐ k์˜ saliency์— voting์„ ํ•ฉ๋‹ˆ๋‹ค.
Votes(k)=I[โˆ‘ISal(SkI)โ‰ฅฯ„]\text{Votes}(k)=\mathbb{I}_{[\sum_I\text{Sal}(S^I_k) \ge \tau]}
3.
Vote(k) ๊ฐ€ percentage p๋ณด๋‹ค ํฌ๋‹ค๋ฉด, foreground ๋กœ ๊ฐ„์ฃผํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ๊นŒ์ง€ Co-Segmentation์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
4.
Part co-segmentation์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด, multi-label CRF๋ฅผ refineํ–ˆ์Šต๋‹ˆ๋‹ค. cluster์˜ ๊ฐฏ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด elbow method ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
(์ขŒ) Input. (์ค‘) Co-segmentation. (์šฐ) Part Co-segmentation.

Point Correpondences

์ด task๋Š” ๋‘ ์ด๋ฏธ์ง€ ์‚ฌ์ด์—์„œ matching point๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—์„œ๋Š” semantic information์ด ๋œ ์ค‘์š”ํ•˜๊ณ  positional ์ •๋ณด๊ฐ€ ๋”์šฑ ์ค‘์š”ํ•œ๋ฐ, ์ €์ž๋Š” ์ด๋ฅผ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•œ ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
1.
Positional Bias: descriptor๋Š” position-aware ํ•ด์•ผํ•˜๋ฏ€๋กœ ์ค‘๊ฐ„ ๋ ˆ์ด์–ด์˜ Feature๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ ์ ˆํ•œ ๋ ˆ์ด์–ด์˜ ์„ ํƒ์€ position์ •๋ณด์™€ semantic ์ •๋ณด ์‚ฌ์ด์˜ trade-off๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
2.
Binning: ์ธ์ ‘ํ•œ ๊ณต๊ฐ„์  ํŠน์ง•์˜ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•จ์œผ๋กœ์จ context๋ฅผ ๊ฐ descriptor์— ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ spatial feature์— log-binning์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
3.
โ€œBest Buddies Pairsโ€œ (BBPs): ๋‘ ์ด๋ฏธ์ง€ ์‚ฌ์ด์— ๋งค์นญ์ ์„ ์ฐพ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‘ ์ด๋ฏธ์ง€์˜ ํ‘œํ˜„์ž ์‚ฌ์ด์— mutual Nearest Neighbor๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
โ€ข
M={mi},ย Q={qi}M=\{m_i\},\ Q=\{q_i\}: ์ด๋ฏธ์ง€ M, Q์˜ binned descriptor set.
โ€ข
BB(M,Q)={(m,q)โˆฃmโˆˆM,qโˆˆQ,NN(m,Q)=q,qโˆงNN(q,M)=m}\text{BB}(M,Q)=\{(m,q)|m\in M, q\in Q, NN(m,Q)=q, q\wedge \text{NN}(q,M)=m \}
4.
Resolution Increase: ViT๋Š” ์ด๋ฏธ์ง€๋ฅผ ํŒจ์น˜๋กœ ์ชผ๊ฐœ ๋ฐ›์•„๋“ค์ด๋ฏ€๋กœ, ๊ณต๊ฐ„์  ํ•ด์ƒ๋„์— ์ œ์•ฝ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด patch๋ฅผ ์ •์˜ํ•  ๋•Œ non-overlapping์ด ์•„๋‹Œ overlapping patch๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  position encoding์€ interpolateํ•ฉ๋‹ˆ๋‹ค.
NBB ๋ฐฉ๋ฒ•๊ณผ ๋น„๊ต
Leverage Deep ViT features to automatically detect semantically corresponding points between images from different classes, under significant variations in appearance, pose and scale.

Video Part Co-segmentation

๋”์šฑ ๋งŽ์€ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์—์„œ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Conclusion

Strength

โ€ข
ViT์—์„œ ์ถ”์ถœ๋œ ๊นŠ์€ ํŠน์ง•๋“ค์ด ๊ณ ํ•ด์ƒ๋„์˜ ์˜๋ฏธ ์ •๋ณด๋ฅผ ์ง€์—ญํ™”ํ•˜์—ฌ ์ธ์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐํ˜€๋ƒˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ œ๋กœ์ƒท ๋ฐฉ๋ฒ•๋ก ์„ ํ†ตํ•ด ์ถ”๊ฐ€์ ์ธ ๋ฐ์ดํ„ฐ ์—†์ด๋„ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
Supervised ๋ฐฉ๋ฒ•๊ณผ ๋น„๊ตํ•˜์—ฌ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

Weakness

โ€ข
๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ: ViT๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ViT๊ฐ€ ํšจ๊ณผ์ ์œผ๋กœ ํ›ˆ๋ จ๋˜๊ธฐ ์œ„ํ•ด ๋งŽ์€ ์˜ˆ์ œ๋ฅผ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์ถ”๊ฐ€ ํ›ˆ๋ จ์ด๋‚˜ ๋ฐ์ดํ„ฐ ์—†์ด๋„ ViT ํŠน์ง•์„ ์ง์ ‘ ์ ์šฉํ•˜๋Š” ๊ฒฝ๋Ÿ‰์˜ ์ œ๋กœ์ƒท ๋ฐฉ๋ฒ•๋ก ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์ง€๋งŒ, ์ด ๋ฐฉ๋ฒ•๋ก ์ด ๋ชจ๋“  ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ตœ์ ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ๋Š”์ง€๋Š” ๋ถˆ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ผ๋ฐ˜ํ™” ๋ฌธ์ œ: ์—ฐ๊ตฌ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์— ๊ฑธ์ณ ์ผ๊ด€๋œ ๋ถ€๋ถ„ ์„ธ๋ถ„ํ™”๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์ด ๋ชจ๋“  ๋„๋ฉ”์ธ์ด๋‚˜ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋™์ผํ•˜๊ฒŒ ์ž˜ ์ž‘๋™ํ• ์ง€์— ๋Œ€ํ•œ ์งˆ๋ฌธ์€ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•œ ๋„๋ฉ”์ธ์—์„œ์˜ ์„ฑ๋Šฅ์€ ๋”์šฑ ๋ถˆํ™•์‹คํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (e.g., MRI์˜์ƒ, ์œ„์„ฑ์˜์ƒ)
โ€ข
์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์˜์กด์„ฑ: ํŠน์ง•์˜ ํ’ˆ์งˆ๊ณผ ๋ฐฉ๋ฒ•์˜ ์„ฑ๊ณต์ด ์‚ฌ์ „ ํ›ˆ๋ จ ๋ชจ๋ธ์˜ ํ’ˆ์งˆ์— ์˜์กด์ ์ž…๋‹ˆ๋‹ค.
โ€ข
๊ณ„์‚ฐ ์ž์›์˜ ํ•„์š”์„ฑ: ViT์˜ Self-attention ์—ฐ์‚ฐ์€ ๋†’์€ ๊ณ„์‚ฐ ์ž์›์„ ์š”๊ตฌํ•˜๋ฉฐ ์ด๋กœ ์ธํ•œ ์ ์šฉ ๋ฒ”์œ„์˜ ์ œํ•œ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ด๋ก ์  ๊ทผ๊ฑฐ: ๋” ๋†’์€ ๊ณต๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด overlapping ํŒจ์น˜๋ฅผ ์ถ”์ถœํ•˜๊ณ  ๊ทธ์— ๋”ฐ๋ผ ์œ„์น˜ ์ธ์ฝ”๋”ฉ์„ ๋ณด๊ฐ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ถ”๊ฐ€ ํ›ˆ๋ จ ์—†์ด๋„ ์ž‘๋™ํ•˜์ง€๋งŒ, ์ด ๋ฐฉ๋ฒ•์ด ๋ชจ๋“  ์‹คํ—˜์—์„œ ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ฒฝํ—˜์  ์ฆ๊ฑฐ๋งŒ ์ œ์‹œ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.