Zero-shot Image-to-Image Translation

참고하기: pix2pix-zero: Zero-shot Image-to-Image Translation 논문리뷰 (junbuml.ee)

Zero-shot Image-to-Image Translation

G Parmar, K Kumar Singh, R Zhang, Y Li, J Lu, JY Zhu (Adobe Research, Carnegie Mellon University)
SIGGRAPH, 2023
[paper][code]

Intro & Overview

•

Image-to-Image Translation을 Train-free로 하기 위한 방법을 제안한다.

•

장점: Training-free, Prompt-free

•

단점: 

Methodology

•

Zero-shot image to image translation 과정은 다음과 같이 이루어진다. 

Inversion: Source image를 노이즈로 Inversion한다. (Autocorrelation regularization)

Finding direction: FROM : TO (cat : dog)의 클립 임베딩 추출로 변환 벡터 찾기

Preserve Content: Cross-attention을 통해 content 유지하기(이미지 변형을 최소화)

•

Conditional GAN distillation: Diffusion 대신 GAN 방식으로 보다 이미지를 빠르게 생성하는 방식으로 Diffusion 모델을 distill한다.

1. Deterministic Inversion.

Inversion은

x_0

를 복구 (reconstruct) 할 수 있는 노이즈맵

x_{inv}

를 찾는 Task이다. 이는 간단히 Reverse Process의 역과정으로 이루어진다. 이때 DDPM의 경우 stochastic한 특성이 있기 때문에 deterministic한 DDIM을 사용한다. DDIM reverse process는 다음과 같다:

x_{t+1} = \sqrt{\bar{\alpha}_{t+1}} \underbrace {\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_\theta(x_t,t)}{\sqrt{\bar{\alpha}_t}} \right)}_ {\text{predicted }x_0} + \underbrace {\sqrt{1-\bar{\alpha}_{t+1} - \sigma_t^2} \cdot \epsilon_\theta(x_t,t)} _{\text{direction pointing to }x_t} + \underbrace {\sigma_t z}_{\text{random noise}} \\~\text{where }z \sim \mathcal{N}(0, I)

이를 Deterministic하게 바꾸어주면 아래와 같이 정리된다.

\begin{equation}x_{t+1}=\sqrt{\bar{\alpha}_{t+1}} f_\theta\left(x_t, t, c\right)+\sqrt{1-\bar{\alpha}_{t+1}} \epsilon_\theta\left(x_t, t, c\right)\end{equation}

이렇게 invert된 Noise map은 statistical property를 따르지 않는 문제가 있다. (uncorrelated, Gaussian white noise) 가우시안 노이즈는 random location의 어떤 pair에서도 correlation이 없어야 한다. 그리고 모든 포인트에 대해 zero-mean, unit-variance를 가져야 한다. 따라서 가우시안 노이즈의 autocorrelation function 은 Kronecker delta function이 된다.

따라서 저자는 inversion process를 가이드하기 위해 auto-correlation objective를 설정한다.

\text {Autocorrelation Objective } \mathcal{L}_{auto} =\mathcal{L}_{\text{pair}} + \lambda \mathcal{L}_{\text{KL}}

•

Lpair\mathcal{L}_{\text{pair}}Lpair​: pairwise regularization term

\begin{equation}\mathcal{L}_{\text {pair }}=\sum_p \frac{1}{S_p^2} \sum_{\delta=1}^{S_p-1} \sum_{x, y, c} \eta_{x, y, c}^p\left(\eta_{x-\delta, y, c}^p+\eta_{x, y-\delta, c}^p\right)\end{equation}

뜯어보자. 각 피라미드 레벨

p

의 Normalize noise map

S_p

에 대해 auto-correlation 계수들의 sum of square가 된다. 이때

\delta

는 offset이고

\eta^p_{x,y,c} \in \mathbb{R}

은 spatial location의 index이다. 이 테크닉은 StyleGAN2에 사용되었다. 저자는 이 아이디어에 몇가지 수정을 가했다.

•

sample offset δ\deltaδ를 1이 아닌 랜덤한 값으로 주어 long-range information의 전파 효과를 높임

•

LKL\mathcal{L}_{\text{KL}}LKL​: KL-Divergence at individual pixel location.

StyleGAN2에서는 zero-mean unitvariance criteria strictly via normalization를 사용했지만, 이는 diffusion에서 발산하는 결과를 가져왔다. 따라서 VAE에서 사용된 것처럼 KLD를 사용했다.

이렇게 Noise를 Regularize한 inversion noise map을 얻게 되었다.

2. Finding Editing direction.

Edit의 방향을 찾는 것은 prompt와 CLIP 을 사용해 이루어진다.

Prompt 생성 GPT-3를 이용해 captioning하거나 (Contextual Text Embedding) 템플릿을 이용한다.

이때, Contextual Text Embedding모델에서는 각 단어의 임베딩이 다른 값이 되기 때문에 여러 문장을 mean한다.

CLIP Distance 두 prompt간의 CLIP Embedding feature에 대해 Δcedit\Delta c_{edit}Δcedit​을 구한다. (mean difference)

Sampling c+Δceditc+\Delta c_{edit}c+Δcedit​ 을 condition으로 주입해 이미지를 샘플링한다. 

3. Edit using Cross-Attention Guidance

먼저, Naive하게

c+\Delta c_{edit}

으로 샘플링해 보니 너무 과하게 이미지가 수정되는 것을 볼 수 있다. (그림 우하단). 샘플은 원본 이미지의 domain-independent feature인 structure와 배경, 컬러등은 일정하게 유지해야 한다. 이를 해결하기 위해 cross-attention guidance를 디자인했다. 이는 2단계로 이루어진다.

Rererence image (512X512) 를 DDIM Inversion해서 노이즈 xinv\mathbf x_{inv}xinv​를 구한다.

이를 원본 prompt ccc에 대해 sampling한다. (reconstruction).

MtrefM^{\text{ref}}_tMtref​ : prompt ccc와 샘플 간 cross-attention map MtrefM^{\text{ref}}_tMtref​를 얻게된다. (각 timestep별로)
이 map은 structure와 관계 있으므로 preserve해야 하는 target이 된다.

MteditM^{\text{edit}}_tMtedit​: 이제 edit direction c+Δceditc+\Delta c_{edit}c+Δcedit​ 를 사용해 cross-attention map MteditM^{\text{edit}}_tMtedit​를 얻는다.

Lxa\mathcal L_{\text{xa}}Lxa​: 두 map이 같아야 하므로 cross-attention loss Lxa\mathcal L_{\text{xa}}Lxa​를 적용한다.

\mathcal{L}_{\text{xa}} = \| M^{\text{edit}}_t - M^{\text{ref}}_t \|_2

위 loss는 structure를 유지하는데 도움이 된다고 저자는 말한다.

매우 간단한 논문이다.

이 외에도 GAN을 사용해 속도를 높이는 방법 또한 제시하지만, 생략하겠다.

Zero-shot Image-to-Image Translation