Search

INADE: Diverse Semantic Image Synthesis via Probability Distribution Modeling

"Diverse Semantic Image Synthesis via Probability Distribution Modeling" Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu, Bin Liu, Gang Hua and Nenghai Yuar, CVPR, 2021 paper | project page | code

Related Works

SPADE (2019, 03)

paper | project | code | demo
Activation Value (γ,β\gamma, \beta: learnable param.)
SPADE: SPAtially DEnormalize
AdaIN: γ\gammaβ\beta를 모두 같은 값을 곱해줌 SPADE: γ\gammaβ\beta를 feature map 의 pixel별로 다르게 적용

Loss

LS-GAN loss (pix2pixHD) → Hinge loss term
KL-div loss LKLD=DKL(q(zx)p(z))\mathcal{L}_{KLD}=\mathcal{D}_{KL}\left({ q(\bold{z}|\bold{x})||p(\bold{z}) }\right), weight=0.05

CLADE (2020, 04)

논문 리뷰를 참고하세요.

INADE (2021, 03)

3. Method

semantic mask mLpH×W\bold{m}\in \mathbb{L}^{\text{H}\times \text{W}}_p\rightarrow photo-realistic image oR3×H×W\bold{o}\in\mathbb{R}^{3\times \text{H}\times\text{W}}
where
semantic label map pLpH×W={1,2,,Lp}p \in \mathbb{L}^{\text{H}\times \text{W}}_p=\{1,2,\cdots,L^p\}
semantic label mapping lm=G(p)l^m=\mathcal{G}(\ell^p)

3.1. Conditional Normalization

Similar to Batch Normalization
activation tensor to i-th normalization layer XiRCi×Wi×Wi\bold{X}^i\in\mathbb{R}^{C^i\times W^i \times W^i} (channel C,H,W)

Step 1) Normalization step

step 2) Modulation step

Yi=γk,x,yiXk,x,yi+βk,x,yiwhere learned modulation parameter {γi,βi}RCi×Wi×Wi\bold{Y}^i=\gamma^i_{k,x,y}\bold{X}^i_{k,x,y}+\beta^i_{k,x,y} \\ \scriptsize{\text{where learned modulation parameter }\{ \gamma^i,\beta^i \}\in\mathbb{R}}^{C^i\times W^i \times W^i}
Conditional normalization (eg. AdaIN)에서, modulation parameter들은 extra condition에 의해 학습된다.
For semantic image synthesis (eg. SPADE)에서, modulation parameter들은 semantic mask에 의해 conditioned된다.

3.2. Variational Modulation Model

Semantic-conditioned modulation은 반복된 Normalization에 의한 semantic information의 "wash-out" 효과를 개선하는데 성공했다.
하지만 여전히 sementic-level / instance-level 의 이미지 생성은 문제가 있었는데, 이는 이미지 diversity가 semantic map / global randomness에만 conditioned 되어있기 때문이다.
기존의 instance-level image generation (Pix2pixHD Panoptic-based Image Synthesis)들은 diversity나 realism이 아닌 instance의 경계에만 주목을 했다.
따라서 Instance Conditioning이 부족하기 때문에 모든 instance들은 diversity가 부족했다.

Key to Instance-level Diversity

Uniform semantic-level distribution: semantic level의 특징을 deterministic하게
Instance-level Randomness: semantic distribution 내에서 diversity를 허용
이를 위해 modulation parameter를 discrete value가 아닌 각 semantic level에 대한 parametric probability distribution으로 모델링했다.

Variational Modulation Model

i-th channel depth CiC^i
semantic category lmLpl^m\in\mathbb{L}^p
distribution transformation parameter of γ,β\gamma,\beta{aγi,aβi,bγi,bβi}RLm×Ci\{\bold{a}^i_\gamma,\bold{a}^i_\beta,\bold{b}^i_\gamma,\bold{b}^i_\beta\} \in\mathbb{R}^{L^m\times C^i}
Stochastic noise matrix {Nγi,Nβi}RLp×Ci\{\bold{N}^i_\gamma,\bold{N}^i_\beta\}\in\mathbb{R}^{L^p\times C^i} from same distribution for sampling
the corresponding modulation parameters
γi[lp]=aγi[G(lp)]Nγi[lp]+aβi[G(lp)]βi[lp]=bγi[G(lp)]Nβi[lp]+bβi[G(lp)]: element-wise multiplication[]:accesses the vector from a matrix in the row-major order\gamma^i[l^p]=\bold{a}^i_\gamma[\mathcal{G}(l^p)] \otimes \bold{N}^i_\gamma[l^p] + \bold{a}^i_\beta[\mathcal{G}(l^p)] \\ \beta^i[l^p]=\bold{b}^i_\gamma[\mathcal{G}(l^p)] \otimes \bold{N}^i_\beta[l^p] + \bold{b}^i_\beta[\mathcal{G}(l^p)] \\ \scriptsize{\otimes \text{: element-wise multiplication}} \\ \scriptsize{[\cdot]: \text{accesses the vector from a matrix in the row-major order}}
식을 뜯어보면
각 semantic label 에 속하는 instance label lpl^p에 대한 mod. param. γ,β\gamma, \beta를 구하는 과정이다.
먼저 instance lpl^p가 소속된 semantic label lm=G(p)l^m=\mathcal{G}(\ell^p)에 대하여
Scaling: 학습된 {aγi,bγi}RLm×Ci\{\bold{a}^i_\gamma, \bold{b}^i_\gamma\} \in\mathbb{R}^{L^m\times C^i}를 곱해주고
Randomness: 랜덤노이즈 {Nγi,Nβi}RLp×Ci\{\bold{N}^i_\gamma,\bold{N}^i_\beta\}\in\mathbb{R}^{L^p\times C^i}를 곱해준 후
Translation: 학습된 {aβi,bβi}RLm×Ci\{\bold{a}^i_\beta,\bold{b}^i_\beta\} \in\mathbb{R}^{L^m\times C^i}를 더해주어 Modulation을 한다.

Instance-Adaptive Modulation Sampling

확률 분포 집합이 동일할 경우에 다양한 modulation parameter를 생성할 수 있다.
이제 Generator가 conditional norm. layer들을 여러개 가질 경우, 이를 조화하는 솔루션이 필요하다.
가장 간단한 방법 :
각각에 Norm. Layer에 대해 독립적으로 Stochastic Sampling
문제: inconsistency, diversity의 무력화 (why?)
따라서 채널 깊이가 동일하지 않은 여러 Norm. Layer에 걸쳐 일관된 instance 샘플링을 달성하는 방법 제안

Noise Sampling

Reference Image rEncoderReference\ Image\ r \rightarrow Encoder \rightarrow
Notes
Semantic Image Synthesis = one-to-many mapping problem. How?Modeling class-level conditional modulation parameters
Related as discrete values and sampling per-instance modulation parameters
Propose as continuous probability distribution through 1. instance-adaptive stochastic sampling that is consistent across the network. 2. Prior Noise Remapping trough linear pertubation parameters encoded from paired references

Goal and Intuition

Controllable diversity in semantic image synthesis from perspective of semantic probablity distributions
each sementic class = 1 distribution
each inscence in this class = drawn from this distribution as a discrete sample.

Proposed

Variational Modulation Models
Extend of discrete modulation parameters → class-wise continuous probability distribution (embed diverse style of each semantic categoty in a class-adopt manner)