ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

Mingzhi Zhu1, 2 Ding Shang1 Sai Qian Zhang1

1New York University     2Rensselaer Polytechnic Institute

zhum8@rpi.edu, dingshang@nyu.edu, sai.zhang@nyu.edu

Abstract

Photorealistic Codec Avatars (PCA), which enable high-fidelity human face rendering, are increasingly adopted in AR/VR applications to support immersive communication and interaction via deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained AR/VR devices such as head-mounted displays (HMDs), where latency and power efficiency are critical.

To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip (SoC) of AR/VR devices to further enhance processing efficiency.

Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge AR/VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to +0.39 over the best 4-bit baseline, delivers up to 3.36× latency reduction, and sustains a rendering rate of 100 FPS in end-to-end tests, satisfying real-time VR requirements.

Visual Comparison: Full vs. Quantized Models

Left: Avatar rendered using the full precision MultiFace model (FP32)
Middle: Degraded avatar showing noise and jitter artifacts from state-of-the-art PTQ baseline (INT4)
Right: Clean, stable avatar achieved through ESCA quantization method (INT4)

ESCA Quantization Pipeline

ESCA Quantization Pipeline

Overview of the ESCA framework showing the integration of ICAS, FFAS, and UV-weighted Hessian-based quantization.

Key Contributions

FovVideoVDP Quality Scores

Method Precision Front Left Right
Full Model FP32 6.5364 5.9480 5.8625
Adaround+LSQ W4A4 4.2531 3.6143 3.5606
POCA5.23104.38384.3457
2DQuant5.29874.39484.3712
GPTQ5.49804.58684.5729
ICAS (Ours)5.59014.73174.7536
UV-W (Ours)5.75594.81304.8187
ICAS-UV (Ours)5.64384.91454.9057
FFAS-UV (Ours)5.85414.97954.9605
Adaround+LSQ W8A8 6.2106 5.5004 5.4381
POCA6.48275.85115.7565
2DQuant6.49835.83135.7497
GPTQ6.23595.61885.3613
ICAS (Ours)5.60075.39135.0762
UV-W (Ours)6.52715.91015.7610
ICAS-UV (Ours)6.36905.66155.5998
FFAS-UV (Ours)6.52415.85895.8071

Higher FovVideoVDP scores indicate better perceptual quality. Our FFAS-UV method achieves the best results in W4A4 quantization.

Inference Latency Comparison

Model Device Latency (ms)
Encoder (full) Snapdragon XR2 Gen 2 13.80
Encoder (full) NVIDIA Jetson Orin NX 16GB 9.96
Encoder (8 bit) Snapdragon XR2 Gen 2 4.00
Encoder (8 bit) Our hardware accelerator 3.05
Decoder (full) NVIDIA Jetson Orin NX 16GB 50.35
Decoder (full) Snapdragon XR2 Gen 2 25.80
Decoder (8 bit) Snapdragon XR2 Gen 2 14.50
Decoder (8 bit) Our hardware accelerator 12.51
Decoder (4 bit) Our hardware accelerator 3.13

Our custom hardware accelerator achieves up to 3.36× latency reduction compared to commercial AR/VR platforms.