Zhenhong Sun | Projects

Recent Projects

Foundation Representation and Manipulation

Lead Contributor

Feb 2025 - Present

This project studies compact representations of 3D assets and their controllable state transitions, providing the foundation for manipulating and predicting the next states of assets in 3D world models.

Developed an adaptive pivot voxel representation that converts complex meshes into compact 3D tokens, enabling efficient VAE reconstruction. Submitted to a Top-Tier ML Conference: P2Voxel.
Designed a topology-aware prior for decomposing 3D meshes into meaningful parts, improving structure-aware asset manipulation. Submitted to a Top-Tier Graphics Conference: Hi-TOPS.

Spatial Understanding and Synthesis

Lead Contributor

Feb 2025 - Present

This project builds coherent multi-shot 3D worlds through agentic generation pipelines, and uses LLM-based agents to perceive and understand both static scene structures and dynamic world events.

Built an executable 3D world generation pipeline that converts narrative scripts into editable multi-shot 3D scenes, improving inter-shot consistency and continuity. arXiv:2604.03315: StoryBlender.
Developed a narrative-grounded world visual attention framework that serves as the perceptual interface of 3D worlds, guiding viewpoint selection and camera trajectories to understand both static and dynamic scene events. arXiv:2606.26964: Look-Before-Move.

Interaction Dynamics and Motion

Lead Contributor

Feb 2025 - Present

This project studies text-to-action generation for humans in 3D worlds, using flow-matching generative techniques to produce expressive motion and human interactions with 3D worlds.

Studied expressive talking-avatar and portrait animation by modeling identity, lip synchronization, emotion, head turning, smiling, and spatial dynamics, improving controllability in talking. Submitted to IEEE Transactions on Cybernetics: XTalker; arXiv:2602.10516: 3DXTalker.
Modeled social structures in 3D human-human interaction generation, improving social coherence, interpersonal relation modeling, and motion plausibility in multi-person dynamic scenes. arXiv:2606.24255: Social Structure Matters in 3D Human-Human Interaction Generation.

Decision Intelligence and Reasoning

Lead Contributor

Feb 2025 - Present

This project studies RL-based decision intelligence for action generation and explores diffusion LLMs as reasoning engines for perception, planning, and decision making in 3D worlds.

Developed scalable in-context and language-supervised decision learning frameworks, enabling agents to adapt policies from task context and natural language supervision across decision-making problems. ICLR 2026: Scalable; NeurIPS 2025: Text-to-Decision Agent.
Designed an evolutionary decoding strategy for diffusion-based LLMs to escape high-confidence reasoning failures, improving robustness and mathematical reasoning performance at test time. Submitted to a Top-Tier Machine Learning Conference: Escaping Confidence Trap.

History Projects

Text-to-Image Diffusion Generation

Leader Contributor

Aug 2023 - May 2025

This project studies how to fine-tune diffusion models for domain-specific generation problems through cross-attention design and controllable generation strategies.

Developed a cross-attention enhancement for human image generation, improving body completeness and proportion quality in work published at CVPR 2024.
Developed a divide-and-conquer strategy for multi-entity generation in ACM MM 2024.
Developed two training-free strategies for detailed multi-instance scene generation, including one TMLR 2026 paper and another Journal submission.

AI for Quantum Science

Main Contributor

Aug 2022 - Aug 2023

This project focuses on machine learning tools for recognizing quantum data patterns and improving system characterization under realistic imperfections.

Adapted Transformer models to quantum state estimation, improving accuracy by an order of magnitude in IEEE Transactions on Cybernetics 2025.
Designed large-scale pre-training and fine-tuning strategies for robust quantum information processing, reported in IEEE Transactions on Emerging Topics in Computational Intelligence 2025.

Lightweight Neural Architecture Search

Leader Contributor

Apr 2021 - Dec 2022

Light-NAS is a distributed full-stack framework built with PyTorch and OpenMPI for efficient network design across classification, detection, 3D action recognition, latency prediction, MCUNet search, and quantization search.

Designed entropy-based NAS algorithms for detection, action recognition, and quantization, leading to ICML 2022, NeurIPS 2022, and ICLR 2023 publications.
Demonstrated practical acceleration of 3x for object detection, 1.68x for animation classfication, and 2.5x for car plate recognition.
GitHub: alibaba/lightweight-neural-architecture-search

Learned Image/Video Compression

Main Contributor

Aug 2018 - Apr 2021

Learned compression methods use entropy models to approximate the distribution of compressible latents with CNNs, achieving performance comparable to traditional image codecs. In this line of work, we proposed the Interpolation Variable Rate model and Spatiotemporal Entropy approaches for image and video compression.

4 Tracks Winner of the Challenge on Learned Image Compression at CVPR 2019.
Interpolation Variable Rate Image Compression accepted by ACM MM 2021.
GitHub: tinyvision/IPCodec