Apple releases open source SHARP AI model that turns single photo into 3D scene

Apple released an open-source artificial intelligence model called SHARP (short for Sharp Monocular View Synthesis in Less Than a Second) that can reconstruct a photorealistic 3D scene from a single two-dimensional image in under one second on standard GPU hardware.

This capability represents a major milestone in computer vision and neural rendering, especially considering that traditional 3D reconstruction techniques typically require dozens or hundreds of images from multiple angles to build a high-fidelity 3D model.

Monocular View Synthesis

SHARP tackles the monocular view synthesis problem — inferring 3D structure and geometry from only a single snapshot. This is fundamentally challenging because depth and occluded geometry are ambiguous without multiple viewpoints. SHARP addresses this by learning statistical depth and structure priors during training.

Representation: 3D Gaussian Splatting

Instead of traditional mesh-based geometry, SHARP uses a 3D Gaussian representation:

Millions of small, colored ellipsoids (Gaussians) approximate volumetric scene geometry.
This representation supports metric scale, meaning distances in the reconstructed 3D scene correspond to real-world distances.

In essence, SHARP’s output is not just a depth map but a rich 3D volumetric field that can be rendered from nearby viewpoints in real time — enabling parallax and slight viewpoint shifts that feel immersive.

It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude.

Neural Architecture & Pipeline

The high-level pipeline works as follows:

Input Processing: A single RGB image is fed to a convolutional / neural network encoder.
Depth Estimation: The network estimates per-pixel depth and learns a depth adjustment that corrects common ambiguities (e.g., transparent surfaces).
Gaussian Regression: Using the depth and color data, SHARP predicts the position, color, scale, and opacity of millions of 3D Gaussians — essentially generating the 3D scene representation in a single forward pass.
Real-Time Rendering: The 3D Gaussian scene can be rendered at high frame rates (e.g., 100+ FPS) from different viewpoints using standard renderers.

Because the entire generation process is a single feed-forward pass, inference is extremely fast — less than a second on a typical GPU.

Training Strategy

SHARP’s training uses a two-stage curriculum:

Synthetic Data Pretraining: Large amounts of clean, computer-generated scenes with known 3D structure teach the network basic geometry and depth understanding.
Real-World Fine-Tuning: Real images are used with self-supervision by synthesizing alternate views and training the model to reconstruct them.

This hybrid approach yields strong generalization, enabling SHARP to perform well even on scenes it hasn’t seen before — called zero-shot generalization.

Performance & Metrics

SHARP sets new state-of-the-art results on common view synthesis benchmarks:

LPIPS (perceptual similarity) and DISTS (structural similarity) metrics improve by ~25–34% and ~21–43%, respectively, versus prior best models.
Synthesis time is reduced by two to three orders of magnitude compared to many diffusion-based or optimization-based approaches.

This combination of speed and quality is essential for real-world applications where interactive feedback is required.

Apple has made SHARP publicly available under open-source licensing:

Repository: apple/ml-sharp on GitHub
Includes: model code, checkpoints, command-line utilities, and sample data. GitHub

Developers can use the CLI to generate 3D Gaussian representations (3DGS format) from arbitrary images and leverage third-party renderers for visualization. GitHub

This open-source release has already prompted community integrations — e.g., adapters for visual programming tools like ComfyUI to make SHARP accessible in creative pipelines

Sharp Monocular View Synthesis in Less Than a Second – Apple Machine Learning Research