Apple released an open-source artificial intelligence model called SHARP (short for Sharp Monocular View Synthesis in Less Than a Second) that can reconstruct a photorealistic 3D scene from a single two-dimensional image in under one second on standard GPU hardware.
This capability represents a major milestone in computer vision and neural rendering, especially considering that traditional 3D reconstruction techniques typically require dozens or hundreds of images from multiple angles to build a high-fidelity 3D model.
Monocular View Synthesis
SHARP tackles the monocular view synthesis problem — inferring 3D structure and geometry from only a single snapshot. This is fundamentally challenging because depth and occluded geometry are ambiguous without multiple viewpoints. SHARP addresses this by learning statistical depth and structure priors during training.

Representation: 3D Gaussian Splatting
Instead of traditional mesh-based geometry, SHARP uses a 3D Gaussian representation:
- Millions of small, colored ellipsoids (Gaussians) approximate volumetric scene geometry.
- This representation supports metric scale, meaning distances in the reconstructed 3D scene correspond to real-world distances.
In essence, SHARP’s output is not just a depth map but a rich 3D volumetric field that can be rendered from nearby viewpoints in real time — enabling parallax and slight viewpoint shifts that feel immersive.
It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude.
Neural Architecture & Pipeline
The high-level pipeline works as follows:
- Input Processing: A single RGB image is fed to a convolutional / neural network encoder.
- Depth Estimation: The network estimates per-pixel depth and learns a depth adjustment that corrects common ambiguities (e.g., transparent surfaces).
- Gaussian Regression: Using the depth and color data, SHARP predicts the position, color, scale, and opacity of millions of 3D Gaussians — essentially generating the 3D scene representation in a single forward pass.
- Real-Time Rendering: The 3D Gaussian scene can be rendered at high frame rates (e.g., 100+ FPS) from different viewpoints using standard renderers.
Because the entire generation process is a single feed-forward pass, inference is extremely fast — less than a second on a typical GPU.
Training Strategy
SHARP’s training uses a two-stage curriculum:
- Synthetic Data Pretraining: Large amounts of clean, computer-generated scenes with known 3D structure teach the network basic geometry and depth understanding.
- Real-World Fine-Tuning: Real images are used with self-supervision by synthesizing alternate views and training the model to reconstruct them.
This hybrid approach yields strong generalization, enabling SHARP to perform well even on scenes it hasn’t seen before — called zero-shot generalization.
Performance & Metrics
SHARP sets new state-of-the-art results on common view synthesis benchmarks:
- LPIPS (perceptual similarity) and DISTS (structural similarity) metrics improve by ~25–34% and ~21–43%, respectively, versus prior best models.
- Synthesis time is reduced by two to three orders of magnitude compared to many diffusion-based or optimization-based approaches.
This combination of speed and quality is essential for real-world applications where interactive feedback is required.
Apple has made SHARP publicly available under open-source licensing:
- Repository:
apple/ml-sharpon GitHub - Includes: model code, checkpoints, command-line utilities, and sample data. GitHub
Developers can use the CLI to generate 3D Gaussian representations (3DGS format) from arbitrary images and leverage third-party renderers for visualization. GitHub
This open-source release has already prompted community integrations — e.g., adapters for visual programming tools like ComfyUI to make SHARP accessible in creative pipelines
Sharp Monocular View Synthesis in Less Than a Second – Apple Machine Learning Research