Large Vision Models (LVM) List

Variational Autoencoders (VAEs)
Type: Generative Model
Use Case: Anomaly detection, image generation, data compression.
Frameworks: TensorFlow, PyTorch.
Key Features:
Learns latent representations of data.
Capable of generating new data similar to the input distribution.

  1. Deep Latent Variable Models (DLVMs)
    Type: Generative Model
    Use Case: Complex data modeling, generative tasks, structured output prediction.
    Frameworks: PyTorch, TensorFlow.
    Key Features:
    Extends traditional LVMs with deep neural networks.
    Used for tasks requiring structured and high-dimensional outputs.
  2. Gaussian Mixture Models (GMMs)
    Type: Probabilistic Model
    Use Case: Clustering, density estimation, anomaly detection.
    Frameworks: Scikit-learn, TensorFlow.
    Key Features:
    Models data as a mixture of multiple Gaussian distributions.
    Useful for tasks where data distribution is multimodal.
  3. Hidden Markov Models (HMMs)
    Type: Probabilistic Model
    Use Case: Time series analysis, speech recognition, bioinformatics.
    Frameworks: PyTorch, Scikit-learn.
    Key Features:
    Models sequential data with hidden states.
    Commonly used in temporal pattern recognition tasks.
  4. Bayesian Neural Networks (BNNs)
    Type: Probabilistic Model
    Use Case: Uncertainty estimation, robust prediction, decision-making.
    Frameworks: Pyro (on PyTorch), TensorFlow Probability.
    Key Features:
    Incorporates uncertainty in model predictions.
    Useful in scenarios where model confidence is critical.
  5. Normalizing Flows
    Type: Generative Model
    Use Case: Density estimation, data generation, anomaly detection.
    Frameworks: PyTorch, TensorFlow.
    Key Features:
    Reversible mappings to model complex distributions.
    Scalable and flexible, suitable for high-dimensional data.
  6. Generative Adversarial Networks (GANs)
    Type: Generative Model
    Use Case: Image generation, data augmentation, unsupervised learning.
    Frameworks: TensorFlow, PyTorch.
    Key Features:
    Adversarial training between a generator and discriminator.
    Capable of generating highly realistic data, particularly images.
  7. Transformers (e.g., BERT, GPT)
    Type: Sequence Model
    Use Case: Natural Language Processing, sequence modeling.
    Frameworks: Hugging Face Transformers (PyTorch, TensorFlow).
    Key Features:
    Powerful for tasks like text generation, translation, summarization.
    Handles long-range dependencies in data efficiently.
  8. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) Networks
    Type: Sequence Model
    Use Case: Time series prediction, language modeling, sequence generation.
    Frameworks: TensorFlow, PyTorch.
    Key Features:
    Designed for sequential data processing.
    LSTMs mitigate the vanishing gradient problem in RNNs.
  9. Autoencoders (AEs)
    Type: Unsupervised Learning Model
    Use Case: Dimensionality reduction, denoising, feature learning.
    Frameworks: TensorFlow, PyTorch.
    Key Features:
    Learns compressed representations of input data.
    Useful for tasks requiring data reconstruction or compression.
    Deployment on Jetson RTX
    Frameworks: Most of the mentioned models can be implemented and deployed using deep learning frameworks like TensorFlow, PyTorch, and TensorRT (for optimized inference) on Jetson devices.
    Optimization: Jetson RTX supports TensorRT, a high-performance deep learning inference optimizer and runtime that can significantly accelerate model inference, especially useful for deploying these models on edge devices.

LVM – Open Source

YOLOv5
Type: Object Detection
Use Case: Real-time object detection in images and video frames.
Key Features:
High-speed detection with good accuracy.
Lightweight and optimized for real-time applications.
Pre-trained on the COCO dataset.
Supports transfer learning for custom datasets.
Repository: YOLOv5 on GitHub

  1. Detectron2
    Type: Object Detection, Segmentation
    Use Case: Object detection, instance segmentation, and keypoint detection.
    Key Features:
    Built on PyTorch with modular design for easy customization.
    State-of-the-art models like Faster R-CNN, Mask R-CNN, and RetinaNet.
    Strong support for multi-object tracking.
    Repository: Detectron2 on GitHub
  2. DeepLabV3+
    Type: Semantic Segmentation
    Use Case: Semantic segmentation of images, useful for pixel-level understanding.
    Key Features:
    Atrous convolution for capturing contextual information.
    Supports multiple backbone architectures like ResNet.
    High accuracy in segmentation tasks.
    Repository: DeepLabV3+ on TensorFlow GitHub
  3. EfficientDet
    Type: Object Detection
    Use Case: Efficient, scalable object detection in images and videos.
    Key Features:
    Balances speed and accuracy with EfficientNet as the backbone.
    Scalable from small to large models (EfficientDet-D0 to D7).
    Pre-trained models available, easy to fine-tune.
    Repository: EfficientDet on GitHub
  4. Swin Transformer
    Type: Image Classification, Object Detection, Segmentation
    Use Case: General-purpose vision model with strong performance in classification, detection, and segmentation tasks.
    Key Features:
    Transformer-based architecture with hierarchical feature maps.
    Outperforms many CNN-based models in accuracy.
    Supports various tasks with a unified architecture.
    Repository: Swin Transformer on GitHub
  5. DINO (Self-Supervised Vision Transformer)
    Type: Image Classification, Feature Extraction
    Use Case: Self-supervised learning for extracting rich visual features from images.
    Key Features:
    Self-supervised pre-training without labeled data.
    Can be fine-tuned for various downstream tasks.
    Uses Vision Transformers (ViTs) for strong performance.
    Repository: DINO on GitHub
  6. OpenMMLab (MMDetection, MMSegmentation, MMTracking)
    Type: Object Detection, Segmentation, Tracking
    Use Case: Modular, extensive toolbox for various vision tasks.
    Key Features:
    Modular design allows easy combination of different components.
    Wide range of pre-trained models available.
    Strong support for frame-by-frame video processing tasks.
    Repository: OpenMMLab on GitHub
  7. CLIP (Contrastive Language-Image Pre-training)
    Type: Image Classification, Zero-Shot Learning
    Use Case: Frame inferencing based on text prompts, image-text matching.
    Key Features:
    Joint training of images and text allows zero-shot inference.
    Can be used to filter frames based on textual descriptions.
    Pre-trained on large datasets, works well for many visual tasks.
    Repository: CLIP on GitHub
  8. ViT (Vision Transformer)
    Type: Image Classification, Feature Extraction
    Use Case: General-purpose image classification, can be adapted for video frame analysis.
    Key Features:
    Transformer-based model for vision tasks.
    High performance on image classification benchmarks.
    Can be adapted for frame-wise inference in videos.
    Repository: ViT on GitHub
  9. OpenPose
    Type: Pose Estimation
    Use Case: Real-time multi-person pose estimation in video frames.
    Key Features:
    Detects human body, hand, facial, and foot keypoints.
    Supports real-time processing on video frames.
    Widely used for applications in sports analytics, gesture recognition, etc.
    Repository: OpenPose on GitHub

Mini LVM

Mini VLM
Type: Vision-Language Model
Use Case: Image-text retrieval, visual question answering (VQA), image captioning.
Key Features:
A smaller version of VLM (Vision-Language Model) designed for efficient inference.
Uses a reduced transformer architecture to minimize model size while retaining alignment capabilities.
Suitable for edge devices or resource-constrained environments.

VILA: On Pre-training for Visual Language Models

GitHub – NVlabs/VILA: VILA – a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)

Mmoondream LVM : Moondream

MiniCPM-V: GitHub – OpenBMB/MiniCPM-V: MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

DistilBERT (Vision-Language Versions)
Type: Distilled Transformer Model
Use Case: Image captioning, VQA, and multimodal tasks when combined with vision backbones.
Key Features:
A smaller, faster version of BERT, adapted for vision-language tasks.
Maintains most of BERT’s capabilities but with a smaller footprint and reduced inference time.
Can be combined with lightweight CNNs or Vision Transformers for vision-language tasks.

TinyBERT (Vision-Language Version)
Type: Distilled Transformer Model
Use Case: Multimodal tasks like VQA, image captioning, and text-based image retrieval.
Key Features:
Further compression of BERT, suitable for deployment on edge devices.
When adapted for vision-language tasks, it provides efficient performance with limited resources.
Useful for applications requiring rapid inference in a lightweight model.

MobileViT (Vision-Language Applications)
Type: Lightweight Vision Transformer
Use Case: Image classification, text-based image retrieval, VQA.
Key Features:
Combines the strengths of CNNs and transformers for a mobile-friendly architecture.
Can be paired with lightweight language models to create a compact vision-language system.
Optimized for mobile and edge devices, offering a good balance between accuracy and efficiency.

LiteVL (Lightweight Vision-Language)
Type: Vision-Language Model
Use Case: Image captioning, VQA, and multimodal understanding.
Key Features:
A lightweight version of traditional vision-language models, designed for efficiency.
Uses smaller transformer layers and reduced embedding sizes to minimize computational load.
Suitable for real-time applications on devices with limited resources.

DistillVLP (Vision-Language Pre-training)
Type: Distilled Vision-Language Model
Use Case: Image-text matching, VQA, and image captioning.
Key Features:
Distilled from larger vision-language models like VLP (Vision-Language Pre-training).
Reduces model size and complexity while maintaining good performance on vision-language tasks.
Ideal for applications where model size and inference speed are critical.

MiniGPT
Type: Vision-Language Model
Use Case: Image captioning, VQA, text-to-image retrieval.
Key Features:
A smaller version of GPT adapted for vision-language tasks.
Integrates with lightweight vision backbones to process images and text jointly.
Suitable for tasks requiring quick, efficient processing with reduced computational requirements.

TinyCLIP
Type: Vision-Language Model
Use Case: Image classification, image-text matching.
Key Features:
A compressed version of the CLIP model, optimized for speed and efficiency.
Maintains the ability to perform zero-shot learning and image-text retrieval with a smaller footprint.
Designed for deployment in resource-constrained environments.

Distilled VisualBERT
Type: Vision-Language Model
Use Case: VQA, image captioning, and other multimodal tasks.
Key Features:
A distilled version of VisualBERT, which combines BERT with vision models.
Reduces the number of transformer layers and parameter size for efficient deployment.
Provides a good balance between performance and resource usage.

LXMERT (Lightweight Adaptation)
Type: Vision-Language Model
Use Case: VQA, image captioning, image-text retrieval.
Key Features:
A version of LXMERT adapted to be more resource-efficient while retaining multimodal capabilities.
Focuses on reducing the number of parameters and computational complexity.
Suitable for real-time vision-language tasks in constrained environments.