‘Run.ai’ – enabling organizations to optimize and accelerate AI workloads in a more efficient manner.

Run.AI is a company that specializes in enabling organizations to optimize and accelerate AI workloads in a more efficient manner. It provides a cloud-native platform designed to maximize the utilization of AI hardware infrastructure, particularly GPUs. By dynamically allocating GPU resources across multiple teams and workloads, Run.AI helps organizations manage their AI/ML infrastructure more effectively.

Dynamic Resource Allocation: Enables the efficient sharing of GPU resources across multiple users and workloads.

Kubernetes Integration: Leverages Kubernetes for container orchestration, allowing for scalable and flexible resource management.

Cluster Management: Centralized management of GPU clusters across on-premises, cloud, and hybrid environments.

Job Prioritization: Allows prioritization and queuing of AI workloads to meet business needs.
Elastic GPU Pools: Facilitates elastic allocation of GPUs for optimal utilization, reducing underused hardware.

Multi-Tenancy Support: Provides secure isolation between different teams or users.
Ease of Integration: Compatible with major AI/ML frameworks like TensorFlow, PyTorch, and others.

Few examples:

Submitting an AI/ML Job

Run.AI enables you to submit jobs to GPU clusters via its CLI or directly using Kubernetes YAML manifests.

CLI Example

bashCopy coderunai submit my-job \
  --image tensorflow/tensorflow:latest-gpu \
  --gpu 2 \
  --command "python train.py --epochs=10"
  • --image: Specifies the container image for the job.
  • --gpu: Allocates the number of GPUs required.
  • --command: The command to execute within the container.

2. Kubernetes YAML Manifest Example

You can use Kubernetes manifests to submit jobs through Run.AI. Below is an example YAML file for a TensorFlow training job.

yamlCopy codeapiVersion: batch/v1
kind: Job
  name: tensorflow-training
    runai.io/project: "default"
      - name: tensorflow-container
        image: tensorflow/tensorflow:latest-gpu
        command: ["python", "train.py"]
            nvidia.com/gpu: 2
      restartPolicy: Never
  • Annotations: Specify the Run:AI project for the job.
  • nvidia.com/gpu: Requests the number of GPUs for the container.

3. Elastic GPU Allocation

Run.AI supports elastic resource allocation. You can request a GPU fraction instead of an entire GPU for lightweight tasks.

Example YAML for Elastic Allocation

yamlCopy codeapiVersion: v1
kind: Pod
  name: elastic-gpu-job
    runai.io/project: "default"
  - name: lightweight-task
    image: pytorch/pytorch:latest
        nvidia.com/gpu: 0.5  # Request half a GPU
    command: ["python", "light_task.py"]

4. Job Monitoring

Run.AI provides tools to monitor your jobs using CLI or the web interface. Below is an example CLI command to check the status of a submitted job.

bashCopy coderunai list jobs


plaintextCopy codeNAME            STATUS      GPU(REQ/LIMIT)   NODE
my-job          Running     2/2             gpu-node-1
tensorflow-job  Pending     1/1             -

5. Scheduling Priority

Run:AI allows job prioritization based on policies. Higher-priority jobs can preempt lower-priority ones.

Example: Setting Priority in YAML

yamlCopy codeapiVersion: batch/v1
kind: Job
  name: high-priority-job
    runai.io/project: "default"
    runai.io/priority: "high"
      - name: high-priority-task
        image: pytorch/pytorch:latest
            nvidia.com/gpu: 1

6. Hybrid Cloud Integration

Run:AI enables hybrid cloud deployments. Below is an example of configuring a hybrid environment to use both on-premise and cloud GPUs.

Example Config

  • On-premise nodes: Tag them with runai.io/type=on-prem.
  • Cloud nodes: Tag them with runai.io/type=cloud.
bashCopy codekubectl label nodes on-prem-node-1 runai.io/type=on-prem
kubectl label nodes cloud-node-1 runai.io/type=cloud

When submitting jobs, specify the node type:

bashCopy coderunai submit cloud-job --node-type cloud
