Introducing the Inference Tab and GPU metric-based Autoscaling

Shankar Radhakrishnan

1 min read

Users can now one-click deploy models in your own AWS infra and autoscale GPU-enabled apps based on GPU and VRAM utilization.

OUTLINE

Text Link

Inference Tab

We’ve created a new hub where you can one-click launch inference workloads in your own AWS infrastructure. Models including DeepGram, Llama-3-8B, and Mistral-3-7B are currently supported.

We run inference workloads on the ideal GPUs for these models, using industry-standard engines that guarantee high throughput and low latency.

GPU metric-based Autoscaling

Users can configure GPU-enabled applications to autoscale based on GPU and VRAM utilization. You can configure this feature from the Resources tab of your application. Applications need to be bound to a specific GPU node group.

New Application Create Flow, Revamped Datastore Create and Update Flow, Multiple Node Groups, and Temporal-based Autoscaling

December 16, 2025

Changelog

Audit Logs, Custom Autoscaling, Persistence for the Grafana Add-on, GPU Jobs, and Improved Deployments

August 28, 2025

Changelog

Datastore Updates, Metrics and Logs v2, Dismiss Notifications, Cancel Pre-deploy Jobs

June 26, 2025

Changelog

Introducing the Inference Tab and GPU metric-based Autoscaling

Users can now one-click deploy models in your own AWS infra and autoscale GPU-enabled apps based on GPU and VRAM utilization.

Inference Tab

GPU metric-based Autoscaling

More Changelog Posts

New Application Create Flow, Revamped Datastore Create and Update Flow, Multiple Node Groups, and Temporal-based Autoscaling

Audit Logs, Custom Autoscaling, Persistence for the Grafana Add-on, GPU Jobs, and Improved Deployments

Datastore Updates, Metrics and Logs v2, Dismiss Notifications, Cancel Pre-deploy Jobs

Tailscale Integration, Cost Optimization Granularity, Helicone Add-on, and Improved App Alerts