CASE STUDY

How Toma uses Porter and Deepgram to Transform the Automative Industry with Voice AI

Shankar Radhakrishnan
July 9, 2025
5 min read

The automotive industry has historically struggled with customer service bottlenecks, particularly around service appointments and maintenance calls. Customers often wait on hold for extended periods, while dealerships juggle overwhelming call volumes with limited staff - this can result in hundreds of thousands in lost revenue per month for each dealership. That's where Toma comes in, revolutionizing how automotive dealerships handle customer interactions through voice AI. 

Toma is a YC W24 batch startup and recently raised a $17M Series A round led by a16z. They provide end-to-end voice AI solutions that act as virtual receptionists for car dealerships nationwide. Their platform handles everything from service appointment scheduling to vehicle status inquiries, achieving over 75% call resolution rates for their franchise dealership customers. With a growing customer base primarily driven by word-of-mouth, Toma has proven that voice AI can deliver genuine value in automotive customer service.

Toma initially hosted their applications on Railway, attracted by its simplicity and ease of deployments. With a small engineering team focused on building their voice AI product, they didn't want to spend precious development cycles wrangling infrastructure. But as they scaled and began handling critical customer calls for dealerships, their infra needs became more demanding.

When shared infrastructure becomes a liability

Toma began to reconsider where they were hosting for several reasons, namely: reliability, latency, control, and scalability.

The breaking point came during a crucial demo when another user on Railway experienced a DDoS attack. Since Railway shared infrastructure across multiple customers, Toma's services went down along with the user who was targeted by the cyberattack. The timing couldn't have been worse, and when they reached out for support, Railway's customer service response was inadequate for their mission-critical use case.

"We were in the middle of important demos when everything went down. When you're handling live customer calls for dealerships, you can't afford that kind of downtime." - Anthony Krivonos, Co-founder and CTO of Toma

The incident highlighted a fundamental issue with shared infrastructure platforms: one bad actor or technical issue could impact all customers. For a voice AI company handling real-time customer interactions, this level of unpredictability wasn’t acceptable. 

Toma also needed more control when it came to their infra, to optimize for the unique demands of voice AI workloads. Real-time voice processing requires consistent low latency, predictable resource allocation, automated scaling, and the ability to run specialized inference models. Railway's shared environment couldn't provide the level of control and customization Toma needed as their voice AI stack became more sophisticated - they realized they needed infra that could scale with the growing complexity of their tech stack while maintaining the reliability their customers demanded.

Managing EKS on their own?

Faced with the need to migrate quickly, Toma evaluated their options carefully. The team initially considered Amazon EKS (Elastic Kubernetes Service) since they wanted more control over their infrastructure. However, with a lean engineering team and an ambitious product roadmap ahead of them, they quickly realized that managing Kubernetes (K8s), and infra in general, themselves would consume too much engineering bandwidth.

"We considered EKS but didn't have the bandwidth to maintain infrastructure. Our team needed to focus on building our voice AI capabilities, not becoming Kubernetes experts." - Anthony Krivonos, Co-founder and CTO of Toma

The appeal of K8s was clear – it would give them the control and scalability they needed. However, they wanted to abstract away operational complexity and retain the developer experience that allowed them to focus on building their core product. 

Porter: the best of both worlds

Porter emerged as the perfect solution for Toma's needs, offering managed Kubernetes infrastructure without the need for a dedicated DevOps team. The platform struck the ideal balance between the simplicity of a traditional PaaS and the control of running in their own AWS cloud.

What impressed Toma most was the migration speed. Within just two days, they had completely migrated from Railway to Porter – a timeline that would have been impossible with a self-managed approach. This rapid migration was crucial given the urgency created by the outage they experienced.

Porter's approach solved multiple problems simultaneously. By providing a managed environment running on top of EKS, Porter gave Toma the scalability and control they needed while eliminating the operational overhead. They could even leverage their existing AWS credits to cover the cost of the underlying compute. 

Running real-time voice AI on Porter

Today, Toma runs two clusters on Porter, for their production and staging environments. Their setup includes 10 applications, with canary deployments running alongside their production workloads to ensure safe releases.

They also make use of Cost Optimization, a feature on Porter that ensures workloads are efficiently bin-packed using a combination of instance types, allowing users to instantly reduce EC2 spend for application workloads by more than 50%. Toma, specifically, went from spending over $13k per month to ~$6k per month.

Toma instantly reduced EC2 spend for application workloads by more than 50%, going from spending over $13k per month to ~$6k per month.

Toma’s architecture reflects the complexity of real-time voice AI processing. During customer calls, Toma's system continuously transcribes speech using Deepgram's Nova 3 model, determines when customers have finished speaking through threshold detection, processes responses through multiple LLMs running simultaneously, and converts responses back to speech using generative AI speech.

"We're doing all the orchestration ourselves. Streaming and latency is huge for us – every millisecond matters in voice interactions.” - Anthony Krivonos, Co-founder and CTO of Toma

Porter has an inference hub where users can launch inference workloads with one-click. Models are deployed on the ideal GPUs, using industry-standard engines that guarantee high throughput and low latency. Day 1 and 2 operations like the loading of drivers, configuring autoscaling, and setting up observability are all abstracted away.

One of the key advantages of Porter has been the ability to self-host Deepgram for inference workloads. Toma runs Deepgram on a colocated GPU node to reduce latency. They’re also planning to deploy open-source models through Porter - currently, all Hugging Face models that run on vLLM are supported in Porter’s inference hub.

Porter's inference tab allows users to launch inference workloads with one click.

The managed nature of Porter's platform has been particularly valuable as Toma's team has grown from just Anthony to four engineers. Rather than dedicating resources to DevOps, the entire team can focus on improving their voice AI capabilities and expanding their service offerings.

How Deepgram beat Whisper

At first, Toma ran self-hosted Whisper for speech recognition but found significant limitations in accuracy and latency.

"We were using self-hosted Whisper a year ago, but transcription was sometimes missed and VAD detection wasn't as strong. We heard that latency was low for Deepgram, and that made all the difference for us." - Anthony Krivonos, Co-founder and CTO of Toma

The migration to Deepgram's Nova 3 model delivered immediate improvements in both accuracy and speed. For automotive dealership calls, where domain-specific terminology is common, Deepgram's keyword prompting capabilities proved particularly valuable. The ability to prime the model with automotive-specific vocabulary significantly improved transcription accuracy.

Toma currently runs Deepgram Nova 3 on-premises through Porter, with plans to participate in Deepgram's private beta for their upcoming streaming-optimized model. The combination of Porter's infrastructure management and Deepgram's low-latency speech recognition creates a powerful foundation for real-time voice AI applications.

Toma runs Deepgram Nova 3 on-premises through Porter, colocated with their application workloads for minimal latency.

Results that matter: 75%+ resolution rates

Toma’s call resolution rates now exceed 75% for several customers, with unresolved cases escalated to human agents. This level of automation allows dealerships to handle more customer interactions with existing staff while improving response times, ultimately resulting in millions of dollars in revenue gained.

The reliability improvements have been equally significant for dealership customers, who depend on consistent availability of customer service.

By reducing latency through their Porter and Deepgram setup, Toma has made their voice AI conversations flow more naturally. The combination of low-latency speech recognition, efficient orchestration, and optimized text-to-speech generation creates voice interactions that feel responsive and seamless to end users.

What's next: outbound calls and more self-hosted LLMs

Toma's infrastructure foundation on Porter positions them well for ambitious expansion plans. The company is preparing to launch outbound calling capabilities that will allow dealerships to proactively reach customers for trade-in valuations, purchase pre-qualification, and service reminders.

The team is also exploring more self-hosted LLM deployments to further reduce latency and increase control over their language processing capabilities. Porter makes these advanced deployments straightforward, allowing Toma to experiment with cutting-edge models - all without any infrastructure constraints. As Toma continues to grow their customer base among franchise dealerships, their infrastructure on Porter scales seamlessly to handle increased call volumes and more sophisticated voice AI workloads.

Next Up

How Onclusive uses Porter to consolidate their tech following five mergers
Shankar Radhakrishnan
3 min read
How Writesonic runs a 1.6TB Kubernetes cluster with no DevOps engineers
Justin Rhee
2 min read
How Dashdive uses Porter to handle a billion requests per day
Shankar Radhakrishnan
5 min read
Govly moves from GCP to AWS in a day using Porter
Shankar Radhakrishnan
5 min read
How HomeLight powers billions of dollars of real estate business on Porter
Justin Rhee
3 min read
Why Carry uses Porter instead of hiring a full-time DevOps engineer
Shankar Radhakrishnan
4 min read
How Memberstack uses Porter to serve 30 million requests
Justin Rhee
3 min read
How Avenue scaled after YC without hiring DevOps
Justin Rhee
3 min read
Why Landing chose Porter to scale their servers
Shankar Radhakrishnan
5 min read
How Nooks uses Porter to achieve a median latency of 250ms for their AI parallel dialer
Shankar Radhakrishnan
11 min read
How Getaround uses Porter to manage Kubernetes clusters serving traffic across 8 countries
Shankar Radhakrishnan
4 min read
Why Woflow moved from ECS to Porter
Trevor Shim
6 min read
How Toma uses Porter and Deepgram to Transform the Automative Industry with Voice AI
Shankar Radhakrishnan
5 min read
Subscribe to our weekly newsletter
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.