SambaNova Launches AI Inference Cloud Platform

PALO ALTO, CA — Sept. 10th, 2024 —  AI chips and models company SambaNova Systems announced SambaNova Cloud AI inference service powered by its SN40L AI chip.

The company said developers can log on for free via an API today — no waiting list — and create their own generative AI applications using both the largest and most capable model, Llama 3.1 405B, and the lightning-fast Llama 3.1 70B. SambaNova Cloud runs Llama 3.1 70B at up to 580 tokens per second (t/s) and 405B at over 100 t/s at full precision.

“SambaNova Cloud is the fastest API service for developers. We deliver world record speed and in full 16-bit precision — all enabled by the world’s fastest AI chip,” said Rodrigo Liang, CEO of SambaNova Systems. “SambaNova Cloud is bringing the most accurate open source models to the vast developer community at speeds they have never experienced before.”

This year, Meta launched Llama 3.1 in three form factors — 8B, 70B, and 405B. The 405B model is the crown jewel for developers, offering a highly competitive alternative to the best closed-source models from OpenAI, Anthropic, and Google. Meta’s Llama 3.1 models are the most popular open-source models, and Llama 3.1 405B is the most intelligent, according to Meta, offering flexibility in how the model can be used and deployed.

“Competitors are not offering the 405B model to developers today because of their inefficient chips. Providers running on Nvidia GPUs are reducing the precision of this model, hurting its accuracy, and running it at almost unusably slow speeds,” continued Liang. “Only SambaNova is running 405B — the best open-source model created – at full precision and at well over 100 tokens per second.”

Llama 3.1 405B is an extremely large model — the largest frontier open-weights model released to date. The size means the cost and complexity of deploying it are high, and the speed at which it’s served is slower compared to smaller models. SambaNova’s SN40L chips reduce this cost and complexity compared to Nvidia H100s and lessen the speed trade-off of the model as the chips serve it at higher speeds.

“Agentic workflows are delivering excellent results for many applications, they need to process a large number of tokens to generate the final result. This makes fast token generation critical. The best open weights model today is Llama 3.1 405B, and SambaNova is the only provider running this model at 16-bit precision and at over 100 tokens/second. This impressive technical achievement opens up exciting capabilities for developers building with LLMs,” stated Andrew Ng, a recognized leader in AI, Founder of DeepLearning.AI, Founder & CEO of Landing AI, General Partner at AI Fund, Chairman and Co-Founder of Coursera and an Adjunct Professor at Stanford University’s Computer Science Department.

Llama 3.1 70B is considered the highest fidelity model for agentic AI use cases, which require high speeds and low latency. Its size makes it suitable for fine-tuning, producing expert models that can be combined in multi-agent systems suitable for solving complex tasks.

SambaNova Cloud is the first platform that allows developers to run their expert models at speeds up to 580 t/s and build agentic applications that run at unparalleled speed.

“Artificial Analysis has independently benchmarked SambaNova as achieving record speeds of 115 output tokens per second on their Llama 3.1 405B cloud API endpoint. This is the fastest output speed available for this level of intelligence across all endpoints tracked by Artificial Analysis, exceeding the speed of the frontier models offered by OpenAI, Anthropic, and Google. SambaNova’s Llama 3.1 endpoints will support speed-dependent AI use-cases, including for applications that require real-time responses or leverage agentic approaches to using language models,” said George Cameron, Co-Founder at Artificial Analysis.

“As a leading proponent of interactive AI-powered Sales Enablement SaaS solutions, Bigtincan is excited to partner with SambaNova. With SambaNova’s impressive performance, we can achieve up to 300% increased efficiency in Bigtincan SearchAI, enabling us to run the most powerful open-source models like Llama in all its configurations and agentic AI workflows with unparalleled speed and effectiveness,” said David Keane, CEO of Bigtincan Solutions, an ASX listed SaaS company.

SambaNova’s Fast API has seen rapid adoption since its launch in early July. With SambaNova Cloud, developers can bring their own checkpoints, fast switch between Llama models, automate workflows using a chain of AI prompts, and utilize existing fine-tuned models with fast inference speed. It will quickly become the go-to inference solution for developers who demand the power of 405B, total flexibility, and speed.

SambaNova Cloud is available today across three tiers: Free, Developer, and Enterprise.

  • The Free Tier (available today): offers free API access anyone who logs in today
  • The Developer Tier (available by end of 2024): enables developers to build models with higher rate limits with Llama 3.1 8B, 70B, and 405B models
  • The Enterprise Tier (available today): provides enterprise customers with the ability to scale with higher rate limits to power production workloads

SambaNova Cloud’s impressive performance is made possible by the SambaNova SN40L AI chip. With its unique, patented dataflow design and three-tier memory architecture, the SN40L can power AI models faster and more efficiently.