Redwood City, CA – FriendliAI, an AI inference platform company, announced a partnership with NVIDIA to launch the Nemotron 3 model family, available on FriendliAI’s Dedicated Endpoints. Developers can deploy Nemotron 3 models on FriendliAI’s inference platform.
Highlights include:
- Up to 13× faster token generation via hybrid Mamba-Transformer MoE architecture and multi-token prediction (MTP) technique
- MoE routing for reduced compute load and real-time latency
- Leading accuracy on SWE Bench, GPQA Diamond, AIME 2025, Humanity Last Exam, IFBench, RULER, and Arena Hard
- Fully open weights, datasets, and recipes for maximum transparency and control
“The combination of NVIDIA’s Nemotron 3 Nano and FriendliAI’s platform represents a milestone in unlocking the promise of agentic AI,” said Byung-Gon Chun, Founder and CEO of FriendliAI. “Efficient, affordable inference is fundamental to deploying agentic AI at scale, and our commitment to performance and scalability makes that possible.”
NVIDIA’s Nemotron 3 is a family of reasoning models designed for agentic AI and reasoning-intensive applications in fields, such as software development, retail, finance, and cybersecurity. The fully open, small language MoE model is purpose-built to deliver exceptional reasoning performance while maintaining the efficiency required for production use.
Inference speed is crucial for agentic AI because it enables real time interaction, scalability and cost efficiency.
The company said running Nemotron 3, FriendliAI delivers:
- Faster performance with optimized GPU kernels
- More efficient MoE serving with Online Quantization + Speculative Decoding
- Predictable latency and autoscaling for traffic spikes
- 50 percent+ GPU cost savings on Dedicated Endpoints
- OpenAI-compatible APIs for easy integration
“The combination of cost efficiency and speed has positioned FriendliAI as a compelling solution for enterprises seeking to optimize their AI infrastructure investments,” added Chun.




