NVIDIA Jarvis AI SDK Fuses Vision, Speech, and other Sensors into One System

“The NVIDIA Jarvis SDK offers a complete workflow to build, train and deploy GPU-accelerated AI systems that can use visual cues such as gestures and gaze along with speech in context. For example lip movement can be fused with speech input to identify the active speaker. Gaze can be used to understand if the speaker is engaging the AI agent or other people in the scene. Such multi-modal fusion enables simultaneous multi-user, multi-context conversations with the AI agent that need deeper understanding of the context.”