02/03/2026
RoboBrain-Audio is a native full-duplex omnimodal interaction system with lifelong memory. It supports simultaneous listening and speaking, question-and-answer interruption, and personalized interaction based on usersâ information and social relationships. It is particularly suitable for scenarios required by embodied agents, such as identity recognition, continuous interaction, being interrupted, and rapid response.
RoboBrain-Audio achieves full-duplex spoken dialog capabilities at the 7B model scale. Trained on approximately 1 million hours of audio-text paired dataâonly about 1% of the data volume of existing large-scale audio foundation modelsâit can still match or even outperform other models of the same type. In contrast to the time-division multiplexing (TDM) architecture commonly adopted by traditional spoken dialog models, its native full-duplex model architecture can reduce the response latency to around the 80-millisecond level.
The Impact of TDM and Native Full-Duplex Technology on the Responsiveness of Embodied Interaction Systems
RoboBrain-Audio supports multi-user identity recognition via facial recognition, voiceprint recognition and other means. Meanwhile, it can memorize usersâ basic information, preferences and interpersonal relationships to construct a long-term memory and social relationship graph. Adopting an asynchronous process featuring parallel storage, retrieval and response, the model is designed with a human-like two-level memory system consisting of short-term and long-term memory. Such design enables long-term planning and cumulative learning in an embodied (robotic/physical) environment. The model achieves a facial recognition accuracy rate of 98.4% and a voiceprint recognition error rate of less than 1%. In noisy environments, its personalized conversation capability attains a factual correctness rate of 87.6% and a response quality score of 8.82 out of 10. Additionally, the systemâs throughput rate exceeds 20 fps, which far surpasses the requirements for real-time voice conversations.