Towards Multimodal Streaming Perception: A Real-Time
Perception Scheduling Framework Based on Relevance

Huang, Dingcheng

dc.contributor.advisor	Youcef-Toumi, Kamal
dc.contributor.author	Huang, Dingcheng
dc.date.accessioned	2025-10-29T17:42:51Z
dc.date.available	2025-10-29T17:42:51Z
dc.date.issued	2025-05
dc.date.submitted	2025-06-26T14:15:10.584Z
dc.identifier.uri	https://hdl.handle.net/1721.1/163460
dc.description.abstract	In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality and information gains in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the relevance concept and the inherent sparsity of information in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time. Our experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose accuracy and comparable YOLO accuracy. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98% in dynamic scenes. The results validate the framework’s capability to enhance real-time perception efficiency without significantly compromising accuracy. Additionally, the framework shows potential as a scalable and systematic solution for multi-modal streaming perception systems in human-robot collaboration.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Towards Multimodal Streaming Perception: A Real-Time Perception Scheduling Framework Based on Relevance
dc.type	Thesis
dc.description.degree	S.M.
dc.contributor.department	Massachusetts Institute of Technology. Department of Mechanical Engineering
mit.thesis.degree	Master
thesis.degree.name	Master of Science in Mechanical Engineering

Files in this item

Name:: huang-dean1231-smme-meche-2025 ...
Size:: 1.306Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record

Towards Multimodal Streaming Perception: A Real-Time Perception Scheduling Framework Based on Relevance

Files in this item

This item appears in the following Collection(s)