The oropharyngeal swabbing is a pre-diagnostic procedure used to test various respiratory diseases, including COVID and Influenza A (H1N1). To improve the testing efficiency of testing, a real-time, accurate, and robust sampling point localization algorithm is needed for robots. However, current solutions rely heavily on visual input, which is not reliable enough for large-scale deployment. The transformer has significantly improved the performance of image-related tasks and challenged the dominance of traditional convolutional neural networks (CNNs) in the image field. Inspired by its success, we propose a novel self-aligning multi-modal transformer (SAMMT) to dynamically attend to different parts of unaligned feature maps, preventing information loss caused by perspective disparity and simplifying overall implementation. Unlike preexisting multi-modal transformers, our attention mechanism works in image space instead of embedding space, rendering the need for the sensor registration process obsolete. To facilitate the multi-modal task, we collected and annotate an oropharynx localization/segmentation dataset by trained medical personnel. This dataset is open-sourced and can be used for future multi-modal research. Our experiments show that our model improves the performance of the localization task by 4.2% compared to the pure visual model, and reduces the pixel-wise error rate of the segmentation task by 16.7% compared to the CNN baseline.
- Article type
- Year
- Co-author
This paper focuses on multi-modal Information Perception (IP) for Soft Robotic Hands (SRHs) using Machine Learning (ML) algorithms. A flexible Optical Fiber-based Curvature Sensor (OFCS) is fabricated, consisting of a Light-Emitting Diode (LED), photosensitive detector, and optical fiber. Bending the roughened optical fiber generates lower light intensity, which reflecting the curvature of the soft finger. Together with the curvature and pressure information, multi-modal IP is performed to improve the recognition accuracy. Recognitions of gesture, object shape, size, and weight are implemented with multiple ML approaches, including the Supervised Learning Algorithms (SLAs) of K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Logistic Regression (LR), and the unSupervised Learning Algorithm (un-SLA) of K-Means Clustering (KMC). Moreover, Optical Sensor Information (OSI), Pressure Sensor Information (PSI), and Double-Sensor Information (DSI) are adopted to compare the recognition accuracies. The experiment results demonstrate that the proposed sensors and recognition approaches are feasible and effective. The recognition accuracies obtained using the above ML algorithms and three modes of sensor information are higer than 85 percent for almost all combinations. Moreover, DSI is more accurate when compared to single modal sensor information and the KNN algorithm with a DSI outperforms the other combinations in recognition accuracy.