Learning to Recognize Human Actions From Noisy Skeleton Data Via Noise Adaptation

Sijie Song     Jiaying Liu     Lilang Lin     Zongming Guo

Wangxuan Institute of Computer Technology, Peking University, Beijing.



Figure 1. (a) RGB images, (b) original noisy skeletons, (c) denoised results by R-SD, (d) denoised results by PE-AE, (e) denoised results G-SD, (f) adapted results by G-NAN. The skeletons are slightly rotated around their torsos for better visualization. Though R-SD, PE-AE and G-SD correct noisy skeletal joints marked by green arrows, our adapted results show more discriminative representations for action recognition marked by red arrows.

Abstract

Recent studies have made great progress on skeleton-based action recognition. However, most of them are developed with relatively clean skeletons without the presence of intensive noise. We argue that the models learned from relatively clean data are not well generalizable to handle noisy skeletons commonly appeared in the real world. In this paper, we address the challenge of recognizing human actions from noisy skeletons, which is seldom explored by previous methods. Beyond exploring the new problem, we further take a new perspective to address it, i.e., noise adaptation, which gets rid of explicit skeleton noise modeling and reliance on skeleton ground truths. Specifically, we develop regression-based and generation-based adaptation models according to whether pairs of noisy skeletons are available. The regression-based model aims to learn noise-suppressed intrinsic feature representations by mapping pairs of noisy skeletons into a noise-robust space. When only unpaired skeletons are accessible, the generation-based model aims to adapt the features from noisy skeletons to a low-noise space by adversarial learning. To verify our proposed model and facilitate research on noisy skeletons, we collect a new dataset Noisy Skeleton Dataset (NSD), the skeletons of which are with much noise and more similar to daily-life data than previous datasets. Extensive experiments are conducted on the NSD, VV-RGBD and N-UCLA datasets, and results consistently show the outstanding performance of our proposed model.

Resources

Citation

@ARTICLE{9576640,
    author={Song, Sijie and Liu, Jiaying and Lin, Lilang and Guo, Zongming},
    journal={IEEE Transactions on Multimedia},
    title={Learning to Recognize Human Actions From Noisy Skeleton Data Via Noise Adaptation},
    year={2022},
    volume={24},
    number={},
    pages={1152-1163},
}

Dataset

Our dataset aims to provide noisy skeletons that consistent with those in the real world. The noise in skeletons is largely due to heavy occlusion caused by viewpoints. Thus, we set Microsoft Kinect V2 cameras around the actors. The horizontal angles of each camera are -120° (side view 1), 0° (front view), and +120°(side view 2) with the height of 120 cm. Our dataset provide simultaneous color images, depth maps, 3D joints and IR frames. We collect 1,009 untrimmed videos, each of which lasts about 1~2 minutes and contains about 7 action instances. In total, there are 6,952 trimmed action clips in 41 action categories. We invite 13 subjects and each subject takes part in 4 daily action videos. Some sample frames can be viewed in Fig. 1. The actors perform actions towards a random direction. Thus, in any case, the data from one of the cameras suffer from heavy occlusion and thus noisy skeletons.

            Figure 2. Average precision on each action category of baseline, R-NAN and G-NAN on the NSD (CS) dataset.

Method

Figure 3. Instantiations for action recognition from noisy skeletons. (a) Regression-based noise adaptation model. X1 and X2 are observed noisy skeletons for a certain action sequence. (b) Generation-based noise adaptation model. X and Z are noisy and relatively clean skeleton sequences, respectively. (Note we omit some losses for simplicity.).