Sequential Human Assembly and Disassembly Motions in Human-Robot Coexisting Environments

Background & Summary

A human-robot system (HRS) is a novel paradigm that comprises humans and robots working together as a unified workforce1. This paradigm is reflected in key research areas such as human-robot interaction (HRI) and human-robot collaboration (HRC)[2](https://www.nature.com/articles/s41597-025-06042-0#ref-CR2 “Leng, J. et al. Industry 5.0: Prospect and retrospect. Journal of Manufacturing Systems 65, 279–295, https://doi.org/10.1016/j.jmsy.2022.09.017

(2022).“). An HRS typically consists of fundamental modules, includi…

Background & Summary

(2022).“). An HRS typically consists of fundamental modules, including perception, cognition, decision-making, and control. Among these modules, perception plays a central role. Moreover, the ability of robots to perceive task-oriented sequential manual procedures performed by humans distinguishes perception in HRS from conventional robotic perception.

Typical modalities for capturing human behaviors include RGB images, RGB-D images, videos, point clouds, and human skeletons. These modalities are often employed to track and recognize human activities. By leveraging the real-time states of human operators, task and motion planning can be implemented to enhance the collaborative capabilities of robots, thereby enabling efficient HRS, especially task-oriented ones. However, human motions in HRS, especially HRC, are predominantly task-related, particularly in collaborative assembly and disassembly.

Related datasets exist in both HRI and HRC. In HRI, MHRI[3](https://www.nature.com/articles/s41597-025-06042-0#ref-CR3 “Azagra, P. et al. A multimodal dataset for object model learning from natural human-robot interaction. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 6134–6141, https://doi.org/10.1109/IROS.2017.8206514

(2017).“) is a multi-modal dataset containing human motion pointing and showing objects to a robot. While THOR[4](https://www.nature.com/articles/s41597-025-06042-0#ref-CR4 “Rudenko, A. et al. ThÖr: Human-robot navigation data collection and accurate motion trajectories dataset. IEEE Robotics and Automation Letters 5(2), 676–682, https://doi.org/10.1109/LRA.2020.2965416

(2020).“) is a motion trajectory dataset for human-robot navigation. Similar datasets in HRI also show up in other robotic-related fields such as hugging interaction5, personality and engagement[6](https://www.nature.com/articles/s41597-025-06042-0#ref-CR6 “Celiktutan, O., Skordos, E. & Gunes, H. Multimodal human-human-robot interactions (mhhri) dataset for studying personality and engagement. IEEE Transactions on Affective Computing 10(4), 484–497, https://doi.org/10.1109/TAFFC.2017.2737019

(2019).“), games[7](https://www.nature.com/articles/s41597-025-06042-0#ref-CR7 “Wit, J., Krahmer, E. & Vogt, P. Introducing the nemo-lowlands iconic gesture dataset, collected through a gameful human–robot interaction. Behavior Research Methods 53(3), 1353–1370, https://doi.org/10.3758/s13428-020-01487-0

(2021).“), and assistive collaboration[8](https://www.nature.com/articles/s41597-025-06042-0#ref-CR8 “Newman, B. A., Aronson, R. M., Srinivasa, S. S., Kitani, K. & Admoni, H. Harmonic: A multimodal dataset of assistive human-robot collaboration. The International Journal of Robotics Research 41(1), 3–11, https://doi.org/10.1177/02783649211050677

(2022).“). In HRC, the analysis of human motion is often industrial task-oriented, encompassing aspects such as body language and the task sequences involved in manipulation tasks. For instance, InHARD[9](https://www.nature.com/articles/s41597-025-06042-0#ref-CR9 “Dallel, M., Havard, V., Baudry, D. & Savatier, X. Inhard - industrial human action recognition dataset in the context of industrial collaborative robotics. 2020 IEEE International Conference on Human-Machine Systems (ICHMS), 1–6, https://doi.org/10.1109/ICHMS49158.2020.9209531

(2020).“) utilizes webcams and wearable motion capture devices in an HRC scenario where humans and robots operate side by side. HRI30[10](https://www.nature.com/articles/s41597-025-06042-0#ref-CR10 “Iodice, F., De Momi, E. & Ajoudani, A. Hri30: An action recognition dataset for industrial human-robot interaction. 2022 26th International Conference on Pattern Recognition (ICPR), 4941–4947, https://doi.org/10.1109/ICPR56361.2022.9956300

(2022).“) collects data from an industrial environment in which a human collaborates with two robots, capturing human motions such as “pick up drill” and “move forward while drilling” from a wide camera view. Assembly101[11](https://www.nature.com/articles/s41597-025-06042-0#ref-CR11 “Sener, F. et al. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21096–21106, https://doi.org/10.1109/CVPR52688.2022.02042

(2022).“) is a large-scale, multi-view dataset designed for sequential manual assembly tasks, primarily developed for augmented reality applications without the inclusion of physical robots, thereby focusing predominantly on forearm and hand movements above a table. ATTACH[12](https://www.nature.com/articles/s41597-025-06042-0#ref-CR12 “Aganian, D., Stephan, B. & Eisenbach, M., Stretz, C. & Gross, H.-M. Attach dataset: Annotated two-handed assembly actions for human action understanding. 2023 IEEE International Conference on Robotics and Automation (ICRA), 11367–11373 https://doi.org/10.1109/ICRA48891.2023.10160633

(2023).“) emphasizes two-hand assembly actions performed by humans, involving 42 participants in a cabinet assembly task, with data collected using three cameras. HA-ViD[13](https://www.nature.com/articles/s41597-025-06042-0#ref-CR13 “Zheng, H., Lee, R. & Lu, Y. Ha-vid: A human assembly video dataset for comprehensive assembly knowledge understanding. Advances in Neural Information Processing Systems 36, 67069–67081, https://doi.org/10.5555/3666122.3669052

(2023).“) highlights the knowledge in assembly from 30 participants, with video data collected from three cameras.

These datasets released in recent years have significantly contributed to HRS research. However, they still present limitations in supporting HRS for industrial tasks. Firstly, common human motion datasets, such as Kinetics14 and NTU RGB+D[15](https://www.nature.com/articles/s41597-025-06042-0#ref-CR15 “Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y. & Kot, A. C. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10), 2684–2701, https://doi.org/10.1109/TPAMI.2019.2916873

(2020).“), do not account for task-oriented human behaviors, especially task sequences. They typically focus on simple daily life activities, such as crossing arms or raising hands. As for HRS-related datasets, InHARD relies on wearable devices, which limits its usage because of higher hardware requirements. HRI30 focuses more on humans walking between robots while holding tools, which is less task-oriented. The advantages of Assembly101 include its large scale and multiple camera views; however, it is not related to robotic environments. ATTACH concentrates on specific manual motions but does not address task sequences, and it lacks human-robot interference such as occlusions. HA-ViD stresses that the data annotation should be shared between humans and robots, yet it lacks human-robot interference on motion. The dynamic nature of HRS often results in human-robot occlusions in practical scenarios, for example, when a moving robot partially occludes the human body. Such occlusions are seldom present in publicly available datasets of daily human activities or in existing datasets about HRS, limiting the usage of these datasets in practical applications.

On the other hand, human motion processing is particularly pivotal for implementing HRS. From the related work, human motion processing can be categorized into several subfields, including gesture recognition, motion classification, motion prediction, task and procedure prediction, intention recognition, autonomous control of robotic systems, and robotic task procedure generation, among others[16](https://www.nature.com/articles/s41597-025-06042-0#ref-CR16 “Zhou, H. et al. An attention-based deep learning approach for inertial motion recognition and estimation in human-robot collaboration. Journal of Manufacturing Systems 67, 97–110, https://doi.org/10.1016/j.jmsy.2023.01.007

(2023).“),[17](https://www.nature.com/articles/s41597-025-06042-0#ref-CR17 “Rabby, M. K. M., Karimoddini, A., Khan, M. A. & Jiang, S. A learning-based adjustable autonomy framework for human-robot collaboration. IEEE Transactions on Industrial Informatics 18(9), 6171–6180, https://doi.org/10.1109/TII.2022.3145567

(2022).“). However, the current limitations of existing work and their experiment conditions are evident in the following aspects:

Number of Participants. Many studies validate proposed methods with few or even a single human due to the scarcity of available participants. For practical HRS applications, the model’s ability to generalize across individuals is crucial.

Environmental Condition. Many precious publications employ relatively simple scenarios where a human interacts with a static environment, without the presence of industrial robots. Many studies only operate under ideal conditions, assuming high-quality data, clear vision, and the absence of occlusions. Human-robot occlusions are often neglected, resulting in methods that perform well only in controlled ideal environments.

Hardware Barrier. Some datasets are constructed using specialized equipment like RGB-D cameras or wearable motion-capturing suits, which are impractical for large-scale deployment of models trained on such datasets in industrial settings.

Task Sequences. In numerous published studies, the categories of human motions are overly simplistic, including actions such as raising hands, crossing arms, and waving hands, which are not typically encountered in real HRC scenarios. Additionally, similar and repetitive motions, along with their corresponding task procedures, receive insufficient attention. Many datasets focus on isolated human motions rather than sequential motions within complete tasks, hindering research on robust recognition across motion category transitions.

Data Accessibility. Most studies in HRS, especially HRC, do not publicly share their datasets, making replication, benchmarking, and implementation challenging for the research community.

These limitations hinder the implementation of HRS in realistic environments and tasks. To this end, our dataset focuses on human motions in sequential assembly and disassembly tasks specially designed for HRC. It highlights samples from multiple individuals in human-robot coexisting environments with occlusions and similarity among task-oriented motion categories.

A comprehensive comparison between the proposed and existing datasets is summarized in Table 1. In detail, this dataset can be used to enhance generalization across diverse participants, incorporate multiple camera perspectives, handle real-world noise and occlusions, utilize accessible hardware setups, and focus on practical, repetitive, and sequential human motions. It includes raw video streams and human skeleton data, along with well-labeled and indexed annotations of various sequential human motions involved in assembly and disassembly tasks. Additionally, it provides complete Python scripts for video clipping, stitching, skeleton generation, data formatting, and labeling, ensuring ease of use and reproducibility. By offering these comprehensive features and being openly accessible, this dataset facilitates replication and benchmarking, thereby advancing the field of HRS, especially HRC. This dataset not only overcomes the limitations of existing datasets by providing robust, real-world scenarios with human-robot occlusions and sequential task procedures but also supports a wide range of research applications, including human motion prediction, robotic task planning, and collaborative system development. In summary, the proposed dataset has the following features.

It involved 33 individuals of different gender (F/M = 1/2), clothing, height, and body shape, aged from 22 to 28.

Two distinct scenarios for both assembly and disassembly tasks were set, each further involving static and dynamic settings depending on whether the robot affects the environment.

Reflection of uncertainties was considered. All individuals behave according to their personal preferences, and the human body is partially blocked by the moving robot.

Easy-to-use and easy-to-produce case. In this dataset, we used 3D-printed gear system assembly and disassembly tasks with nine procedures. The gear system (from a previous SIEMENS robot learning challenge) can be easily reproduced by users, and the procedures in this gear system are also feasible to be reassociated or reordered.

Multiple camera views via contactless perception. Three off-the-shelf webcams were installed at different positions and heights without wearable devices.

Flexibility. Raw videos with indexed frames, well-clipped human manual procedures, and Python scripts, based on which users can reclip the videos to form the task into different procedural sequences or apply different skeleton and mesh tracking algorithms.

Multiple purposes. This dataset can support future HRS studies such as action recognition[18](https://www.nature.com/articles/s41597-025-06042-0#ref-CR18 “Zhang, X., Yi, D., Behdad, S. & Saxena, S. Unsupervised human activity recognition learning for disassembly tasks. IEEE Transactions on Industrial Informatics 20(1), 785–794, https://doi.org/10.1109/TII.2023.3264284

(2024).“), task sequence prediction, behavioral analysis considering uncertainties, robot task planning, robot motion planning[19](https://www.nature.com/articles/s41597-025-06042-0#ref-CR19 “Merikh Nejadasl, A. et al. Ergonomically optimized path-planning for industrial human–robot collaboration. The International Journal of Robotics Research 43(12), 1884–1897, https://doi.org/10.1177/02783649241235670

(2024).“), turn-taking prediction in HRC, partially occluded human pose estimation, multi-source sensor fusion, etc.

Furthermore, an extensive evaluation of 13 state-of-the-art deep learning models was conducted across two practical aspects: early human motion prediction (offline) and robot task procedure generation (online). The results indicate a noticeable performance gap between offline and online settings, highlighting the need for further research to enhance online inference with smoother transitions between task procedures. Additionally, the design trade-off between model capacity and computational efficiency expects a more detailed exploration.

Methods

The proposed dataset is collected from two scenarios. Scenario A is about fewer human-robot occlusions, while there are more occlusions in scenario B. In both scenarios, containers are placed on a workbench, and product components are in the containers. In scenario B, an industrial robot was moving arbitrarily in front of human operators. The two scenarios are shown in Fig. 1.

Fig. 1

Scenarios for data collection.

There are 25 participants in each scenario. Because some of the participants only joined a single scenario, the complete dataset contains data from 33 student volunteers aged from 22 to 28. 8 of them are only in scenario A, 8 of them are only in scenario B, and 17 of them are in both scenarios. Among them, 11 are female and 22 are male. The height and weight distribution of all the participants are shown in Fig. 2. All participants were informed about the types of data to be collected, how those data would be stored and processed, and the overall purpose of the study. All participants provided consent for data collection and public release of de-identified videos. Participants had the option to wear face masks during data collection. To ensure privacy, all raw videos in the public dataset have been anonymized by automatically blurring faces. According to the Letter of Compliance (Dnr: HS-2025-2104 KS 4.4.1) provided by the KTH Research Ethics Advisor at the KTH Research Support Office, this research does not require institutional review. Although it involves processing information about living human beings, none of the data is traceable to specific individuals, either directly or indirectly, meaning no personal data is handled. Furthermore, the research does not involve any sensitive personal information such as health data, genetic or biometric data, or information related to ethnicity, beliefs, or legal offences. It also does not involve any physical or psychological procedures on human subjects, nor does it pose any risk of harm or use of biological samples. Therefore, the study falls outside the scope of the Swedish Ethics Review Act and does not require formal ethics approval.

Fig. 2

Height and weight distribution of all the participants.

The assembly and disassembly tasks in this dataset use a gear system as the object. It contains seven components corresponding to seven task procedures. In addition, two more procedures are also contained, which are moving the plate from the holder to the workbench and moving the product installed on the plate from the workbench to the endpoint. In all, there are 9 procedures in both tasks. The CAD model of this gear system is also in the dataset as CAD_model.zip, which could be built using 3D printers.

Only off-the-shelf web cameras (Logitech C270 and HIKVISION E14a) were used for data collection. Participants were not required to use any wearable devices. Any video sample in this dataset was recorded by three cameras with different perspectives simultaneously. Cameras were located at different heights arbitrarily. For scenario A, cameras were located at 170 cm (left), 155 cm (middle), and 160 cm (right). For scenario B, cameras were located at 165 cm (left), 155 cm (middle), and 155 cm (right). There is no restriction on lighting conditions or coordinated lighting for multiple cameras.

During the assembly, the participants are required to first take a plate for components to be installed onto, then take components from containers and conduct assembly task procedures, and finally transport the assembled product with the plate to the endpoint. Disassembly is conducted inversely. Participants are required to conduct one task three times. Before data collection, every volunteer was taught how to do the assembly and disassembly tasks, and given one trial to practice. Apart from what has been mentioned, participants have no other rules to follow. For example, participants can use any hand for any procedure, use one hand or both hands for grasping, and are allowed to hesitate at any time. An illustration of such a protocol is shown in Fig. 3, in which anchors (the frame index dividing two adjacent procedures) are used to divide assembly and disassembly procedures, and the red arrows and blue curve are used to illustrate the reach-out and pull-back motion and trend, respectively. 5 annotators were required to label the start and end frame indexes for every procedure. One of them designed the annotation protocol, led the annotation team, supervised the work of the other four annotators, and finally verified all annotations by reviewing the segmented videos based on the provided frame indices.

Fig. 3

Illustration of the protocol.

Data Records

The dataset is available at Dryad[20](https://www.nature.com/articles/s41597-025-06042-0#ref-CR20 “Liu, Z. et al. Sequential human assembly and disassembly motions in human-robot coexisting environments [Dataset]. Dryad https://doi.org/10.5061/dryad.ncjsxkt6f

(2025).“). This dataset consists of 10,100 task procedure samples in total, that is 5,050 samples for each scenario and 2,025 samples for each task. In each scenario, data are from 25 participants, 3 cameras, 3 times of one task, 2 main tasks (assembly and disassembly), and 9 procedures. A summary of this dataset appears in Table 2. Concretely, this dataset consists of raw videos, clipped videos, and clipped 2D and 3D skeleton data.

Raw videos

The raw videos in this dataset are with 640 × 480 resolution. All videos are recorded at 30 frames per second, regardless of the camera hardware. All the videos are stored in AVI files, together with TXT files indicating the frame number and time in seconds from the start of the video. Each piece in the TXT files is a tuple containing three elements, that is (i, t, T), in which i (index of row) is the index of one frame, t is the time from the start in millisecond, and T is the clock time when this frame is recorded. There exist 900 raw videos in total, 450 for each scenario, and 225 for each task in one scenario.

Task procedure anchors

Task procedure anchors are the keyframe numbers that can annotate the start and end of any sequential human motions in the assembly and disassembly tasks. They are stored in CSV files. As for each row in such a CSV file, the first column is the name of one raw video. The remaining are 18 frame anchors (the start and end frames) for 9 task procedures in this raw video. For one task in each scenario, there is one CSV file of the task procedure anchors. The anchors of the annotation were done manually with Adobe Premiere Pro.

Video clips

Video clips were generated using the raw video and the task procedure anchors by the Python scripts. Every raw video is clipped into 9 procedural video clips. Video clips are also in AVI format and are renamed based on the file name of the raw video. There exist 10,100 video clips in total, 5,050 for each scenario, and 2,025 for each task in one scenario.

Skeleton frames

Skeleton frames are about the unmerged skeleton data generated by OpenPose (for 2D)[21](https://www.nature.com/articles/s41597-025-06042-0#ref-CR21 “Cao, Z., Hidalgo, G., Simon, T., Wei, S. & Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(01), 172–186, https://doi.org/10.1109/TPAMI.2019.2929257

(2021).“) and MMPose (for 3D)[22](https://www.nature.com/articles/s41597-025-06042-0#ref-CR22 “MMPose OpenMMLab Pose Estimation Toolbox and Benchmark. https://github.com/open-mmlab/mmpose

(2020)“) when taking video clips as input. We chose OpenPose for 2D skeleton generation due to its fast processing speed. We selected MMpose for 3D skeleton estimation due to its integration within the OpenMMLab ecosystem and its stronger modeling of human body priors. MMpose leverages parametric human models and applies kinematic constraints to enforce anatomical plausibility, which helps mitigate common issues such as joint floating, bone twisting, or limb penetration. End users of this dataset could switch to any skeleton generator for their usage. For every video clip, there is a series of JSON files representing the skeleton data of each frame. It is also possible to use other skeleton estimators on all the raw videos or video clips. The skeleton frames JSON files are formatted in the following way. This dataset provides skeleton data in both 2D and 3D formats. Within the 2D format, pose_keypoints_2d is heavily used, represented by (x, y, c), where x,y are the coordinate value, and c is the confidence value on this joint by the skeleton estimator. All 18 joints on one frame are listed in pose_keypoints_2d with an order defined by the Microsoft COCO dataset23. For the 3D skeleton, only the procedural skeleton sequences are provided. Skeleton frames are aimed at an easier understanding of the skeleton data on each frame, while the procedural skeleton sequences in the following paragraph are more often used.

Procedural skeleton sequences

A procedural skeleton sequence is a sequence of skeleton data from a single camera within one procedure in a task. One procedural skeleton sequence can be regarded as a sample, with a label indicating the category of this procedure. Procedural skeleton sequences are stored in JSON files. In such 2D skeleton JSON files, data contains frame indexes, as well as pose and score (confidence score generated by OpenPose) data at every frame. Finally, the label and its index are stored at the end of such a JSON file, as well as in the filename. The 3D skeleton JSON files have a similar structure, which can be seen as follows. Normalised procedural skeleton sequences are also provided in this dataset. The normalisation was conducted on camera resolution, for the user to be able to train models running on cameras with different resolutions. The users can certainly flexibly conduct the normalisation here or in the following steps, such as building the NumPy-based files before training.

Naming rules

All the AVI raw video files and their corresponding TXT files are with such a naming rule that is S**S_P**P_N_K_C, in which S**S ∈ {s1, s2}, P**P ∈ {p1, p1, …, p25},N ∈ {1, 2, 3}, K ∈ {a, d}, and C ∈ {1, 2, 3} are labels for scenarios, participant number, trial number, task, and camera number, respectively. Concretely, s1 is scenario A and s2 is scenario B. As for the task, a represents the assembly task while d is the disassembly task. Regarding cameras, camera 1 is the one to the right front of the human, camera 2 is the one to the left front of the human, and camera 3 is the one in the centre. All the AVI video clips were renamed by adding a final symbol to the original raw video name. It is shown as S**S_P**P_N_K_C_R**R, where R**R ∈ {c1, c2, …, c9}. Skeleton frames are named by S**S_P**P_N_K_C_R**R_F, where F is the frame index starting from 0. Procedural skeleton sequences merged by skeleton frames are also named by S**S_P**P_N_K_C_R**R.

Technical Validation

Extensive validations have been performed based on this dataset to demonstrate the practical usage of this dataset quantitatively and establish the benchmark for future research works. Concretely, a variety of state-of-the-art action recognition models in the computer vision field are implemented to realize the early prediction of human sequential motions for further task procedure generation for robots, inspired by previous work[24](https://www.nature.com/articles/s41597-025-06042-0#ref-CR24 “Liu, Z., Liu, Q., Xu, W., Wang, L. & Ji, Z. Adaptive real-time similar repetitive manual procedure prediction and robotic procedure generation for human-robot collaboration. Advanced Engineering Informatics 58, 102129, https://doi.org/10.1016/j.aei.2023.102129

(2023).“) that used only the 2D skeleton data in the assembly task. These models include the RGB video-based ones (C3D[25](https://www.nature.com/articles/s41597-025-06042-0#ref-CR25 “Tran, D., Bourdev, L. & Fergus, R., Torresani, L. & Paluri, M. Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision (ICCV), https://doi.org/10.1109/ICCV.2015.510

(2015).“), I3D[26](https://www.nature.com/articles/s41597-025-06042-0#ref-CR26 “Carreira, J. & Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), https://doi.org/10.1109/CVPR.2017.502

(2017).“), SlowFast[27](https://www.nature.com/articles/s41597-025-06042-0#ref-CR27 “Feichtenhofer, C., Fan, H., Malik, J. & He, K. Slowfast networks for video recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), https://doi.org/10.1109/ICCV.2019.00630

(2019).“), TPN[28](https://www.nature.com/articles/s41597-025-06042-0#ref-CR28 “Yang, C., Xu, Y., Shi, J., Dai, B. & Zhou, B. Temporal pyramid network for action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://doi.org/10.1109/CVPR42600.2020.00067

(2020).“), TimeSformer29, VideoSwin[30](https://www.nature.com/articles/s41597-025-06042-0#ref-CR30 “Liu, Z. et al. Video swin transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3202–3211, https://doi.org/10.1109/CVPR52688.2022.00320

(2022).“), MViT[31](https://www.nature.com/articles/s41597-025-06042-0#ref-CR31 “Li, Y. et al. Mvitv2: Improved multiscale vision transformers for classification and detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4804–4814, https://doi.org/10.1109/CVPR52688.2022.00476

(2022).“)) and the skeleton-based ones (STGCN[32](https://www.nature.com/articles/s41597-025-06042-0#ref-CR32 “Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. AAAI Press https://doi.org/10.5555/3504035.3504947

(2018).“), 2s-AGCN[33](https://www.nature.com/articles/s41597-025-06042-0#ref-CR33 “Shi, L., Zhang, Y., Cheng, J. & Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://doi.org/10.1109/CVPR.2019.01230

(2019).“), PoseC3D[34](https://www.nature.com/articles/s41597-025-06042-0#ref-CR34 “Duan, H., Zhao, Y., Chen, K., Lin, D. & Dai, B. Revisiting skeleton-based action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2969–2978, https://doi.org/10.1109/CVPR52688.2022.00298

(2022).“), STGCNPP[35](https://www.nature.com/articles/s41597-025-06042-0#ref-CR35 “Duan, H., Wang, J., Chen, K. & Lin, D. Pyskl: Towards good practices for skeleton action recognition. Proceedings of the 30th ACM International Conference on Multimedia, 7351–7354, Association for Computing Machinery, https://doi.org/10.1145/3503161.3548546

(2022).“)). These models represent the major developments of human action recognition, e.g., from earlier Convolutional Neural Network (CNN)-based ones such as C3D and I3D to recent Transformer-based ones such as VideoSwin and MViT. They have also shown promising performances on various benchmarks in the general domain. For the skeleton-based models, Graph Convolutional Network (GCN) is commonly considered because it well captures the topological patterns of human body skeleton. Both 2D skeleton data generated from OpenPose and 3D skeleton data generated from MMPose are considered as the input modality. We benchmark both video and skeleton-based methods because of the consideration of the communication bandwidth cost of robotic applications. Often, skeleton-based modality has much lower data volume and requires fewer computational resources. The computer used for validation was equipped with an Intel i9-13900KF CPU, an RTX4090 GPU, and 64GB of RAM.

Human Motion Early Prediction

Human motion early prediction aims to recognize the task-related category of each human motion before it is completed, i.e., using a portion of video frames as the available observations. The fundamental problem is to classify the video segments based on the RGB or skeleton pose information.

The algorithmic performance for the early prediction of sequential human motions is evaluated for each of these models. Four subsets of this dataset are individually employed to report the outcomes (assembly and disassembly of scenario A and scenario B). Instead of randomly splitting each subset into the training and testing parts, two practical schemes are adopted that are similar to those in NTU RGB+D[15](https://www.nature.com/articles/s41597-025-06042-0#ref-CR15 “Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y. & Kot, A. C. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10), 2684–2701, https://doi.org/10.1109/TPAMI.2019.2916873

(2020).“): X Subject and X View. For X Subject, 70% of the entire participants are considered in training while the remaining 30% are used for testing. For X View, cameras 2 and 3 are considered in training while camera 1 is used for testing. Such schemes aim to fully exploit the diversity presented in this dataset, thus bridging the gap between the existing datasets and the HRC cases in the real world. Moreover, the first 25% and 50% of each of the clipped entire human motions are provided as the available information for the models, respectively. This is challenging yet meaningful because the robot task procedures need to be generated as early as possible to ensure smooth procedural transitions between human and robot. Top-1 and Top-2 recognition accuracies are reported for each of the configurations. The number of floating operations (FLOPs) of each implemented model is also presented (in the unit of Giga) to reflect the computational efficiency.

Tables 3–6, below show the results on the four subsets of this dataset, respectively. In general, it can be noticed that scenario B is more challenging than scenario A, regardless of the assembly or disassembly.

Robot Task Procedure Generation

Robot task procedure generation indicates generating a procedure in a task sequence for the robot to conduct collaboratively, given the current status of the human motion. It stresses the online inference performance of each implemented model, which is directly related to the efficiency of HRC. Specifically, the models here are requested to execute the real-time prediction (inference) once every 5 video frames with a temporal sliding window of 32 video frames on each complete task video (instead of the procedural videos in the previous validation). The task sequence constraints (the gear system mentioned before) for the robot task procedure generation validation are from previous work[24](https://www.nature.com/articles/s41597-025-06042-0#ref-CR24 “Liu, Z., Liu, Q., Xu, W., Wang, L. & Ji, Z. Adaptive real-time similar repetitive manual procedure prediction and robotic procedure generation for human-robot collaboration. Advanced Engineering Informatics 58, 102129, https://doi.org/10.1016/j.aei.2023.102129

(2023).“).

For each of these models, two important performance-related values are reported for each atomic human task procedure plus the average overall task procedures from Tables 7–10. The first value (the upper rows) corresponds to the temporally-aggregated prediction accuracy, while the second value (the lower rows) corresponds to the degrees of delay for the first correct prediction. Concretely, the first value quantifies the percentage of correct prediction times with respect to the overall attempts made. It is different from algorithmic performance in that it concerns continuous predictions as well as the overlap between successive procedures. The second value quantifies the percentage of running video frames before a correct prediction result first appears. It is intended to reflect how timely a meaningful robot task procedure can be generated based on human motion prediction. Tables 7–10 below present the results of the online inference performance for the challenging X View scheme and 50% observation of the complete videos. Considering the page limit, only the RGB video-based models are involved here because they generally perform better than the skeleton-based models.

Usage Notes

Users can use the data from different folders for their research purposes. For instance, for each task in one scenario, users can reach raw video and its corresponding frame index and timestamp from folders raw video and raw video frames. Data from these two folders can be used for video processing and further data generation with the specific needs of the users, just like skeletons. Data in folders procedural skeleton sequences and procedural skeleton sequences 3D are about the human skeleton sequences of every task procedure, which can be used for skeleton-based processing. Human skeletons were generated by estimators with no bias or further manual tuning. It means all the raw videos are the ground truth, while the quality of human skeletons relies on the human skeleton estimator. Occlusions affect the precision of skeleton joint position and may cause loss of joints. Users could switch to any other skeleton estimators for their research. Data in procedure_anchors.csv can be used to clip the raw video into procedural ones. Data from the aforementioned two scenarios can also be used jointly, for instance, to study the effects on parameter sensitivity by environmental changes. Because this dataset provides divided procedural data, rearranging the sequence can boost the diversity regarding the procedure sequence.

Data availability

The dataset has been deposited to Dryad: https://doi.org/10.5061/dryad.ncjsxkt6f.

Code availability

The following GitHub repository contains the Python scripts to operate and use this dataset: https://github.com/KTH-IPS/SD-Dataset.

References

Li, S. et al. Proactive human-robot collaboration: Mutual-cognitive, predictable, and self-organising perspectives. Robotics and Computer-Integrated Manufacturing 81, 102510 (2023).

Article Google Scholar 1.

Leng, J. et al. Industry 5.0: Prospect and retrospect. Journal of Manufacturing Systems 65, 279–295, https://doi.org/10.1016/j.jmsy.2022.09.017 (2022).

Article Google Scholar 1.

Azagra, P. et al. A multimodal dataset for object model learning from natural human-robot interaction. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 6134–6141, https://doi.org/10.1109/IROS.2017.8206514 (2017). 1.

Rudenko, A. et al. ThÖr: Human-robot navigation data collection and accurate motion trajectories dataset. IEEE Robotics and Automation Letters 5(2), 676–682, https://doi.org/10.1109/LRA.2020.2965416 (2020).

Article MathSciNet Google Scholar 1.

Bagewadi, K., Campbell, J. & Amor, H.B. Multimodal dataset of human-robot hugging interaction. arXiv preprint arXiv:1909.07471 (2019) 1.

Celiktutan, O., Skordos, E. & Gunes, H. Multimodal human-human-robot interactions (mhhri) dataset for studying personality and engagement. IEEE Transactions on Affective Computing 10(4), 484–497, https://doi.org/10.1109/TAFFC.2017.2737019 (2019).

Article Google Scholar 1.

Wit, J., Krahmer, E. & Vogt, P. Introducing the nemo-lowlands iconic gesture dataset, collected through a gameful human–robot interaction. Behavior Research Methods 53(3), 1353–1370, https://doi.org/10.3758/s13428-020-01487-0 (2021).

Article PubMed Google Scholar 1.

Newman, B. A., Aronson, R. M., Srinivasa, S. S., Kitani, K. & Admoni, H. Harmonic: A multimodal dataset of assistive human-robot collaboration. The International Journal of Robotics Research 41(1), 3–11, https://doi.org/10.1177/02783649211050677 (2022).

Article Google Scholar 1.

Dallel, M., Havard, V., Baudry, D. & Savatier, X. Inhard - industrial human action recognition dataset in the context of industrial collaborative robotics. 2020 IEEE International Conference on Human-Machine Systems (ICHMS), 1–6, https://doi.org/10.1109/ICHMS49158.2020.9209531 (2020). 1.

Iodice, F., De Momi, E. & Ajoudani, A. Hri30: An action recognition dataset for industrial human-robot interaction. 2022 26th International Conference on Pattern Recognition (ICPR), 4941–4947, https://doi.org/10.1109/ICPR56361.2022.9956300 (2022). 1.

Sener, F. et al. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21096–21106, https://doi.org/10.1109/CVPR52688.2022.02042 (2022). 1.

Aganian, D., Stephan, B. & Eisenbach, M., Stretz, C. & Gross, H.-M. Attach dataset: Annotated two-handed assembly actions for human action understanding. 2023 IEEE International Conference on Robotics and Automation (ICRA), 11367–11373 https://doi.org/10.1109/ICRA48891.2023.10160633 (2023). 1.

Zheng, H., Lee, R. & Lu, Y. Ha-vid: A human assembly video dataset for comprehensive assembly knowledge understanding. Advances in Neural Information Processing Systems 36, 67069–67081, https://doi.org/10.5555/3666122.3669052 (2023).

Article Google Scholar 1.

Kay, W. et al. The kinetics human action video dataset. arXiv preprint 10.48550/arXiv.1705.06950 (2017). 1.

Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y. & Kot, A. C. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10), 2684–2701, https://doi.org/10.1109/TPAMI.2019.2916873 (2020).

Article PubMed Google Scholar 1.

Zhou, H. et al. An attention-based deep learning approach for inertial motion recognition and estimation in human-robot collaboration. Journal of Manufacturing Systems 67, 97–110, https://doi.org/10.1016/j.jmsy.2023.01.007 (2023).

Article Google Scholar 1.

Rabby, M. K. M., Karimoddini, A., Khan, M. A. & Jiang, S. A learning-based adjustable autonomy framework for human-robot collaboration. IEEE Transactions on Industrial Informatics 18(9), 6171–6180, https://doi.org/10.1109/TII.2022.3145567 (2022).

Article Google Scholar 1.

Zhang, X., Yi, D., Behdad, S. & Saxena, S. Unsupervised human activity recognition learning for disassembly tasks. IEEE Transactions on Industrial Informatics 20(1), 785–794, https://doi.org/10.1109/TII.2023.3264284 (2024).

Article Google Scholar 1.

Merikh Nejadasl, A. et al. Ergonomically optimized path-planning for industrial human–robot collaboration. The International Journal of Robotics Research 43(12), 1884–1897, https://doi.org/10.1177/02783649241235670 (2024).

Article [Google Scholar](http://scholar.google.com/schol

Background & Summary

Background & Summary

Number of Participants. Many studies validate proposed methods with few or even a single human due to the scarcity of available participants. For practical HRS applications, the model’s ability to generalize across individuals is crucial.

Hardware Barrier. Some datasets are constructed using specialized equipment like RGB-D cameras or wearable motion-capturing suits, which are impractical for large-scale deployment of models trained on such datasets in industrial settings.

It involved 33 individuals of different gender (F/M = 1/2), clothing, height, and body shape, aged from 22 to 28.

Two distinct scenarios for both assembly and disassembly tasks were set, each further involving static and dynamic settings depending on whether the robot affects the environment.

Reflection of uncertainties was considered. All individuals behave according to their personal preferences, and the human body is partially blocked by the moving robot.

Multiple camera views via contactless perception. Three off-the-shelf webcams were installed at different positions and heights without wearable devices.

Flexibility. Raw videos with indexed frames, well-clipped human manual procedures, and Python scripts, based on which users can reclip the videos to form the task into different procedural sequences or apply different skeleton and mesh tracking algorithms.

Methods

Data Records

Raw videos

Task procedure anchors

Video clips

Skeleton frames

Procedural skeleton sequences

Naming rules

Technical Validation

Human Motion Early Prediction

Robot Task Procedure Generation

Usage Notes

Data availability

Code availability

References

Similar Posts