SimBa Dataset

A. Simoni, S. Pini, G. Borghi, R. Vezzani

SimBa is a collection of synthetic and real data for the robot pose estimation task in collaborative environments. Among several collaborative robots, we chose the Rethink Baxter, which moves respectively to a set of random pick-n-place locations on a table, assuming realistic poses. The dataset contains RGB-D images with annotations for the 3D joints and the pick-n-place locations. The synthetic part contains over 350k annotated RGB-D images, while the real part contains over 20k annotated RGB-D frames.

Paper Code

Please cite with the following BibTeX:

@article{simoni2022semi,
  title={Semi-Perspective Decoupled Heatmaps for 3D Robot Pose Estimation from Depth Maps},
  author={Simoni, Alessandro and Pini, Stefano and Borghi, Guido and Vezzani, Roberto},
  journal={IEEE Robotics and Automation Letters},
  year={2022},
  publisher={IEEE}
}

Download

Please follow the instructions in the README file to build the dataset correctly.
In the following, you can download each part of the dataset separately depending on your research interests:

Dataset info

In the following, further details about the acquisition process, the image data, the annotations and the dataset splits are reported below. Each section describes separately synthetic and real data.

Data acquisition (powered by

)

Synthetic

We collected data using ROS and Gazebo for simulating a synthetic environment in which the robot performs random pick-n-place motions. We record 2 runs with different initialization seeds. Each run contains 20 recording sequences composed of 10 pick-n-place motions split equally between left and right robot arm. Each sequence is recorded by 3 anchor cameras (center, left, right) which are randomly moved within a sphere of 1m diameter.

Real

We collected data using ROS and a time-of-flight Microsoft Kinect One (second version) sensor capturing the moving robot from 3 camera positions (center, left, right). We record 20 pick-n-place sequences from each camera location and split them equally between right and left arm.

Images

Synthetic

The synthetic part contains 10FPS videos with RGB (1980x1080) and depth (512x424) frames. In total there are 40 sequences with over 350k annotated frames.

Real

The real part contains 15FPS videos with RGB (1980x1080) and depth (512x424) frames. In total there are 20 sequences with over 20k annotated frames.

Annotations

Synthetic

0: "head",                         8:  "left upper shoulder",
1: "right upper shoulder",         9:  "left lower shoulder",
2: "right lower shoulder",         10: "left upper elbow",
3: "right upper elbow",            11: "left lower elbow",
4: "right lower elbow",            12: "left upper forearm",
5: "right upper forearm",          13: "left lower forearm",
6: "right lower forearm",          14: "left wrist",
7: "right wrist",                  15: "base"

The synthetic annotations contain information about the 3D robot joints positions, the pick-n-place locations and the camera positions. They are stored in JSON files with the structure described below:

{
    "joints": {
        "world": {                 # coordinates w.r.t. world origin
            "head": {
                "translation.z": float,
                "translation.x": float,
                "translation.y": float,
                "rotation.w": float,
                "rotation.z": float,
                "rotation.x": float,
                "rotation.y": float
            }
            ⋮
            "base": {
                "translation.x": float,
                "translation.y": float,
                "translation.z": float,
                "rotation.w": float,
                "rotation.z": float,
                "rotation.x": float,
                "rotation.y": float
            }
        }
        "base": {                  # coordinates w.r.t. robot base
            ⋮
        }
        "kinect_v2_c": {           # coordinates w.r.t. right camera
            ⋮
        }
        "kinect_v2_b": {           # coordinates w.r.t. left camera
            ⋮
        }
        "kinect_v2_a": {           # coordinates w.r.t. center camera
            ⋮
        }
    }
}

{
    "place": {
        "position": {
            "x": float,
            "y": float,
            "z": float
            }
        "orientation": {
            "x": float,
            "y": float,
            "z": float,
            "w": float
            }
    }
    "pick": {
        "position": {
            "x": float,
            "y": float,
            "z": float
            }
        "orientation": {
            "x": float,
            "y": float,
            "z": float,
            "w": float
            }
    }
}

{
    "position": {
        "x": float,
        "y": float,
        "z": float
        }
    "orientation": {
        "x": float,
        "y": float,
        "z": float,
        "w": float
        }
}

{
    "rgb": {
        "fx": float,
        "fy": float,
        "cx": float,
        "cy": float
        }
    "depth": {
        "fx": float,
        "fy": float,
        "cx": float,
        "cy": float
        }
    "tx_depth_to_rgb": float
}

Real

0: "head",                         8:  "left upper shoulder",
1: "right upper shoulder",         9:  "left lower shoulder",
2: "right lower shoulder",         10: "left upper elbow",
3: "right upper elbow",            11: "left lower elbow",
4: "right lower elbow",            12: "left upper forearm",
5: "right upper forearm",          13: "left lower forearm",
6: "right lower forearm",          14: "left wrist",
7: "right wrist",                  15: "base"

The real annotations contain information about the 3D robot joints positions and camera positions. The joints are saved as TXT files (one file per joint) while the camera info in a JSON file with the structure described below:

# coordinates w.r.t. robot base
timestamp, tx, ty, tz, rx, ry, rz, rw

{
    "rgb": {
        "fx": float,
        "fy": float,
        "cx": float,
        "cy": float
        }
    "depth": {
        "fx": float,
        "fy": float,
        "cx": float,
        "cy": float
        }
    "tx_depth_to_rgb": float
    "cam_center": {
        "tr_camera_to_base": [float, float, float],
        "tilt_angle": float,
        "offset": float
    }
    "cam_left": {
        ⋮
    }
    "cam_right": {
        ⋮
    }
}

Splits

Synthetic

The synthetic data are split into 28 train sequences, 4 validation sequences and 8 test sequences. Each sequences is sampled every 10 frames.

Train	Validation	Test	Total
26603 (70%)	3842 (10%)	7490 (20%)	37935

Real

The real sequences represent a single test split. Each sequence is sampled every 5 frames obtaining a test of 4032 images.

Contacts

If you have any questions about our dataset, please use the issues section on the github repo.
Alternatively, send an e-mail at alessandro.simoni@unimore.it.

Acknowledgment

This work was supported by the XiLab Unimore that helped us recording the real sequences with the Baxter robot.
We thank also Stan Birchfield and Timothy E. Lee of NVIDIA for the helpful discussions.

SimBa Dataset

A. Simoni, S. Pini, G. Borghi, R. Vezzani

Download

Dataset info

Data acquisition (powered by

)

Synthetic

Real

Images

Synthetic

Real

Annotations

Synthetic

3D joints' position JSON ⯈

Pick-n-place position JSON ⯈

Camera position JSON ⯈

Camera parameters JSON ⯈

Real

3D joints' position TXT ⯈

Camera position and parameters JSON ⯈

Splits

Synthetic

Real

Contacts

Acknowledgment