SuperQ-GRASP is a grasp pose estimation method designed specifically to grasp the large objects uncommon in a tabletop scenario based on Primitive Decomposition
Abstract
Grasp planning and estimation have been a long-standing research problem in robotics,
with two main approaches to find graspable poses on the objects: 1) geometric approach,
which relies on 3D models of objects and the gripper to estimate valid grasp poses, and 2)
data-driven, learning-based approach, with models trained to identify grasp poses from raw
sensor observations. The latter assumes comprehensive geometric coverage during the training phase.
However, the data-driven approach is typically biased toward tabletop scenarios and struggle to
generalize to out-of-distribution scenarios with larger objects (e.g. chair). Additionally,
raw sensor data (e.g. RGB-D data) from a single view of these larger objects is often incomplete
and necessitates additional observations. In this paper, we take a geometric approach, leveraging
advancements in object modeling (e.g. NeRF) to build an implicit model by taking RGB images from
views around the target object. This model enables the extraction of explicit mesh model while also
capturing the visual appearance from novel viewpoints that is useful for perception tasks like object
detection and pose estimation. We further decompose the NeRF-reconstructed 3D mesh into superquadrics (SQs)
- parametric geometric primitives, each mapped to a set of precomputed grasp poses, allowing grasp
composition on the target object based on these primitives.Our proposed pipeline overcomes the problems:
a) noisy depth and incomplete view of the object, with a modeling step, and b) generalization to objects
of any size.
We validate the performance of our pipeline on 5 different large objects
at different poses in real-world experiments using
SPOT from Boston Dynamics
Video
SuperQ-Grasp
Grasp Pose Estimation
The primary contribution of the project is to propose a grasp pose estimation method on the large objects that are uncommon
in table scenarios. The method involves decomposing the target object mesh into several primitive shapes,
predicting grasp poses for each individual primitive, and subsequently filtering out the invalid poses to maintain only the valid ones.
In this context, Superquadrics are utilized as the primitive shapes,
and Marching
Primitives is employed to decompose the target object's mesh into smaller superquadrics, upon which grasp pose estimation is applied.
Real-world Experiments
We validate the performance of our pipeline on the robotic platform
SPOT from Boston Dynamics.
We use instant-NGP to construct the target object mesh.
Also, unlike synthetic data in simulation, the object pose with respect to the gripper in real-world experiments is unknown in advance.
Therefore, to deal with this issue, we depend on GroundingSAM
and LoFTR to estimate the object pose relative to the gripper.
Results
Experiments on synthetic data
We create a dataset of 20 objects with 15 synthetic objects (3 chairs, 3 carts, 2 buckets, 2 boxes, 2 suitcases,
2 tables, and 1 folding chair) selected from PartNet-Mobility
, and 5 real-world objects
(2 chairs, 1 vacuum cleaner, 1 suitcase, and 1 table). These objects represent common large objects encountered daily
and cover a diverse range of geometrical structures.
We establish two baselines to capture variations in how
Contact-GraspNet
can be employed for grasp pose estimation, allowing for comparison with our
SuperQ-GRASP method: 1) CG+Mesh: This baseline applies Contact-GraspNet to the point cloud extracted from the complete 3D mesh of the target object;
2) CG+Depth: This baseline applied Contact-GraspNet to the point cloud obtained from a single-view depth image as seen by a robot's gripper camera.
Compared to the two baseline methods, our pipelilne can predict more stable grasp poses at the region
closer to the camera, which is also the starting pose of the gripper in our case for the SPOT robot.
Also, the predicted grasp poses are more concentrated in a specific region.
Qualitative Results on selected Objects
(Click the videos to open them in a new tab, if you want to see them more clearly)
Evaluation on the object
NOTE:
Red: invalid grasp poses; Green: valid grasp poses
Blue dots: observed depth point cloud as a partial view of the object
Real-world Experiments
To validate the performance of our pipeline in real-world scenarios, we place each of the 5 real-world objects
at a specified location with arbitrary orientations.
The Boston Dynamics Spot robot is then tasked with
estimating the object's pose, identifying a graspable pose,
and executing a reach-and-grasp action.
Our pipeline demonstrates a higher success rate across four test
objects (two chairs, a vacuum cleaner, and a table),
highlighting its capability to estimate valid grasp poses
for larger objects with complex geometries, including high-genus objects like chairs.
Here are the demonstrations.
Real-world Experiment examples
Visualize the real-world experiment for
Additional Results
Results on small objects
In addition, we show that our pipeline can have a competitive performance in comparison to the two baseline methods on
the small synthetic objects that are typically placed on the taletop. The mesh models of the objects are taken from
PartNet-Mobility
Qualitative Results on small Objects
(Click the videos to open them in a new tab, if you want to see them more clearly)
Evaluation on the object
NOTE:
Red: invalid grasp poses; Green: valid grasp poses
Blue dots: observed depth point cloud as a partial view of the object
Custom graspable region
We also demonstrate that our pipeline can allow the user to select the custom graspable
region. For each individual superquadric at the edge of the object (labeled in different
colors and associated with their own indices) , it can be regarded
as one potential graspable region , where grasp poses can be generated and evaluated. The
user can also use the index of the superquadric directly to select the desirable graspable
region to generate valid grasp poses, depending on the downstream tasks.
Custom graspable region selection examples
By default, the pipeline will select the closest superquadric to the
current gripper as the graspable region to generate grasp poses:
If specified, the user can also select the custom superquadric (in this example,
the user wants to grasp one edge of the back of the chair, so the index of the superquadric,
which is 54, can be fed to the pipeline) as the desired graspable region to generate grasp poses