🧩CoRL 2025 Workshop

corl

2025

workshop

2nd Workshop on Dexterous Manipulation - Learning and Control with Diverse Modalities

Published

September 25, 2025

Hompage

1 Reordered list

Ranked by relevance to dexterous hands and tactile sensing

DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation
Vision-Free Object 6D Pose Estimation for In-Hand Manipulation via Multi-Modal Haptic Attention
Zero-shot Sim2Real Transfer for Magnet-Based Tactile Sensor on Insertion Tasks
ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
TacDexGrasp: Compliant and Robust Dexterous Grasping with QP and Tactile Feedback
Tactile Memory with Soft Robot: Tactile Retrieval-based Contact-rich Manipulation with a Soft Wrist
DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation
Suction Leap-Hand: Suction Cups on a Multi-fingered Hand Enables Embodied Dexterity and In-Hand Teleoperation
mimic-one: A Scalable Model Recipe for General Purpose Robot Dexterity
FunGrasp: Functional Grasping for Diverse Dexterous Hands
Latent Action Diffusion for Cross-Embodiment Manipulation
Scaling Cross-Embodiment World Models for Dexterous Manipulation
EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-Rich Tasks
FLASH: Flow-Based Language-Annotated Grasp Synthesis for Dexterous Hands
Way-Tu: A Framework for Tool Selection and Manipulation Using Waypoint Representations
HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Bimanual Dexterous Manipulation

2 All 16 Accept-Spotlight papers

2.2 Way-Tu: A Framework for Tool Selection and Manipulation Using Waypoint Representations

Short summary: Introduces a waypoint-based representation and pipeline for selecting and manipulating tools — combining learned waypoint predictors with motion optimization to perform tool use tasks robustly. (OpenReview)

Questions:

How are waypoints represented (Cartesian, relative frames, keyframes) and why that choice?
How robust is tool selection when the perceived affordance is noisy or partially occluded?
Did you compare direct end-to-end policy vs waypoint + optimizer — tradeoffs in sample efficiency & robustness?
How do you handle tool dynamics (e.g., flexible tools) in planning?
Can the same waypoint representation generalize across different robot embodiments?
What failure cases are common — poor grasp, imprecise waypoint timing, optimizer convergence?

2.3 HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Bimanual Dexterous Manipulation

Short summary: HERMES provides a unified RL + sim2real pipeline to convert heterogeneous human motion sources into physically plausible mobile bimanual robot behaviors — includes depth image based sim2real transfer and closed-loop localization for mobile dextrous tasks. (OpenReview)

Questions:

How do you align heterogeneous human motion data (different capture setups) before training?
What components most reduce the sim2real gap (depth transfer, domain randomization, etc.)? Any ablations?
How do you integrate navigation and manipulation timing reliably in mobile setups?
Does the policy exploit human kinematic priors or learn purely from RL?
How sample-efficient is the approach and how much human data is needed?
Any limits when transferring to different robot hand kinematics / DoF?

2.4 Scaling Cross-Embodiment World Models for Dexterous Manipulation

Short summary: Proposes particle-based world models that represent both human and robot embodiments as particle sets and define actions as particle displacements — enabling unified world models that scale to multiple embodiments and support cross-embodiment control. (OpenReview)

Questions:

How do you choose particle resolution and object/hand particle assignment for efficiency vs fidelity?
Does the particle representation keep crucial contact details for high-precision tasks?
How well does policy transfer when the robot and human have very different actuation constraints?
Any emergent failure modes when scaling to deformable objects?
How does the approach compare with kinematic retargeting + robot dynamics modeling?
What are memory/computation requirements for inference on real robots?

2.5 DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation

Short summary: DexUMI is a hardware+software pipeline using the human hand as an interface (via wearable exoskeleton + software retargeting/inpainting) to collect dexterous demonstrations and transfer them to different robot hands with good real-world success. (OpenReview)

Questions:

What kinematic limits of the exoskeleton limit the range of demonstrable motions?
How do you handle embodiment gaps for very different robot hands (finger count, joint limits)?
What are privacy / safety considerations for wearables during long teleop sessions?
How much post-processing (retargeting correction) is required before policy training?
Is inpainting of the human hand in video robust to occlusions / lighting?
How does performance degrade when switching to an unseen robot hand type?

2.6 mimic-one: a Scalable Model Recipe for General-Purpose Robot Dexterity

Short summary: A practical recipe combining a new 16-DoF tendon-driven hand, curated teleoperation data (with self-correction), and a large generative policy (diffusion) to achieve robust, real-world dexterous control and emergent self-correction behaviors. (OpenReview)

Questions:

Which element of the recipe (hardware, data protocol, model) contributes most to out-of-distribution success?
How is “self-correction” measured and how do you encourage it in training?
What are the tradeoffs in using diffusion models vs autoregressive controllers for high-frequency control?
How expensive is data collection and what teleop interfaces were most effective?
Any examples where the model fails to self-correct or produces unsafe motions?
How reproducible is the hardware design and codebase for other labs?

2.7 Latent Action Diffusion for Cross-Embodiment Manipulation

Short summary: Learns a contrastive latent action space and uses diffusion modeling in that latent space to produce cross-embodiment manipulation policies that can imitate and transfer between different hand embodiments. (OpenReview)

Questions:

How is the latent action space structured and what prevents mode collapse?
How much action retargeting is needed when moving between embodiments?
How sample-efficient is diffusion in latent action space compared to direct action diffusion?
Are there latency constraints for diffusion sampling in closed-loop control?
How do you evaluate safety / constraint satisfaction when sampling actions in new embodiments?
Did you compare with non-diffusion generative models (VAE, normalizing flows)?

2.9 DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation

Short summary: Introduces DexSkin, a soft, conformable capacitive electronic skin that provides dense, localizable tactile sensing across complex finger geometries; demonstrates its use for learning contact-rich manipulation on gripper fingers. (OpenReview)

Questions:

What is the spatial and force resolution of the DexSkin sensors and how are they calibrated?
How robust is DexSkin to wear, contamination, and repeated contact cycles?
Does the skin change fingertip geometry or add compliance that affects grasp dynamics?
How easy is integration across different robot hand designs (curved surfaces, joints)?
Which downstream learning tasks showed the largest improvement using DexSkin?
Is latency / sampling rate sufficient for high-bandwidth tactile control?

2.10 Zero-shot Sim2Real Transfer for Magnet-Based Tactile Sensor on Insertion Tasks

Short summary: Proposes a technique to sim2real transfer magnet-based tactile sensing for insertion tasks with zero real training — likely via physics-aware simulation, sensor modeling and domain randomization to generalize to real sensors. (OpenReview)

Questions:

What aspects of the tactile sensor model were most critical for zero-shot transfer?
How is magnetic field noise and manufacturing variation handled in simulation?
Do you observe any failure modes on unusual object geometries or adhesives?
How does the method generalize to non-insertion contact tasks?
What metrics and baselines did you compare for zero-shot success?
Would small amounts of real fine-tuning drastically improve performance?

2.11 EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-Rich Tasks

Short summary: Presents a hierarchical policy architecture that enforces SE(3) equivariance (spatial symmetries) and maps vision to force/interaction behaviors — designed to generalize spatially (e.g., peg-in-hole) from few demonstrations. (OpenReview)

Questions:

Which parts are enforced analytically equivariant, and which are learned?
How does equivariance affect sample efficiency and generalization empirically?
Does equivariance hurt expressivity for asymmetric tasks?
How do you integrate force control at the low level with vision-level equivariant policies?
Any limits observed when task geometry or object frames change drastically?
How sensitive to calibration and coordinate frame misalignments?

2.12 TacDexGrasp: Compliant and Robust Dexterous Grasping with QP and Tactile Feedback

Short summary: Uses tactile feedback with a quadratic programming (QP) controller to distribute contact forces and prevent rotational/translational slip for multi-fingered hands — a compliant tactile control approach without explicit torque models. (OpenReview)

Questions:

How are tactile signals mapped into QP constraints/objective — linearization choices?
How fast is the QP solved and is it real-time at control loop rates?
How robust is the method to unexpected external disturbances (bumps, pushes)?
How do you estimate friction coefficients or do you avoid explicit friction estimates?
How do you switch between manipulation vs hold/grasp modes?
Any stability guarantees under contact switching?

2.13 FunGrasp: Functional Grasping for Diverse Dexterous Hands

Short summary: FunGrasp focuses on task-oriented / functional grasping (e.g., grasping scissors by holes) by retargeting single RGBD human functional grasp demonstrations to different robot hands and training RL policies with sim-to-real techniques and privileged information. (OpenReview)

Questions:

How do you define & evaluate “functional correctness” vs geometric/grasp metrics?
How robust is one-shot transfer from a single RGBD human image to unseen objects?
What retargeting errors are typical and how are they corrected during policy training?
Which sim2real tricks mattered most for real deployment?
Does the method work for safety-critical tools (blades, needles)? Any constraints?
How is the dataset of human functional grasps curated / annotated?

2.14 Suction Leap-Hand: Suction Cups on a Multi-fingered Hand Enables Embodied Dexterity and In-Hand Teleoperation

Short summary: Describes a practical hardware add-on: mounting suction cups on fingertips/palm of a three-fingered dexterous hand, enabling new manipulation capabilities (adhesive in-hand manipulations) and improved teleoperation for challenging in-hand tasks. (OpenReview)

Questions:

How do suction cups change the control strategy (grasp forces, rolling/sliding actions)?
What materials/porosities of objects break suction assumptions?
Any tradeoffs in using suction vs frictional finger pads (speed, robustness)?
How is suction controlled (binary vs continuous vacuum) and integrated with finger force control?
How safe is teleoperation when using suction for delicate tasks?
Were there tasks humans couldn’t do but suction enabled for robots (or vice versa)?

2.15 Tactile Memory with Soft Robot: Tactile Retrieval-based Contact-rich Manipulation with a Soft Wrist

Short summary: Introduces a tactile retrieval/memory system for contact-rich manipulation leveraging a soft wrist; uses stored tactile patterns to retrieve similar contact episodes to guide control in new situations. (OpenReview)

Questions:

How are tactile episodes indexed and retrieved (embedding, similarity metric)?
How does the soft wrist affect contact patterns compared to rigid wrists?
Does retrieval generalize across different objects or only similar contacts?
How is timeliness handled — retrieving past episodes quickly enough for closed-loop correction?
How much memory/storage is required for the tactile database as it scales?
What are failure modes when retrieval returns poor matches?

2.16 FLASH: Flow-Based Language-Annotated Grasp Synthesis for Dexterous Hands

Short summary: FLASH is a flow-matching model that generates language-conditioned, physically plausible dexterous grasps conditioned on hand & object point clouds and a text instruction; trained on a curated, language-annotated grasp dataset and shows generalization to novel prompts. (OpenReview)

Questions:

How do you ensure generated grasps are physically admissible (no interpenetration, stable contact forces)?
How is language embedded and aligned with geometric affordances? Any failure examples with ambiguous language?
How large / diverse is FLASH-drive dataset and what annotation quality controls exist?
How does flow-matching compare to diffusion for grasp generation here?
Can the model propose alternative grasps ranked by task suitability?
How does this integrate with downstream control for closing the loop (grasp execution)?

3 Reference

Dexterous Manipulation: Learning and Control with Diverse Modalities

1 Reordered list

2 All 16 Accept-Spotlight papers

2.1 ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

2.2 Way-Tu: A Framework for Tool Selection and Manipulation Using Waypoint Representations

2.3 HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Bimanual Dexterous Manipulation

2.4 Scaling Cross-Embodiment World Models for Dexterous Manipulation

2.5 DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation

2.6 mimic-one: a Scalable Model Recipe for General-Purpose Robot Dexterity

2.7 Latent Action Diffusion for Cross-Embodiment Manipulation

2.8 Vision-Free Object 6D Pose Estimation for In-Hand Manipulation via Multi-Modal Haptic Attention

2.9 DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation

2.10 Zero-shot Sim2Real Transfer for Magnet-Based Tactile Sensor on Insertion Tasks

2.11 EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-Rich Tasks

2.12 TacDexGrasp: Compliant and Robust Dexterous Grasping with QP and Tactile Feedback

2.13 FunGrasp: Functional Grasping for Diverse Dexterous Hands

2.14 Suction Leap-Hand: Suction Cups on a Multi-fingered Hand Enables Embodied Dexterity and In-Hand Teleoperation

2.15 Tactile Memory with Soft Robot: Tactile Retrieval-based Contact-rich Manipulation with a Soft Wrist

2.16 FLASH: Flow-Based Language-Annotated Grasp Synthesis for Dexterous Hands

3 Reference