đź§©CoRL 2025 Workshop
1 Reordered list
Ranked by relevance to dexterous hands and tactile sensing
- DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation
- Vision-Free Object 6D Pose Estimation for In-Hand Manipulation via Multi-Modal Haptic Attention
- Zero-shot Sim2Real Transfer for Magnet-Based Tactile Sensor on Insertion Tasks
- ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
- TacDexGrasp: Compliant and Robust Dexterous Grasping with QP and Tactile Feedback
- Tactile Memory with Soft Robot: Tactile Retrieval-based Contact-rich Manipulation with a Soft Wrist
- DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation
- Suction Leap-Hand: Suction Cups on a Multi-fingered Hand Enables Embodied Dexterity and In-Hand Teleoperation
- mimic-one: A Scalable Model Recipe for General Purpose Robot Dexterity
- FunGrasp: Functional Grasping for Diverse Dexterous Hands
- Latent Action Diffusion for Cross-Embodiment Manipulation
- Scaling Cross-Embodiment World Models for Dexterous Manipulation
- EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-Rich Tasks
- FLASH: Flow-Based Language-Annotated Grasp Synthesis for Dexterous Hands
- Way-Tu: A Framework for Tool Selection and Manipulation Using Waypoint Representations
- HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Bimanual Dexterous Manipulation
2 All 16 Accept-Spotlight papers
2.1 ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
Short summary: Proposes a cross-modal transformer that fuses vision + tactile using cross-attention and an autoregressive tactile prediction head; training uses a curriculum from ground-truth to predicted tactile inputs to stabilize representation learning for contact-rich manipulation. (OpenReview)
Questions:
- How sensitive is performance to the tactile sensor quality/noise distribution used at training time?
- Which cross-attention design choices (layers, heads) mattered most in ablations?
- Can the model operate when tactile and vision are intermittently unavailable (e.g., occlusion / sensor dropout)? Any experiments?
- Do you freeze visual backbone or fine-tune it jointly — which worked better?
- How does the learned representation transfer to new tasks/objects not seen in training?
- What’s the compute/latency at inference — suitable for real-time control?
2.2 Way-Tu: A Framework for Tool Selection and Manipulation Using Waypoint Representations
Short summary: Introduces a waypoint-based representation and pipeline for selecting and manipulating tools — combining learned waypoint predictors with motion optimization to perform tool use tasks robustly. (OpenReview)
Questions:
- How are waypoints represented (Cartesian, relative frames, keyframes) and why that choice?
- How robust is tool selection when the perceived affordance is noisy or partially occluded?
- Did you compare direct end-to-end policy vs waypoint + optimizer — tradeoffs in sample efficiency & robustness?
- How do you handle tool dynamics (e.g., flexible tools) in planning?
- Can the same waypoint representation generalize across different robot embodiments?
- What failure cases are common — poor grasp, imprecise waypoint timing, optimizer convergence?
2.3 HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Bimanual Dexterous Manipulation
Short summary: HERMES provides a unified RL + sim2real pipeline to convert heterogeneous human motion sources into physically plausible mobile bimanual robot behaviors — includes depth image based sim2real transfer and closed-loop localization for mobile dextrous tasks. (OpenReview)
Questions:
- How do you align heterogeneous human motion data (different capture setups) before training?
- What components most reduce the sim2real gap (depth transfer, domain randomization, etc.)? Any ablations?
- How do you integrate navigation and manipulation timing reliably in mobile setups?
- Does the policy exploit human kinematic priors or learn purely from RL?
- How sample-efficient is the approach and how much human data is needed?
- Any limits when transferring to different robot hand kinematics / DoF?
2.4 Scaling Cross-Embodiment World Models for Dexterous Manipulation
Short summary: Proposes particle-based world models that represent both human and robot embodiments as particle sets and define actions as particle displacements — enabling unified world models that scale to multiple embodiments and support cross-embodiment control. (OpenReview)
Questions:
- How do you choose particle resolution and object/hand particle assignment for efficiency vs fidelity?
- Does the particle representation keep crucial contact details for high-precision tasks?
- How well does policy transfer when the robot and human have very different actuation constraints?
- Any emergent failure modes when scaling to deformable objects?
- How does the approach compare with kinematic retargeting + robot dynamics modeling?
- What are memory/computation requirements for inference on real robots?
2.5 DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation
Short summary: DexUMI is a hardware+software pipeline using the human hand as an interface (via wearable exoskeleton + software retargeting/inpainting) to collect dexterous demonstrations and transfer them to different robot hands with good real-world success. (OpenReview)
Questions:
- What kinematic limits of the exoskeleton limit the range of demonstrable motions?
- How do you handle embodiment gaps for very different robot hands (finger count, joint limits)?
- What are privacy / safety considerations for wearables during long teleop sessions?
- How much post-processing (retargeting correction) is required before policy training?
- Is inpainting of the human hand in video robust to occlusions / lighting?
- How does performance degrade when switching to an unseen robot hand type?
2.6 mimic-one: a Scalable Model Recipe for General-Purpose Robot Dexterity
Short summary: A practical recipe combining a new 16-DoF tendon-driven hand, curated teleoperation data (with self-correction), and a large generative policy (diffusion) to achieve robust, real-world dexterous control and emergent self-correction behaviors. (OpenReview)
Questions:
- Which element of the recipe (hardware, data protocol, model) contributes most to out-of-distribution success?
- How is “self-correction” measured and how do you encourage it in training?
- What are the tradeoffs in using diffusion models vs autoregressive controllers for high-frequency control?
- How expensive is data collection and what teleop interfaces were most effective?
- Any examples where the model fails to self-correct or produces unsafe motions?
- How reproducible is the hardware design and codebase for other labs?
2.7 Latent Action Diffusion for Cross-Embodiment Manipulation
Short summary: Learns a contrastive latent action space and uses diffusion modeling in that latent space to produce cross-embodiment manipulation policies that can imitate and transfer between different hand embodiments. (OpenReview)
Questions:
- How is the latent action space structured and what prevents mode collapse?
- How much action retargeting is needed when moving between embodiments?
- How sample-efficient is diffusion in latent action space compared to direct action diffusion?
- Are there latency constraints for diffusion sampling in closed-loop control?
- How do you evaluate safety / constraint satisfaction when sampling actions in new embodiments?
- Did you compare with non-diffusion generative models (VAE, normalizing flows)?
2.8 Vision-Free Object 6D Pose Estimation for In-Hand Manipulation via Multi-Modal Haptic Attention
Short summary: Presents a vision-free haptic attention estimator that fuses kinesthetic, contact, and proprioceptive signals and their temporal dynamics to estimate in-hand object 6D pose — demonstrated to support reliable reorientation without vision. (OpenReview)
Questions:
- What temporal window / filtering is required for robust haptic pose estimates?
- How sensitive is estimation to slippage and changing contact modes?
- What’s the runtime and can it be used in closed-loop control at manipulation frequencies?
- How does accuracy compare to vision-based 6D pose estimators under occlusion?
- Can the haptic model generalize to new object shapes or materials?
- How do you handle ambiguous haptic signals that map to multiple pose hypotheses?
2.9 DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation
Short summary: Introduces DexSkin, a soft, conformable capacitive electronic skin that provides dense, localizable tactile sensing across complex finger geometries; demonstrates its use for learning contact-rich manipulation on gripper fingers. (OpenReview)
Questions:
- What is the spatial and force resolution of the DexSkin sensors and how are they calibrated?
- How robust is DexSkin to wear, contamination, and repeated contact cycles?
- Does the skin change fingertip geometry or add compliance that affects grasp dynamics?
- How easy is integration across different robot hand designs (curved surfaces, joints)?
- Which downstream learning tasks showed the largest improvement using DexSkin?
- Is latency / sampling rate sufficient for high-bandwidth tactile control?
2.10 Zero-shot Sim2Real Transfer for Magnet-Based Tactile Sensor on Insertion Tasks
Short summary: Proposes a technique to sim2real transfer magnet-based tactile sensing for insertion tasks with zero real training — likely via physics-aware simulation, sensor modeling and domain randomization to generalize to real sensors. (OpenReview)
Questions:
- What aspects of the tactile sensor model were most critical for zero-shot transfer?
- How is magnetic field noise and manufacturing variation handled in simulation?
- Do you observe any failure modes on unusual object geometries or adhesives?
- How does the method generalize to non-insertion contact tasks?
- What metrics and baselines did you compare for zero-shot success?
- Would small amounts of real fine-tuning drastically improve performance?
2.11 EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-Rich Tasks
Short summary: Presents a hierarchical policy architecture that enforces SE(3) equivariance (spatial symmetries) and maps vision to force/interaction behaviors — designed to generalize spatially (e.g., peg-in-hole) from few demonstrations. (OpenReview)
Questions:
- Which parts are enforced analytically equivariant, and which are learned?
- How does equivariance affect sample efficiency and generalization empirically?
- Does equivariance hurt expressivity for asymmetric tasks?
- How do you integrate force control at the low level with vision-level equivariant policies?
- Any limits observed when task geometry or object frames change drastically?
- How sensitive to calibration and coordinate frame misalignments?
2.12 TacDexGrasp: Compliant and Robust Dexterous Grasping with QP and Tactile Feedback
Short summary: Uses tactile feedback with a quadratic programming (QP) controller to distribute contact forces and prevent rotational/translational slip for multi-fingered hands — a compliant tactile control approach without explicit torque models. (OpenReview)
Questions:
- How are tactile signals mapped into QP constraints/objective — linearization choices?
- How fast is the QP solved and is it real-time at control loop rates?
- How robust is the method to unexpected external disturbances (bumps, pushes)?
- How do you estimate friction coefficients or do you avoid explicit friction estimates?
- How do you switch between manipulation vs hold/grasp modes?
- Any stability guarantees under contact switching?
2.13 FunGrasp: Functional Grasping for Diverse Dexterous Hands
Short summary: FunGrasp focuses on task-oriented / functional grasping (e.g., grasping scissors by holes) by retargeting single RGBD human functional grasp demonstrations to different robot hands and training RL policies with sim-to-real techniques and privileged information. (OpenReview)
Questions:
- How do you define & evaluate “functional correctness” vs geometric/grasp metrics?
- How robust is one-shot transfer from a single RGBD human image to unseen objects?
- What retargeting errors are typical and how are they corrected during policy training?
- Which sim2real tricks mattered most for real deployment?
- Does the method work for safety-critical tools (blades, needles)? Any constraints?
- How is the dataset of human functional grasps curated / annotated?
2.14 Suction Leap-Hand: Suction Cups on a Multi-fingered Hand Enables Embodied Dexterity and In-Hand Teleoperation
Short summary: Describes a practical hardware add-on: mounting suction cups on fingertips/palm of a three-fingered dexterous hand, enabling new manipulation capabilities (adhesive in-hand manipulations) and improved teleoperation for challenging in-hand tasks. (OpenReview)
Questions:
- How do suction cups change the control strategy (grasp forces, rolling/sliding actions)?
- What materials/porosities of objects break suction assumptions?
- Any tradeoffs in using suction vs frictional finger pads (speed, robustness)?
- How is suction controlled (binary vs continuous vacuum) and integrated with finger force control?
- How safe is teleoperation when using suction for delicate tasks?
- Were there tasks humans couldn’t do but suction enabled for robots (or vice versa)?
2.15 Tactile Memory with Soft Robot: Tactile Retrieval-based Contact-rich Manipulation with a Soft Wrist
Short summary: Introduces a tactile retrieval/memory system for contact-rich manipulation leveraging a soft wrist; uses stored tactile patterns to retrieve similar contact episodes to guide control in new situations. (OpenReview)
Questions:
- How are tactile episodes indexed and retrieved (embedding, similarity metric)?
- How does the soft wrist affect contact patterns compared to rigid wrists?
- Does retrieval generalize across different objects or only similar contacts?
- How is timeliness handled — retrieving past episodes quickly enough for closed-loop correction?
- How much memory/storage is required for the tactile database as it scales?
- What are failure modes when retrieval returns poor matches?
2.16 FLASH: Flow-Based Language-Annotated Grasp Synthesis for Dexterous Hands
Short summary: FLASH is a flow-matching model that generates language-conditioned, physically plausible dexterous grasps conditioned on hand & object point clouds and a text instruction; trained on a curated, language-annotated grasp dataset and shows generalization to novel prompts. (OpenReview)
Questions:
- How do you ensure generated grasps are physically admissible (no interpenetration, stable contact forces)?
- How is language embedded and aligned with geometric affordances? Any failure examples with ambiguous language?
- How large / diverse is FLASH-drive dataset and what annotation quality controls exist?
- How does flow-matching compare to diffusion for grasp generation here?
- Can the model propose alternative grasps ranked by task suitability?
- How does this integrate with downstream control for closing the loop (grasp execution)?