Model-based RL + intrinsic motivation for Trifinger robot
How difficult is it to train a robot to learn manipulation skills in the real world?
Abstract
In many real-world robotics applications, autonomous agentsoften encounter environments where task specifications are limited or missing. In such cases, a promising idea is that of endowing the agent with a sense of curiosity, facilitating agents to explore and acquireessential skills for future tasks without supervision. This intrinsic motivation, akin to children’s natural exploratory behaviors, is vital for optimizing exploration in reinforcement learning (RL) systems. Yet, there’s a challenge in transferring these techniques from simulated to real-world robotic platforms, especially when prior research predominantly focuses on proprioceptive data only accessible from simulation environments.
Our goal is to transfer these methods into the real-world by adapting them to work with image observations. We do so by deploying established object detection models that allow us to predict the state information of the objects, which is vital for the RL system. We’ve established a perception pipeline designed to produce reliable object-centric representations, crucial for the application of RL algorithms.By incorporating established object detection models, we predict state data for the RL system, like object orientation and position. This work marks progress in adapting intrinsically motivated RL techniques for real robotic systems.
Project Overview
Traditional reinforcement learning requires explicit task specifications, but many real-world robotic applications lack clear objectives. Intrinsically motivated exploration offers a solution by enabling robots to autonomously develop skills without supervision, similar to how children naturally explore their environment through curiosity. The prior work Curious Exploration using Epistemic Uncertainty via Structured World Models (CEE-US) leverages epistemic uncertainty in an ensemble of Graph Neural Networks to guide exploration in multi-object environments. CEE-US builds a structural dynamics model for the environment while exploring state spaces with high intrinsic reward.
While CEE-US showed promising results in simulation with impressive combinatorial generalization capabilities, deploying such techniques on physical robots presents significant challenges in perception. The project focused on creating a robust perception pipeline that could provide reliable state estimation for the curiosity-driven learning algorithms in the real world.
Technical Implementation
The developed perception system for the TriFinger robot platform consisted of:
- YOLOv5 Object Detection: A fine-tuned object detector provided bounding boxes around cube objects with high accuracy across the three camera views of the TriFinger platform.
- 6D Pose Estimation: The CubeCornerPredictor neural network processed inputs from the object detector to estimate the precise corner keypoints and pose of the cube, enabling accurate tracking of position and orientation in 3D space.
- Real-time Performance: The pipeline was optimized to achieve real-time processing with multi-object tracking capabilities, meeting the latency requirements for effective robot control.
- Gym Environment Interface: The RealTriFingerEnv component provided a standardized interface between the perception system,the RL algorithm, and Trifinger Platform, converting camera observations into structured state representations and managing bidirectional communication with the robot hardware.
Keypoint Prediction Sample Results
One Cube:

Two Cubes:

Learned Dynamics Model Accuracy in Simulation

Final Presentation
For more information, please check out this presentation: