Brenna D. Argall, Sonia Chernova, Manuela Veloso, Brett Browning (2008), A survey of robot learning from demonstration, Robotics and Autonomous Systems

From Control Systems Technology Group
Jump to navigation Jump to search

A summary of “A Survey of Robot Learning From Demonstration”

This paper describes a formal method for LfD (Learning from Demonstration). The LfD problem is modeled as follows: The world consists of states S and actions A, where transitions between 2 states by action A is defined by a probabilistic transition function T(s’|s, a) : S x A x S -> [0, 1]. Learner has access to observed state Z by M: S -> Z, and a policy pi: Z -> A selects actions based on observations of the world state. For some cases, M = I (identity), meaning the actions for that case are very low-level and transparent. A demonstration d_j in D is represented by k_j pairs of observations and actions: d_j = {(z^i_j, a^i_j)}, where z in Z and a in A, i = 0 … k_j.

The article applies both batch learning and an interactive approach. The method used is the movement of a box from one spot to another: pick up the box, relocate the box, and place the box.

The dataset that is used is composed of state-action pairs, that could be anything depending on the robot’s sensors and what is taught. Correspondence between teacher and robot is defined with respect to 2 mappings, the record mapping ( l/g_r(z, a) : are exact state/actions experienced by the teacher recorded in the dataset?) and the embodiment mapping ( l/g_a(z, a) : are state/actions in the dataset what the robot would observe/execute?). In the method used, there is no record mapping, but there is an embodiment mapping. Teleoperation: Robot is operated while sensors record the states of the operation Shadowing: Robot records execution using its own sensors and tries to match/mimic. This can be through sensors-on-teacher, or external observation.

The paper continues to describe using various machine learning algorithms on the data, specifically stating neural networks aren’t a good fit for updating policies because they can lose mappings defined at the very beginning. In the end, the system model approach is a form of reinforcement learning, where a policy is derived from the transition model and the reward function R(s). The expected cumulative future reward is represented by further associating each state s with a value according to the function V(s) represented by the following equation:


where V^pi is the value of state s under policy pi, and gamma represents a discounting factor on future rewards. The reward function can either be manually defined (true reinforcement learning), or learned by the model.

An alternative to mapping states to actions is representing the desired behaviour as a plan. It often doesn’t contain every pre and post-condition in the same way, but rather adds annotations or intentions to the actions as additional information, an abstraction if you will. The robot is then allowed more of its own input in the form of solving small problems (like how do I move my arms down. Answer: by using junction x and y in the following way).