One of the core problems in reinforcement learning (RL) is estimating the long-term reward of a given policy. In many real-world applications such as healthcare, robotics and dialogue systems, running a new policy on users or robots can be costly or r.isky. This gives rise to the need for off-policy, or counterfactual, estimation: estimate the long-term reward of a given policy using data previously collected by another policy (e.g., the one currently deployed). This talk will describe some recent advances in this problem, for which many standard estimators suffer an exponentially large variance (known as "the curse of horizon"). Our approach is based on a dual linear program formulation of the long-term reward, and can be extended to estimate confidence intervals.
Lihong Li is a Senior Principal Scientist at Amazon. He obtained a PhD degree in Computer Science from Rutgers University. After that, he held research positions in Yahoo!, Microsoft and Google, before joining Amazon. His main research interests are in reinforcement learning, including contextual bandits, and other related problems in AI. His work is often inspired by applications in recommendation, advertising, Web search and conversational systems. Homepage: .