We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback.