Understanding Manipulation Contexts By Vision And Language For Robotic Vision