Closing The Loop Between Language And Vision For Embodied Agents