Multimodal Learning For Vision And Language