Structured Models For Vision-And-Language Reasoning