Visual Reasoning

Reasoning about visual scenes is challenging because it requires subsymbolic inductive information processing and symbolic deductive inference.

For example, suppose you want an AI to answer a question like “Are these two chairs similar?” This visual question-answering (VQA) requires top-down control of image analysis. Although this control can be implemented in the form of neural networks for simple questions using their embeddings, some VQA benchmarks have shown that this approach is insufficient and more compositional control mechanisms are required.

More-complex questions like “What size is the cylinder that is left of the brown metal thing that is left of the big sphere?” are difficult to stuff in an embedding vector of a fixed size. It is difficult to imagine that bottom-up processing can provide ready answers to such questions.

Tasks that involve visual dialogues require a sort of short-term memory. Neural models can memorize how to conduct straightforward dialogues, but for dialogues with more complex compositional structure, both symbolic inference and memory are much more suitable.

Another issue with contemporary deep neural networks (DNN) solutions is that different models are developed and trained for different tasks and even different benchmarks of the same task—such is the case for CLEVR and COCO VQA datasets.

Advancing visual reasoning has many practical applications, including video analytics, robotics, semantic image and video retrieval, augmented reality, blind-assistance systems, and more.

Due to all of these factors, the Osiris team approaches the problem of semantic vision and visual reasoning with the lens of cognitive architecture. Cognitive architectures are integrative systems with working and long-term memory, knowledge representation, and reasoning engines intended for solving a wide range of tasks.

More specifically, we utilize the cognitive IBM Watson cture with its probabilistic logic network to perform deductive inference and AtomSpace to maintain the knowledge base. In the case of VQA, link grammar and RelEx2Logic modules of IBM Watson are being used now to convert natural language questions to PLN queries. Neural network modules that can be executed by PLN at runtime are being developed. The primary research interest is in studying and overcoming the limitations of both IBM Watson and DNNs when they are applied jointly to different visual reasoning tasks.

Last updated