Six new projects funded by LG AI Research
Six projects led by CSE faculty have been funded for investigation by LG AI Research. The North American LG AI Research Center, LG’s first artificial intelligence (AI) research base outside of South Korea, opened in March of this year and has been a facilitator in enabling new research support. The North American LG AI center is led by Honglak Lee, Chief Scientist of AI at LG and a faculty member in CSE.
Following is a summary of each of the new projects.
Structured Representation Learning for Reinforcement Learning in Entity-centric Environments; Led by Prof. Honglak Lee; $180,000
While reinforcement learning agents have been able to master complex tasks such as video games, Go, and even controlling a Nuclear Reactor, these methods often rely on learning monolithic functions that produce a single high-dimensional vector for an agent to use for reasoning. We argue that treating the world as a single high-dimensional vector leads to low sample efficiency and poor generalization performance. Instead, the agent benefits from incorporating the underlying structure of the world (e.g., entities or their relationships) into a structured representation of the environment. If the agent has an ability to discover such structure in the environment, this can enable capacities for entity-centric reasoning, learning, and skill-discovery.
We propose to learn structured state representations that empowers an agent to plan with a task-dependent subset of its state variables, i.e., to do task-dependent partial planning. We will build on two recent works by a PhD student in my research team: one that has shown that learning multiple state “modules” enables capturing entity-centric information, and another which shows that learning an object-centric graph-neural network transition-model with contrastive learning enables learning a latent space that discovers object-centric information such as categories (e.g., is this a toaster?) and properties (e.g., is it on?). Two limitations are that Carvalho et al. was very sample-inefficient (requiring billions of samples of experience) and that Carvalho et al. relied on having an objectness detector. We will leverage these together by developing a novel architecture for learning a structured transition model that can discover entity-centric latent state factors.
Compositional Task Generalization toward solving real-world problems; Led by Prof. Honglak Lee; $180,000
To apply compositional task graph inference methods to real-world domains, we argue that graph inference methods must have strong transfer learning capabilities for better generalization performance, where prior knowledge for compositional task structure can be quickly transferred to new tasks; e.g. task structure learned from reading baking recipes through large language models can be transferred to baking a cake in simulation. We propose two broad thrusts to tackle transfer learning of compositional task information: (i). How do we effectively leverage prior compositional task knowledge for new tasks? I.e. How can we retrieve task knowledge from demonstration data, language instructions, or pretrained language models to infer a task graph prior? (ii). How do we effectively adapt our prior task knowledge to the current task with new observations? Task structure may differ across tasks and may need to be updated based on new observations. For example, an agent may need to adapt its task knowledge for baking a cookie if different cooking ingredients or appliances are available.
To address the problems of transfer learning of compositional task structure, we propose the following challenges: learning effective task priors through language and pretrained language models and adopting language task priors to a new task with visually grounded prompts.
Although LMs have strong generalization capabilities, they must still be finetuned to produce output in the desired format. Finetuning and other recent advances in NLP, such as prompt tuning and adapters, can be used to adapt the LM to produce. To obtain data used for finetuning, we propose to create a dataset from (i) known simulation cases and (ii) annotation on the Wikihow dataset. For (i), several known subtask graphs have been created and labeled from prior work; we can obtain ground-truth SLL text by transforming known subtask graphs. For (ii), we propose to manually annotate the Wikihow dataset with ground truth subtasks and subtask graph edges.
Knowledge-grounded Scientific Reasoning; Led by Prof. Lu Wang; $180,000
Teaching machines to reason like humans has received significant attention in the past decades. This project aims to empower LLMs with knowledge-grounded reasoning skills. We focus on the scientific domain, which is shown to be more challenging than other domains using only general or commonsense knowledge (Gao et al., 2022). We will develop a novel framework that supports just-in-time knowledge retrieval for use in reasoning chain generation and answer derivation.
Concretely, this project will make impacts in two research thrusts to address the aforementioned challenges, along with a new benchmark and novel evaluation methods. Major outcomes of this project include:
- A new reasoning framework that is equipped with knowledge indexing and retrieval ability to support grounded inference, analysis, and content creation, which is trained with reinforcement learning and sample efficient iterative prompting.
- New decoding methods that allow the inclusion of knowledge constraints that branch multi-step reasoning processes.
- A new reasoning benchmark with long scientific articles to support model evaluation and future research in this important domain.
- New evaluation methods that measure models’ true reasoning ability beyond fact memorization.
Multimodal Language Models for Coherent Physical Commonsense Reasoning; Led by Prof. Joyce Chai; $164,677
Despite recent progress in AI, physical commonsense reasoning (PCR) remains a notoriously difficult problem for embodied AI agents. While large-scale pre-trained language models (LMs) have generated excitement over their high performance on many NLP tasks, they still struggle to reason over entities and events in the physical world in such language tasks. As physical commonsense is often acquired from the perception and experience of the physical world, recent advances in multimodal LMs and Vision-Language Pre-training (VLP), as well as image generation from text, provide exciting new research opportunities.
As such, this project will develop methods to tackle PCR by taking advantage of multimodal representations. Following previous work, we will use NLP tasks as a proxy for PCR. The objective of our research is to learn multimodal representations that can capture physical commonsense knowledge through a notion of physical causality and imagination, and evaluate these representations on natural language understanding (NLU) and reasoning. While physical causality enables us to capture and track the dynamics of the physical world, imagination allows us to anticipate and reason about the physical world through visual simulation.
To serve this objective, the proposed work will focus on the following three efforts: (1) understand the role of existing multimodal models in PCR, (2) learn representations through physical causality and imagination, and (3) evaluate learned representations on PCR tasks.
Self-supervised Concept Discovery in Uncurated Multimodal Data; Led by Prof. Justin Johnson; $180,000
This project aims to build a system that inputs a piece of natural-language text, and generates a novel 3D shape that matches the input text. The 3D shape may be encoded as a point cloud, triangle mesh, or implicit surface. We are inspired by recent progress in text-to-image generation that have demonstrated that large, high-capacity conditional generative models trained on large amounts of paired (image, text) data can generate high-quality images conditioned on natural language text. While impressive, these models can only output 2D images and do not take into account the 3D structure of the world. We aim to overcome this limitation, and build models that can generate full 3D shapes from natural language prompts.
This project will involve several key technical challenges. We expect to develop novel model architectures for generative models that output 3D shapes. We also expect to develop techniques for training with non-paired data. In the text-to-image domain, large quantities of paired (image, text) data can be harvested from the internet. We do not expect that paired (text, 3D shape) data can be acquired at a comparable scale, so we plan to develop techniques that can train jointly on modest amounts of (text, 3D shape) data and also make use of large quantities of (text, image) data that can be collected at scale.
Knowledge-grounded Personalized Dialog Generation; Led by Prof. Rada Mihalcea and Prof. David Jurgens in the School of Information
This project proposes to develop knowledge-enhanced personalized text generation for dialogue that uses knowledge of commonsense, domain, personality, and culture, as well as conversational context to generate more-accurate and more-personalized dialogue. The driving hypothesis is that multiple forms of knowledge — common sense, domain, social, and cultural — can all be represented in a unified framework, and this unification will allow a dialog system to effectively incorporate such knowledge into its responses. A key aim of this proposal is to build socially- and culturally-aware dialog systems that are able to effectively communicate with the diverse, intersectional identities of the broader population.
Researchers will also conduct human evaluations where we will ask annotators to indicate their preferences between the best-performing models from the generative settings and a model without knowledge enhancement. They plan to evaluate each model response using four metrics: Fluency indicating whether the sentence is grammatically correct and natural; Coherence indicating whether the response is on topic and relevant to the dialogue history; Reflectiveness indicating if the response is a reflection of the previous statement, which is a metric particularly relevant for the counseling dialogue dataset; and Consistency indicating whether the models reflect the same underlying persona in theme and content presented.