CLunch Archive

Here are the past talks given at CLunch.

CLunch Archive

CMU

May 6, 2024

Talking to Robots

How do we instruct robots to perform actions in the world? How much is conveyed in language vs inferred from context -- whether embodied or social? How do we build agents that then ask questions when confused? In this talk, I won't answer any of these questions, but I'll do my best to outline several pieces of work from the lab that try and lay the groundwork for exploring these larger issues, both within simulated and physical robots.

Leonie Weissweiler

LMU Munich

April 22, 2024

Could we be overestimating the linguistic capabilities of LLMs?

The evaluation of the linguistic capabilities of LLMs requires not only a target phenomenon, but also labelled natural data at scale or the means to create it artificially, which should be uncontaminated, ideally include languages other than English, and rely on implicit, rather than explicit, knowledge of language. These conditions are especially challenging to satisfy for the rare and complex phenomena that remain as challenges for state-of-the-art models. In this talk, I will present several evaluations of the morphological, syntactic, and semantic capabilities of LLMs, demonstrate strategies for gathering or creating data and setups to push the boundaries of current evaluation strategies, and show how these can be used to identify remaining LLM linguistic weaknesses.

Ana Marasović

University of Utah

April 15, 2024

Challenges in Fostering (Dis)Trust in AI

What factors enable people to trust trustworthy models and distrust untrustworthy models? Broadly, (dis)trust can be derived from two sources: (1) intrinsic, which stems from understanding a model's inner workings or reasoning, and (2) extrinsic, which is based on observing a model's external behaviors. Evaluation benchmarks created by AI researchers can foster extrinsic (dis)trust in a given contract, but they must be properly constructed. Only then can they ensure that a model, to pass the test, must truly uphold the intended contract. I will overview the challenges of constructing valid evaluations. On the other hand, explainable AI (XAI) aims to provide insights into a model’s reasoning, thus fostering intrinsic (dis)trust. XAI is not without its challenges, which I will discuss towards the end of my talk.

Diyi Yang

Stanford University

April 8, 2024

Human-AI Interaction in the Age of Large Language Models

Large language models have revolutionized the way humans interact with AI systems, transforming a wide range of fields and disciplines. In this talk, we discuss several approaches to enhancing human-AI interaction using LLMs. The first one looks at training people with conflict resolution skills via LLMs-based simulation and feedback. The second part develops parameter efficient learning techniques for adapting LLMs to low-resource languages and dialects towards accessible human-AI interaction.These different works demonstrate how human-AI interaction via LLMs can empower individuals and foster positive change.

Zhou Yu

Columbia University

April 1, 2024

Conversational AI beyond chatGPT

chatGPT amazed the general public with its ability to follow novel instructions. However, there is still a gap between chatGPT and fundamental human conversation abilities. This talk describes two works toward filling this gap through better conversational planning and strategies. The first work, LLM-Augmenter, proposes a general framework that aligns LLM capabilities with user task intents through reinforcement learning planning. The second work demonstrates that a chatbot with advanced self-disclosure conversational strategies is likelier and more convincing.

Hyunwoo Kim

AI2

March 25, 2024

Theory of Mind and LLMs: What it is and Why it is important

"Last year, debates about whether large language models (LLMs) demonstrate theory of mind capabilities have sparked considerable interest in the AI field. Theory of mind refers to the ability to attribute mental states to others, a key aspect of human social reasoning. This includes understanding others beliefs, desires, intentions, and thoughts, all of which play a significant role in social interactions. In this talk, I will delve deeper into the following questions: ""Do LLMs have a theory of mind?"", ""What are essential criteria for evaluating theory of mind in LLMs?"", and “Why is theory of mind important in AI systems?” More concretely, this talk will discuss important theoretical foundations from psychology and examine why theory of mind can be critical in addressing privacy concerns in LLMs."

Koustuv Saha

UIUC

March 18, 2024

Measuring Wellbeing in Situated Contexts with Social Media and Multimodal Sensing: Promises and Perils

A core aspect of our social lives is often embedded in the communities we are situated in. Our shared experiences and social ties intertwine our situated context with our wellbeing. A better understanding of wellbeing can help devise timely support provisions. However, traditional forms of wellbeing measurements have limitations, motivating an increasing interest in supporting wellbeing through passive sensing technologies. Parallelly, social media platforms enable us to connect and express our personal and social lives with others. Given its ubiquity, social media can be considered a “passive sensor” to obtain naturalistic data, which can also be combined with various multimodal sensing to comprehensively measure wellbeing. However, wellbeing sensing technologies can lead to unintended outcomes and cause harms. Therefore, despite the potential, are we ready to deploy these wellbeing sensing technologies in the real world yet? In this talk, Koustuv Saha will present theory-driven computational and causal methods for leveraging social media in concert with complementary multisensor data to examine wellbeing, particularly in situated communities such as college campuses and workplaces. He will also interrogate the meaningfulness of the data and inferences and reflect on how these approaches can potentially be misinterpreted or misused without additional considerations. To bridge the gap between the theoretical promise and practical utility, he will present the importance of evaluating the needs, benefits, and harms of wellbeing sensing technologies in practice. This talk will propel the vision toward questioning the underlying assumptions and in responsible design and deployment of wellbeing sensing technologies (if at all) for situated communities and the future of work.

Eunsol Choi

University of Texas at Austin

March 11, 2024

Knowledge-Rich Language Systems in a Dynamic World

Natural language provides an intuitive and powerful interface to access knowledge at scale. Modern language systems draw information from two rich knowledge sources: (1) information stored in their parameters during massive pretraining and (2) documents retrieved at inference time. Yet, we are far from building systems that can reliably provide information from such knowledge sources. In this talk, I will discuss paths for more robust systems. In the first part of the talk, I will present a module for scaling retrieval-based knowledge augmentation. We learn a compressor that maps retrieved documents into textual summaries prior to in-context integration. This not only reduces the computational costs but also filters irrelevant or incorrect information. In the second half of the talk, I will discuss the challenges of updating knowledge stored in model parameters and propose a method to prevent models from reciting outdated information by identifying facts that are prone to rapid change. I will conclude my talk by proposing an interactive system that can elicit information from users when needed.

Jeffrey (Young-Min) Cho

University of Pennsylvania

February 26, 2024

Impact of Response Length on LLM-Generated Dialogue Quality and User Perception

Large Language Models are often used as conversational agents, even though they are not predominantly trained on dialogue datasets. Consequently, their responses often diverge from those in natural human conversation, tending towards verbosity or, less frequently, brevity. In this paper, we study the impact of optimizing response length on the quality of a dialogue system. Our findings reveal that GPT produces responses that are longer than those of humans, and these are unexpectedly favored, even over human-generated responses, due to their richer informational content and perceived greater empathy. However, for applications such as voicebots, shorter responses could be preferred. To generate responses that match those from humans in length, we introduce RULER, a supervised model that leverages historical conversational data to guide the generation of appropriately lengthed responses. We find that RULER responses are judged to be of higher quality than those from humans, in spite of being comparable in length.

Shreya Havaldar

University of Pennsylvania

February 26, 2024

Evaluating Multicultural Behavior of LLMs

Multilingual LLMs like GPT-4 and Gemini are linguistically fluent (i.e. they generate fluent non-English text), but not necessarily culturally fluent (i.e. they appropriately reflect the social norms, emotions, and behaviors of users from different cultures). While it is important for us to make these LLMs better at cultural adaptation, we lack proper methods to evaluate the multicultural behavior of these models. Focusing on emotion, I present techniques grounded in cultural psychology to evaluate how well LLMs understand emotional subjectivity across cultures. Despite the fact that emotions are experienced and expressed differently across the world, we find that embeddings obtained from LMs (e.g., XLM-RoBERTa) are Anglocentric, and generative LMs (e.g., ChatGPT) reflect Western norms, even when responding to prompts in other languages. Our results show that multilingual LMs struggle with cultural adaptation and developing proper techniques to evaluate this is an important problem for the NLP community.

Sunny Rai

University of Pennsylvania

February 26, 2024

Extracting Cross-Cultural Social Norms using Moral Emotions

In this talk, I will present a culture-agnostic approach to norm discovery, using moral emotions, shame and pride, to identify examples of normative expectations and extract corresponding social norms. These norms can be used for designing culturally aware NLP systems and achieving pluralistic values in LLMs.

Ben Zhou

University of Pennsylvania

February 19, 2024

Towards Generalizable and Controllable Reasoning in NLP and AI Systems

Advancements in natural language processing (NLP) have spurred a wave of innovation. Still, the reliability and generalizability of language models (LMs) remain areas of concern, blocking them from complex reasoning scenarios or sensitive topics. This talk presents works on augmenting models with experiential knowledge and symbolic reasoning to refine controllable reasoning, improve abduction skills, and bolster model generalizability. We will also examine the limitations of current semantic-based reasoning methods and highlight the integration of symbolic techniques to construct more transparent and explainable decision-making processes. Through synthesizing empirical evidence and theoretical insights, we propose pragmatic pathways toward responsible and trustworthy NLP applications in mission-critical environments.

Sihao Chen

University of Pennsylvania

February 19, 2024

Propositional Text Representation Learning in the era of LLMs

In an era where most NLP problems are solved in a text-in-text-out, end-to-end fashion, do meaning representations of text still matter? The answer is yes, and the benefit it brings might surprise you! I will introduce our recent line of work, where we rethink and redefine the use of propositions in modern NLP. I will discuss the benefit of propositional text representation learning in LLM-related applications such as hallucination detection, attribution for generated text, and retrieval-augmented generation.

William Wang

UCSB

February 12, 2024

Principles of Reasoning: Compositional and Collaborative Generative AI

A majority of existing research in large language models and generative AI systems focus on scaling and engineering. In this talk, I argue that we need principled understanding of the science of generative AI, in particular, to understand the emergent ability of large language models. I present a Bayesian latent variable approach to enhancing in-context learning in large language models (LLMs) through optimal demonstration selection, demonstrating substantial improvements across various text classification tasks. Second, I argue that modern generative AI systems must be modular and collaborative, to solve complex reasoning problems. We introduce Logic-LM, an in-context framework that synergizes LLMs with symbolic solvers, significantly boosting logical problem-solving abilities. We will also elaborate how to build in-context neuro-symbolic solutions to improve the compositionality in text-to-image systems. Our observations indicate that the future of generative AI is compositional and collaborative, as opposed to a single-model system.

Sebastian Gehrmann

Bloomberg

February 5, 2024

Evaluation in the age of Large Language Models

New language models are being developed at a rapid pace. While these models have incredible new abilities, we still mostly follow the same old paradigms when it comes to evaluating the language that these models produce. As a result, claims about their performance rely either on anecdotal evidence or on experiments on anglo-centric corpora with flawed metrics. We thus can’t systematically answer the question that lies at the core of natural language generation research: how good is a system that produces natural language and where does it fail? I will discuss the deliberations of languages, datasets, metrics, and human evaluations that are required to address this problem. I will also connect these insights to broader trends in the industry and how they affect development of new products.

Andrew Zhu

University of Pennsylvania

January 29, 2024

Kani: A Lightweight and Highly Hackable Framework for Building Language Model Applications

Language model applications are becoming increasingly popular and complex, often including features like tool usage and retrieval augmentation. However, existing frameworks for such applications are often opinionated, deciding for developers how their prompts ought to be formatted and imposing limitations on customizability and reproducibility. To solve this we present Kani: a lightweight, flexible, and model-agnostic open-source framework for building language model applications. Kani helps developers implement a variety of complex features by supporting the core building blocks of chat interaction: model interfacing, chat management, and robust function calling. All Kani core functions are easily overridable and well documented to empower developers to customize functionality for their own needs. Kani thus serves as a useful tool for researchers, hobbyists, and industry professionals alike to accelerate their development while retaining interoperability and fine-grained control.

Bryan Li

University of Pennsylvania

January 29 2024

This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models

Do the Spratly Islands belong to China, the Philippines, or Vietnam? A pretrained large language model (LLM) may answer differently if asked in the languages of each claimant country: Chinese, Tagalog, or Vietnamese. In this paper, we show that LLMs recall certain geographical knowledge inconsistently when queried in different languages—a phenomenon we term geopolitical bias. As a targeted case study, we consider territorial disputes, an inherently controversial and multilingual task. We introduce BorderLines, a dataset of territorial disputes which covers 251 territories, each associated with a set of multiple-choice questions in the languages of each claimant country (49 languages in total). We also propose a suite of evaluation metrics to precisely quantify bias and consistency in responses across different languages. We then evaluate various multilingual LLMs on our dataset and metrics to probe their internal knowledge and use the proposed metrics to discover numerous inconsistencies in how these models respond in different languages. Finally, we explore several prompt modification strategies, aiming to either amplify or mitigate geopolitical bias, which highlights how brittle LLMs are and how they tailor their responses depending on cues from the interaction context.

Alyssa Hwang

University of Pennsylvania

January 29, 2024

Developing Grounded Intuition of Large Language Models

Large language models in the current era of natural language processing have shown unprecedented performance on increasingly complex tasks, leading to challenges in evaluating models and understanding their limits. Recent studies have turned to example-driven qualitative analysis to gain a better "intuition" of how LLMs respond to intricate, inventive requests. In this work, I propose a new methodology to systematize and substantiate this style of qualitative evaluation in techniques from the social sciences. Using GPT-Vision and scientific images as a case study, I will walk through the qualitative evaluation method, theoretical social science background, and resulting insights---intuition of the model's capabilities grounded in empirical evidence---to show how this method can be used for any generative model. I welcome feedback on adapting the preprint for a conference submission.

Hongming Zhang

Tencent AI Lab, Bellevue

December 11, 2023

Advancing Information Retrieval in the Age of Large Language Models

This presentation delves into the evolving landscape of information retrieval (IR) in the context of Large Language Models (LLMs). LLMs have showcased remarkable language comprehension abilities, yet they face challenges in rapidly assimilating private or real-time data, crucial for applications like virtual assistants. This talk will start with an overview of conventional IR techniques, including dense retrieval. Building upon this, I will present our recent work in proposition-level information retrieval, which enhances the granularity and precision of existing IR systems. In the end, I will discuss how to broaden the scope of IR to create a virtual assistant system.

Fei Wang

University of Southern California

November 27, 2023

Robust and Context-Faithful Language Understanding with (Large) Language Models

Large language models (LLMs) have achieved remarkable success in various language understanding tasks. However, their deployment in real-world scenarios raises significant accountability concerns. In this presentation, I will introduce our recent work on enhancing contextual faithfulness and robustness of LLMs. First, LLMs often make unfaithful predictions based on entity mentions or parametric knowledge, ignoring the context. I will present causality-driven approaches, including training-time and in-context causal intervention, to mitigate entity bias for both black-box and white-box LLMs. Second, LLMs may capture various unreliable prediction shortcuts, some of which could be unknown. I will demonstrate how to address this issue by proactively mitigating biases in the attention module without needing to identify the specific cause of the bias. Finally, I will outline future directions for advancing accountable and responsible LLMs.

Alexander Rush

Cornell Tech

November 20, 2023

Inverting Language Models

As language models enter production environments, their intermediate states are used for a myriad of downstream applications such as search, prompting, and document comparison. In this talk, I discuss the feasibility of language model inversion. Specifically, we are interested in how much information language models contain about their inputs? We investigate the problem in two scenarios, recovering text inputs from the outputs of embeddings from sentence embedders and next-token probability outputs from language models. In many cases, our methods are able to fully recover exact textual inputs given just intermediate states. I'll discuss the security implications of these findings, as well as what this tells us about compression embedding and language modeling applications.

Greg Durrett

UT Austin

November 13, 2023

Making LLMs Right

Large language models (LLMs) like ChatGPT have been criticized for their propensity to 'hallucinate' and state incorrect information. However, the errors these systems make are not random: there are certain capabilities, like summarization, that they can do quite reliably, and others, like arithmetic, that are fundamentally unreliable. In this talk, I argue that paying attention to this divide in capabilities allows us to make LLMs more correct. First, I will discuss how we use LLMs as building blocks in systems that can do sound reasoning over natural language. For instance, we can use them to translate a natural language problem definition into a formal specification; alternatively, we can break a reasoning problem down into steps that are easily checkable. I will present our new dataset, MuSR, consisting of tasks like murder mysteries that feature challenging reasoning embedded in narratives. Second, I will discuss how we can figure out post-hoc whether LLMs' generations are right. Our approach is inspired by human fact-checking: first, dissect an LLM's 'claim' into pieces, then explain whether those pieces are right or wrong. Finally, I will discuss ongoing work on how to integrate this error detection capability into LLMs to improve the state of the art.

Mohit Iyyer

University of Massachusetts Amherst

November 6, 2023

Evaluating and detecting long-form LLM-generated text

Progress in NLP methodology over the last thirty years has been driven by benchmarks, from the Penn Treebank to GLUE. Benchmarks are useful because they provide a standard task, dataset, and means of evaluation that any researcher can use to quickly and easily demonstrate the value of their method. However, in the current age of LLMs, I argue that benchmarking is becoming increasingly obsolete. Beyond challenges such as data contamination, the dubious scientific validity of prompt engineering, and usage of closed-source APIs, each of which is critical in its own right, there exist fundamental issues with how to formulate real-world tasks into benchmarks that can rank LLMs based on the much-desired 'single score'. I highlight these issues using some of my lab's recent work on tasks such as long-form question answering, book-length summarization, and literary translation. Next, I'll pivot to a different problem that plagues not only evaluation but also society as a whole: the rapid proliferation of LLM-generated text. Detecting such text is not only important for combating malicious use cases such as academic plagiarism, but also to ensure that LLMs of the future are not just pretrained on text generated by their inferior predecessors. I outline several attacks against existing LLM-generated text detectors such as watermarking (e.g., paraphrasing, translation, cropping) and describe a retrieval-based approach that is more robust to these attacks but comes with issues of its own.

Vivek Gupta

University of Pennsylvania

October 30, 2023

Inference and Reasoning for Semi-structured Tables

Understanding semi-structured tabular data, which is ubiquitous in the real world, requires an understanding of the meaning of text fragments and the implicit connections between them. We believe such data could be used to investigate how individuals and machines reason about semi-structured data. First, we present the InfoTabS dataset, which consists of human-written textual predictions based on tables collected from Wikipedia’s infoboxes. Our research demonstrates that the semi-structured, multi-domain, and heterogeneous nature of the premises prompts complicated, multi-faceted reasoning, offering a modeling challenge for traditional modeling techniques. Second, we analyzed these challenges in-depth and developed simple, effective preprocessing strategies to overcome them. Thirdly, despite accurate NLI predictions, we demonstrate through rigorous probing that the existing model does not reason with the provided tabular facts. To address this, we suggest a two-stage evidence extraction and tabular inference technique for enhancing model reasoning and interpretability. We also investigate efficient methods for enhancing tabular inference datasets with semi-automatic data augmentation and pattern-based pre-training. Lastly, to ensure that tabular reasoning models work in more than one language, we introduce XInfoTabS, a cost-effective pipeline for translating tables. In the near future, we plan to test the tabular reasoning model for temporal changes, especially for dynamic tables where information changes over time.

Najoung Kim

Boston University

October 23, 2023

Entity Tracking in Language Models

Keeping track of how states of entities change as a text or dialog unfolds is a key prerequisite to discourse understanding. We propose a behavioral task testing to what extent a language model can infer the final state of an entity given a natural language description of the initial state and a series of state-changing operations, following a set of desiderata we lay out for measuring nontrivial entity tracking capacity. Our evaluations of several language models reveal that only (1) in-context learning with models trained on large amounts of code, or (2) finetuning a model directly on the entity tracking task lead to nontrivial entity tracking behavior. This suggests that language models can learn to track entities but pretraining on text corpora alone does not make this capacity surface. In light of these results, I will end with brief discussions of ongoing work regarding the role of code pretraining and probing for latent representations of entity states.

Yoon Kim

MIT

October 16th, 2023

Large Language Models & Symbolic Structures

Over the past decade the field of NLP has shifted from a pipelined approach (wherein intermediate symbolic structures such as parse trees are explicitly predicted and utilized for downstream tasks) to an end-to-end approach wherein pretrained large language models (LLMs) are adapted to various downstream tasks via finetuning or prompting. What role (if any) can symbolic structures play in the era of LLMs? In the first part of the talk, we will see how latent symbolic structures can be used to guide LLM-based neural machine translation systems to improve translation of low resource language, and also enable the use of new translation rules during inference. In the second part, we will see how expert-derived grammars can be used to control LLMs via prompting for tasks such as semantic parsing where the output structure must obey strict domain-specific constraints.

Martha Palmer

University of Colorado

October 9th, 2023

Uniform Meaning Representations

This talk will discuss symbolic representations of sentences in context, focusing on abstract meaning representations (AMR), examining their capability for capturing certain aspects of meaning. A focus will be how AMR’s can be expanded to encompass figurative language, the recovery of implicit arguments and relations between events. These examples will be in English, and indeed some features of AMR are English-centric. Uniform Meaning Representations, a multi-sentence annotation scheme that is revising AMRs to make them more suitable for other languages, especially low resource languages will be introduced. UMRs include more formal logical scope, number, tense, aspect and modality as well as temporal relations. The talk will conclude with a discussion of ways in which these meaning representations can be enriched even more, by mapping to Wikidata Qnodes, and their potential for improving the explainability of large language models

Frank Ferraro

University of Maryland, Baltimore County

October 2nd, 2023

Deconstructing Complex Events through Modeling Uncertainty, States, and Outcomes

Situations that people experience or describe---from global, macro-level events like "financial instability" to everyday, micro-level ones like "going to dinner"---can be complex, with uncertainty in what might happen next, participant's actions affecting one another, and overlapping events contributing to an outcome. Developing a computational understanding of these situations is not straightforward. In this talk, I will present methods for learning how to encode richer information about those descriptions. These approaches are a fruitful way of improving modeling performance, the ability to predict what events might happen next, or identify how the different participants in that situation are affected. In the first way, I will examine how event descriptions can be augmented with structural and semantic hierarchy, while accounting for uncertainty in both. In the second, I will look at how we can get models to reason about implicit states of participants in events, and reason about changes to these states as they relate to the broader situation. Finally, I will consider how we can characterize complex events by looking at participant-specific, state-based outcomes.

Josh Ludan

University of Pennsylvania

September 18th, 2023

Concept Bottleneck Models for Text Classification

In this work, we apply the concept bottleneck framework to text classification. Traditional explainability methods for text classification often focus on local interpretability, highlighting which tokens are relevant but failing to explain their role in the decision-making process. Our system addresses this by offering concept-level explanations. For example, it can break down a restaurant review rating into individual measurements of concepts such as "ambiance" and "food quality," which linearly combine to yield the final prediction. We utilize large language models to generate and measure these concepts in a zero-shot manner, requiring no external guidance or information beyond the original task dataset. Our system is evaluated on twelve diverse text datasets, ranging from hate speech classification to movie review sentiment, and shows generally positive results. This includes an 89.3% accuracy in measuring machine defined concepts when evaluating using human annotations as a gold standard. Overall, this work presents a promising direction for creating interpretable text classification systems.

Alyssa Hwang

University of Pennsylvania

September 18th, 2023

Rewriting the Script: Adapting Text Instructions for Voice Interaction

Voice assistants have sharply risen in popularity in recent years, but their use has been limited mostly to simple applications like music, hands-free search, or control of internet-of-things devices. What would it take for voice assistants to guide people through more complex tasks? In our work, we study the limitations of the dominant approach voice assistants take to complex task guidance: reading aloud written instructions. Using recipes as an example, we observe twelve participants cook at home with a state-of-the-art voice assistant. We learn that the current approach leads to nine challenges, including obscuring the bigger picture, overwhelming users with too much information, and failing to communicate affordances. Instructions delivered by a voice assistant are especially difficult because they cannot be skimmed as easily as written instructions. Alexa in particular did not surface crucial details to the user or answer questions well. We draw on our observations to propose eight ways in which voice assistants can “rewrite the script”—summarizing, signposting, splitting, elaborating, volunteering, reordering, redistributing, and visualizing—to transform written sources into forms that are readily communicated through spoken conversation. We conclude with a vision of how modern advancements in natural language processing can be leveraged for intelligent agents to guide users effectively through complex tasks.

Hao Wang

Rutgers University

April 19th, 2023

Bayesian Deep Learning: From Single-Domain Reasoning to Infinite-Domain Adaptation

While perception tasks such as visual object recognition and text understanding play an important role in human intelligence, the subsequent tasks that involve inference, reasoning, and planning require an even higher level of intelligence. The past few years have seen major advances in many perception tasks using deep learning models. In terms of higher-level inference, however, probabilistic graphical models, with their ability to expressively describe properties of variables and various probabilistic relations among variables, are still more powerful and flexible. To achieve integrated intelligence that involves both perception and inference, we have been exploring along a research direction, which we call Bayesian deep learning, to tightly integrate deep learning and Bayesian models within a principled probabilistic framework. In this talk, I will present the proposed unified framework and some of our recent work on Bayesian deep learning with various applications including recommendation, social network analysis, interpretable healthcare, domain adaptation, and representation learning.

Sunny Rai

University of Pennsylvania

April 12th, 2023

Investigating Racial Heterogeneity in Language Markers of Depression

The racial and ethnic differences in the manifestation of depression are well documented. However, the effect of these differences on computational models for mental disorders trained on online language is relatively unexplored. This work analyzes the interaction between race and linguistic features correlated with PHQ-9 score. Our experiments reveal that the pronoun I, widely used as an indicator of depression, has significant interaction with race correlating with PHQ-9 scores for White but not for Black individuals. Various open vocabulary topics correlated with PHQ-9 demonstrate a contradictory trend for their usage by White and Black individuals when depressed. A linear regression machine learning model trained on White individuals predicts depression in White individuals with a Pearson r of 0.39(p < 0.05) but returns an insignificant correlation for depression scores in Black individuals indicating its inefficacy in diagnosing depression for the Black population. Interestingly, a model trained on Black individuals predicts depression in both racial groups albeit with different performances (r = 0.355 for Black and r = 0.338 for White). The results advocate the urgent need to validate computational mental health models on minority populations before deployment.

Nanyun (Violet) Peng

University of California, Los Angeles

April 7th, 2023

Controllable Text Generation For Open-World Creativity

Recent advances in large auto-regressive language models have demonstrated strong results in generating natural languages and significantly improved the performances for applications such as machine translation and summarization. However, when the generation tasks are open-ended and the content is under-specified, or there are format or cross-modal association constraints, existing techniques struggle to generate long-term coherent and creative contents that follow format constraints. This happens because autoregressive language models are only trained to predict the next word, and it is hard to impose structural or content control/contraints to the model. In this talk, I will present our recent works on creative generation including poetry and melogy-to-lyrics generation, which highlight the importance of controllable text generation beyond the prevalent auto-regressive formulation. We propose a novel insertion-based generation model and a controllable decoding-time algorithm to steer models to better conform to constraints.

Roy Schwartz

Hebrew University of Jerusalem

March 29th, 2023

Spurious Correlations: Challenges, Solutions, and Opportunities

Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy'' instances, culminating in a recent proposal to eliminate single-word correlations altogether. In this talk, I will identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater'' and miss important signals encoding common sense and world knowledge. I will highlight several alternatives to dataset balancing, focusing on a surprising proposal: in order to mitigate biases in models, one needs to amplify them in our training sets.

Yoav Artzi

Cornell University (Cornell Tech)

March 22nd, 2023

Learning and Reasoning in Natural Language Interaction

Natural language is first and foremost an instrument of interaction, where interlocutors produce and comprehend language to relay information to accomplish their intents. This talk focuses on challenges and opportunities that arise from this interactive nature of language. The response of participants to the language they comprehend can form a strong learning signal for the party that produced the language. Did I achieve my intent? In the first part, I will show how to use this signal to learn to produce natural language instructions. I will then discuss the problem of language-conditioned reinforcement learning, where benchmark development has been hindered because computing rewards requires resolving language semantics. I will describe a new approach to address this challenge. Finally, core to linguistic interaction is the use of abstraction to communicate concepts in a generalizable way. I will describe a new resource to study this phenomena, and show how it sheds light on the generalization abilities of language-and-vision pre-trained models.

Daniel Fried

Carnegie Mellon University (Language Technologies Institute)

March 15th, 2023

Using Language Strategically in Context

As NLP systems interact with people in a widening range of world contexts, it is increasingly important to model pragmatic aspects of language: the goals that underlie language use, and the effects that language has on people. Across a diverse range of task-oriented settings, we've found that reasoning about language as a strategic action allows NLP models to interact more successfully with human partners. First, I'll describe a procedure for pragmatically generating and interpreting instructions. We train listener and speaker models that imitate how people interpret and produce language in grounded contexts. We use these models to (1) predict how a person might interpret language from the system and (2) resolve ambiguity by reasoning about what goal might have made a person say what they did. These procedures make interaction with human partners more successful in settings including visually-grounded instruction following and interactive preference learning. I'll also give an overview of work with the FAIR Diplomacy team on CICERO, an agent that achieves human-level performance in the dialogue and strategy board game Diplomacy. CICERO integrates LLMs with a strategic planner: choosing mutually beneficial plans for itself and its partners, and generating dialogue in pursuit of these plans. When deployed in an anonymous online Diplomacy league with human partners, CICERO ranked in the top 10% of participants who played more than one game.

Niranjan Balasubramanian

Stony Brook University

March 6th, 2023

What ails multi-step reasoning and how to fix it.

Multi-step reasoning has seen much empirical progress on many datasets recently, especially in Question Answering. However, training and evaluating on typical crowdsourced datasets is problematic because of the potential for shortcut reasoning based on artifacts. What can we do about this? In this three part talk, I will first show how we can formalize and measure disconnected reasoning, a type of bad multihop reasoning. I will then discuss how we can construct new datasets using a bottom-up construction process, which allows us to better control for desired properties in the resulting dataset. In the third part, I will briefly present how synthetically generated data can be used to teach a broad range of multihop skills in a reliable manner and how to improve reliable multi-step reasoning in open-domain QA settings.

Graham Neubig

Carnegie Mellon University (Language Technology Institute)

March 1st, 2023

Is My NLP Model Working? The Answer is Harder Than You Think

As natural language processing now permeates many different applications, its practical use is unquestionable. However, at the same time NLP is still imperfect, and errors cause everything from minor inconveniences to major PR disasters. Better understanding when our NLP models work and when they fail is critical to the efficient and reliable use of NLP in real-world scenarios. So how can we do so? In this talk I will discuss two issues: automatic evaluation of generated text, and automatic fine-grained analysis of NLP system results, which are some first steps towards a science of NLP model evaluation.

Alan Ritter

Georgia Tech

February 24th, 2023

Towards Cost Efficient Use of Pre-Trained Language Models

Large language models are leading to breakthroughs in a variety of applications, from information extraction systems that are accurate and robust, to human-like conversational assistants. In this talk I will analyze when the benefits of training a new model outweigh the computational costs, in the context of domain adaptation. Conventional wisdom holds that data annotation is expensive, so computational methods that leverage freely available unlabeled data can present an economical alternative when adapting to a new domain. The talk will examine this assumption in the context of pretraining-based domain adaptation, which requires significant GPU/TPU resources for each new domain. We frame domain adaptation as a consumer choice problem: given a fixed budget, what combination of annotation and pre-training lead to maximum utility? In the second part of the talk, I will discuss recent work on in-context learning for anaphora resolution. I will show that resolving anaphora in scientific protocols is a challenging task for in-context learning, then present a new method, MICE (Mixtures of In-Context Experts) and demonstrate how it can accurately resolve multiple-antecedent anaphora in paragraphs describing chemical synthesis procedures. MICE enables accurate few-shot anaphora resolution by ensembling hundreds of prompts that are created from only a handful of training examples. Finally, I will discuss applications of NLP on chemical synthesis protocols and show a demo of a system that can help chemists more efficiently find experimental details described in the literature.

Julian Michael

NYU

February 15th, 2023

What Do NLP Researchers Believe? Results of the NLP Community Metasurvey

I will present the results of the NLP Community Metasurvey (https://nlpsurvey.net). This was a questionnaire that we ran from May to June 2022 which elicited the opinions of NLP researchers on controversial issues, including industry influence in the field, concerns about AGI, and ethics. Our results put concrete numbers to several controversies. For example, respondents are split almost exactly in half on questions about: the importance of artificial general intelligence, whether language models understand language, and the necessity of linguistic structure and inductive bias for solving NLP problems. In addition, the survey posed "meta-questions," asking respondents to predict the distribution of survey responses. This allows us not only to gain insight on the spectrum of beliefs held by NLP researchers, but also to uncover false sociological beliefs where the community’s predictions don’t match reality. We find such mismatches on a wide range of issues. Among other results, the community greatly overestimates its own belief in the usefulness of benchmarks and the potential for scaling to solve real-world problems, while underestimating its own belief in the importance of linguistic structure, inductive bias, and interdisciplinary science. Our hope is that this can provide context for the NLP research community to have more informed and self-aware discussions of these complex issues. In this talk, I will walk through our results and open the floor for such a discussion.

Karl Stratos

Rutgers CS

February 1st, 2023

Retrieval-Augmented Models for Natural Language Processing

Prompting large pretrained language models has been enormously successful in solving a wide class of natural language tasks. In this approach, a task is formatted in some natural language template to "prompt" the model to generate the correct answer (e.g., "Q: Why is the sky blue? A: "). While surprisingly effective, it often generates false and unverifiable claims, limiting real-world applications. In this talk, I will advocate an alternative approach based on retrieval. Instead of naively generating answers, the model must first retrieve a piece of evidence from a knowledge base (e.g., Wikipedia). By having an explicit knowledge retrieval step, the model is forced to return factually accurate and verifiable claims. It can also use a new knowledge base at test time, thus is capable of zero-shot learning. I will focus on the task of entity retrieval and linking. I will first present a technique based on hard negative mining to make entity retrieval more robust (NAACL 2021). I will then build on the retrieval framework to present a novel paradigm for entity linking (ICLR 2022).

Gail Weiss

EPFL

January 25, 2023

Thinking Like Transformers

Transformers - the purely attention based NN architecture - have emerged as a powerful tool in sequence processing. But how does a transformer think? When we discuss the computational power of RNNs, or consider a problem that they have solved, it is easy for us to think in terms of automata and their variants (such as counter machines and pushdown automata). But when it comes to transformers, no such intuitive model is available. In this talk I will present a programming language, RASP (Restricted Access Sequence Processing), which we hope will serve the same purpose for transformers as finite state machines do for RNNs. In particular, we will identify the base computations of a transformer and abstract them into a small number of primitives, which are composed into a small programming language. We will go through some example programs in the language, and discuss how a given RASP program relates to the transformer architecture.

Soroush Vosoughi

Dartmouth College

December 12, 2022

Prosocial Language Models

Large-scale language models. (e.g., BERT, GPT-3) have revolutionized the field of natural language processing (NLP). Such pre-trained models show close-to-human-level performance on diverse tasks with little or no training data. The success of such models is at least partially due to their large size (most have hundreds or even thousands of millions of parameters) and the large datasets used for their pre-training (typically collected from the web). However, these same attributes lead to these models reflecting the biases and the antisocial attitudes on the web. These attitudes are a significant bottleneck for using these models in real-world settings, especially for social applications. In my lab, we develop methods for post hoc (i.e., inference time) mitigation of such antisocial attitudes. Post hoc mitigation allows us to avoid retraining the models (which is costly and intractable) while enforcing prosocial attitudes during inference. In this talk, I will review some of our recent work for making language models less biased and more aligned with human moral values through inference-time mitigation.

Ben Van Durme

Johns Hopkins University

December 5, 2022

Embracing Uncertainty

I will discuss a series of projects on collecting labels with uncertainty. Time allowing I will touch on model calibration and downstream tasks. Modern Artificial Intelligence rests heavily on probabilistic models for classification. This usually means categorical nominal assignment; discrete labels given to inputs at prediction time. For example, an object captured in an image either was or was not truly a "cat", or some text describes an event that we might describe as a "TRANSACTION". While concepts like cats and transactions can be real in the world, this does not mean agents can be certain of these truths. In practice, human agents (annotators) are forced to choose from a label set without reflecting uncertainty in their decisions, and modelers then force artificial agents to do the same. Blurry images and ambiguous texts lead humans to have uncertain beliefs, while modern neural frameworks trained on discrete labels make predictions with high confidence. Lets instead embrace uncertainty as part of agent (task) design.

Smaranda Muresan

Columbia University

November 28, 2022

Text Generation: The Curious Case of Figurative Language and Argumentation

Large-scale language models based on transformer architectures, such as GPT-3 or BERT, have advanced the state of the art in Natural Language Understanding and Generation. However, even though these models have shown impressive performance for a variety of tasks, they often struggle to model implicit and/or non-compositional meaning, such as figurative language and argumentative text. In this talk, I will present some of our work on text generation models for figurative language and argumentation. There are two main challenges we have to address to make progress in this space: 1) the need to model common sense and/or connotative knowledge required for these tasks; and 2) the lack of large training datasets. I will discuss our proposed theoretically-grounded knowledge-enhanced text generation models for figurative language such as metaphor and for argument reframing. If time permits I will share our recent efforts of using a model-in-the-loop approach for building datasets for figurative language understanding modeled as an entailment task with explanation generation.

Yulia Tsvetkov

University of Washington

November 21, 2022

Interpretation as Weak Supervision for Data-Efficient NLP

Deep learning is typically associated with an abundance of data. But there are scenarios when pre-collected data will never be enough. For example, language on social media is constantly evolving and pretrained language models cannot adapt to rapid language change, dialects, and sociolects, no matter how large pretraining/annotated datasets are. Other examples of constantly evolving and therefore always-low-resource language domains include scientific articles, expert notes, and even news. In this talk, I will advocate for using model interpretability methods to dynamically procure data annotations in such low resource scenarios. In the first part, I will show how instance attribution approaches to model interpretability can identify critical training examples to improve the robustness and adaptability of hate speech classifiers. In the second part, I'll show how self-explaining models can be used for entity and keyphrase extraction in scientific articles. I'll conclude with more ideas for this new paradigm of using approaches to interpreting neural networks as an intrinsic component in low-resource NLP systems and not only as a tool to present explanations to humans.

Rotem Dror

University of Pennsylvania

November 14, 2022

Standards for Experiment Design and Evaluation in Natural Language Processing

In this job-talk-like seminar, I will present selected works from my Ph.D. and postdoctoral research. In the first part of the talk, I will overview three papers that cover practices to compare two NLP models and decide which is better based on experiment practices that are prevalent in NLP, such as conducting experiments with multiple datasets and deep neural network models. In the second part of the talk, I will dive into the intriguing world of evaluation of text-generation applications, where I will discuss how to determine which automatic evaluation metrics are appropriate.

Hannaneh Hajishirzi

University of Washington

November 7, 2022

Toward Robust, Multi-Task Natural Language Processing

Recent advances in deep learning algorithms and large-scale datasets are spurring progress in many Natural Language Processing (NLP) tasks, including question answering. Nevertheless, these models cannot scale up when task-annotated training data are scarce. This talk presents my lab's work toward building general-purpose models in NLP and how to systematically evaluate them. I present a new meta-dataset – called super-Natural Instructions – that includes a variety of NLP tasks and their descriptions to evaluate cross-task generalization. Then, I introduce a new meta training approach that can solve more than 1600 NLP tasks only from their descriptions and a few examples. Finally, I present a series of work in robust fine-tuning methods and how to edit models with arithmetics over task vectors.

Rui Zhang

Penn State University

October 31, 2022

Semantic Parsing in the Era of Large Language Models

Semantic parsing is the task of translating natural language sentences into meaning representations such as SQL queries and logic forms. Traditional semantic parsing research relies on delicate data curating, heavy feature engineering, and specific model architecturing. Despite their success, these approaches are typically not generalizable across different tasks and meaning representations, limiting systematic and compatible research. In this talk, I will provide a brief overview of recent progress in unified and efficient paradigms for semantic parsing with the help of large language models (i.e., UnifiedSKG). Then, I will describe our recent work on cross-lingual semantic parsing using text-to-text language models (i.e., XSemPLR) and retrieval-augmented in-context learning (i.e., XRICL), and two new datasets to challenge the reasoning abilities of large language models on tables (i.e., MultiHiertt) and first-order logic (i.e., FOLIO). I will conclude with some future directions for semantic parsing developments.

Xiang (Lorraine) Li

Allen Institute for AI (AI2) & University of Pittsburg

October 24, 2022

Probabilistic Commonsense Knowledge in Language

Commonsense knowledge is critical to achieving artificial general intelligence. This shared common background knowledge is implicit in all human communication, facilitating efficient information exchange and understanding. However, commonsense research is hampered by its immense quantity of knowledge because an explicit categorization is impossible. Furthermore, a plumber could repair a sink in a kitchen or a bathroom, indicating that common sense reveals a probable assumption rather than a definitive answer. To align with these properties of commonsense fundamentally, we want to model and evaluate such knowledge human-like using probabilistic abstractions and principles. This talk will introduce a probabilistic model representing commonsense knowledge using a learned latent space of geometric embeddings -- probabilistic box embeddings. Using box embeddings makes it possible to handle commonsense queries with intersections, unions, and negations in a way similar to Venn diagram reasoning. Meanwhile, existing evaluations do not reflect the probabilistic nature of commonsense knowledge. To fill in the gap, I will discuss a method of retrieving commonsense related question answer distributions from human annotators and a novel method of generative evaluation. We utilize these approaches in two new commonsense datasets (ProtoQA, Commonsense frame completion). The combination of modeling and evaluation methods based on probabilistic principles sheds light on how commonsense knowledge can be incorporated into artificial intelligence models in the future.

Jacob Andreas

MIT

October 17, 2022

Toward Natural Language Supervision

In the age of deep networks, "learning" almost invariably means "learning from examples". Image classifiers are trained with large datasets of images, machine translation systems with corpora of translated sentences, and robot policies with rollouts or demonstrations. When human learners acquire new concepts and skills, we often do so with richer supervision, especially in the form of language---we learn new concepts from exemplars accompanied by descriptions or definitions, and new skills from demonstrations accompanied by instructions. In natural language processing, recent years have seen a number of successful approaches to learning from task definitions and other forms of auxiliary language-based supervision. But these successes have been largely confined to tasks that also involve language as an input and an output---what will it take to make language-based training useful for the rest of the machine learning ecosystem? In this talk, I'll present two recent applications of natural language supervision to tasks outside the traditional domain of NLP: using language to guide visuomotor policy learning and inductive program synthesis. In these applications, natural language annotations reveal latent compositional structure in the space of programs and plans, helping models discover reusable abstractions for perception and interaction. This kind of compositional structure is present in many tasks beyond policy learning and program synthesis, and I'll conclude with a brief discussion of how these techniques might be more generally applied.

Chenhao Tan

University of Chicago

October 3, 2022

Towards Human-Centered Explanations of AI Predictions

Explanations of AI predictions are considered crucial for human-AI interactions such as model debugging and model-assisted decision making, but it remains an open question what makes effective AI explanations. In this talk, I will highlight the distinction between emulation and discovery tasks, which shapes the answers to this question. In emulation tasks, humans provide groundtruth labels and the goal of AI is to emulate human intelligence. Although it is intuitive to think that humans can provide valid explanations in this case, I argue that humans may not be able to provide "good" explanations. Despite the growing efforts in building datasets of human explanations, caution is required to use such human explanations for evaluation or as supervision signals. In contrast, in discovery tasks, humans may not necessarily know the groundtruth label. While human-subject experiments are increasingly used to evaluate whether explanations improve human decisions, human+AI rarely outperforms AI alone. I will discuss the importance of identifying human strengths and AI strengths, and present our initial efforts in decision-focused summarization. I will conclude with future directions for developing effective human-centered explanations.

William Wang (UCSB)

UC Santa Barbara

September 26, 2022

Self-Supervised Language-and-Vision Reasoning

A key challenge for Artificial Intelligence research is to go beyond static observational data and consider more challenging settings that involve dynamic actions and incremental decision-making. In this talk, I will introduce our work on visually-grounded language reasoning via the studies of vision-and-language navigation. In particular, I will emphasize three benefits of self-supervised learning: (1) improves generalization in unseen environments; (2) creates adversarial counterfactuals to augment observational data; (3) enables transfer learning for challenging settings. I will briefly introduce other reasoning problems my groups have been working on recently.

Michael Strube

HITS & Heidelberg University

September 12, 2022

Generalizability and Robustness in Coreference Resolution

In the last ten years we have seen considerable improvements in the performance of coreference resolvers, from about 60 points F1-measure to more than 80 since the CoNLL shared tasks 2011 and 2012. These improvements are mostly due to new machine learning techniques, in particular neural coreference resolvers. However, while these improvements have been reported on the CoNLL data, it is not clear whether these improvements hold on datasets in other genres, domains, and languages. In this talk I report on a series of experiments -- done by PhD. students in my research group -- testing the generalizability and robustness of coreference resolvers. Our experiments indicate that the results reported by modern machine learning based systems are not stable across genres and domains. However, the rule-based system by Lee et al. (2013), which won the CoNLL shared task 2011, is still competitive in these setups. A possible conclusion is that neural coreference resolvers should be equipped with more linguistic knowledge to make them more robust. To test the generalizability the field should not only evaluate on the CoNLL/OntoNotes data but on different domains, genres, languages and in downstream tasks.

Su Lin Blodgett

Microsoft Research Montréal

April 25, 2022

Towards Equitable Language Technologies

Language technologies are now ubiquitous. Yet the benefits of these technologies do not accrue evenly to all people, and they can be harmful; they can reproduce stereotypes, prevent speakers of “non-standard” language varieties from participating fully in public discourse, and reinscribe historical patterns of linguistic discrimination. In this talk, I will take a tour through the rapidly emerging body of research examining bias and harm in language technologies and offer some perspective on the many challenges of this work. I will discuss some recent efforts to understand language-related harms in their sociohistorical contexts, and to investigate NLP resources developed for one such harm—stereotyping—touching on the complexities of deciding what these resources ought to measure, and how they ought to measure it.

Esin Durmus

Stanford University

April 18, 2022

On the Evaluation and Mitigation of Faithfulness Errors in Abstractive Summarization

Despite recent progress in abstractive summarization, systems still generate unfaithful summaries, i.e. summaries that contain information that is not supported by the input. There has been a lot of effort to develop methods to measure and improve faithfulness errors. In this talk, I will first introduce some of the proposed methods to measure faithfulness of summarization systems. Then, I will present a spurious correlate: i.e., extractiveness of the summary, that potentially influences how we should evaluate the faithfulness of these systems. In particular, I will describe our work that proposes a method to measure and improve faithfulness by accounting for the extractiveness of summarization systems. Furthermore, I will discuss the importance of accounting for spurious correlations (such as extractiveness, perplexity, and length) in designing effective evaluation frameworks for text generation.

Maarten Sap

Allen Institute for AI (AI2)

April 11, 2022

Detecting and Rewriting Socially Biased Language

Language has the power to reinforce stereotypes and project social biases onto others, either through overt hate or subtle biases. Accounting for this toxicity and social bias in language is crucial for natural language processing (NLP) systems to be safely and ethically deployed in the world. In this talk, I will first discuss subjectivity challenges in binary hate speech detection, by examining perceptions of offensiveness of text depending on reader attitudes and identities. Through an online study, we find several correlates between over- or under-detecting text as toxic based on political leaning, attitudes about racism and free speech. Then, as an alternative to binary hate speech detection, I will present Social Bias Frames, a new structured formalism for distilling biased implications of language. Using a new corpus of 150k structured annotations, we show that models can learn to reason about high-level offensiveness of statements, but struggle to explain why a statement might be harmful. Finally, I will introduce PowerTransformer, an unsupervised model for controllable debiasing of text through the lens of connotation frames of power and agency. With this model, we show that subtle gender biases in how characters are portrayed in stories and movies can be mitigated through automatic rewriting. I will conclude with future directions for better reasoning about toxicity and social biases in language.

Allyson Ettinger

University of Chicago

April 4, 2022

"Understanding" and prediction: Controlled examinations of meaning sensitivity in pre-trained models

In recent years, NLP has made what appears to be incredible progress, with performance even surpassing human performance on some benchmarks. How should we interpret these advances? Have these models achieved language "understanding"? Operating on the premise that "understanding" will necessarily involve the capacity to extract and deploy meaning information, in this talk I will discuss a series of projects leveraging targeted tests to examine NLP models' ability to capture meaning in a systematic fashion. I will first discuss work probing model representations for compositional meaning, with a particular focus on disentangling compositional information from encoding of lexical properties. I'll then explore models' ability to extract and deploy meaning information during word prediction, applying tests inspired by psycholinguistics to examine what types of information models encode and access for anticipating words in context. In all cases, these investigations apply tests that prioritize control of unwanted cues, so as to target the desired meaning capabilities with greater precision. The results of these studies suggest that although models show a good deal of sensitivity to word-level information, and to a number of semantic and syntactic distinctions, they show little sign of capturing higher-level compositional meaning, of capturing logical impacts of meaning components like negation, or of retaining access to robust representations of meaning information conveyed in prior context. I will discuss potential implications of these findings with respect to the goals of achieving "understanding" with currently dominant pre-training paradigms.

Heng Ji

University of Illinois at Urbana-Champaign

March 28, 2022

Information Surgery: Faking Multimedia Fake News for Real Fake News Detection

We are living in an era of information pollution. The dissemination of falsified information can cause chaos, hatred, and trust issues among humans, and can eventually hinder the development of society. In particular, human-written disinformation, which is often used to manipulate certain populations, had catastrophic impact on multiple events, such as the 2016 US Presidential Election, Brexit, the COVID-19 pandemic, and the recent Russia’s assault on Ukraine. Hence, we are in urgent need of a defending mechanism against human-written disinformation. While there has been a lot of research and many recent advances in neural fake news detection, there are many challenges remaining. In particular, the accuracy of existing techniques at detecting human-written fake news is barely above random. In this talk I will present our recent attempts at tackling four unique challenges in the frontline of combating fake news written by both machines and humans: (1) Define a new task on knowledge element level misinformation detection based on cross-media knowledge extraction and reasoning to make the detector more accurate and explainable; (2) Generate training data for the detector based on knowledge graph manipulation and knowledge graph guided natural language generation; (3) Use Natural Language Inference to ensure the fake information cannot be inferred from the rest of the real document; (4) Propose the first work to generate propaganda for more robust detection of human-written fake news.

Omer Levy

Tel Aviv University

March 21, 2022

SCROLLS: Standard CompaRison Over Long Language Sequences

NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. Initial baselines, including Longformer Encoder-Decoder, indicate that there is ample room for improvement on SCROLLS. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.

Jordan Boyd-Graber

University of Maryland

March 14, 2022

Manchester vs. Cranfield: Why do we have computers answering questions from web search data and how can we do it better?

In this talk, I'll argue that the intellectual nexus of computers searching through the web to answer questions comes from research undertaken in two mid-century English university towns: Manchester and Cranfield. After reviewing the seminal work of Cyril Cleverdon and Alan Turing and explaining how that shaped today the information and AI age, I'll argue that these represent two competing visions for how computers should answer questions: either exploration of intelligence (Manchester) or serving the user (Cranfield). However, regardless of which paradigm you adhere to, I argue that the ideals for those visions are not fulfilled in modern question answering implementations: the human (Ken Jennings) vs. computer (Watson) competition on Jeopardy! was rigged, other evaluations don't show which system knows more about a topic, the training and evaluation data don't reflect the background of users, and the annotation scheme for training data is incomplete. After outlining our short-term solutions to these issues, I'll then discuss a longer-term plan to achieve the goals of both the Manchester and Cranfield paradigms.

Tal Linzen

New York University

February 28, 2022

Causal analysis of the syntactic representations used by Transformers

The success of artificial neural networks in language processing tasks has underscored the need to understand how they accomplish their behavior, and, in particular, how their internal vector representations support that behavior. The probing paradigm, which has often been invoked to address this question, relies on the (typically implicit) assumption that if a classifier can decode a particular piece of information from the model's intermediate representation, then that information plays a role in shaping the model's behavior. This assumption is not necessarily justified. Using the test case of everyone's favorite syntactic phenomenon - English subject-verb number agreement - I will present an approach that provides much stronger evidence for the *causal* role of the encoding of a particular linguistic feature in the model's behavior. This approach, which we refer to as AlterRep, modifies the internal representation in question such that it encodes the opposite value of that feature; e.g., if BERT originally encoded a particular word as occurring inside a relative clause, we modify the representation to encode that it is not inside the relative clause. I will show that the conclusions of this method diverge from those of the probing method. Finally, if time permits, I will present a method based on causal mediation analysis that makes it possible to draw causal conclusions by applying counterfactual interventions to the *inputs*, contrasting with AlterRep which intervenes on the model's internal representations.

Maria Ryskina

Carnegie Mellon University

February 21, 2022

Learning Computational Models of Non-Standard Language

Non-standard linguistic items, such as novel words or creative spellings, are common in domains like social media and pose challenges for automatically processing text from these domains. To build models capable of processing such innovative items, we need to not only understand how humans reason about non-standard language, but also be able to operationalize this knowledge to create useful inductive biases. In this talk, I will present empirical studies of several phenomena under the umbrella of non-standard language, modeled at the levels of granularity ranging from individual users to entire dialects. First, I will show how idiosyncratic spelling preferences reveal information about the user, with an application to the bibliographic task of identifying typesetters of historical printed documents. Second, I will discuss the common patterns in user-specific orthographies and demonstrate that incorporating these patterns helps with unsupervised conversion of idiosyncratically romanized text into the native orthography of the language. In the final part of the talk, I will focus on word emergence in a dialect as a whole and present a diachronic corpus study modeling the language-internal and language-external factors that drive neology.

Spencer Caplan

Swarthmore College

February 14, 2022

On the importance of baselines: Communicative efficiency and the statistics of words in natural language

Is language designed for communicative and functional efficiency? G. K. Zipf (1949) famously argued that shorter words are more frequent because they are easier to use, thereby resulting in the statistical law that bears his name. Yet, G. A. Miller (1957) showed that even a monkey randomly typing at a keyboard, and intermittently striking the space bar, would generate “words” with similar statistical properties. Recent quantitative analyses of human language lexicons (Piantadosi et al., 2012) have revived Zipf's functionalist hypothesis. Ambiguous words tend to be short, frequent, and easy to articulate in language production. Such statistical findings are commonly interpreted as evidence for pressure for efficiency, as the context of language use often provides cues to overcome lexical ambiguity. In this talk, I update Miller's monkey thought experiment to incorporate empirically motivated phonological and semantic constraints on the creation of words. I claim that the appearance of communicative efficiency is a spandrel (in the sense of Gould & Lewontin, 1979), as lexicons formed without the context of language use or reference to communication or efficiency exhibit comparable statistical properties. Furthermore, the updated monkey model provides a good fit for the growth trajectory of English as recorded in the Oxford English Dictionary. Focusing on the history of English words since 1900, I show that lexicons resulting from the monkey model provide a better embodiment of communicative efficiency than the actual lexicon of English. I conclude by arguing that the kind of faulty logic underlying the study of communicative efficiency crops up quite commonly within NLP -- evaluation metrics, and appropriate baselines, need to be carefully considered before any claims (cognitive or otherwise) can safely be made on their basis.

Peter Clark

Allen Institute for AI (AI2)

February 7, 2022

Systematic Reasoning and Explanation over Natural Language

Recent work has shown that transformers can be trained to reason *systematically* with natural language (NL) statements, answering questions with answers implied by a set of provided facts and rules, and even generating proofs for those conclusions. However, these systems required all the knowledge to be provided explicitly as input. In this talk, I will describe our current work on generalizing this to real NL problems, where the system produces faithful, entailment-based proofs for its answers, including materializing its own latent knowledge as needed for those proofs. The resulting reasoning-supported answers can then be inspected, debugged, and corrected by the user, offering new opportunities for interactive problem-solving dialogs, and taking a step towards "teachable systems" that can learn from such dialogs over time.

Sihao Chen, Liam Dugan, Xingyu Fu

University of Pennsylvania

January 31, 2022

Mini Talks

The three talks this week include "Characterizing Media Presentation Biases and Polarization with Unsupervised Open Entity Relation Learning" (Sihao Chen), "Are humans able to detect boundaries between human-written and machine-generated text?" (Liam Dugan) and "There’s a Time and Place for Reasoning Beyond the Image" (Xingyu Fu).

Jonathan Berant

Tel Aviv University

December 6, 2021

Zero-shot learning and out-of-distribution generalization: two sides of the same coin

Recent advances in large pre-trained language models have shifted the NLP community’s attention to new challenges: (a) training models with zero, or very few, examples, and (b) generalizing to out-of-distribution examples. In this talk, I will argue that the two are intimately related, and describe ongoing (read, new!) work in those directions. First, I will describe a new pre-training scheme for open-domain question answering that is based on the notion of “recurring spans” across different paragraphs. We show this training scheme leads to a zero-shot retriever that is competitive with DPR (which trains on thousands of examples), and is more robust w.r.t the test distribution. Second, I will focus on compositional generalization, a particular type of out-of-distribution generalization setup where models need to generalize to structures that are unobserved at training time. I will show that the view that seq2seq models categorically do not generalize to new compositions is false, and present a more nuanced analysis, which elucidates what are the conditions under which models struggle to compositionally generalize.

He He

New York University

November 30, 2021

Out-of-distribution generalization in NLP

Real-world NLP models must work well when the test distribution differs from the training distribution. While we have made great progress in natural language understanding thanks to large-scale pre-training, current models still take shortcuts and rely on spurious correlations in specific datasets. In this talk, I will discuss the role of pre-training and data in model robustness to distribution shifts. In particular, I will describe how pre-trained models avoid learning spurious correlations, when data augmentation helps and hurts, and how large language models can be leveraged to improve few-shot learning.

Yue Yang

University of Pennsylvania

November 22, 2021

Investigate Procedural Events in a Multimodal Fashion

Recently, there has been growing attention to studying procedural events while most of them focus on the text. We utilize multimodal as a tool to probe the procedure knowledge. This talk will introduce two projects: 1) Visual Goal-Step Inference using wikiHow -- Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. 2) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval -- Schemas are structure representations of complex tasks that can aid artificial intelligence by allowing models to break down complex tasks into intermediate steps. We propose a novel system that induces schemas from web videos and generalizes schemas for unseen tasks to improve video retrieval performance.

Marjorie McShane

Rensselaer Polytechnic Institute

November 15, 2021

Toward Broad and Deep Language Understanding for Intelligent Systems

The early vision of AI included the goal of endowing intelligent systems with human-like language processing capabilities. This proved harder than expected, leading the vast majority of natural language processing practitioners to pursue less ambitious, shorter-term goals. Whereas the utility of human-like language processing is unquestionable, its feasibility is quite justifiably questioned. In this talk, I will not only argue that some approximation of human-like language processing is possible, I will present a program of R&D that is working on making it a reality. This vision, as well as progress to date, is described in the book Linguistics for the Age of AI (MIT Press, 2021).

Daphne Ippolito

University of Pennsylvania

November 1, 2021

Language Models Memorize their Training Data; Dataset Deduplication Helps

Large neural language models are capable of memorizing their training data. First, I will discuss why this memorization is bad and the subtleties involved in studying harmful memorization tendencies. Then, I will go over some early results on the circumstances under which GPT-Neo, a popular public language model, exhibits memorization. Finally, I will describe our recent paper on deduplicating training data and discuss how models trained on deduplicated data memorize less, are more efficient to train, and possibly generalize better. I will also examine the problem of train-test leakage in existing popular datasets.

Samuel Bowman

NYU

October 25, 2021

Overclaiming in NLP Is a Serious Problem. Underclaiming May Be Worse.

In an effort to avoid reinforcing widespread hype about the capabilities of state-of-the-art language technology systems, researchers have developed practices in framing and citation that serve to deemphasize the field's successes, even at the cost of making misleadingly strong claims about the limits of our best systems. This is a problem, though, and it may be more serious than it looks: It limits our ability to mitigate short-term harms from NLP deployments and it limits our ability to prepare for the potentially-enormous impacts of more distant future systems. This paper urges researchers to be careful about these claims, and suggests some research directions that will make it easier to avoid or rebut them.

Diyi Yang

Georgia Tech

October 18, 2021

Socially Aware Language Technologies: Theory, Method, and Practice

Natural language processing (NLP) has had increasing success and produced extensive industrial applications. Despite being sufficient to enable these applications, current NLP systems often ignore the social part of language, e.g., who says it, in what context, for what goals. In this talk, we take a closer look at social factors in language via a new theory taxonomy and its interplay with computational methods via two lines of work. The first one studies hate speech and racial bias by introducing a benchmark corpus on implicit hate speech and computational models on detecting and explaining latent hatred in language. The second part demonstrates how more structures of conversations can be utilized to generate better summaries for everyday interaction. We conclude by discussing several open-ended questions about how to build socially aware language technologies.

Hangfeng He

University of Pennsylvania

October 4, 2021

Incidental Supervision for Natural Language Understanding

It is labor-intensive to acquire human annotations for natural language understanding (NLU) tasks because annotation can be complex and often requires significant linguistic expertise. Therefore, it is important to investigate how to get supervision from indirect signals and improve one's target task. In this topic, we focus on improving NLU by exploiting incidental supervision signals. Specifically, our goal is to first provide a better understanding of incidental signals, and then design more efficient algorithms to collect, select, and use incidental signals for NLU tasks. This problem is challenging because of the intrinsic differences between incidental supervision signals and target tasks. In addition, the complicated properties of natural language, such as variability and ambiguity, make the problem more challenging. Our contribution to this line of work so far is in three directions. First, we show how to exploit information from cheap signals to help other tasks. Specifically, we retrieve distributed representations from question-answering (QA) pairs to help various downstream tasks. Second, in order to facilitate selecting appropriate incidental signals for a given target task, we propose a unified informativeness measure to quantify the benefits of various incidental signals. Finally, we design efficient algorithms to exploit specific types of incidental signals, where we design a new weighted training algorithm to improve the sample efficiency of learning from cross-task signals. In the future, we plan to further investigate the usage of incidental signals for NLU tasks by better understanding the properties of natural language. Specifically, we propose to work on reasoning in natural language, and study the benefit of the structure in NLU tasks.

Tom Hope

Allen Institute for AI

September 27, 2021

Harnessing Scientific Literature for Boosting Discovery and Innovation

In the year 1665, the first academic journal was published. Fast forward to today, there are millions of scientific papers coming out every year. This explosion of knowledge represents an opportunity to accelerate innovation with automated systems that scour the literature for solutions and inspirations. However, it also creates information overload and isolated “research bubbles” that limit discovery and sharing, slowing down scientific progress and cross-fertilization. In this talk, I will present our work toward addressing these large-scale challenges for the future of science. In the first part of the talk, I will overview our core approach which consists of identifying key “building blocks” of scientific thought, formalizing and structuring them into computational representations that power creative innovation systems we construct. These include systems that surface inspirations, recommend novel authors, enable search for challenges, hypotheses and causal relations, and tools for exploration and visualization of collaboration networks. The second part of the talk will consist of a dive into our new work -- SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts (AKBC 2021) -- motivated by some of the applications above. We present a new task of cross-document coreference with a referential hierarchy over mention clusters, including a new challenging dataset and models. Finally, if time permits, I will discuss our recent paper --- Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study (AKBC 2021), where we integrate language models and graph embeddings to boost biomedical link prediction with applications in drug discovery.

Bryan Li, Weiqiu You, Qing Lyu (Veronica)

University of Pennsylvania

September 20, 2021

Mini Talks

Our mini talks include "Careful with Context: A Critique of Methods for Commonsense Inference" presented by Bryan Li, "Zero-shot Image Classification with Text using Pretrained Embedding" presented by Weiqiu You, and "Is 'my favorite new movie' 'my favorite movie;? Probing the Understanding of Recursive Noun Phrases" presented by Qing Lyu (Veronica).

Danqi Chen

Princeton University

May 3, 2021

Learning Representations for Dense Retrieval

Dense retrieval has become a new paradigm to retrieve relevant text information in open-domain question answering and other knowledge-intensive NLP tasks. Compared to sparse, non-trainable vector space models, dense retrieval holds great promise in better capturing semantic relationships (e.g., synonyms and paraphrases) between the query and retrieved text units. However, training dense vector models from limited labeled data and scale them to a large text corpus remains challenging. In this talk, I will discuss two recent studies: (1) Dense Passage Retriever (DPR), a simple and effective method that allows learning a dense retriever from a small number of question-answer pairs. It greatly outperforms BM25 and can be used with an extractive or generative reader model for QA and other tasks. (2) DensePhrases, which builds an index of dense representations of all the phrases at the Wikipedia scale. We can directly run retrieval at phrase level and obtain extreme runtime efficiency and competitive performance. DensePhrases can also be used as a dense knowledge base.

Veronica Perez-Rosas

University of Michigan

April 26, 2021

Natural Language Processing for Enhanced Mental Healthcare

In recent years, there has been an increasing need for psychotherapy to address a wide variety of behavioral and mental health issues. This need has become even more prominent during the ongoing pandemic as COVID-19 related concerns have increased mental distress. Developing computational methods that gain a better understanding of mental health conversations can help practitioners to improve the quality of care. In this talk, I will first describe work on identifying conversational behaviors that lead to successful counseling interactions. Next, I will present ongoing work on developing a counseling dialog generation system that can assist counselors while acquiring and improving counseling skills. In particular, I will describe a counseling dialog system that provides language feedback to counseling trainees using the pretrained transformer architecture and context augmentation techniques inspired by traditional strategies used during counseling training.

Nate Chambers

United States Naval Academy

April 19, 2021

Extracting from Adversarial Text with a Visual Character-Based Model: extracting phone numbers from human trafficking ads

Adversarial text is written with obfuscated words and characters for the purpose of fooling machine learned extractors. Illicit domains like human trafficking often employ such techniques. This talk will address the challenge of extracting phone numbers from this noisy text, such as "3w?n7_callme28tree(?nE)_573", but more broadly the talk will discuss the NLP challenge of dealing with unicode characters in any domain. With very little available training data for human trafficking, how can today's neural models learn to generalize to the diversity of noise available to an adversarial writer? This talk will present a couple solutions to this challenge, focusing on character-based neural models that use NLP architectures like LSTMs and CRFs, but also that draw inspiration from the vision community to perform image recognition of the characters with CNNs. I'll first present results from our Best Paper Award at the Workshop for Noisy User-Generated text, exploring extraction from short text snippets, and then show simple steps to expand it to full document extraction.

Ivan Vulic

Cambridge University

April 12, 2021

Cross-Lingual Transfer in Low-Data Regimes: On Some Achievements, Trends, and Challenges

A key challenge in cross-lingual NLP is developing general language-independent architectures that will be equally applicable to any language. However, this ambition is hindered by the large variation in 1) structural and semantic properties of the world’s languages, as well as 2) raw and task data scarcity for many different languages, tasks, and domains. As a consequence, existing language technology is still largely limited to a handful of resource-rich languages. In this talk, we introduce and discuss a range of recent techniques and breakthroughs that aim to deal with such large cross-language variations and low-data regimes efficiently. We cover a range of cutting-edge approaches including adapter-based models for cross-lingual transfer, contextual parameter generation and hypernetworks, learning in few-shot and zero-shot scenarios, and typologically driven learning and source selection. Finally, this talk demonstrates that low-resource languages, despite very positive research trends and results achieved in recent years, still lag behind major languages, and outline several key challenges for future research in this area.

Lillian Lee

Cornell University

April 5, 2021

Discussion Dynamics: Early prediction of controversy; content removal as a moderation strategy

Elizabeth Clark

University of Washington

March 29, 2021

Where NLG Meets People: Text Generation Models and Evaluation for Human-Machine Collaboration

Natural language generation (NLG) models' ability to generate long, fluent texts has enabled progress and new applications across many NLG subfields, but it also poses challenges for model evaluation. In this talk, I will discuss how we can use NLG models in a collaborative setting to offer suggestions to people as they perform a creative writing task. I will present a "machine-in-the-loop" framework for machine-writer collaboration and show how it can be used to improve NLG models. I will also discuss the challenge of evaluating long, fluent passages of generated text and introduce Sentence Mover's Similarity, a metric for automatically evaluating multi-sentence text. Finally, I will discuss the role of human evaluations in NLG and propose directions for collecting better human evaluations for current NLG models.

Abigail See

Stanford University

March 22, 2021

Neural Generation Meets Real People: Towards Emotionally Engaging Mixed-Initiative Conversations

In this talk I will present Chirpy Cardinal, an open-domain dialogue agent built by the Stanford NLP team in the 2019-2020 Alexa Prize competition. Building an open-domain socialbot that talks to real people is challenging – such a system must meet multiple user expectations such as broad world knowledge, conversational style, and emotional connection. Our socialbot engages users on their terms – prioritizing their interests, feelings and autonomy. As a result, our socialbot provides a responsive, personalized user experience, capable of talking knowledgeably about a wide variety of topics, as well as chatting empathetically about ordinary life. Neural generation plays a key role in achieving these goals, providing the backbone for our conversational and emotional tone. Chirpy Cardinal ultimately won 2nd place in the competition, with a 3.6/5.0 average customer rating. In this talk I will cover the technical details of the bot, analysis of its strengths and weaknesses, unexpected findings during the competition, and future work.

Kellie Webster

Google

March 15, 2021

Best Practices for using Natural Language Models: A Case Study from Gendered Correlations

Natural language processing has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, using masked language modeling. The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream. I will present two works in this direction, “Measuring and Reducing Gendered Correlations in Pre-trained Models” and "Scalable Cross Lingual Pivots to Model Pronoun Gender for Translation". In the first, we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models: (i) It is important to measure for unintended correlations; (ii) Be careful even when making seemingly innocuous configuration changes; and (iii) There are opportunities for general mitigations. In the second, we explore how to leverage the rich representations in BERT to improve gendered pronoun accuracy in machine translation.

David Bamman

University of California, Berkeley

March 8, 2021

Modeling the Spread of Information within Novels

Understanding the ways in which information flows through social networks is important for questions of influence--including tracking the spread of cultural trends and disinformation and measuring shifts in public opinion. Much work in this space has focused on networks where nodes, edges and information are all directly observed (such as Twitter accounts with explicit friend/follower edges and retweets as instances of propagation); in this talk, I will focus on the comparatively overlooked case of information propagation in *implicit* networks--where we seek to discover single instances of a message passing from person A to person B to person C, only given a depiction of their activity in text. Literature in many ways presents an ideal domain for modeling information propagation described in text, since it depicts a largely closed universe in which characters interact and speak to each other. At the same time, it poses several wholly distinct challenges--in particular, both the length of literary texts and the subtleties involved in extracting information from fictional works pose difficulties for NLP systems optimized for other domains. In this talk, I will describe our work in measuring information propagation in these implicit networks, and detail an NLP pipeline for discovering it, focusing in detail on new datasets we have created for tagging characters and their coreference in text. This is joint work with Matt Sims, Olivia Lewke, Anya Mansoor, Sejal Popat and Sheng Shen.

Greg Durrett

UT Austin

March 1, 2021

Addressing the Paradox of Flexible but Reliable Text Generation

Text generation is a paradox. We want our generation models to imitate patterns in training data, but also have the flexibility to work in new settings and behave in new ways. We want our models to say creative things, but also be reliable and factual with respect to their inputs. How can we achieve these dual goals with a single system? Our work focuses on generation systems that are controlled and assessed in fine-grained ways: control mechanisms can help enumerate diverse inputs, which are then assessed according to our desired criteria. I will describe work in paraphrasing and summarization where intermediate syntactic control mechanisms can make our models more expressive. I will then describe how to assess these models' outputs from the standpoint of factuality and grammaticality in a fine-grained way, localizing errors to individual words and dependency arcs. By achieving diversity and then enforcing quality, we can build systems that are simultaneously flexible and reliable enough to handle a range of generation settings.

Ankur Parikh

Google

February 22, 2021

Towards High Precision Text Generation

Despite large advances in neural text generation in terms of fluency, existing generation techniques are prone to hallucination and often produce output that is unfaithful or irrelevant to the source text. In this talk, we take a multi-faceted approach to this problem from 3 aspects: data, evaluation, and modeling. From the data standpoint, we propose ToTTo, a tables-to-text-dataset with high quality annotator revised references that we hope can serve as a benchmark for high precision text generation task. While the dataset is challenging, existing n-gram based evaluation metrics are often insufficient to detect hallucinations. To this end, we propose BLEURT, a fully learnt end-to-end metric based on transfer learning that can quickly adapt to measure specific evaluation criteria and a model based on confidence decoding to mitigate hallucinations. Finally, I will discuss GEM, a living benchmark for generation that is the result of a large collaboration among many institutions, and will be an ACL 2021 workshop this year

Barlas Oguz

Facebook AI

February 15, 2021

Dense Retrieval for Question Answering

Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this talk, we discuss recent work which shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks. We also discuss extensions to the multi-hop setting, where we can outperform competing approaches with 10x less computation.

Liang Huang

Oregon State University/Baidu Research USA

February 8, 2021

Fighting COVID-19 using Parsing Algorithms and Grammar Formalisms

To defeat the current COVID-19 pandemic, a messenger RNA (mRNA) vaccine has emerged as a promising approach thanks to its rapid and scalable production and non-infectious and non-integrating properties. However, designing an mRNA sequence to achieve high stability and protein yield remains a challenging problem due to the exponentially large search space (e.g., there are 2.4 x 10^632 possible mRNA sequence candidates for the spike protein of SARS-CoV-2). We describe two on-going efforts for this problem, both using linear-time algorithms inspired by my earlier work in natural language parsing. On one hand, the Eterna OpenVaccine project from Stanford Medical School takes a crowd-sourcing approach to let game players all over the world design stable sequences. To evaluate sequence stability (in terms of free energy), they use LinearFold from my group (2019) since it’s the only linear-time RNA folding algorithm available (which makes it the only one fast enough for COVID-scale genomes). On the other hand, we take a computational approach to directly search for the optimal sequence in this exponentially large space via dynamic programming. It turns out this problem can be reduced to a classical problem in formal language theory and computational linguistics (intersection between CFG and DFA), which can be solved in O(n^3) time, just like lattice parsing for speech. In the end, we can design the optimal mRNA vaccine candidate for SARS-CoV-2 spike protein in just about 10 minutes. This talk is dedicated to the memory of my PhD advisor Aravind Joshi who taught me that linguistics and biology share the same mathematical foundations.

Lara Martin

University of Pennsylvania

January 25, 2021

Dungeons and Discourse: Using Computational Storytelling & Speech to Look at Natural Language Use

Although we are currently riding a technological wave of personal assistants, many of these agents still struggle to communicate appropriately. In particular, these systems lack coherence, the ability to adapt to novel situations, creativity, emotional understanding, and collaboration. My work focuses on creating open-world storytelling systems and developing agents that leverage speech understanding to communicate with humans more effectively. In this talk, I look at how tabletop roleplaying games such as Dungeons & Dragons can be used as motivation for how to improve conversational systems and understand how people communicate.

Rotem Dror

University of Pennsylvania

December 7, 2020

Statistical Significance Testing for Natural Language Processing

Data-driven experimental analysis has become the main evaluation tool of Natural Language Processing (NLP) algorithms. In fact, in the last decade, it has become rare to see an NLP paper, particularly one that proposes a new algorithm, that does not include extensive experimental analysis, and the number of involved tasks, datasets, domains, and languages is constantly growing. This emphasis on empirical results highlights the role of statistical significance testing in NLP research: If we, as a community, rely on empirical evaluation to validate our hypotheses and reveal the correct language processing mechanisms, we better be sure that our results are not coincidental. In this talk, I will go through the main chapters of the book in the title (https://www.morganclaypool.com/doi/abs/10.2200/S00994ED1V01Y202002HLT045) and answer the following questions: How to choose a valid statistical test for your experiments? How to perform statistical analysis when experimenting with multiple datasets? How to compare deep neural models in a statistically valid manner? And some more surprises...

Dragomir R. Radev

Yale University

November 30, 2020

Closing the Loop in Natural Language Interfaces to Relational Databases: Parsing, Dialogue, and Generation

Natural Language is a very efficient method of communication among humans. However, when users want to talk to their computers, translating this NL to computer actions is a very challenging task. One possible way for such human-computer interaction is to translate NL sentences to database queries and then to convert the output of these queries back to NL. In order for such an approach to work, one needs to address several challenges: the lack of annotated question-query pairs, the discourse issues present in multi-turn questions, and the issues that arise in a dialogue context. In this presentation, I will talk about recent work on natural language interfaces to databases. As part of the Yale Spider project, we have developed three new datasets and launched three matching shared tasks. Spider is a collection of 10,181 manually created natural language questions on databases from 138 domains, and the 5,693 database queries that correspond to them. SParC (Semantic Parsing in Context) consists of 4,298 coherent sequences of questions and the matching queries. Finally, CoSQL consists of WoZ 3k dialogues and a total of 30k turns, and their translations to SQL. I will then introduce GraPPa, a pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We used GraPPa to obtain SOTA performance on four popular fully supervised and weakly supervised table semantic parsing benchmarks. Joint work with Tao Yu, Rui Zhang, Victoria Lin, Caiming Xiong, and many others.

Ellie Pavlick

Brown University/Google AI

November 9, 2020

You can lead a horse to water...: Representing vs. Using Features in Neural NLP

A wave of recent work has sought to understand how pretrained language models work. Such analyses have resulted in two seemingly contradictory sets of results. On one hand, work based on "probing classifiers" generally suggests that SOTA language models contain rich information about linguistic structure (e.g., parts of speech, syntax, semantic roles). On the other hand, work which measures performance on linguistic "challenge sets" shows that models consistently fail to use this information when making predictions. In this talk, I will present a series of results that attempt to bridge this gap. Our recent experiments suggest that the disconnect is not due to catastrophic forgetting nor is it (entirely) explained by insufficient training data. Rather, it is best explained in terms of how "accessible" features are to the model following pretraining, where "accessibility" can be quantified using an information-theoretic interpretation of probing classifiers.

Vivek Srikumar

University of Utah

November 2, 2020

Where Neural Networks Fail: The Case for a Little Help from Knowledge

Today's dominant paradigm for modeling complex linguistic tasks calls for training neural networks by minimizing loss on massive datasets. While the agenda is undeniably successful, we may not have the luxury of annotated data for every task or domain of interest. Reducing dependence on labeled examples may require us to rethink how we supervise models. In this talk, I will discuss some failures of today's end-to-end trained neural networks. In particular, I will focus on two phenomena---societal stereotypes implicitly present in their decisions, and their inability to perform complex reasoning---both due to the models inability to internalize knowledge about the world. Following this, I will describe our work on using knowledge to inform neural networks without introducing additional parameters. Declarative rules stated in logic can be systematically compiled into computation graphs that augment the structure of neural models, and also into regularizers that can use labeled or unlabeled examples. I will present experiments involving text understanding and semantic role labeling, which show that such declaratively constrained neural networks can successfully internalize the information in the rules, providing an easy-to-use mechanism for supervising neural networks that does not involve data annotation.

Liang Huang

Oregon State University/Baidu Research USA

October 26, 2020

Simultaneous Translation: Breakthrough and Recent Progress

Simultaneous interpretation (i.e., translating concurrently with the source language speech) is widely used in many scenarios including multilateral organizations (UN/EU), international summits (APEC/G-20), legal proceedings, and press conferences. However, it is well known to be one of the most challenging tasks for humans due to the simultaneous perception and production in two languages. As a result, there are only a few thousand professional simultaneous interpreters world-wide, and each of them can only sustain for 15-30 minutes in each turn. On the other hand, simultaneous translation (either speech-to-text or speech-to-speech) is also notoriously difficult for machines and has remained one of the holy grails of AI. A key challenge here is the word order difference between the source and target languages. For example, if you simultaneously translate German (an SOV language) to English (an SVO language), you often have to wait for the sentence-final German verb. Therefore, most existing "real-time" translation systems resort to conventional full-sentence translation, causing an undesirable latency of at least one sentence, rendering the audience largely out of sync with the speaker. There have been efforts towards genuine simultaneous translation, but with limited success. Recently, we discovered a much simpler and surprisingly effective approach to simultaneous (speech-to-text) translation by designing a "prefix-to-prefix" framework tailed to simultaneity requirements. This is in contrast with the "sequence-to-sequence" framework which assumes the availability of the full input sentence. Our approach results in the first simultaneous translation system that achieves reasonable translation quality with controllable latency and was successfully deployed in many commercial products. Since 2019, our work has attracted renewed interest in this long-standing problem which was once thought to be out of reach. I will also discuss our efforts towards the ultimate goal of simultaneous speech-to-speech translation, and conclude with a list of remaining challenges. (Part of this talk was given as an ACL 2019 Keynote, but this Clunch talk will cover more recent progress.)

Rui Zhang

Penn State University

October 19, 2020

Building Robust Conversational Question Answering Systems Over Databases of Tabular Data

A vast amount of information is stored in relational databases consisting of tables. These databases provide fundamental frameworks of data systems for business in various domains. In real-world applications, users would like to interact with databases for information requests just like talking to a human. However, querying databases requires proficiency in the SQL query language syntax and the knowledge of underlying table structures. Consequently, despite the enormous popularity of relational databases, the ability to retrieve information from these databases is still limited for many ordinary users. In this talk, I will describe some completed and ongoing efforts to build conversational question answering systems over databases of tabular data that is (1) robust to user queries by handling different types of user inputs, (2) conversational and interactive by conversing with users in a dialog setting with its reasoning ability over multi-turn contexts of interaction history, (3) explainable and verifiable by generating natural language explanations of system predicted SQL queries and execution results for user verification and feedback, (4) transferable and adaptable by quickly adapting to different domains and scenarios of databases.

Daniel Deutsch

University of Pennsylvania

October 12, 2020

Ongoing Work on Summarization Evaluation Metrics

In this talk, I will provide an overview of two ongoing works on summarization evaluation metrics. The first work analyzes the extent to which ROUGE and BERTScore actually measure the information overlap between two summaries. I show that they largely do not and propose an alternative method of comparing summarization systems which does and is interpretable. The second work focuses on using QA to evaluate summaries. After proposing a new QA-based metric, I benchmark its performance on current datasets, identify performance bottlenecks, and estimate its upper-bound performance, concluding QA is a promising future research direction.

Mike Lewis

Facebook AI Research

October 5, 2020

Modelling Language and the World

Much recent progress in NLP has been driven by training language models on large unlabelled datasets. I will argue that language modelling requires both linguistic and world knowledge, but that these can be disentangled and modelled separately. First, I will describe kNN-LM, which shows how converting a language model into a nearest neighbor classifier can give large gains in performance, by giving the model access to facts in the training set during inference. I will then introduce MARGE, a new approach to pre-training sequence-to-sequence models with an unsupervised paraphrasing objective. This objective emphasises learning to paraphrase over memorizing facts. MARGE performs well on classification, generation and retrieval tasks in many languages, without supervision in some cases, making it arguably the most broadly applicable pre-trained model to date.

Dan Hopkins

University of Pennsylvania

September 21, 2020

The Polarization and Nationalization of American State Party Platforms, 1918-2017

The role of U.S. state political parties has changed substantially in recent decades. One common supposition is that contemporary state parties are increasingly polarized and nationalized, meaning that the Democratic and Republican parties adopt similar positions nationwide. Yet, the relationship between these shifts, the mechanisms underpinning them, and the extent to which they have unfolded similarly across states and issue areas remain open questions. We introduce a data set of 2,041 state party platforms to measure nationalization and polarization between 1918 and 2018. Applying tools from automated and manual content analysis, we find that there is a dramatic divergence in the topics covered in Democratic and Republican platforms starting in the early 1990s, at virtually the same time as federal-level rhetorical polarization. During this same period, the differences across states in platforms decreased and social issues became more prominent, suggesting a tight connection between polarization, nationalization, and social issues such as abortion.

Mohit Iyyer

University of Massachusetts Amherst

September 14, 2020

Towards interactive story generation

Story generation is difficult to computationally formalize and evaluate, and there are many important questions to ask when tackling the problem. What should we consider as the base unit of a story (e.g., a sentence? a paragraph? a chapter?) What kind of data should we use to train these models (novels? short stories? overly simplistic mechanically-turked paragraphs?) Is any model architecture currently capable of producing long-form narratives that have some semblance of coherent discourse structure, such as plot arcs and character development? When evaluating the outputs of our models, can we do better than just asking people to rate the text based on vaguely defined properties such as "enjoyability"? In this talk, I'll discuss my lab's ongoing work on story generation by introducing a new dataset and evaluation method that we hope will spur progress in this area. I'll then describe practical challenges (slow inference, unsecure models) that we face when deploying our models in real-world author-facing settings, along with some solutions we have developed to combat these challenges.

Matt Gardner

Allen Institute for Artificial Intelligence

March 5, 2020

NLP Evaluations That We Believe In

With all of the modeling advancements in recent years, NLP benchmarks have been falling over left and right: "human performance" has been reached on SQuAD 1 and 2, GLUE and SuperGLUE, and many commonsense datasets. Yet no serious researcher actually believes that these systems understand language, or even really solve the underlying tasks behind these datasets. To get benchmarks that we actually believe in, we need to both think more deeply about the language phenomena that our benchmarks are targeting, and make our evaluation sets more rigorous. I will first present ORB, an Open Reading Benchmark that collects many reading comprehension datasets that we (and others) have recently built, targeting various aspects of what it means to read. I will then present contrast sets, a way of creating non-iid test sets that more thoroughly evaluate a model's abilities on some task, decoupling training data artifacts from test labels.

Mohammad Sadegh Rasooli

University of Pennsylvania

February 27, 2020

Cross-Lingual Transfer of Natural Language Processing Systems

Accurate natural language processing systems rely heavily on annotated datasets. In the absence of such datasets, transfer methods can help to develop a model by transferring annotations from one or more rich-resource languages to the target language of interest. These methods are generally divided into two approaches: 1) annotation projection from translation data, aka parallel data, using supervised models in rich-resource languages, and 2) direct model transfer from annotated datasets in rich-resource languages. In this talk, we present different methods for transfer of syntactic and semantic dependency parsers. We propose an annotation projection method that performs well in scenarios for which a large amount of in-domain parallel data is available. We also propose a method which is a combination of annotation projection and direct model transfer that can leverage a minimal amount of information from a small out-of-domain parallel dataset to develop highly accurate transfer models. Furthermore, we present an unsupervised syntactic reordering model to improve the accuracy of dependency parser transfer for non-European languages. We also propose a method for cross-lingual transfer of dependency parsing based on multi-task learning by leveraging supervised syntactic information in the target language of interest. Finally, we introduce our current efforts for learning cross-lingual representations using information from different modalities especially from images in the massively multilingual image dataset (MMID).

Zhiting Hu

Carnegie Mellon University

February 20, 2020

Connecting the Dots between Learning Paradigms

Continued research has created a diverse set of learning algorithms for ingesting distinct forms of experience (e.g. data, cost, knowledge constraints). However, it is often challenging for practitioners to choose or adapt solutions from such a bewildering marketplace of algorithms, as it could demand deep ML expertise and bespoke innovations. This talk will present an attempt to systematize several paradigms of algorithms for both a unifying understanding and new systematic methodologies of creating ML solutions. I will show that some of the popular algorithms in supervised learning, constraint-driven learning, reinforcement learning, etc, indeed share a common succinct formulation, showing that different forms of experience can be used for learning in the same way. The unifying representation of algorithms allows us to methodically exchange solutions between paradigms, and learn from combinations of experience jointly, for complex problems such as text and image generation.

Nitish Gupta

University of Pennsylvania

February 13, 2020

Neural Module Networks for Reasoning over Text

Answering compositional questions that require multiple steps of reasoning against text is challenging, especially when they involve discrete, symbolic operations. Neural module networks (NMNs) learn to parse such questions as executable programs composed of learnable modules, performing well on synthetic visual QA domains. In this talk, I will outline the challenges in learning these models for non-synthetic questions on open-domain text, where a model needs to deal with the diversity of natural language and perform a broader range of reasoning. Then, I will present how we extend NMNs by (a) introducing modules that reason over a paragraph of text, performing symbolic reasoning (such as arithmetic, sorting, counting) over numbers and dates in a probabilistic and differentiable manner; and (b) proposing an unsupervised auxiliary loss to help extract arguments associated with the events in text. Additionally, we show that a limited amount of heuristically-obtained question program and intermediate module output supervision provides sufficient inductive bias for accurate learning. In conclusion, I will present methods for achieving interpretability in such compositional neural models and challenges for future research.

Noam Slonim

IBM

February 6, 2020

Project Debater – How Persuasive can a Computer be?

Project Debater is the first AI system that can meaningfully debate a human opponent. The system, an IBM Grand Challenge, is designed to build coherent, convincing speeches on its own, as well as provide rebuttals to the opponent’s main arguments. In 2019, Project Debater competed against Harish Natarajan, who holds the world record for most debate victories, in an event held in San Francisco that was broadcasted live world-wide. In this talk I will tell the story of Project Debater, from conception to a climatic final event, describe its underlying technology, and discuss how it can be leveraged for advancing decision making and critical thinking.

Jay-Yoon Lee

Carnegie Mellon University

January 30, 2020

Injecting output constraints into neural NLP models in a model agnostic way

The talk discusses a particular method of injecting constraints into neural models, primarily for natural language processing (NLP) tasks. While neural models have set the new state of the art performance in many tasks from vision to NLP, they often fail to learn simple rules necessary for well-formed structures unless there is an immense amount of training data. The talk claims that not all the aspects of the model have to be learned from the data itself and injecting simple knowledge/constraints into the neural models can help low-resource tasks as well as improving state-of-the-art models. The talk focuses on the structural knowledge of the output space and injects knowledge of correct or preferred structures as an objective to the model in a model-agnostic way, i.e. without modification to the model structure. The first benefit of focusing on the knowledge of output space is that it is intuitive as we can directly enforce outputs to satisfy logical/linguistic constraints. Another advantage of structural knowledge is that it often does not require a labeled dataset. Focusing on the example of Semantic Role Labeling and its constraints related to the syntactic parse tree, the talk showcases the efficacy of the proposed inference algorithm and the proposed semi-supervised learning.

Nick Montfort

Massachusetts Institute of Technology

January 23, 2020

Lean Computer-Generated Poetry as Exploration of Language, Culture, and Computation

Computational poetics is a compelling area of NLP. Poetry has helped to constitute cultures for millennia and its composition is considered one of the most human activities. On the generation side, computational poetics involves the production of poetic language, potentially with meter, rhyme and other forms of musicality, metaphors and their cousins, narrative aspects, and intertextual references. Essentially, the main objective of computationally generated poetry is being culturally and individually resonant for at least some readers or listeners in some cultures. There are a wide variety of approaches, some of which seek to model human creativity, as in the computational creativity community. Work in the area is undertaken by academic researchers, poets and artists, and programmers seeking amusement and diversion during events such as NaNoGenMo (National Novel Generation Month), which accommodates the generation of all sorts of large-scale literature, including poetry. In my talk, I will introduce my own practice as a computational poet, which does not involve developing general models of human creativity. My practice is often considered experimental and sometimes conceptual; it is not, in any case, expressive, that is, mainly concerned with my experiences or with conveying my emotions. Rather, I consider myself a situated and embodied explorer of language, culture, and computation. My means of exploration is the development of computational poetry. My practice involves writing programs that are usually small and simple, based on specific unusual lexicons and combinatorial techniques. As part of inquiring about computation, my work connects with platform studies and deals with specifics of particular computers and programming languages. As I share and discuss some of my specific computational poems, I will describe how this type of NLG work touches on questions of language and thought as studied in, for instance, linguistics, cognitive science, and conventional poetics.

Adam Poliak

Johns Hopkins University

December 10, 2019

Sentence-level Semantic Inference: From Diverse Phenomena to Applications

Many NLP tasks involve understanding meaning at the sentence-level. In order to analyze such models, we should decompose sentence-level semantic understanding into a diverse array of smaller, more-focused, fine-grained types of reasoning. This will help improve our understanding of the sentence-level reasoning capabilities of our NLP systems. In this talk, we will focus on Natural Language Inference (NLI), the task of determining if one sentence (hypothesis) can likely be inferred from another (context/premise). NLI has traditionally be used to evaluate how well different models understand language and the relationship between texts. We investigate whether 10 recent NLI datasets require models to reason about both texts, or if the datasets contain biases or statistical irregularities that allow a model to correctly label a context-hypothesis pair by only looking at a hypothesis. In the most popular dataset that we consider, a hypothesis-only model outperforms the majority baseline by over 2x. We will also discuss our recently released dataset, the Diverse NLI Collection (DNC), that can be used to shed light on a model’s ability to capture or understand a diverse array of semantic phenomena that are important to Natural Language Understanding. We will demonstrate how a variant of the DNC has been used to evaluate whether a Neural Machine Translation encoder captures semantic phenomena related to translation. With the remaining time, we will discuss how lessons from these studies can be applied real-world uses cases of sentence-level semantic inference. This talk is based on work that has appeared at NAACL, ACL, StarSem, and EMNLP.

Yoav Artzi

Cornell University

December 3, 2019

Robot Control and Collaboration in Situated Instruction Following

I will present two projects studying the problem of learning to follow natural language instructions. I will present new datasets, a class of interpretable models for instruction following, learning methods that combine the benefits of supervised and reinforcement learning, and new evaluation protocols. In the first part, I will discuss the task of executing natural language instructions with a robotic agent. In contrast to existing work, we do not engineer formal representations of language meaning or the robot environment. Instead, we learn to directly map raw observations and language to low-level continuous control of a quadcopter drone. In the second part, I will propose the task of learning to follow sequences of instructions in a collaborative scenario, where both the user and the system execute actions in the environment and the user controls the system using natural language. To study this problem, we build CerealBar, a multi-player 3D game where a leader instructs a follower, and both act in the environment together to accomplish complex goals. The two projects were led by Valts Blukis, Alane Suhr, and collaborators.

Hangfeng He

University of Pennsylvania

November 19, 2019

Distributed Semantic Representations from Question-Answering Signals

Human annotations, especially those from experts, are costly for many natural language processing (NLP) tasks. One emerging approach is to use natural language to annotate natural language, but it is challenging to get supervision effectively from annotations that are very different from the target task. This paper studies the case where the annotations are in the format of question answering (QA). We propose a novel approach to retrieve two types of semantic representations from QA, using which we can consistently improve on a suite of tasks. This work may have pointed out an alternative way to supervise NLP tasks.

Shuai Tang

University of California, San Diego

November 12, 2019

Revisiting post-processing for word embeddings

Word embeddings learnt from large corpora have been adopted in various applications in natural language processing and served as the general input representations to learning systems. Recently, a series of post-processing methods have been proposed to boost the performance of word embeddings on similarity comparison and analogy retrieval tasks, and some have been adapted to compose sentence representations. The general hypothesis behind these methods is that by enforcing the embedding space to be more isotropic, the similarity between words can be better expressed. We view these methods as an approach to shrink the covariance/gram matrix, which is estimated by learning word vectors, towards a scaled identity matrix. By optimising an objective in the semi-Riemannian manifold with Centralised Kernel Alignment (CKA), we are able to search for the optimal shrinkage parameter, and provide a post-processing method to smooth the spectrum of learnt word vectors which yields improved performance on downstream tasks.

Daniel Deutsch

University of Pennsylvania

October 29, 2019

A General-Purpose Algorithm for Constrained Sequential Inference

Inference in structured prediction involves finding the best output structure for an input, subject to certain constraints. Many current approaches use sequential inference, which constructs the output in a left-to-right manner. However, there is no general framework to specify constraints in these approaches. We present a principled approach for incorporating constraints into sequential inference algorithms. Our approach expresses constraints using an automaton, which is traversed in lock-step during inference, guiding the search to valid outputs. We show that automata can express commonly used constraints and are easily incorporated into sequential inference. When it is more natural to represent constraints as a set of automata, our algorithm uses an active set method for demonstrably fast and efficient inference. We experimentally show the benefits of our algorithm on constituency parsing and semantic role labeling. For parsing, unlike unconstrained approaches, our algorithm always generates valid output, incurring only a small drop in performance. For semantic role labeling, imposing constraints using our algorithm corrects common errors, improving F1 by 1.5 points. These benefits increase in low-resource settings. Our active set method achieves a 5.2x relative speed-up over a naive approach.

Daniel Deutsch

University of Pennsylvania

October 29, 2019

Summary Cloze: A New Task for Content Selection in Topic-Focused Summarization

A key challenge in topic-focused summarization is determining what information should be included in the summary, a problem known as content selection. In this work, we propose a new method for studying content selection in topic-focused summarization called the summary cloze task. The goal of the summary cloze task is to generate the next sentence of a summary conditioned on the beginning of the summary, a topic, and a reference document(s). The main challenge is deciding what information in the references is relevant to the topic and partial summary and should be included in the summary. Although the cloze task does not address all aspects of the traditional summarization problem, the more narrow scope of the task allows us to collect a large-scale datset of nearly 500k summary cloze instances from Wikipedia. We report experimental results on this new dataset using various extractive models and a two-step abstractive model that first extractively selects a small number of sentences and then abstractively summarizes them. Our results show that the topic and partial summary help the models identify relevant content, but the task remains a significant challenge.

Ben Zhou

University of Pennsylvania

October 29, 2019

"Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding

Understanding time is crucial for understanding events expressed in natural language. Because people rarely say the obvious, it is often necessary to have commonsense knowledge about various temporal aspects of events, such as duration, frequency, and temporal order. However, this important problem has so far received limited attention. This paper systematically studies this temporal commonsense problem. Specifically, we define five classes of temporal commonsense, and use crowdsourcing to develop a new dataset, MCTACO, that serves as a test set for this task. We find that the best current methods used on MCTACO are still far behind human performance, by about 20%, and discuss several directions for improvement. We hope that the new dataset and our study here can foster more future research on this topic.

Katharina Kann

New York University

October 22, 2019

Neural Networks for Morphological Generation in the Minimal-Resource Setting

As languages other than English are moving more and more into the focus of natural language processing, accurate handling of morphology is increasing in importance. This talk presents neural network-based approaches to morphological generation, casting the problem as a character-based sequence-to-sequence task. First, we will generally discuss how to successfully train neural sequence-to-sequence models for this. Then, since many morphologically rich languages only have limited resources, the main part of the talk will focus on how to overcome the challenges that limited amounts of annotated training data pose to neural models. The approaches covered in this talk include multi-task learning, cross-lingual transfer learning, and meta-learning.

Jithin Pradeep

The Vanguard Group

October 15, 2019

ArSI - Artificial Speech Intelligence - An end to end automatic speech recognition using Attention plus CTC

Shi Yu

The Vanguard Group

October 15, 2019

A Financial Service Chatbot based on Deep Bidirectional Transformers

Christopher Lynn

University of Pennsylvania

October 8, 2019

Human information processing in complex networks

Humans communicate using systems of interconnected stimuli or concepts -- from language and music to literature and science -- yet it remains unclear how, if at all, the structure of these networks supports the communication of information. Although information theory provides tools to quantify the information produced by a system, traditional metrics do not account for the inefficient and biased ways that humans process this information. Here we develop an analytical framework to study the information generated by a system as perceived by a human observer. We demonstrate experimentally that this perceived information depends critically on a system's network topology. Applying our framework to several real networks, we find that they communicate a large amount of information (having high entropy) and do so efficiently (maintaining low divergence from human expectations). Moreover, we show that such efficient communication arises in networks that are simultaneously heterogeneous, with high-degree hubs, and clustered, with tightly-connected modules -- the two defining features of hierarchical organization. Together, these results suggest that many real networks are constrained by the pressures of information transmission, and that these pressures select for specific structural features.

Dan Goldwasser

Purdue University

October 1, 2019

Joint Models for Social, Behavioral and Textual Information

Understanding natural language communication often requires context, such as the speakers' backgrounds and social conventions, however, when it comes to computationally modeling these interactions, we typically ignore their broader context and analyze the text in isolation. In this talk, I will review on-going work demonstrating the importance of holistically modeling behavioral, social and textual information. I will focus on several NLP problems, including political discourse analysis on Twitter, partisan news detection and open-domain debate stance prediction, and discuss how jointly modeling text and social behavior can help reduce the supervision effort and provide a better representation for language understanding tasks.

Robert Shaffer

University of Pennsylvania

September 24, 2019

Similarity Inference for Legal Texts

Quantifying similarity between pairs of documents is a ubiquitous task. Both researchers and members of the public frequently use document-level pairwise similarity measures to describe or explore unfamiliar corpora, or to test hypotheses regarding diffusion of ideas between authors. High-level similarity measures are particularly useful when dealing with legal or political corpora, which often contain long, thematically diverse, and specialized language that is difficult for non-experts to interpret. Unfortunately, though similarity estimation is a well-studied problem in the context of short documents and document excerpts, less attention has been paid to the problem of similarity inference for long documents.

Reno Kriz

University of Pennsylvania

September 17, 2019

Comparison of Diverse Decoding Methods from Conditional Language Models

While conditional language models have greatly improved in their ability to output high-quality natural language, many NLP applications benefit from being able to generate a diverse set of candidate sequences. Diverse decoding strategies aim to, within a given-sized candidate list, cover as much of the space of high-quality outputs as possible, leading to improvements for tasks that re-rank and combine candidate outputs. Standard decoding methods, such as beam search, optimize for generating high likelihood sequences rather than diverse ones, though recent work has focused on increasing diversity in these methods. We conduct an extensive survey of decoding-time strategies for generating diverse outputs from conditional language models. We also show how diversity can be improved without sacrificing quality by over-sampling additional candidates, then filtering to the desired number.

Daphne Ippolito

University of Pennsylvania

September 17, 2019

Detecting whether Text is Human- or Machine-Generated

With the advent of generative models with a billion parameters or more, it is now possible to automatically generate vast amounts of human-sounding text. But just how human-like is this machine-generated text? Intuitively, shorter amounts of machine-generated text are harder to detect, but exactly how many words can a machine generate and still fool both humans and trained discriminators? We investigate how the choices of sampling strategy and text sequence length impact discriminability from human-written text, using both automatic detection methods and human judgement.