CLunch Archive

Here are the past talks given at CLunch.

CLunch Archive

Hao Wang

Rutgers University

April 19th, 2023

Bayesian Deep Learning: From Single-Domain Reasoning to Infinite-Domain Adaptation

While perception tasks such as visual object recognition and text understanding play an important role in human intelligence, the subsequent tasks that involve inference, reasoning, and planning require an even higher level of intelligence. The past few years have seen major advances in many perception tasks using deep learning models. In terms of higher-level inference, however, probabilistic graphical models, with their ability to expressively describe properties of variables and various probabilistic relations among variables, are still more powerful and flexible. To achieve integrated intelligence that involves both perception and inference, we have been exploring along a research direction, which we call Bayesian deep learning, to tightly integrate deep learning and Bayesian models within a principled probabilistic framework. In this talk, I will present the proposed unified framework and some of our recent work on Bayesian deep learning with various applications including recommendation, social network analysis, interpretable healthcare, domain adaptation, and representation learning.

Sunny Rai

University of Pennsylvania

April 12th, 2023

Investigating Racial Heterogeneity in Language Markers of Depression

The racial and ethnic differences in the manifestation of depression are well documented. However, the effect of these differences on computational models for mental disorders trained on online language is relatively unexplored. This work analyzes the interaction between race and linguistic features correlated with PHQ-9 score. Our experiments reveal that the pronoun I, widely used as an indicator of depression, has significant interaction with race correlating with PHQ-9 scores for White but not for Black individuals. Various open vocabulary topics correlated with PHQ-9 demonstrate a contradictory trend for their usage by White and Black individuals when depressed. A linear regression machine learning model trained on White individuals predicts depression in White individuals with a Pearson r of 0.39(p < 0.05) but returns an insignificant correlation for depression scores in Black individuals indicating its inefficacy in diagnosing depression for the Black population. Interestingly, a model trained on Black individuals predicts depression in both racial groups albeit with different performances (r = 0.355 for Black and r = 0.338 for White). The results advocate the urgent need to validate computational mental health models on minority populations before deployment.

Nanyun (Violet) Peng

University of California, Los Angeles

April 7th, 2023

Controllable Text Generation For Open-World Creativity

Recent advances in large auto-regressive language models have demonstrated strong results in generating natural languages and significantly improved the performances for applications such as machine translation and summarization. However, when the generation tasks are open-ended and the content is under-specified, or there are format or cross-modal association constraints, existing techniques struggle to generate long-term coherent and creative contents that follow format constraints. This happens because autoregressive language models are only trained to predict the next word, and it is hard to impose structural or content control/contraints to the model. In this talk, I will present our recent works on creative generation including poetry and melogy-to-lyrics generation, which highlight the importance of controllable text generation beyond the prevalent auto-regressive formulation. We propose a novel insertion-based generation model and a controllable decoding-time algorithm to steer models to better conform to constraints.

Roy Schwartz

Hebrew University of Jerusalem

March 29th, 2023

Spurious Correlations: Challenges, Solutions, and Opportunities

Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy'' instances, culminating in a recent proposal to eliminate single-word correlations altogether. In this talk, I will identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater'' and miss important signals encoding common sense and world knowledge. I will highlight several alternatives to dataset balancing, focusing on a surprising proposal: in order to mitigate biases in models, one needs to amplify them in our training sets.

Yoav Artzi

Cornell University (Cornell Tech)

March 22nd, 2023

Learning and Reasoning in Natural Language Interaction

Natural language is first and foremost an instrument of interaction, where interlocutors produce and comprehend language to relay information to accomplish their intents. This talk focuses on challenges and opportunities that arise from this interactive nature of language. The response of participants to the language they comprehend can form a strong learning signal for the party that produced the language. Did I achieve my intent? In the first part, I will show how to use this signal to learn to produce natural language instructions. I will then discuss the problem of language-conditioned reinforcement learning, where benchmark development has been hindered because computing rewards requires resolving language semantics. I will describe a new approach to address this challenge. Finally, core to linguistic interaction is the use of abstraction to communicate concepts in a generalizable way. I will describe a new resource to study this phenomena, and show how it sheds light on the generalization abilities of language-and-vision pre-trained models.

Daniel Fried

Carnegie Mellon University (Language Technologies Institute)

March 15th, 2023

Using Language Strategically in Context

As NLP systems interact with people in a widening range of world contexts, it is increasingly important to model pragmatic aspects of language: the goals that underlie language use, and the effects that language has on people. Across a diverse range of task-oriented settings, we've found that reasoning about language as a strategic action allows NLP models to interact more successfully with human partners. First, I'll describe a procedure for pragmatically generating and interpreting instructions. We train listener and speaker models that imitate how people interpret and produce language in grounded contexts. We use these models to (1) predict how a person might interpret language from the system and (2) resolve ambiguity by reasoning about what goal might have made a person say what they did. These procedures make interaction with human partners more successful in settings including visually-grounded instruction following and interactive preference learning. I'll also give an overview of work with the FAIR Diplomacy team on CICERO, an agent that achieves human-level performance in the dialogue and strategy board game Diplomacy. CICERO integrates LLMs with a strategic planner: choosing mutually beneficial plans for itself and its partners, and generating dialogue in pursuit of these plans. When deployed in an anonymous online Diplomacy league with human partners, CICERO ranked in the top 10% of participants who played more than one game.

Niranjan Balasubramanian

Stony Brook University

March 6th, 2023

What ails multi-step reasoning and how to fix it.

Multi-step reasoning has seen much empirical progress on many datasets recently, especially in Question Answering. However, training and evaluating on typical crowdsourced datasets is problematic because of the potential for shortcut reasoning based on artifacts. What can we do about this? In this three part talk, I will first show how we can formalize and measure disconnected reasoning, a type of bad multihop reasoning. I will then discuss how we can construct new datasets using a bottom-up construction process, which allows us to better control for desired properties in the resulting dataset. In the third part, I will briefly present how synthetically generated data can be used to teach a broad range of multihop skills in a reliable manner and how to improve reliable multi-step reasoning in open-domain QA settings.

Graham Neubig

Carnegie Mellon University (Language Technology Institute)

March 1st, 2023

Is My NLP Model Working? The Answer is Harder Than You Think

As natural language processing now permeates many different applications, its practical use is unquestionable. However, at the same time NLP is still imperfect, and errors cause everything from minor inconveniences to major PR disasters. Better understanding when our NLP models work and when they fail is critical to the efficient and reliable use of NLP in real-world scenarios. So how can we do so? In this talk I will discuss two issues: automatic evaluation of generated text, and automatic fine-grained analysis of NLP system results, which are some first steps towards a science of NLP model evaluation.

Alan Ritter

Georgia Tech

February 24th, 2023

Towards Cost Efficient Use of Pre-Trained Language Models

Large language models are leading to breakthroughs in a variety of applications, from information extraction systems that are accurate and robust, to human-like conversational assistants. In this talk I will analyze when the benefits of training a new model outweigh the computational costs, in the context of domain adaptation. Conventional wisdom holds that data annotation is expensive, so computational methods that leverage freely available unlabeled data can present an economical alternative when adapting to a new domain. The talk will examine this assumption in the context of pretraining-based domain adaptation, which requires significant GPU/TPU resources for each new domain. We frame domain adaptation as a consumer choice problem: given a fixed budget, what combination of annotation and pre-training lead to maximum utility? In the second part of the talk, I will discuss recent work on in-context learning for anaphora resolution. I will show that resolving anaphora in scientific protocols is a challenging task for in-context learning, then present a new method, MICE (Mixtures of In-Context Experts) and demonstrate how it can accurately resolve multiple-antecedent anaphora in paragraphs describing chemical synthesis procedures. MICE enables accurate few-shot anaphora resolution by ensembling hundreds of prompts that are created from only a handful of training examples. Finally, I will discuss applications of NLP on chemical synthesis protocols and show a demo of a system that can help chemists more efficiently find experimental details described in the literature.

Julian Michael


February 15th, 2023

What Do NLP Researchers Believe? Results of the NLP Community Metasurvey

I will present the results of the NLP Community Metasurvey ( This was a questionnaire that we ran from May to June 2022 which elicited the opinions of NLP researchers on controversial issues, including industry influence in the field, concerns about AGI, and ethics. Our results put concrete numbers to several controversies. For example, respondents are split almost exactly in half on questions about: the importance of artificial general intelligence, whether language models understand language, and the necessity of linguistic structure and inductive bias for solving NLP problems. In addition, the survey posed "meta-questions," asking respondents to predict the distribution of survey responses. This allows us not only to gain insight on the spectrum of beliefs held by NLP researchers, but also to uncover false sociological beliefs where the community’s predictions don’t match reality. We find such mismatches on a wide range of issues. Among other results, the community greatly overestimates its own belief in the usefulness of benchmarks and the potential for scaling to solve real-world problems, while underestimating its own belief in the importance of linguistic structure, inductive bias, and interdisciplinary science. Our hope is that this can provide context for the NLP research community to have more informed and self-aware discussions of these complex issues. In this talk, I will walk through our results and open the floor for such a discussion.

Karl Stratos

Rutgers CS

February 1st, 2023

Retrieval-Augmented Models for Natural Language Processing

Prompting large pretrained language models has been enormously successful in solving a wide class of natural language tasks. In this approach, a task is formatted in some natural language template to "prompt" the model to generate the correct answer (e.g., "Q: Why is the sky blue? A: "). While surprisingly effective, it often generates false and unverifiable claims, limiting real-world applications. In this talk, I will advocate an alternative approach based on retrieval. Instead of naively generating answers, the model must first retrieve a piece of evidence from a knowledge base (e.g., Wikipedia). By having an explicit knowledge retrieval step, the model is forced to return factually accurate and verifiable claims. It can also use a new knowledge base at test time, thus is capable of zero-shot learning. I will focus on the task of entity retrieval and linking. I will first present a technique based on hard negative mining to make entity retrieval more robust (NAACL 2021). I will then build on the retrieval framework to present a novel paradigm for entity linking (ICLR 2022).

Gail Weiss


January 25, 2023

Thinking Like Transformers

Transformers - the purely attention based NN architecture - have emerged as a powerful tool in sequence processing. But how does a transformer think? When we discuss the computational power of RNNs, or consider a problem that they have solved, it is easy for us to think in terms of automata and their variants (such as counter machines and pushdown automata). But when it comes to transformers, no such intuitive model is available. In this talk I will present a programming language, RASP (Restricted Access Sequence Processing), which we hope will serve the same purpose for transformers as finite state machines do for RNNs. In particular, we will identify the base computations of a transformer and abstract them into a small number of primitives, which are composed into a small programming language. We will go through some example programs in the language, and discuss how a given RASP program relates to the transformer architecture.

Soroush Vosoughi

Dartmouth College

December 12, 2022

Prosocial Language Models

Large-scale language models. (e.g., BERT, GPT-3) have revolutionized the field of natural language processing (NLP). Such pre-trained models show close-to-human-level performance on diverse tasks with little or no training data. The success of such models is at least partially due to their large size (most have hundreds or even thousands of millions of parameters) and the large datasets used for their pre-training (typically collected from the web). However, these same attributes lead to these models reflecting the biases and the antisocial attitudes on the web. These attitudes are a significant bottleneck for using these models in real-world settings, especially for social applications. In my lab, we develop methods for post hoc (i.e., inference time) mitigation of such antisocial attitudes. Post hoc mitigation allows us to avoid retraining the models (which is costly and intractable) while enforcing prosocial attitudes during inference. In this talk, I will review some of our recent work for making language models less biased and more aligned with human moral values through inference-time mitigation.

Ben Van Durme

Johns Hopkins University

December 5, 2022

Embracing Uncertainty

I will discuss a series of projects on collecting labels with uncertainty. Time allowing I will touch on model calibration and downstream tasks. Modern Artificial Intelligence rests heavily on probabilistic models for classification. This usually means categorical nominal assignment; discrete labels given to inputs at prediction time. For example, an object captured in an image either was or was not truly a "cat", or some text describes an event that we might describe as a "TRANSACTION". While concepts like cats and transactions can be real in the world, this does not mean agents can be certain of these truths. In practice, human agents (annotators) are forced to choose from a label set without reflecting uncertainty in their decisions, and modelers then force artificial agents to do the same. Blurry images and ambiguous texts lead humans to have uncertain beliefs, while modern neural frameworks trained on discrete labels make predictions with high confidence. Lets instead embrace uncertainty as part of agent (task) design.

Smaranda Muresan

Columbia University

November 28, 2022

Text Generation: The Curious Case of Figurative Language and Argumentation

Large-scale language models based on transformer architectures, such as GPT-3 or BERT, have advanced the state of the art in Natural Language Understanding and Generation. However, even though these models have shown impressive performance for a variety of tasks, they often struggle to model implicit and/or non-compositional meaning, such as figurative language and argumentative text. In this talk, I will present some of our work on text generation models for figurative language and argumentation. There are two main challenges we have to address to make progress in this space: 1) the need to model common sense and/or connotative knowledge required for these tasks; and 2) the lack of large training datasets. I will discuss our proposed theoretically-grounded knowledge-enhanced text generation models for figurative language such as metaphor and for argument reframing. If time permits I will share our recent efforts of using a model-in-the-loop approach for building datasets for figurative language understanding modeled as an entailment task with explanation generation.

Yulia Tsvetkov

University of Washington

November 21, 2022

Interpretation as Weak Supervision for Data-Efficient NLP

Deep learning is typically associated with an abundance of data. But there are scenarios when pre-collected data will never be enough. For example, language on social media is constantly evolving and pretrained language models cannot adapt to rapid language change, dialects, and sociolects, no matter how large pretraining/annotated datasets are. Other examples of constantly evolving and therefore always-low-resource language domains include scientific articles, expert notes, and even news. In this talk, I will advocate for using model interpretability methods to dynamically procure data annotations in such low resource scenarios. In the first part, I will show how instance attribution approaches to model interpretability can identify critical training examples to improve the robustness and adaptability of hate speech classifiers. In the second part, I'll show how self-explaining models can be used for entity and keyphrase extraction in scientific articles. I'll conclude with more ideas for this new paradigm of using approaches to interpreting neural networks as an intrinsic component in low-resource NLP systems and not only as a tool to present explanations to humans.

Rotem Dror

University of Pennsylvania

November 14, 2022

Standards for Experiment Design and Evaluation in Natural Language Processing

In this job-talk-like seminar, I will present selected works from my Ph.D. and postdoctoral research. In the first part of the talk, I will overview three papers that cover practices to compare two NLP models and decide which is better based on experiment practices that are prevalent in NLP, such as conducting experiments with multiple datasets and deep neural network models. In the second part of the talk, I will dive into the intriguing world of evaluation of text-generation applications, where I will discuss how to determine which automatic evaluation metrics are appropriate.

Hannaneh Hajishirzi

University of Washington

November 7, 2022

Toward Robust, Multi-Task Natural Language Processing

Recent advances in deep learning algorithms and large-scale datasets are spurring progress in many Natural Language Processing (NLP) tasks, including question answering. Nevertheless, these models cannot scale up when task-annotated training data are scarce. This talk presents my lab's work toward building general-purpose models in NLP and how to systematically evaluate them. I present a new meta-dataset – called super-Natural Instructions – that includes a variety of NLP tasks and their descriptions to evaluate cross-task generalization. Then, I introduce a new meta training approach that can solve more than 1600 NLP tasks only from their descriptions and a few examples. Finally, I present a series of work in robust fine-tuning methods and how to edit models with arithmetics over task vectors.

Rui Zhang

Penn State University

October 31, 2022

Semantic Parsing in the Era of Large Language Models

Semantic parsing is the task of translating natural language sentences into meaning representations such as SQL queries and logic forms. Traditional semantic parsing research relies on delicate data curating, heavy feature engineering, and specific model architecturing. Despite their success, these approaches are typically not generalizable across different tasks and meaning representations, limiting systematic and compatible research. In this talk, I will provide a brief overview of recent progress in unified and efficient paradigms for semantic parsing with the help of large language models (i.e., UnifiedSKG). Then, I will describe our recent work on cross-lingual semantic parsing using text-to-text language models (i.e., XSemPLR) and retrieval-augmented in-context learning (i.e., XRICL), and two new datasets to challenge the reasoning abilities of large language models on tables (i.e., MultiHiertt) and first-order logic (i.e., FOLIO). I will conclude with some future directions for semantic parsing developments.

Xiang (Lorraine) Li

Allen Institute for AI (AI2) & University of Pittsburg

October 24, 2022

Probabilistic Commonsense Knowledge in Language

Commonsense knowledge is critical to achieving artificial general intelligence. This shared common background knowledge is implicit in all human communication, facilitating efficient information exchange and understanding. However, commonsense research is hampered by its immense quantity of knowledge because an explicit categorization is impossible. Furthermore, a plumber could repair a sink in a kitchen or a bathroom, indicating that common sense reveals a probable assumption rather than a definitive answer. To align with these properties of commonsense fundamentally, we want to model and evaluate such knowledge human-like using probabilistic abstractions and principles. This talk will introduce a probabilistic model representing commonsense knowledge using a learned latent space of geometric embeddings -- probabilistic box embeddings. Using box embeddings makes it possible to handle commonsense queries with intersections, unions, and negations in a way similar to Venn diagram reasoning. Meanwhile, existing evaluations do not reflect the probabilistic nature of commonsense knowledge. To fill in the gap, I will discuss a method of retrieving commonsense related question answer distributions from human annotators and a novel method of generative evaluation. We utilize these approaches in two new commonsense datasets (ProtoQA, Commonsense frame completion). The combination of modeling and evaluation methods based on probabilistic principles sheds light on how commonsense knowledge can be incorporated into artificial intelligence models in the future.

Jacob Andreas


October 17, 2022

Toward Natural Language Supervision

In the age of deep networks, "learning" almost invariably means "learning from examples". Image classifiers are trained with large datasets of images, machine translation systems with corpora of translated sentences, and robot policies with rollouts or demonstrations. When human learners acquire new concepts and skills, we often do so with richer supervision, especially in the form of language---we learn new concepts from exemplars accompanied by descriptions or definitions, and new skills from demonstrations accompanied by instructions. In natural language processing, recent years have seen a number of successful approaches to learning from task definitions and other forms of auxiliary language-based supervision. But these successes have been largely confined to tasks that also involve language as an input and an output---what will it take to make language-based training useful for the rest of the machine learning ecosystem? In this talk, I'll present two recent applications of natural language supervision to tasks outside the traditional domain of NLP: using language to guide visuomotor policy learning and inductive program synthesis. In these applications, natural language annotations reveal latent compositional structure in the space of programs and plans, helping models discover reusable abstractions for perception and interaction. This kind of compositional structure is present in many tasks beyond policy learning and program synthesis, and I'll conclude with a brief discussion of how these techniques might be more generally applied.

Chenhao Tan

University of Chicago

October 3, 2022

Towards Human-Centered Explanations of AI Predictions

Explanations of AI predictions are considered crucial for human-AI interactions such as model debugging and model-assisted decision making, but it remains an open question what makes effective AI explanations. In this talk, I will highlight the distinction between emulation and discovery tasks, which shapes the answers to this question. In emulation tasks, humans provide groundtruth labels and the goal of AI is to emulate human intelligence. Although it is intuitive to think that humans can provide valid explanations in this case, I argue that humans may not be able to provide "good" explanations. Despite the growing efforts in building datasets of human explanations, caution is required to use such human explanations for evaluation or as supervision signals. In contrast, in discovery tasks, humans may not necessarily know the groundtruth label. While human-subject experiments are increasingly used to evaluate whether explanations improve human decisions, human+AI rarely outperforms AI alone. I will discuss the importance of identifying human strengths and AI strengths, and present our initial efforts in decision-focused summarization. I will conclude with future directions for developing effective human-centered explanations.

William Wang (UCSB)

UC Santa Barbara

September 26, 2022

Self-Supervised Language-and-Vision Reasoning

A key challenge for Artificial Intelligence research is to go beyond static observational data and consider more challenging settings that involve dynamic actions and incremental decision-making. In this talk, I will introduce our work on visually-grounded language reasoning via the studies of vision-and-language navigation. In particular, I will emphasize three benefits of self-supervised learning: (1) improves generalization in unseen environments; (2) creates adversarial counterfactuals to augment observational data; (3) enables transfer learning for challenging settings. I will briefly introduce other reasoning problems my groups have been working on recently.

Michael Strube

HITS & Heidelberg University

September 12, 2022

Generalizability and Robustness in Coreference Resolution

In the last ten years we have seen considerable improvements in the performance of coreference resolvers, from about 60 points F1-measure to more than 80 since the CoNLL shared tasks 2011 and 2012. These improvements are mostly due to new machine learning techniques, in particular neural coreference resolvers. However, while these improvements have been reported on the CoNLL data, it is not clear whether these improvements hold on datasets in other genres, domains, and languages. In this talk I report on a series of experiments -- done by PhD. students in my research group -- testing the generalizability and robustness of coreference resolvers. Our experiments indicate that the results reported by modern machine learning based systems are not stable across genres and domains. However, the rule-based system by Lee et al. (2013), which won the CoNLL shared task 2011, is still competitive in these setups. A possible conclusion is that neural coreference resolvers should be equipped with more linguistic knowledge to make them more robust. To test the generalizability the field should not only evaluate on the CoNLL/OntoNotes data but on different domains, genres, languages and in downstream tasks.

Su Lin Blodgett

Microsoft Research Montréal

April 25, 2022

Towards Equitable Language Technologies

Language technologies are now ubiquitous. Yet the benefits of these technologies do not accrue evenly to all people, and they can be harmful; they can reproduce stereotypes, prevent speakers of “non-standard” language varieties from participating fully in public discourse, and reinscribe historical patterns of linguistic discrimination. In this talk, I will take a tour through the rapidly emerging body of research examining bias and harm in language technologies and offer some perspective on the many challenges of this work. I will discuss some recent efforts to understand language-related harms in their sociohistorical contexts, and to investigate NLP resources developed for one such harm—stereotyping—touching on the complexities of deciding what these resources ought to measure, and how they ought to measure it.

Esin Durmus

Stanford University

April 18, 2022

On the Evaluation and Mitigation of Faithfulness Errors in Abstractive Summarization

Despite recent progress in abstractive summarization, systems still generate unfaithful summaries, i.e. summaries that contain information that is not supported by the input. There has been a lot of effort to develop methods to measure and improve faithfulness errors. In this talk, I will first introduce some of the proposed methods to measure faithfulness of summarization systems. Then, I will present a spurious correlate: i.e., extractiveness of the summary, that potentially influences how we should evaluate the faithfulness of these systems. In particular, I will describe our work that proposes a method to measure and improve faithfulness by accounting for the extractiveness of summarization systems. Furthermore, I will discuss the importance of accounting for spurious correlations (such as extractiveness, perplexity, and length) in designing effective evaluation frameworks for text generation.

Maarten Sap

Allen Institute for AI (AI2)

April 11, 2022

Detecting and Rewriting Socially Biased Language

Language has the power to reinforce stereotypes and project social biases onto others, either through overt hate or subtle biases. Accounting for this toxicity and social bias in language is crucial for natural language processing (NLP) systems to be safely and ethically deployed in the world. In this talk, I will first discuss subjectivity challenges in binary hate speech detection, by examining perceptions of offensiveness of text depending on reader attitudes and identities. Through an online study, we find several correlates between over- or under-detecting text as toxic based on political leaning, attitudes about racism and free speech. Then, as an alternative to binary hate speech detection, I will present Social Bias Frames, a new structured formalism for distilling biased implications of language. Using a new corpus of 150k structured annotations, we show that models can learn to reason about high-level offensiveness of statements, but struggle to explain why a statement might be harmful. Finally, I will introduce PowerTransformer, an unsupervised model for controllable debiasing of text through the lens of connotation frames of power and agency. With this model, we show that subtle gender biases in how characters are portrayed in stories and movies can be mitigated through automatic rewriting. I will conclude with future directions for better reasoning about toxicity and social biases in language.

Allyson Ettinger

University of Chicago

April 4, 2022

"Understanding" and prediction: Controlled examinations of meaning sensitivity in pre-trained models

In recent years, NLP has made what appears to be incredible progress, with performance even surpassing human performance on some benchmarks. How should we interpret these advances? Have these models achieved language "understanding"? Operating on the premise that "understanding" will necessarily involve the capacity to extract and deploy meaning information, in this talk I will discuss a series of projects leveraging targeted tests to examine NLP models' ability to capture meaning in a systematic fashion. I will first discuss work probing model representations for compositional meaning, with a particular focus on disentangling compositional information from encoding of lexical properties. I'll then explore models' ability to extract and deploy meaning information during word prediction, applying tests inspired by psycholinguistics to examine what types of information models encode and access for anticipating words in context. In all cases, these investigations apply tests that prioritize control of unwanted cues, so as to target the desired meaning capabilities with greater precision. The results of these studies suggest that although models show a good deal of sensitivity to word-level information, and to a number of semantic and syntactic distinctions, they show little sign of capturing higher-level compositional meaning, of capturing logical impacts of meaning components like negation, or of retaining access to robust representations of meaning information conveyed in prior context. I will discuss potential implications of these findings with respect to the goals of achieving "understanding" with currently dominant pre-training paradigms.

Heng Ji

University of Illinois at Urbana-Champaign

March 28, 2022

Information Surgery: Faking Multimedia Fake News for Real Fake News Detection

We are living in an era of information pollution. The dissemination of falsified information can cause chaos, hatred, and trust issues among humans, and can eventually hinder the development of society. In particular, human-written disinformation, which is often used to manipulate certain populations, had catastrophic impact on multiple events, such as the 2016 US Presidential Election, Brexit, the COVID-19 pandemic, and the recent Russia’s assault on Ukraine. Hence, we are in urgent need of a defending mechanism against human-written disinformation. While there has been a lot of research and many recent advances in neural fake news detection, there are many challenges remaining. In particular, the accuracy of existing techniques at detecting human-written fake news is barely above random. In this talk I will present our recent attempts at tackling four unique challenges in the frontline of combating fake news written by both machines and humans: (1) Define a new task on knowledge element level misinformation detection based on cross-media knowledge extraction and reasoning to make the detector more accurate and explainable; (2) Generate training data for the detector based on knowledge graph manipulation and knowledge graph guided natural language generation; (3) Use Natural Language Inference to ensure the fake information cannot be inferred from the rest of the real document; (4) Propose the first work to generate propaganda for more robust detection of human-written fake news.

Omer Levy

Tel Aviv University

March 21, 2022

SCROLLS: Standard CompaRison Over Long Language Sequences

NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. Initial baselines, including Longformer Encoder-Decoder, indicate that there is ample room for improvement on SCROLLS. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.

Jordan Boyd-Graber

University of Maryland

March 14, 2022

Manchester vs. Cranfield: Why do we have computers answering questions from web search data and how can we do it better?

In this talk, I'll argue that the intellectual nexus of computers searching through the web to answer questions comes from research undertaken in two mid-century English university towns: Manchester and Cranfield. After reviewing the seminal work of Cyril Cleverdon and Alan Turing and explaining how that shaped today the information and AI age, I'll argue that these represent two competing visions for how computers should answer questions: either exploration of intelligence (Manchester) or serving the user (Cranfield). However, regardless of which paradigm you adhere to, I argue that the ideals for those visions are not fulfilled in modern question answering implementations: the human (Ken Jennings) vs. computer (Watson) competition on Jeopardy! was rigged, other evaluations don't show which system knows more about a topic, the training and evaluation data don't reflect the background of users, and the annotation scheme for training data is incomplete. After outlining our short-term solutions to these issues, I'll then discuss a longer-term plan to achieve the goals of both the Manchester and Cranfield paradigms.

Tal Linzen

New York University

February 28, 2022

Causal analysis of the syntactic representations used by Transformers

The success of artificial neural networks in language processing tasks has underscored the need to understand how they accomplish their behavior, and, in particular, how their internal vector representations support that behavior. The probing paradigm, which has often been invoked to address this question, relies on the (typically implicit) assumption that if a classifier can decode a particular piece of information from the model's intermediate representation, then that information plays a role in shaping the model's behavior. This assumption is not necessarily justified. Using the test case of everyone's favorite syntactic phenomenon - English subject-verb number agreement - I will present an approach that provides much stronger evidence for the *causal* role of the encoding of a particular linguistic feature in the model's behavior. This approach, which we refer to as AlterRep, modifies the internal representation in question such that it encodes the opposite value of that feature; e.g., if BERT originally encoded a particular word as occurring inside a relative clause, we modify the representation to encode that it is not inside the relative clause. I will show that the conclusions of this method diverge from those of the probing method. Finally, if time permits, I will present a method based on causal mediation analysis that makes it possible to draw causal conclusions by applying counterfactual interventions to the *inputs*, contrasting with AlterRep which intervenes on the model's internal representations.

Maria Ryskina

Carnegie Mellon University

February 21, 2022

Learning Computational Models of Non-Standard Language

Non-standard linguistic items, such as novel words or creative spellings, are common in domains like social media and pose challenges for automatically processing text from these domains. To build models capable of processing such innovative items, we need to not only understand how humans reason about non-standard language, but also be able to operationalize this knowledge to create useful inductive biases. In this talk, I will present empirical studies of several phenomena under the umbrella of non-standard language, modeled at the levels of granularity ranging from individual users to entire dialects. First, I will show how idiosyncratic spelling preferences reveal information about the user, with an application to the bibliographic task of identifying typesetters of historical printed documents. Second, I will discuss the common patterns in user-specific orthographies and demonstrate that incorporating these patterns helps with unsupervised conversion of idiosyncratically romanized text into the native orthography of the language. In the final part of the talk, I will focus on word emergence in a dialect as a whole and present a diachronic corpus study modeling the language-internal and language-external factors that drive neology.

Spencer Caplan

Swarthmore College

February 14, 2022

On the importance of baselines: Communicative efficiency and the statistics of words in natural language

Is language designed for communicative and functional efficiency? G. K. Zipf (1949) famously argued that shorter words are more frequent because they are easier to use, thereby resulting in the statistical law that bears his name. Yet, G. A. Miller (1957) showed that even a monkey randomly typing at a keyboard, and intermittently striking the space bar, would generate “words” with similar statistical properties. Recent quantitative analyses of human language lexicons (Piantadosi et al., 2012) have revived Zipf's functionalist hypothesis. Ambiguous words tend to be short, frequent, and easy to articulate in language production. Such statistical findings are commonly interpreted as evidence for pressure for efficiency, as the context of language use often provides cues to overcome lexical ambiguity. In this talk, I update Miller's monkey thought experiment to incorporate empirically motivated phonological and semantic constraints on the creation of words. I claim that the appearance of communicative efficiency is a spandrel (in the sense of Gould & Lewontin, 1979), as lexicons formed without the context of language use or reference to communication or efficiency exhibit comparable statistical properties. Furthermore, the updated monkey model provides a good fit for the growth trajectory of English as recorded in the Oxford English Dictionary. Focusing on the history of English words since 1900, I show that lexicons resulting from the monkey model provide a better embodiment of communicative efficiency than the actual lexicon of English. I conclude by arguing that the kind of faulty logic underlying the study of communicative efficiency crops up quite commonly within NLP -- evaluation metrics, and appropriate baselines, need to be carefully considered before any claims (cognitive or otherwise) can safely be made on their basis.

Peter Clark

Allen Institute for AI (AI2)

February 7, 2022

Systematic Reasoning and Explanation over Natural Language

Recent work has shown that transformers can be trained to reason *systematically* with natural language (NL) statements, answering questions with answers implied by a set of provided facts and rules, and even generating proofs for those conclusions. However, these systems required all the knowledge to be provided explicitly as input. In this talk, I will describe our current work on generalizing this to real NL problems, where the system produces faithful, entailment-based proofs for its answers, including materializing its own latent knowledge as needed for those proofs. The resulting reasoning-supported answers can then be inspected, debugged, and corrected by the user, offering new opportunities for interactive problem-solving dialogs, and taking a step towards "teachable systems" that can learn from such dialogs over time.

Sihao Chen, Liam Dugan, Xingyu Fu

University of Pennsylvania

January 31, 2022

Mini Talks

The three talks this week include "Characterizing Media Presentation Biases and Polarization with Unsupervised Open Entity Relation Learning" (Sihao Chen), "Are humans able to detect boundaries between human-written and machine-generated text?" (Liam Dugan) and "There’s a Time and Place for Reasoning Beyond the Image" (Xingyu Fu).

Jonathan Berant

Tel Aviv University

December 6, 2021

Zero-shot learning and out-of-distribution generalization: two sides of the same coin

Recent advances in large pre-trained language models have shifted the NLP community’s attention to new challenges: (a) training models with zero, or very few, examples, and (b) generalizing to out-of-distribution examples. In this talk, I will argue that the two are intimately related, and describe ongoing (read, new!) work in those directions. First, I will describe a new pre-training scheme for open-domain question answering that is based on the notion of “recurring spans” across different paragraphs. We show this training scheme leads to a zero-shot retriever that is competitive with DPR (which trains on thousands of examples), and is more robust w.r.t the test distribution. Second, I will focus on compositional generalization, a particular type of out-of-distribution generalization setup where models need to generalize to structures that are unobserved at training time. I will show that the view that seq2seq models categorically do not generalize to new compositions is false, and present a more nuanced analysis, which elucidates what are the conditions under which models struggle to compositionally generalize.

He He

New York University

November 30, 2021

Out-of-distribution generalization in NLP

Real-world NLP models must work well when the test distribution differs from the training distribution. While we have made great progress in natural language understanding thanks to large-scale pre-training, current models still take shortcuts and rely on spurious correlations in specific datasets. In this talk, I will discuss the role of pre-training and data in model robustness to distribution shifts. In particular, I will describe how pre-trained models avoid learning spurious correlations, when data augmentation helps and hurts, and how large language models can be leveraged to improve few-shot learning.

Yue Yang

University of Pennsylvania

November 22, 2021

Investigate Procedural Events in a Multimodal Fashion

Recently, there has been growing attention to studying procedural events while most of them focus on the text. We utilize multimodal as a tool to probe the procedure knowledge. This talk will introduce two projects: 1) Visual Goal-Step Inference using wikiHow -- Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. 2) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval -- Schemas are structure representations of complex tasks that can aid artificial intelligence by allowing models to break down complex tasks into intermediate steps. We propose a novel system that induces schemas from web videos and generalizes schemas for unseen tasks to improve video retrieval performance.

Marjorie McShane

Rensselaer Polytechnic Institute

November 15, 2021

Toward Broad and Deep Language Understanding for Intelligent Systems

The early vision of AI included the goal of endowing intelligent systems with human-like language processing capabilities. This proved harder than expected, leading the vast majority of natural language processing practitioners to pursue less ambitious, shorter-term goals. Whereas the utility of human-like language processing is unquestionable, its feasibility is quite justifiably questioned. In this talk, I will not only argue that some approximation of human-like language processing is possible, I will present a program of R&D that is working on making it a reality. This vision, as well as progress to date, is described in the book Linguistics for the Age of AI (MIT Press, 2021).

Daphne Ippolito

University of Pennsylvania

November 1, 2021

Language Models Memorize their Training Data; Dataset Deduplication Helps

Large neural language models are capable of memorizing their training data. First, I will discuss why this memorization is bad and the subtleties involved in studying harmful memorization tendencies. Then, I will go over some early results on the circumstances under which GPT-Neo, a popular public language model, exhibits memorization. Finally, I will describe our recent paper on deduplicating training data and discuss how models trained on deduplicated data memorize less, are more efficient to train, and possibly generalize better. I will also examine the problem of train-test leakage in existing popular datasets.

Samuel Bowman


October 25, 2021

Overclaiming in NLP Is a Serious Problem. Underclaiming May Be Worse.

In an effort to avoid reinforcing widespread hype about the capabilities of state-of-the-art language technology systems, researchers have developed practices in framing and citation that serve to deemphasize the field's successes, even at the cost of making misleadingly strong claims about the limits of our best systems. This is a problem, though, and it may be more serious than it looks: It limits our ability to mitigate short-term harms from NLP deployments and it limits our ability to prepare for the potentially-enormous impacts of more distant future systems. This paper urges researchers to be careful about these claims, and suggests some research directions that will make it easier to avoid or rebut them.

Diyi Yang

Georgia Tech

October 18, 2021

Socially Aware Language Technologies: Theory, Method, and Practice

Natural language processing (NLP) has had increasing success and produced extensive industrial applications. Despite being sufficient to enable these applications, current NLP systems often ignore the social part of language, e.g., who says it, in what context, for what goals. In this talk, we take a closer look at social factors in language via a new theory taxonomy and its interplay with computational methods via two lines of work. The first one studies hate speech and racial bias by introducing a benchmark corpus on implicit hate speech and computational models on detecting and explaining latent hatred in language. The second part demonstrates how more structures of conversations can be utilized to generate better summaries for everyday interaction. We conclude by discussing several open-ended questions about how to build socially aware language technologies.

Hangfeng He

University of Pennsylvania

October 4, 2021

Incidental Supervision for Natural Language Understanding

It is labor-intensive to acquire human annotations for natural language understanding (NLU) tasks because annotation can be complex and often requires significant linguistic expertise. Therefore, it is important to investigate how to get supervision from indirect signals and improve one's target task. In this topic, we focus on improving NLU by exploiting incidental supervision signals. Specifically, our goal is to first provide a better understanding of incidental signals, and then design more efficient algorithms to collect, select, and use incidental signals for NLU tasks. This problem is challenging because of the intrinsic differences between incidental supervision signals and target tasks. In addition, the complicated properties of natural language, such as variability and ambiguity, make the problem more challenging. Our contribution to this line of work so far is in three directions. First, we show how to exploit information from cheap signals to help other tasks. Specifically, we retrieve distributed representations from question-answering (QA) pairs to help various downstream tasks. Second, in order to facilitate selecting appropriate incidental signals for a given target task, we propose a unified informativeness measure to quantify the benefits of various incidental signals. Finally, we design efficient algorithms to exploit specific types of incidental signals, where we design a new weighted training algorithm to improve the sample efficiency of learning from cross-task signals. In the future, we plan to further investigate the usage of incidental signals for NLU tasks by better understanding the properties of natural language. Specifically, we propose to work on reasoning in natural language, and study the benefit of the structure in NLU tasks.

Tom Hope

Allen Institute for AI

September 27, 2021

Harnessing Scientific Literature for Boosting Discovery and Innovation

In the year 1665, the first academic journal was published. Fast forward to today, there are millions of scientific papers coming out every year. This explosion of knowledge represents an opportunity to accelerate innovation with automated systems that scour the literature for solutions and inspirations. However, it also creates information overload and isolated “research bubbles” that limit discovery and sharing, slowing down scientific progress and cross-fertilization. In this talk, I will present our work toward addressing these large-scale challenges for the future of science. In the first part of the talk, I will overview our core approach which consists of identifying key “building blocks” of scientific thought, formalizing and structuring them into computational representations that power creative innovation systems we construct. These include systems that surface inspirations, recommend novel authors, enable search for challenges, hypotheses and causal relations, and tools for exploration and visualization of collaboration networks. The second part of the talk will consist of a dive into our new work -- SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts (AKBC 2021) -- motivated by some of the applications above. We present a new task of cross-document coreference with a referential hierarchy over mention clusters, including a new challenging dataset and models. Finally, if time permits, I will discuss our recent paper --- Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study (AKBC 2021), where we integrate language models and graph embeddings to boost biomedical link prediction with applications in drug discovery.

Bryan Li, Weiqiu You, Qing Lyu (Veronica)

University of Pennsylvania

September 20, 2021

Mini Talks

Our mini talks include "Careful with Context: A Critique of Methods for Commonsense Inference" presented by Bryan Li, "Zero-shot Image Classification with Text using Pretrained Embedding" presented by Weiqiu You, and "Is 'my favorite new movie' 'my favorite movie;? Probing the Understanding of Recursive Noun Phrases" presented by Qing Lyu (Veronica).

Danqi Chen

Princeton University

May 3, 2021

Learning Representations for Dense Retrieval

Dense retrieval has become a new paradigm to retrieve relevant text information in open-domain question answering and other knowledge-intensive NLP tasks. Compared to sparse, non-trainable vector space models, dense retrieval holds great promise in better capturing semantic relationships (e.g., synonyms and paraphrases) between the query and retrieved text units. However, training dense vector models from limited labeled data and scale them to a large text corpus remains challenging. In this talk, I will discuss two recent studies: (1) Dense Passage Retriever (DPR), a simple and effective method that allows learning a dense retriever from a small number of question-answer pairs. It greatly outperforms BM25 and can be used with an extractive or generative reader model for QA and other tasks. (2) DensePhrases, which builds an index of dense representations of all the phrases at the Wikipedia scale. We can directly run retrieval at phrase level and obtain extreme runtime efficiency and competitive performance. DensePhrases can also be used as a dense knowledge base.

Veronica Perez-Rosas

University of Michigan

April 26, 2021

Natural Language Processing for Enhanced Mental Healthcare

In recent years, there has been an increasing need for psychotherapy to address a wide variety of behavioral and mental health issues. This need has become even more prominent during the ongoing pandemic as COVID-19 related concerns have increased mental distress. Developing computational methods that gain a better understanding of mental health conversations can help practitioners to improve the quality of care. In this talk, I will first describe work on identifying conversational behaviors that lead to successful counseling interactions. Next, I will present ongoing work on developing a counseling dialog generation system that can assist counselors while acquiring and improving counseling skills. In particular, I will describe a counseling dialog system that provides language feedback to counseling trainees using the pretrained transformer architecture and context augmentation techniques inspired by traditional strategies used during counseling training.

Nate Chambers

United States Naval Academy

April 19, 2021

Extracting from Adversarial Text with a Visual Character-Based Model: extracting phone numbers from human trafficking ads

Adversarial text is written with obfuscated words and characters for the purpose of fooling machine learned extractors. Illicit domains like human trafficking often employ such techniques. This talk will address the challenge of extracting phone numbers from this noisy text, such as "3w?n7_callme28tree(?nE)_573", but more broadly the talk will discuss the NLP challenge of dealing with unicode characters in any domain. With very little available training data for human trafficking, how can today's neural models learn to generalize to the diversity of noise available to an adversarial writer? This talk will present a couple solutions to this challenge, focusing on character-based neural models that use NLP architectures like LSTMs and CRFs, but also that draw inspiration from the vision community to perform image recognition of the characters with CNNs. I'll first present results from our Best Paper Award at the Workshop for Noisy User-Generated text, exploring extraction from short text snippets, and then show simple steps to expand it to full document extraction.

Ivan Vulic

Cambridge University

April 12, 2021

Cross-Lingual Transfer in Low-Data Regimes: On Some Achievements, Trends, and Challenges

A key challenge in cross-lingual NLP is developing general language-independent architectures that will be equally applicable to any language. However, this ambition is hindered by the large variation in 1) structural and semantic properties of the world’s languages, as well as 2) raw and task data scarcity for many different languages, tasks, and domains. As a consequence, existing language technology is still largely limited to a handful of resource-rich languages. In this talk, we introduce and discuss a range of recent techniques and breakthroughs that aim to deal with such large cross-language variations and low-data regimes efficiently. We cover a range of cutting-edge approaches including adapter-based models for cross-lingual transfer, contextual parameter generation and hypernetworks, learning in few-shot and zero-shot scenarios, and typologically driven learning and source selection. Finally, this talk demonstrates that low-resource languages, despite very positive research trends and results achieved in recent years, still lag behind major languages, and outline several key challenges for future research in this area.

Lillian Lee

Cornell University

April 5, 2021

Discussion Dynamics: Early prediction of controversy; content removal as a moderation strategy

Elizabeth Clark

University of Washington

March 29, 2021

Where NLG Meets People: Text Generation Models and Evaluation for Human-Machine Collaboration

Natural language generation (NLG) models' ability to generate long, fluent texts has enabled progress and new applications across many NLG subfields, but it also poses challenges for model evaluation. In this talk, I will discuss how we can use NLG models in a collaborative setting to offer suggestions to people as they perform a creative writing task. I will present a "machine-in-the-loop" framework for machine-writer collaboration and show how it can be used to improve NLG models. I will also discuss the challenge of evaluating long, fluent passages of generated text and introduce Sentence Mover's Similarity, a metric for automatically evaluating multi-sentence text. Finally, I will discuss the role of human evaluations in NLG and propose directions for collecting better human evaluations for current NLG models.

Abigail See

Stanford University

March 22, 2021

Neural Generation Meets Real People: Towards Emotionally Engaging Mixed-Initiative Conversations

In this talk I will present Chirpy Cardinal, an open-domain dialogue agent built by the Stanford NLP team in the 2019-2020 Alexa Prize competition. Building an open-domain socialbot that talks to real people is challenging – such a system must meet multiple user expectations such as broad world knowledge, conversational style, and emotional connection. Our socialbot engages users on their terms – prioritizing their interests, feelings and autonomy. As a result, our socialbot provides a responsive, personalized user experience, capable of talking knowledgeably about a wide variety of topics, as well as chatting empathetically about ordinary life. Neural generation plays a key role in achieving these goals, providing the backbone for our conversational and emotional tone. Chirpy Cardinal ultimately won 2nd place in the competition, with a 3.6/5.0 average customer rating. In this talk I will cover the technical details of the bot, analysis of its strengths and weaknesses, unexpected findings during the competition, and future work.

Kellie Webster


March 15, 2021

Best Practices for using Natural Language Models: A Case Study from Gendered Correlations

Natural language processing has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, using masked language modeling. The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream. I will present two works in this direction, “Measuring and Reducing Gendered Correlations in Pre-trained Models” and "Scalable Cross Lingual Pivots to Model Pronoun Gender for Translation". In the first, we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models: (i) It is important to measure for unintended correlations; (ii) Be careful even when making seemingly innocuous configuration changes; and (iii) There are opportunities for general mitigations. In the second, we explore how to leverage the rich representations in BERT to improve gendered pronoun accuracy in machine translation.

David Bamman

University of California, Berkeley

March 8, 2021

Modeling the Spread of Information within Novels

Understanding the ways in which information flows through social networks is important for questions of influence--including tracking the spread of cultural trends and disinformation and measuring shifts in public opinion. Much work in this space has focused on networks where nodes, edges and information are all directly observed (such as Twitter accounts with explicit friend/follower edges and retweets as instances of propagation); in this talk, I will focus on the comparatively overlooked case of information propagation in *implicit* networks--where we seek to discover single instances of a message passing from person A to person B to person C, only given a depiction of their activity in text. Literature in many ways presents an ideal domain for modeling information propagation described in text, since it depicts a largely closed universe in which characters interact and speak to each other. At the same time, it poses several wholly distinct challenges--in particular, both the length of literary texts and the subtleties involved in extracting information from fictional works pose difficulties for NLP systems optimized for other domains. In this talk, I will describe our work in measuring information propagation in these implicit networks, and detail an NLP pipeline for discovering it, focusing in detail on new datasets we have created for tagging characters and their coreference in text. This is joint work with Matt Sims, Olivia Lewke, Anya Mansoor, Sejal Popat and Sheng Shen.

Greg Durrett

UT Austin

March 1, 2021

Addressing the Paradox of Flexible but Reliable Text Generation

Text generation is a paradox. We want our generation models to imitate patterns in training data, but also have the flexibility to work in new settings and behave in new ways. We want our models to say creative things, but also be reliable and factual with respect to their inputs. How can we achieve these dual goals with a single system? Our work focuses on generation systems that are controlled and assessed in fine-grained ways: control mechanisms can help enumerate diverse inputs, which are then assessed according to our desired criteria. I will describe work in paraphrasing and summarization where intermediate syntactic control mechanisms can make our models more expressive. I will then describe how to assess these models' outputs from the standpoint of factuality and grammaticality in a fine-grained way, localizing errors to individual words and dependency arcs. By achieving diversity and then enforcing quality, we can build systems that are simultaneously flexible and reliable enough to handle a range of generation settings.

Ankur Parikh


February 22, 2021

Towards High Precision Text Generation

Despite large advances in neural text generation in terms of fluency, existing generation techniques are prone to hallucination and often produce output that is unfaithful or irrelevant to the source text. In this talk, we take a multi-faceted approach to this problem from 3 aspects: data, evaluation, and modeling. From the data standpoint, we propose ToTTo, a tables-to-text-dataset with high quality annotator revised references that we hope can serve as a benchmark for high precision text generation task. While the dataset is challenging, existing n-gram based evaluation metrics are often insufficient to detect hallucinations. To this end, we propose BLEURT, a fully learnt end-to-end metric based on transfer learning that can quickly adapt to measure specific evaluation criteria and a model based on confidence decoding to mitigate hallucinations. Finally, I will discuss GEM, a living benchmark for generation that is the result of a large collaboration among many institutions, and will be an ACL 2021 workshop this year

Barlas Oguz

Facebook AI

February 15, 2021

Dense Retrieval for Question Answering

Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this talk, we discuss recent work which shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks. We also discuss extensions to the multi-hop setting, where we can outperform competing approaches with 10x less computation.

Liang Huang

Oregon State University/Baidu Research USA

February 8, 2021

Fighting COVID-19 using Parsing Algorithms and Grammar Formalisms

To defeat the current COVID-19 pandemic, a messenger RNA (mRNA) vaccine has emerged as a promising approach thanks to its rapid and scalable production and non-infectious and non-integrating properties. However, designing an mRNA sequence to achieve high stability and protein yield remains a challenging problem due to the exponentially large search space (e.g., there are 2.4 x 10^632 possible mRNA sequence candidates for the spike protein of SARS-CoV-2). We describe two on-going efforts for this problem, both using linear-time algorithms inspired by my earlier work in natural language parsing. On one hand, the Eterna OpenVaccine project from Stanford Medical School takes a crowd-sourcing approach to let game players all over the world design stable sequences. To evaluate sequence stability (in terms of free energy), they use LinearFold from my group (2019) since it’s the only linear-time RNA folding algorithm available (which makes it the only one fast enough for COVID-scale genomes). On the other hand, we take a computational approach to directly search for the optimal sequence in this exponentially large space via dynamic programming. It turns out this problem can be reduced to a classical problem in formal language theory and computational linguistics (intersection between CFG and DFA), which can be solved in O(n^3) time, just like lattice parsing for speech. In the end, we can design the optimal mRNA vaccine candidate for SARS-CoV-2 spike protein in just about 10 minutes. This talk is dedicated to the memory of my PhD advisor Aravind Joshi who taught me that linguistics and biology share the same mathematical foundations.

Lara Martin

University of Pennsylvania

January 25, 2021

Dungeons and Discourse: Using Computational Storytelling & Speech to Look at Natural Language Use

Although we are currently riding a technological wave of personal assistants, many of these agents still struggle to communicate appropriately. In particular, these systems lack coherence, the ability to adapt to novel situations, creativity, emotional understanding, and collaboration. My work focuses on creating open-world storytelling systems and developing agents that leverage speech understanding to communicate with humans more effectively. In this talk, I look at how tabletop roleplaying games such as Dungeons & Dragons can be used as motivation for how to improve conversational systems and understand how people communicate.

Rotem Dror

University of Pennsylvania

December 7, 2020

Statistical Significance Testing for Natural Language Processing

Data-driven experimental analysis has become the main evaluation tool of Natural Language Processing (NLP) algorithms. In fact, in the last decade, it has become rare to see an NLP paper, particularly one that proposes a new algorithm, that does not include extensive experimental analysis, and the number of involved tasks, datasets, domains, and languages is constantly growing. This emphasis on empirical results highlights the role of statistical significance testing in NLP research: If we, as a community, rely on empirical evaluation to validate our hypotheses and reveal the correct language processing mechanisms, we better be sure that our results are not coincidental. In this talk, I will go through the main chapters of the book in the title ( and answer the following questions: How to choose a valid statistical test for your experiments? How to perform statistical analysis when experimenting with multiple datasets? How to compare deep neural models in a statistically valid manner? And some more surprises...

Dragomir R. Radev

Yale University

November 30, 2020

Closing the Loop in Natural Language Interfaces to Relational Databases: Parsing, Dialogue, and Generation

Natural Language is a very efficient method of communication among humans. However, when users want to talk to their computers, translating this NL to computer actions is a very challenging task. One possible way for such human-computer interaction is to translate NL sentences to database queries and then to convert the output of these queries back to NL. In order for such an approach to work, one needs to address several challenges: the lack of annotated question-query pairs, the discourse issues present in multi-turn questions, and the issues that arise in a dialogue context. In this presentation, I will talk about recent work on natural language interfaces to databases. As part of the Yale Spider project, we have developed three new datasets and launched three matching shared tasks. Spider is a collection of 10,181 manually created natural language questions on databases from 138 domains, and the 5,693 database queries that correspond to them. SParC (Semantic Parsing in Context) consists of 4,298 coherent sequences of questions and the matching queries. Finally, CoSQL consists of WoZ 3k dialogues and a total of 30k turns, and their translations to SQL. I will then introduce GraPPa, a pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We used GraPPa to obtain SOTA performance on four popular fully supervised and weakly supervised table semantic parsing benchmarks. Joint work with Tao Yu, Rui Zhang, Victoria Lin, Caiming Xiong, and many others.

Ellie Pavlick

Brown University/Google AI

November 9, 2020

You can lead a horse to water...: Representing vs. Using Features in Neural NLP

A wave of recent work has sought to understand how pretrained language models work. Such analyses have resulted in two seemingly contradictory sets of results. On one hand, work based on "probing classifiers" generally suggests that SOTA language models contain rich information about linguistic structure (e.g., parts of speech, syntax, semantic roles). On the other hand, work which measures performance on linguistic "challenge sets" shows that models consistently fail to use this information when making predictions. In this talk, I will present a series of results that attempt to bridge this gap. Our recent experiments suggest that the disconnect is not due to catastrophic forgetting nor is it (entirely) explained by insufficient training data. Rather, it is best explained in terms of how "accessible" features are to the model following pretraining, where "accessibility" can be quantified using an information-theoretic interpretation of probing classifiers.

Vivek Srikumar

University of Utah

November 2, 2020

Where Neural Networks Fail: The Case for a Little Help from Knowledge

Today's dominant paradigm for modeling complex linguistic tasks calls for training neural networks by minimizing loss on massive datasets. While the agenda is undeniably successful, we may not have the luxury of annotated data for every task or domain of interest. Reducing dependence on labeled examples may require us to rethink how we supervise models. In this talk, I will discuss some failures of today's end-to-end trained neural networks. In particular, I will focus on two phenomena---societal stereotypes implicitly present in their decisions, and their inability to perform complex reasoning---both due to the models inability to internalize knowledge about the world. Following this, I will describe our work on using knowledge to inform neural networks without introducing additional parameters. Declarative rules stated in logic can be systematically compiled into computation graphs that augment the structure of neural models, and also into regularizers that can use labeled or unlabeled examples. I will present experiments involving text understanding and semantic role labeling, which show that such declaratively constrained neural networks can successfully internalize the information in the rules, providing an easy-to-use mechanism for supervising neural networks that does not involve data annotation.

Liang Huang

Oregon State University/Baidu Research USA

October 26, 2020

Simultaneous Translation: Breakthrough and Recent Progress

Simultaneous interpretation (i.e., translating concurrently with the source language speech) is widely used in many scenarios including multilateral organizations (UN/EU), international summits (APEC/G-20), legal proceedings, and press conferences. However, it is well known to be one of the most challenging tasks for humans due to the simultaneous perception and production in two languages. As a result, there are only a few thousand professional simultaneous interpreters world-wide, and each of them can only sustain for 15-30 minutes in each turn. On the other hand, simultaneous translation (either speech-to-text or speech-to-speech) is also notoriously difficult for machines and has remained one of the holy grails of AI. A key challenge here is the word order difference between the source and target languages. For example, if you simultaneously translate German (an SOV language) to English (an SVO language), you often have to wait for the sentence-final German verb. Therefore, most existing "real-time" translation systems resort to conventional full-sentence translation, causing an undesirable latency of at least one sentence, rendering the audience largely out of sync with the speaker. There have been efforts towards genuine simultaneous translation, but with limited success. Recently, we discovered a much simpler and surprisingly effective approach to simultaneous (speech-to-text) translation by designing a "prefix-to-prefix" framework tailed to simultaneity requirements. This is in contrast with the "sequence-to-sequence" framework which assumes the availability of the full input sentence. Our approach results in the first simultaneous translation system that achieves reasonable translation quality with controllable latency and was successfully deployed in many commercial products. Since 2019, our work has attracted renewed interest in this long-standing problem which was once thought to be out of reach. I will also discuss our efforts towards the ultimate goal of simultaneous speech-to-speech translation, and conclude with a list of remaining challenges. (Part of this talk was given as an ACL 2019 Keynote, but this Clunch talk will cover more recent progress.)

Rui Zhang

Penn State University

October 19, 2020

Building Robust Conversational Question Answering Systems Over Databases of Tabular Data

A vast amount of information is stored in relational databases consisting of tables. These databases provide fundamental frameworks of data systems for business in various domains. In real-world applications, users would like to interact with databases for information requests just like talking to a human. However, querying databases requires proficiency in the SQL query language syntax and the knowledge of underlying table structures. Consequently, despite the enormous popularity of relational databases, the ability to retrieve information from these databases is still limited for many ordinary users. In this talk, I will describe some completed and ongoing efforts to build conversational question answering systems over databases of tabular data that is (1) robust to user queries by handling different types of user inputs, (2) conversational and interactive by conversing with users in a dialog setting with its reasoning ability over multi-turn contexts of interaction history, (3) explainable and verifiable by generating natural language explanations of system predicted SQL queries and execution results for user verification and feedback, (4) transferable and adaptable by quickly adapting to different domains and scenarios of databases.

Daniel Deutsch

University of Pennsylvania

October 12, 2020

Ongoing Work on Summarization Evaluation Metrics

In this talk, I will provide an overview of two ongoing works on summarization evaluation metrics. The first work analyzes the extent to which ROUGE and BERTScore actually measure the information overlap between two summaries. I show that they largely do not and propose an alternative method of comparing summarization systems which does and is interpretable. The second work focuses on using QA to evaluate summaries. After proposing a new QA-based metric, I benchmark its performance on current datasets, identify performance bottlenecks, and estimate its upper-bound performance, concluding QA is a promising future research direction.

Mike Lewis

Facebook AI Research

October 5, 2020

Modelling Language and the World

Much recent progress in NLP has been driven by training language models on large unlabelled datasets. I will argue that language modelling requires both linguistic and world knowledge, but that these can be disentangled and modelled separately. First, I will describe kNN-LM, which shows how converting a language model into a nearest neighbor classifier can give large gains in performance, by giving the model access to facts in the training set during inference. I will then introduce MARGE, a new approach to pre-training sequence-to-sequence models with an unsupervised paraphrasing objective. This objective emphasises learning to paraphrase over memorizing facts. MARGE performs well on classification, generation and retrieval tasks in many languages, without supervision in some cases, making it arguably the most broadly applicable pre-trained model to date.

Dan Hopkins

University of Pennsylvania

September 21, 2020

The Polarization and Nationalization of American State Party Platforms, 1918-2017

The role of U.S. state political parties has changed substantially in recent decades. One common supposition is that contemporary state parties are increasingly polarized and nationalized, meaning that the Democratic and Republican parties adopt similar positions nationwide. Yet, the relationship between these shifts, the mechanisms underpinning them, and the extent to which they have unfolded similarly across states and issue areas remain open questions. We introduce a data set of 2,041 state party platforms to measure nationalization and polarization between 1918 and 2018. Applying tools from automated and manual content analysis, we find that there is a dramatic divergence in the topics covered in Democratic and Republican platforms starting in the early 1990s, at virtually the same time as federal-level rhetorical polarization. During this same period, the differences across states in platforms decreased and social issues became more prominent, suggesting a tight connection between polarization, nationalization, and social issues such as abortion.

Mohit Iyyer

University of Massachusetts Amherst

September 14, 2020

Towards interactive story generation

Story generation is difficult to computationally formalize and evaluate, and there are many important questions to ask when tackling the problem. What should we consider as the base unit of a story (e.g., a sentence? a paragraph? a chapter?) What kind of data should we use to train these models (novels? short stories? overly simplistic mechanically-turked paragraphs?) Is any model architecture currently capable of producing long-form narratives that have some semblance of coherent discourse structure, such as plot arcs and character development? When evaluating the outputs of our models, can we do better than just asking people to rate the text based on vaguely defined properties such as "enjoyability"? In this talk, I'll discuss my lab's ongoing work on story generation by introducing a new dataset and evaluation method that we hope will spur progress in this area. I'll then describe practical challenges (slow inference, unsecure models) that we face when deploying our models in real-world author-facing settings, along with some solutions we have developed to combat these challenges.

Matt Gardner

Allen Institute for Artificial Intelligence

March 5, 2020

NLP Evaluations That We Believe In

With all of the modeling advancements in recent years, NLP benchmarks have been falling over left and right: "human performance" has been reached on SQuAD 1 and 2, GLUE and SuperGLUE, and many commonsense datasets. Yet no serious researcher actually believes that these systems understand language, or even really solve the underlying tasks behind these datasets. To get benchmarks that we actually believe in, we need to both think more deeply about the language phenomena that our benchmarks are targeting, and make our evaluation sets more rigorous. I will first present ORB, an Open Reading Benchmark that collects many reading comprehension datasets that we (and others) have recently built, targeting various aspects of what it means to read. I will then present contrast sets, a way of creating non-iid test sets that more thoroughly evaluate a model's abilities on some task, decoupling training data artifacts from test labels.

Mohammad Sadegh Rasooli

University of Pennsylvania

February 27, 2020

Cross-Lingual Transfer of Natural Language Processing Systems

Accurate natural language processing systems rely heavily on annotated datasets. In the absence of such datasets, transfer methods can help to develop a model by transferring annotations from one or more rich-resource languages to the target language of interest. These methods are generally divided into two approaches: 1) annotation projection from translation data, aka parallel data, using supervised models in rich-resource languages, and 2) direct model transfer from annotated datasets in rich-resource languages. In this talk, we present different methods for transfer of syntactic and semantic dependency parsers. We propose an annotation projection method that performs well in scenarios for which a large amount of in-domain parallel data is available. We also propose a method which is a combination of annotation projection and direct model transfer that can leverage a minimal amount of information from a small out-of-domain parallel dataset to develop highly accurate transfer models. Furthermore, we present an unsupervised syntactic reordering model to improve the accuracy of dependency parser transfer for non-European languages. We also propose a method for cross-lingual transfer of dependency parsing based on multi-task learning by leveraging supervised syntactic information in the target language of interest. Finally, we introduce our current efforts for learning cross-lingual representations using information from different modalities especially from images in the massively multilingual image dataset (MMID).

Zhiting Hu

Carnegie Mellon University

February 20, 2020

Connecting the Dots between Learning Paradigms

Continued research has created a diverse set of learning algorithms for ingesting distinct forms of experience (e.g. data, cost, knowledge constraints). However, it is often challenging for practitioners to choose or adapt solutions from such a bewildering marketplace of algorithms, as it could demand deep ML expertise and bespoke innovations. This talk will present an attempt to systematize several paradigms of algorithms for both a unifying understanding and new systematic methodologies of creating ML solutions. I will show that some of the popular algorithms in supervised learning, constraint-driven learning, reinforcement learning, etc, indeed share a common succinct formulation, showing that different forms of experience can be used for learning in the same way. The unifying representation of algorithms allows us to methodically exchange solutions between paradigms, and learn from combinations of experience jointly, for complex problems such as text and image generation.

Nitish Gupta

University of Pennsylvania

February 13, 2020

Neural Module Networks for Reasoning over Text

Answering compositional questions that require multiple steps of reasoning against text is challenging, especially when they involve discrete, symbolic operations. Neural module networks (NMNs) learn to parse such questions as executable programs composed of learnable modules, performing well on synthetic visual QA domains. In this talk, I will outline the challenges in learning these models for non-synthetic questions on open-domain text, where a model needs to deal with the diversity of natural language and perform a broader range of reasoning. Then, I will present how we extend NMNs by (a) introducing modules that reason over a paragraph of text, performing symbolic reasoning (such as arithmetic, sorting, counting) over numbers and dates in a probabilistic and differentiable manner; and (b) proposing an unsupervised auxiliary loss to help extract arguments associated with the events in text. Additionally, we show that a limited amount of heuristically-obtained question program and intermediate module output supervision provides sufficient inductive bias for accurate learning. In conclusion, I will present methods for achieving interpretability in such compositional neural models and challenges for future research.

Noam Slonim


February 6, 2020

Project Debater – How Persuasive can a Computer be?

Project Debater is the first AI system that can meaningfully debate a human opponent. The system, an IBM Grand Challenge, is designed to build coherent, convincing speeches on its own, as well as provide rebuttals to the opponent’s main arguments. In 2019, Project Debater competed against Harish Natarajan, who holds the world record for most debate victories, in an event held in San Francisco that was broadcasted live world-wide. In this talk I will tell the story of Project Debater, from conception to a climatic final event, describe its underlying technology, and discuss how it can be leveraged for advancing decision making and critical thinking.

Jay-Yoon Lee

Carnegie Mellon University

January 30, 2020

Injecting output constraints into neural NLP models in a model agnostic way

The talk discusses a particular method of injecting constraints into neural models, primarily for natural language processing (NLP) tasks. While neural models have set the new state of the art performance in many tasks from vision to NLP, they often fail to learn simple rules necessary for well-formed structures unless there is an immense amount of training data. The talk claims that not all the aspects of the model have to be learned from the data itself and injecting simple knowledge/constraints into the neural models can help low-resource tasks as well as improving state-of-the-art models. The talk focuses on the structural knowledge of the output space and injects knowledge of correct or preferred structures as an objective to the model in a model-agnostic way, i.e. without modification to the model structure. The first benefit of focusing on the knowledge of output space is that it is intuitive as we can directly enforce outputs to satisfy logical/linguistic constraints. Another advantage of structural knowledge is that it often does not require a labeled dataset. Focusing on the example of Semantic Role Labeling and its constraints related to the syntactic parse tree, the talk showcases the efficacy of the proposed inference algorithm and the proposed semi-supervised learning.

Nick Montfort

Massachusetts Institute of Technology

January 23, 2020

Lean Computer-Generated Poetry as Exploration of Language, Culture, and Computation

Computational poetics is a compelling area of NLP. Poetry has helped to constitute cultures for millennia and its composition is considered one of the most human activities. On the generation side, computational poetics involves the production of poetic language, potentially with meter, rhyme and other forms of musicality, metaphors and their cousins, narrative aspects, and intertextual references. Essentially, the main objective of computationally generated poetry is being culturally and individually resonant for at least some readers or listeners in some cultures. There are a wide variety of approaches, some of which seek to model human creativity, as in the computational creativity community. Work in the area is undertaken by academic researchers, poets and artists, and programmers seeking amusement and diversion during events such as NaNoGenMo (National Novel Generation Month), which accommodates the generation of all sorts of large-scale literature, including poetry. In my talk, I will introduce my own practice as a computational poet, which does not involve developing general models of human creativity. My practice is often considered experimental and sometimes conceptual; it is not, in any case, expressive, that is, mainly concerned with my experiences or with conveying my emotions. Rather, I consider myself a situated and embodied explorer of language, culture, and computation. My means of exploration is the development of computational poetry. My practice involves writing programs that are usually small and simple, based on specific unusual lexicons and combinatorial techniques. As part of inquiring about computation, my work connects with platform studies and deals with specifics of particular computers and programming languages. As I share and discuss some of my specific computational poems, I will describe how this type of NLG work touches on questions of language and thought as studied in, for instance, linguistics, cognitive science, and conventional poetics.

Adam Poliak

Johns Hopkins University

December 10, 2019

Sentence-level Semantic Inference: From Diverse Phenomena to Applications

Many NLP tasks involve understanding meaning at the sentence-level. In order to analyze such models, we should decompose sentence-level semantic understanding into a diverse array of smaller, more-focused, fine-grained types of reasoning. This will help improve our understanding of the sentence-level reasoning capabilities of our NLP systems. In this talk, we will focus on Natural Language Inference (NLI), the task of determining if one sentence (hypothesis) can likely be inferred from another (context/premise). NLI has traditionally be used to evaluate how well different models understand language and the relationship between texts. We investigate whether 10 recent NLI datasets require models to reason about both texts, or if the datasets contain biases or statistical irregularities that allow a model to correctly label a context-hypothesis pair by only looking at a hypothesis. In the most popular dataset that we consider, a hypothesis-only model outperforms the majority baseline by over 2x. We will also discuss our recently released dataset, the Diverse NLI Collection (DNC), that can be used to shed light on a model’s ability to capture or understand a diverse array of semantic phenomena that are important to Natural Language Understanding. We will demonstrate how a variant of the DNC has been used to evaluate whether a Neural Machine Translation encoder captures semantic phenomena related to translation. With the remaining time, we will discuss how lessons from these studies can be applied real-world uses cases of sentence-level semantic inference. This talk is based on work that has appeared at NAACL, ACL, StarSem, and EMNLP.

Yoav Artzi

Cornell University

December 3, 2019

Robot Control and Collaboration in Situated Instruction Following

I will present two projects studying the problem of learning to follow natural language instructions. I will present new datasets, a class of interpretable models for instruction following, learning methods that combine the benefits of supervised and reinforcement learning, and new evaluation protocols. In the first part, I will discuss the task of executing natural language instructions with a robotic agent. In contrast to existing work, we do not engineer formal representations of language meaning or the robot environment. Instead, we learn to directly map raw observations and language to low-level continuous control of a quadcopter drone. In the second part, I will propose the task of learning to follow sequences of instructions in a collaborative scenario, where both the user and the system execute actions in the environment and the user controls the system using natural language. To study this problem, we build CerealBar, a multi-player 3D game where a leader instructs a follower, and both act in the environment together to accomplish complex goals. The two projects were led by Valts Blukis, Alane Suhr, and collaborators.

Hangfeng He

University of Pennsylvania

November 19, 2019

Distributed Semantic Representations from Question-Answering Signals

Human annotations, especially those from experts, are costly for many natural language processing (NLP) tasks. One emerging approach is to use natural language to annotate natural language, but it is challenging to get supervision effectively from annotations that are very different from the target task. This paper studies the case where the annotations are in the format of question answering (QA). We propose a novel approach to retrieve two types of semantic representations from QA, using which we can consistently improve on a suite of tasks. This work may have pointed out an alternative way to supervise NLP tasks.

Shuai Tang

University of California, San Diego

November 12, 2019

Revisiting post-processing for word embeddings

Word embeddings learnt from large corpora have been adopted in various applications in natural language processing and served as the general input representations to learning systems. Recently, a series of post-processing methods have been proposed to boost the performance of word embeddings on similarity comparison and analogy retrieval tasks, and some have been adapted to compose sentence representations. The general hypothesis behind these methods is that by enforcing the embedding space to be more isotropic, the similarity between words can be better expressed. We view these methods as an approach to shrink the covariance/gram matrix, which is estimated by learning word vectors, towards a scaled identity matrix. By optimising an objective in the semi-Riemannian manifold with Centralised Kernel Alignment (CKA), we are able to search for the optimal shrinkage parameter, and provide a post-processing method to smooth the spectrum of learnt word vectors which yields improved performance on downstream tasks.

Daniel Deutsch

University of Pennsylvania

October 29, 2019

A General-Purpose Algorithm for Constrained Sequential Inference

Inference in structured prediction involves finding the best output structure for an input, subject to certain constraints. Many current approaches use sequential inference, which constructs the output in a left-to-right manner. However, there is no general framework to specify constraints in these approaches. We present a principled approach for incorporating constraints into sequential inference algorithms. Our approach expresses constraints using an automaton, which is traversed in lock-step during inference, guiding the search to valid outputs. We show that automata can express commonly used constraints and are easily incorporated into sequential inference. When it is more natural to represent constraints as a set of automata, our algorithm uses an active set method for demonstrably fast and efficient inference. We experimentally show the benefits of our algorithm on constituency parsing and semantic role labeling. For parsing, unlike unconstrained approaches, our algorithm always generates valid output, incurring only a small drop in performance. For semantic role labeling, imposing constraints using our algorithm corrects common errors, improving F1 by 1.5 points. These benefits increase in low-resource settings. Our active set method achieves a 5.2x relative speed-up over a naive approach.

Daniel Deutsch

University of Pennsylvania

October 29, 2019

Summary Cloze: A New Task for Content Selection in Topic-Focused Summarization

A key challenge in topic-focused summarization is determining what information should be included in the summary, a problem known as content selection. In this work, we propose a new method for studying content selection in topic-focused summarization called the summary cloze task. The goal of the summary cloze task is to generate the next sentence of a summary conditioned on the beginning of the summary, a topic, and a reference document(s). The main challenge is deciding what information in the references is relevant to the topic and partial summary and should be included in the summary. Although the cloze task does not address all aspects of the traditional summarization problem, the more narrow scope of the task allows us to collect a large-scale datset of nearly 500k summary cloze instances from Wikipedia. We report experimental results on this new dataset using various extractive models and a two-step abstractive model that first extractively selects a small number of sentences and then abstractively summarizes them. Our results show that the topic and partial summary help the models identify relevant content, but the task remains a significant challenge.

Ben Zhou

University of Pennsylvania

October 29, 2019

"Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding

Understanding time is crucial for understanding events expressed in natural language. Because people rarely say the obvious, it is often necessary to have commonsense knowledge about various temporal aspects of events, such as duration, frequency, and temporal order. However, this important problem has so far received limited attention. This paper systematically studies this temporal commonsense problem. Specifically, we define five classes of temporal commonsense, and use crowdsourcing to develop a new dataset, MCTACO, that serves as a test set for this task. We find that the best current methods used on MCTACO are still far behind human performance, by about 20%, and discuss several directions for improvement. We hope that the new dataset and our study here can foster more future research on this topic.

Katharina Kann

New York University

October 22, 2019

Neural Networks for Morphological Generation in the Minimal-Resource Setting

As languages other than English are moving more and more into the focus of natural language processing, accurate handling of morphology is increasing in importance. This talk presents neural network-based approaches to morphological generation, casting the problem as a character-based sequence-to-sequence task. First, we will generally discuss how to successfully train neural sequence-to-sequence models for this. Then, since many morphologically rich languages only have limited resources, the main part of the talk will focus on how to overcome the challenges that limited amounts of annotated training data pose to neural models. The approaches covered in this talk include multi-task learning, cross-lingual transfer learning, and meta-learning.

Jithin Pradeep

The Vanguard Group

October 15, 2019

ArSI - Artificial Speech Intelligence - An end to end automatic speech recognition using Attention plus CTC

Shi Yu

The Vanguard Group

October 15, 2019

A Financial Service Chatbot based on Deep Bidirectional Transformers

Christopher Lynn

University of Pennsylvania

October 8, 2019

Human information processing in complex networks

Humans communicate using systems of interconnected stimuli or concepts -- from language and music to literature and science -- yet it remains unclear how, if at all, the structure of these networks supports the communication of information. Although information theory provides tools to quantify the information produced by a system, traditional metrics do not account for the inefficient and biased ways that humans process this information. Here we develop an analytical framework to study the information generated by a system as perceived by a human observer. We demonstrate experimentally that this perceived information depends critically on a system's network topology. Applying our framework to several real networks, we find that they communicate a large amount of information (having high entropy) and do so efficiently (maintaining low divergence from human expectations). Moreover, we show that such efficient communication arises in networks that are simultaneously heterogeneous, with high-degree hubs, and clustered, with tightly-connected modules -- the two defining features of hierarchical organization. Together, these results suggest that many real networks are constrained by the pressures of information transmission, and that these pressures select for specific structural features.

Dan Goldwasser

Purdue University

October 1, 2019

Joint Models for Social, Behavioral and Textual Information

Understanding natural language communication often requires context, such as the speakers' backgrounds and social conventions, however, when it comes to computationally modeling these interactions, we typically ignore their broader context and analyze the text in isolation. In this talk, I will review on-going work demonstrating the importance of holistically modeling behavioral, social and textual information. I will focus on several NLP problems, including political discourse analysis on Twitter, partisan news detection and open-domain debate stance prediction, and discuss how jointly modeling text and social behavior can help reduce the supervision effort and provide a better representation for language understanding tasks.

Robert Shaffer

University of Pennsylvania

September 24, 2019

Similarity Inference for Legal Texts

Quantifying similarity between pairs of documents is a ubiquitous task. Both researchers and members of the public frequently use document-level pairwise similarity measures to describe or explore unfamiliar corpora, or to test hypotheses regarding diffusion of ideas between authors. High-level similarity measures are particularly useful when dealing with legal or political corpora, which often contain long, thematically diverse, and specialized language that is difficult for non-experts to interpret. Unfortunately, though similarity estimation is a well-studied problem in the context of short documents and document excerpts, less attention has been paid to the problem of similarity inference for long documents.

Reno Kriz

University of Pennsylvania

September 17, 2019

Comparison of Diverse Decoding Methods from Conditional Language Models

While conditional language models have greatly improved in their ability to output high-quality natural language, many NLP applications benefit from being able to generate a diverse set of candidate sequences. Diverse decoding strategies aim to, within a given-sized candidate list, cover as much of the space of high-quality outputs as possible, leading to improvements for tasks that re-rank and combine candidate outputs. Standard decoding methods, such as beam search, optimize for generating high likelihood sequences rather than diverse ones, though recent work has focused on increasing diversity in these methods. We conduct an extensive survey of decoding-time strategies for generating diverse outputs from conditional language models. We also show how diversity can be improved without sacrificing quality by over-sampling additional candidates, then filtering to the desired number.

Daphne Ippolito

University of Pennsylvania

September 17, 2019

Detecting whether Text is Human- or Machine-Generated

With the advent of generative models with a billion parameters or more, it is now possible to automatically generate vast amounts of human-sounding text. But just how human-like is this machine-generated text? Intuitively, shorter amounts of machine-generated text are harder to detect, but exactly how many words can a machine generate and still fool both humans and trained discriminators? We investigate how the choices of sampling strategy and text sequence length impact discriminability from human-written text, using both automatic detection methods and human judgement.