CLunch

CLunch is the weekly Computational Linguistics lunch run by the NLP group. We invite external and internal speakers to come and present their research on natural language processing, computational linguistics, and machine learning.

Interested in attending CLunch? Sign up for our mailing list here.

View older talks at the CLunch archive.

Upcoming Talks

Fall 2023

Andrew Zhu

University of Pennsylvania

January 29, 2024

Kani: A Lightweight and Highly Hackable Framework for Building Language Model Applications

Language model applications are becoming increasingly popular and complex, often including features like tool usage and retrieval augmentation. However, existing frameworks for such applications are often opinionated, deciding for developers how their prompts ought to be formatted and imposing limitations on customizability and reproducibility. To solve this we present Kani: a lightweight, flexible, and model-agnostic open-source framework for building language model applications. Kani helps developers implement a variety of complex features by supporting the core building blocks of chat interaction: model interfacing, chat management, and robust function calling. All Kani core functions are easily overridable and well documented to empower developers to customize functionality for their own needs. Kani thus serves as a useful tool for researchers, hobbyists, and industry professionals alike to accelerate their development while retaining interoperability and fine-grained control.


Bryan Li

University of Pennsylvania

January 29 2024

This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models

Do the Spratly Islands belong to China, the Philippines, or Vietnam? A pretrained large language model (LLM) may answer differently if asked in the languages of each claimant country: Chinese, Tagalog, or Vietnamese. In this paper, we show that LLMs recall certain geographical knowledge inconsistently when queried in different languages—a phenomenon we term geopolitical bias. As a targeted case study, we consider territorial disputes, an inherently controversial and multilingual task. We introduce BorderLines, a dataset of territorial disputes which covers 251 territories, each associated with a set of multiple-choice questions in the languages of each claimant country (49 languages in total). We also propose a suite of evaluation metrics to precisely quantify bias and consistency in responses across different languages. We then evaluate various multilingual LLMs on our dataset and metrics to probe their internal knowledge and use the proposed metrics to discover numerous inconsistencies in how these models respond in different languages. Finally, we explore several prompt modification strategies, aiming to either amplify or mitigate geopolitical bias, which highlights how brittle LLMs are and how they tailor their responses depending on cues from the interaction context.


Alyssa Hwang

University of Pennsylvania

January 29, 2024

Developing Grounded Intuition of Large Language Models

Large language models in the current era of natural language processing have shown unprecedented performance on increasingly complex tasks, leading to challenges in evaluating models and understanding their limits. Recent studies have turned to example-driven qualitative analysis to gain a better "intuition" of how LLMs respond to intricate, inventive requests. In this work, I propose a new methodology to systematize and substantiate this style of qualitative evaluation in techniques from the social sciences. Using GPT-Vision and scientific images as a case study, I will walk through the qualitative evaluation method, theoretical social science background, and resulting insights---intuition of the model's capabilities grounded in empirical evidence---to show how this method can be used for any generative model. I welcome feedback on adapting the preprint for a conference submission.


Sebastian Gehrmann

Bloomberg

February 5, 2024

Evaluation in the age of Large Language Models

New language models are being developed at a rapid pace. While these models have incredible new abilities, we still mostly follow the same old paradigms when it comes to evaluating the language that these models produce. As a result, claims about their performance rely either on anecdotal evidence or on experiments on anglo-centric corpora with flawed metrics. We thus can’t systematically answer the question that lies at the core of natural language generation research: how good is a system that produces natural language and where does it fail? I will discuss the deliberations of languages, datasets, metrics, and human evaluations that are required to address this problem. I will also connect these insights to broader trends in the industry and how they affect development of new products.


William Wang

UCSB

February 12, 2024

Principles of Reasoning: Compositional and Collaborative Generative AI

A majority of existing research in large language models and generative AI systems focus on scaling and engineering. In this talk, I argue that we need principled understanding of the science of generative AI, in particular, to understand the emergent ability of large language models. I present a Bayesian latent variable approach to enhancing in-context learning in large language models (LLMs) through optimal demonstration selection, demonstrating substantial improvements across various text classification tasks. Second, I argue that modern generative AI systems must be modular and collaborative, to solve complex reasoning problems. We introduce Logic-LM, an in-context framework that synergizes LLMs with symbolic solvers, significantly boosting logical problem-solving abilities. We will also elaborate how to build in-context neuro-symbolic solutions to improve the compositionality in text-to-image systems. Our observations indicate that the future of generative AI is compositional and collaborative, as opposed to a single-model system.


Sihao Chen

University of Pennsylvania

February 19, 2024

Propositional Text Representation Learning in the era of LLMs

In an era where most NLP problems are solved in a text-in-text-out, end-to-end fashion, do meaning representations of text still matter? The answer is yes, and the benefit it brings might surprise you! I will introduce our recent line of work, where we rethink and redefine the use of propositions in modern NLP. I will discuss the benefit of propositional text representation learning in LLM-related applications such as hallucination detection, attribution for generated text, and retrieval-augmented generation.


Ben Zhou

University of Pennsylvania

February 19, 2024

Towards Generalizable and Controllable Reasoning in NLP and AI Systems

Advancements in natural language processing (NLP) have spurred a wave of innovation. Still, the reliability and generalizability of language models (LMs) remain areas of concern, blocking them from complex reasoning scenarios or sensitive topics. This talk presents works on augmenting models with experiential knowledge and symbolic reasoning to refine controllable reasoning, improve abduction skills, and bolster model generalizability. We will also examine the limitations of current semantic-based reasoning methods and highlight the integration of symbolic techniques to construct more transparent and explainable decision-making processes. Through synthesizing empirical evidence and theoretical insights, we propose pragmatic pathways toward responsible and trustworthy NLP applications in mission-critical environments.


Sunny Rai

University of Pennsylvania

February 26, 2024

Extracting Cross-Cultural Social Norms using Moral Emotions

In this talk, I will present a culture-agnostic approach to norm discovery, using moral emotions, shame and pride, to identify examples of normative expectations and extract corresponding social norms. These norms can be used for designing culturally aware NLP systems and achieving pluralistic values in LLMs.


Shreya Havaldar

University of Pennsylvania

February 26, 2024

Evaluating Multicultural Behavior of LLMs

Multilingual LLMs like GPT-4 and Gemini are linguistically fluent (i.e. they generate fluent non-English text), but not necessarily culturally fluent (i.e. they appropriately reflect the social norms, emotions, and behaviors of users from different cultures). While it is important for us to make these LLMs better at cultural adaptation, we lack proper methods to evaluate the multicultural behavior of these models. Focusing on emotion, I present techniques grounded in cultural psychology to evaluate how well LLMs understand emotional subjectivity across cultures. Despite the fact that emotions are experienced and expressed differently across the world, we find that embeddings obtained from LMs (e.g., XLM-RoBERTa) are Anglocentric, and generative LMs (e.g., ChatGPT) reflect Western norms, even when responding to prompts in other languages. Our results show that multilingual LMs struggle with cultural adaptation and developing proper techniques to evaluate this is an important problem for the NLP community.


Jeffrey (Young-Min) Cho

University of Pennsylvania

February 26, 2024

Impact of Response Length on LLM-Generated Dialogue Quality and User Perception

Large Language Models are often used as conversational agents, even though they are not predominantly trained on dialogue datasets. Consequently, their responses often diverge from those in natural human conversation, tending towards verbosity or, less frequently, brevity. In this paper, we study the impact of optimizing response length on the quality of a dialogue system. Our findings reveal that GPT produces responses that are longer than those of humans, and these are unexpectedly favored, even over human-generated responses, due to their richer informational content and perceived greater empathy. However, for applications such as voicebots, shorter responses could be preferred. To generate responses that match those from humans in length, we introduce RULER, a supervised model that leverages historical conversational data to guide the generation of appropriately lengthed responses. We find that RULER responses are judged to be of higher quality than those from humans, in spite of being comparable in length.


Eunsol Choi

University of Texas at Austin

March 11, 2024

Knowledge-Rich Language Systems in a Dynamic World

Natural language provides an intuitive and powerful interface to access knowledge at scale. Modern language systems draw information from two rich knowledge sources: (1) information stored in their parameters during massive pretraining and (2) documents retrieved at inference time. Yet, we are far from building systems that can reliably provide information from such knowledge sources. In this talk, I will discuss paths for more robust systems. In the first part of the talk, I will present a module for scaling retrieval-based knowledge augmentation. We learn a compressor that maps retrieved documents into textual summaries prior to in-context integration. This not only reduces the computational costs but also filters irrelevant or incorrect information. In the second half of the talk, I will discuss the challenges of updating knowledge stored in model parameters and propose a method to prevent models from reciting outdated information by identifying facts that are prone to rapid change. I will conclude my talk by proposing an interactive system that can elicit information from users when needed. 


Koustuv Saha

UIUC

March 18, 2024

Measuring Wellbeing in Situated Contexts with Social Media and Multimodal Sensing: Promises and Perils

A core aspect of our social lives is often embedded in the communities we are situated in. Our shared experiences and social ties intertwine our situated context with our wellbeing. A better understanding of wellbeing can help devise timely support provisions. However, traditional forms of wellbeing measurements have limitations, motivating an increasing interest in supporting wellbeing through passive sensing technologies. Parallelly, social media platforms enable us to connect and express our personal and social lives with others. Given its ubiquity, social media can be considered a “passive sensor” to obtain naturalistic data, which can also be combined with various multimodal sensing to comprehensively measure wellbeing. However, wellbeing sensing technologies can lead to unintended outcomes and cause harms. Therefore, despite the potential, are we ready to deploy these wellbeing sensing technologies in the real world yet? In this talk, Koustuv Saha will present theory-driven computational and causal methods for leveraging social media in concert with complementary multisensor data to examine wellbeing, particularly in situated communities such as college campuses and workplaces. He will also interrogate the meaningfulness of the data and inferences and reflect on how these approaches can potentially be misinterpreted or misused without additional considerations. To bridge the gap between the theoretical promise and practical utility, he will present the importance of evaluating the needs, benefits, and harms of wellbeing sensing technologies in practice. This talk will propel the vision toward questioning the underlying assumptions and in responsible design and deployment of wellbeing sensing technologies (if at all) for situated communities and the future of work.


Hyunwoo Kim

AI2

March 25, 2024

Theory of Mind and LLMs: What it is and Why it is important

"Last year, debates about whether large language models (LLMs) demonstrate theory of mind capabilities have sparked considerable interest in the AI field. Theory of mind refers to the ability to attribute mental states to others, a key aspect of human social reasoning. This includes understanding others beliefs, desires, intentions, and thoughts, all of which play a significant role in social interactions. In this talk, I will delve deeper into the following questions: ""Do LLMs have a theory of mind?"", ""What are essential criteria for evaluating theory of mind in LLMs?"", and “Why is theory of mind important in AI systems?” More concretely, this talk will discuss important theoretical foundations from psychology and examine why theory of mind can be critical in addressing privacy concerns in LLMs."


Zhou Yu

Columbia University

April 1, 2024

Conversational AI beyond chatGPT

chatGPT amazed the general public with its ability to follow novel instructions. However, there is still a gap between chatGPT and fundamental human conversation abilities. This talk describes two works toward filling this gap through better conversational planning and strategies. The first work, LLM-Augmenter, proposes a general framework that aligns LLM capabilities with user task intents through reinforcement learning planning. The second work demonstrates that a chatbot with advanced self-disclosure conversational strategies is likelier and more convincing.


Diyi Yang

Stanford University

April 8, 2024

Human-AI Interaction in the Age of Large Language Models

Large language models have revolutionized the way humans interact with AI systems, transforming a wide range of fields and disciplines. In this talk, we discuss several approaches to enhancing human-AI interaction using LLMs. The first one looks at training people with conflict resolution skills via LLMs-based simulation and feedback. The second part develops parameter efficient learning techniques for adapting LLMs to low-resource languages and dialects towards accessible human-AI interaction.These different works demonstrate how human-AI interaction via LLMs can empower individuals and foster positive change.


Ana Marasović

University of Utah

April 15, 2024

Challenges in Fostering (Dis)Trust in AI

What factors enable people to trust trustworthy models and distrust untrustworthy models? Broadly, (dis)trust can be derived from two sources: (1) intrinsic, which stems from understanding a model's inner workings or reasoning, and (2) extrinsic, which is based on observing a model's external behaviors. Evaluation benchmarks created by AI researchers can foster extrinsic (dis)trust in a given contract, but they must be properly constructed. Only then can they ensure that a model, to pass the test, must truly uphold the intended contract. I will overview the challenges of constructing valid evaluations. On the other hand, explainable AI (XAI) aims to provide insights into a model’s reasoning, thus fostering intrinsic (dis)trust. XAI is not without its challenges, which I will discuss towards the end of my talk.


Leonie Weissweiler

LMU Munich

April 22, 2024

Could we be overestimating the linguistic capabilities of LLMs?

The evaluation of the linguistic capabilities of LLMs requires not only a target phenomenon, but also labelled natural data at scale or the means to create it artificially, which should be uncontaminated, ideally include languages other than English, and rely on implicit, rather than explicit, knowledge of language. These conditions are especially challenging to satisfy for the rare and complex phenomena that remain as challenges for state-of-the-art models. In this talk, I will present several evaluations of the morphological, syntactic, and semantic capabilities of LLMs, demonstrate strategies for gathering or creating data and setups to push the boundaries of current evaluation strategies, and show how these can be used to identify remaining LLM linguistic weaknesses.


Arman Cohan

Yale

April 29, 2024


Past Talks

Past talks from the current and previous semesters are shown below. View older talks at the CLunch archive.

Hongming Zhang

Tencent AI Lab, Bellevue

December 11

Advancing Information Retrieval in the Age of Large Language Models

This presentation delves into the evolving landscape of information retrieval (IR) in the context of Large Language Models (LLMs). LLMs have showcased remarkable language comprehension abilities, yet they face challenges in rapidly assimilating private or real-time data, crucial for applications like virtual assistants. This talk will start with an overview of conventional IR techniques, including dense retrieval. Building upon this, I will present our recent work in proposition-level information retrieval, which enhances the granularity and precision of existing IR systems. In the end, I will discuss how to broaden the scope of IR to create a virtual assistant system.


Fei Wang

University of Southern California

November 27

Robust and Context-Faithful Language Understanding with (Large) Language Models

Large language models (LLMs) have achieved remarkable success in various language understanding tasks. However, their deployment in real-world scenarios raises significant accountability concerns. In this presentation, I will introduce our recent work on enhancing contextual faithfulness and robustness of LLMs. First, LLMs often make unfaithful predictions based on entity mentions or parametric knowledge, ignoring the context. I will present causality-driven approaches, including training-time and in-context causal intervention, to mitigate entity bias for both black-box and white-box LLMs. Second, LLMs may capture various unreliable prediction shortcuts, some of which could be unknown. I will demonstrate how to address this issue by proactively mitigating biases in the attention module without needing to identify the specific cause of the bias. Finally, I will outline future directions for advancing accountable and responsible LLMs.


Alexander Rush

Cornell Tech

November 20

Inverting Language Models

As language models enter production environments, their intermediate states are used for a myriad of downstream applications such as search, prompting, and document comparison. In this talk, I discuss the feasibility of language model inversion. Specifically, we are interested in how much information language models contain about their inputs? We investigate the problem in two scenarios, recovering text inputs from the outputs of embeddings from sentence embedders and next-token probability outputs from language models. In many cases, our methods are able to fully recover exact textual inputs given just intermediate states. I'll discuss the security implications of these findings, as well as what this tells us about compression embedding and language modeling applications.


Greg Durrett

UT Austin

November 13

Making LLMs Right

Large language models (LLMs) like ChatGPT have been criticized for their propensity to 'hallucinate' and state incorrect information. However, the errors these systems make are not random: there are certain capabilities, like summarization, that they can do quite reliably, and others, like arithmetic, that are fundamentally unreliable. In this talk, I argue that paying attention to this divide in capabilities allows us to make LLMs more correct. First, I will discuss how we use LLMs as building blocks in systems that can do sound reasoning over natural language. For instance, we can use them to translate a natural language problem definition into a formal specification; alternatively, we can break a reasoning problem down into steps that are easily checkable. I will present our new dataset, MuSR, consisting of tasks like murder mysteries that feature challenging reasoning embedded in narratives. Second, I will discuss how we can figure out post-hoc whether LLMs' generations are right. Our approach is inspired by human fact-checking: first, dissect an LLM's 'claim' into pieces, then explain whether those pieces are right or wrong. Finally, I will discuss ongoing work on how to integrate this error detection capability into LLMs to improve the state of the art.


Mohit Iyyer

University of Massachusetts Amherst

November 6

Evaluating and detecting long-form LLM-generated text

Progress in NLP methodology over the last thirty years has been driven by benchmarks, from the Penn Treebank to GLUE. Benchmarks are useful because they provide a standard task, dataset, and means of evaluation that any researcher can use to quickly and easily demonstrate the value of their method. However, in the current age of LLMs, I argue that benchmarking is becoming increasingly obsolete. Beyond challenges such as data contamination, the dubious scientific validity of prompt engineering, and usage of closed-source APIs, each of which is critical in its own right, there exist fundamental issues with how to formulate real-world tasks into benchmarks that can rank LLMs based on the much-desired 'single score'. I highlight these issues using some of my lab's recent work on tasks such as long-form question answering, book-length summarization, and literary translation. Next, I'll pivot to a different problem that plagues not only evaluation but also society as a whole: the rapid proliferation of LLM-generated text. Detecting such text is not only important for combating malicious use cases such as academic plagiarism, but also to ensure that LLMs of the future are not just pretrained on text generated by their inferior predecessors. I outline several attacks against existing LLM-generated text detectors such as watermarking (e.g., paraphrasing, translation, cropping) and describe a retrieval-based approach that is more robust to these attacks but comes with issues of its own.


Vivek Gupta

University of Pennsylvania

October 30

Inference and Reasoning for Semi-structured Tables

Understanding semi-structured tabular data, which is ubiquitous in the real world, requires an understanding of the meaning of text fragments and the implicit connections between them. We believe such data could be used to investigate how individuals and machines reason about semi-structured data. First, we present the InfoTabS dataset, which consists of human-written textual predictions based on tables collected from Wikipedia’s infoboxes. Our research demonstrates that the semi-structured, multi-domain, and heterogeneous nature of the premises prompts complicated, multi-faceted reasoning, offering a modeling challenge for traditional modeling techniques. Second, we analyzed these challenges in-depth and developed simple, effective preprocessing strategies to overcome them. Thirdly, despite accurate NLI predictions, we demonstrate through rigorous probing that the existing model does not reason with the provided tabular facts. To address this, we suggest a two-stage evidence extraction and tabular inference technique for enhancing model reasoning and interpretability. We also investigate efficient methods for enhancing tabular inference datasets with semi-automatic data augmentation and pattern-based pre-training. Lastly, to ensure that tabular reasoning models work in more than one language, we introduce XInfoTabS, a cost-effective pipeline for translating tables. In the near future, we plan to test the tabular reasoning model for temporal changes, especially for dynamic tables where information changes over time.


Najoung Kim

Boston University

October 23

Entity Tracking in Language Models

Keeping track of how states of entities change as a text or dialog unfolds is a key prerequisite to discourse understanding. We propose a behavioral task testing to what extent a language model can infer the final state of an entity given a natural language description of the initial state and a series of state-changing operations, following a set of desiderata we lay out for measuring nontrivial entity tracking capacity. Our evaluations of several language models reveal that only (1) in-context learning with models trained on large amounts of code, or (2) finetuning a model directly on the entity tracking task lead to nontrivial entity tracking behavior. This suggests that language models can learn to track entities but pretraining on text corpora alone does not make this capacity surface. In light of these results, I will end with brief discussions of ongoing work regarding the role of code pretraining and probing for latent representations of entity states.


Yoon Kim

MIT

October 16th, 2023

Large Language Models & Symbolic Structures

Over the past decade the field of NLP has shifted from a pipelined approach (wherein intermediate symbolic structures such as parse trees are explicitly predicted and utilized for downstream tasks) to an end-to-end approach wherein pretrained large language models (LLMs) are adapted to various downstream tasks via finetuning or prompting. What role (if any) can symbolic structures play in the era of LLMs? In the first part of the talk, we will see how latent symbolic structures can be used to guide LLM-based neural machine translation systems to improve translation of low resource language, and also enable the use of new translation rules during inference. In the second part, we will see how expert-derived grammars can be used to control LLMs via prompting for tasks such as semantic parsing where the output structure must obey strict domain-specific constraints.


Martha Palmer

University of Colorado

October 9th, 2023

Uniform Meaning Representations

This talk will discuss symbolic representations of sentences in context, focusing on abstract meaning representations (AMR), examining their capability for capturing certain aspects of meaning. A focus will be how AMR’s can be expanded to encompass figurative language, the recovery of implicit arguments and relations between events. These examples will be in English, and indeed some features of AMR are English-centric. Uniform Meaning Representations, a multi-sentence annotation scheme that is revising AMRs to make them more suitable for other languages, especially low resource languages will be introduced. UMRs include more formal logical scope, number, tense, aspect and modality as well as temporal relations. The talk will conclude with a discussion of ways in which these meaning representations can be enriched even more, by mapping to Wikidata Qnodes, and their potential for improving the explainability of large language models


Frank Ferraro

University of Maryland, Baltimore County

October 2nd, 2023

Deconstructing Complex Events through Modeling Uncertainty, States, and Outcomes

Situations that people experience or describe---from global, macro-level events like "financial instability" to everyday, micro-level ones like "going to dinner"---can be complex, with uncertainty in what might happen next, participant's actions affecting one another, and overlapping events contributing to an outcome. Developing a computational understanding of these situations is not straightforward. In this talk, I will present methods for learning how to encode richer information about those descriptions. These approaches are a fruitful way of improving modeling performance, the ability to predict what events might happen next, or identify how the different participants in that situation are affected. In the first way, I will examine how event descriptions can be augmented with structural and semantic hierarchy, while accounting for uncertainty in both. In the second, I will look at how we can get models to reason about implicit states of participants in events, and reason about changes to these states as they relate to the broader situation. Finally, I will consider how we can characterize complex events by looking at participant-specific, state-based outcomes.


Josh Ludan

University of Pennsylvania

September 18th, 2023

Concept Bottleneck Models for Text Classification

In this work, we apply the concept bottleneck framework to text classification. Traditional explainability methods for text classification often focus on local interpretability, highlighting which tokens are relevant but failing to explain their role in the decision-making process. Our system addresses this by offering concept-level explanations. For example, it can break down a restaurant review rating into individual measurements of concepts such as "ambiance" and "food quality," which linearly combine to yield the final prediction. We utilize large language models to generate and measure these concepts in a zero-shot manner, requiring no external guidance or information beyond the original task dataset. Our system is evaluated on twelve diverse text datasets, ranging from hate speech classification to movie review sentiment, and shows generally positive results. This includes an 89.3% accuracy in measuring machine defined concepts when evaluating using human annotations as a gold standard. Overall, this work presents a promising direction for creating interpretable text classification systems.


Alyssa Hwang

University of Pennsylvania

September 18th, 2023

Rewriting the Script: Adapting Text Instructions for Voice Interaction

Voice assistants have sharply risen in popularity in recent years, but their use has been limited mostly to simple applications like music, hands-free search, or control of internet-of-things devices. What would it take for voice assistants to guide people through more complex tasks? In our work, we study the limitations of the dominant approach voice assistants take to complex task guidance: reading aloud written instructions. Using recipes as an example, we observe twelve participants cook at home with a state-of-the-art voice assistant. We learn that the current approach leads to nine challenges, including obscuring the bigger picture, overwhelming users with too much information, and failing to communicate affordances. Instructions delivered by a voice assistant are especially difficult because they cannot be skimmed as easily as written instructions. Alexa in particular did not surface crucial details to the user or answer questions well. We draw on our observations to propose eight ways in which voice assistants can “rewrite the script”—summarizing, signposting, splitting, elaborating, volunteering, reordering, redistributing, and visualizing—to transform written sources into forms that are readily communicated through spoken conversation. We conclude with a vision of how modern advancements in natural language processing can be leveraged for intelligent agents to guide users effectively through complex tasks.