Colocated with the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025)
215 - San Miguel, Albuquerque Convention Center
Recent advances in Natural Language Processing, and the emergence of pretrained Large Language Models (LLM) specifically, have led to significant breakthroughs in language understanding, generation, and interaction, leading to increasing usage of the models in real-life tasks. However, these advancements come with risks, including potential breaches of privacy, the propagation of bias, copyright violation, and vulnerabilities to adversarial manipulation. The demand for trustworthy NLP solutions is pressing as the public, policymakers, and organizations seek assurances that NLP systems protect data confidentiality, operate fairly, and adhere to ethical principles.
This year, we are excited to host our TrustNLP workshop at NAACL 2025, aimed at fostering discussions on these pressing challenges and driving the development of solutions that prioritize trustworthiness in NLP technologies. The workshop aspires to bring together researchers from various fields to engage in meaningful dialogue on key topics such as fairness and bias mitigation, transparency and explainability, privacy-preserving NLP methods, and the ethical deployment of AI systems. By providing a platform for sharing innovative research and practical insights, this workshop seeks to bridge the gaps between these interconnected objectives and establish a foundation for a more comprehensive and holistic approach to trustworthy NLP.
The program is below in Mountain Daylight Time Time
Time |
Event |
9:00-9:10 am |
Opening Address |
9:10-9:50 am |
Keynote 1: Mor Geva |
9:50-10:30 am |
Keynote 2: Eric Wallace |
10:30-11:00 am |
Coffee Break |
11:00-11:40 pm |
Keynote 3: Niloofar Mireshghallah |
11:40-12:30 pm |
Trusted AI Challenge -- securely advance LLMs that code (Prasoon Goyal) |
12:30-1:30 pm |
Lunch break |
1:30-3:00 pm |
In-person + Virtual Poster Session |
2:45 - 3:30 pm |
Industrial-Academic Panels (Hybrid) (Moderator: Aram Galstyan) |
3:00-4:00 pm |
Coffee Break |
4:00-5:25 pm |
Best and Spotlight Paper Presentations |
4:25 -5:30 pm |
Closing Remarks |
We invite papers which focus on different aspects of safe and trustworthy language modeling. Topics of interest include (but are not limited to):
All submissions undergo double-blind peer review (with author names and affiliations removed) by the program committee, and they will be assessed based on their relevance to the workshop themes.
All submissions go through the OpenReview. To submit, use submission link.
Submitted manuscripts must be 8 pages long for full papers and 4 pages long for short papers. Please follow NAACL submission policies. Both full and short papers can have unlimited pages for references and appendices. Please note that at least one of the authors of each accepted paper must register for the workshop and present the paper. Template files can be found here.
We also ask authors to include a limitation section and broader impact statement, following guidelines from the main conference.
If your paper has been reviewed by ACL, EMNLP, EACL, or ARR and the average rating is higher than 2.5 (either average soundness or excitement score), The paper is qualified to be submitted on the fast track. In the appendix, please include the reviews and a short statement discussing what parts of the paper have been revised.
NAACL workshops are traditionally archival. To allow dual submission of work, we are also including a non-archival track. If accepted, these submissions will still participate and present their work in the workshop. A reference to the paper will be hosted on the workshop website (if desired), but will not be included in the official proceedings. Please submit through OpenReview but indicate that this is a cross submission at the bottom of the submission form. You can also skip this step and inform us of your non-archival preference after the reviews. Papers accepted to the Findings of NAACL 2025 may also submit non-archival to the workshop, link TBD.
Accepted and under-review papers are allowed to be submitted to the workshop but will not be included in the proceedings.
No anonymity period will be required for papers submitted to the workshop, per the latest updates to the ACL anonymity policy. However, submissions must still remain fully anonymized.
Mor Geva, Assistant Professor (Senior Lecturer) at Tel Aviv University and a Research Scientist at Google
Mor Geva is an Assistant Professor (Senior Lecturer) at the School of Computer Science and AI at Tel Aviv University and a Research Scientist at Google. Her research focuses on understanding the inner workings of large language models, to increase their transparency and efficiency, control their operation, and improve their reasoning abilities. Mor completed a Ph.D. in Computer Science at Tel Aviv University and was a postdoctoral researcher at Google DeepMind and the Allen Institute for AI. She was nominated as an MIT Rising Star in EECS (2021) and received multiple awards, including Intel's Rising Star Faculty Award (2024), an EMNLP Best Paper Award (2024), an EACL Outstanding Paper Award (2023), and the Dan David Prize for Graduate Students in the field of AI (2020).
Talk title: Into the Gap Between What Language Models Say and What They Know
Abstract: Alignment efforts to make large language models (LLMs) trustworthy and safe are often easy to bypass, as it is possible to steer models away from their safe behavior to generate biased, harmful, or incorrect information. This raises the question of what information LLMs capture in their hidden representations versus in the text they generate. In this talk, we will tackle this question from a mechanistic interpretability point of view. We will show that it is possible to estimate how knowledgeable a model is about a given subject only from its hidden representations, using a simple and lightweight probe, called KEEN. While KEEN correlates with model factuality, question-answering performance, and hedging behavior, it reveals a gap between the model’s inner knowledge and the knowledge it expresses in its outputs. Next, we will consider the problem of unlearning and leverage “parametric knowledge traces” for evaluation. We will see that while existing unlearning methods succeed at standard behavioral evaluations, they fail to erase the concept from the model parameters and instead suppress its generation during inference, leaving the model vulnerable to adversarial attacks.
Eric Wallace, Research Scientist at OpenAI
Eric Wallace is a research scientist at OpenAI, where he studies the theory and practice of building trustworthy, secure, and private machine learning models. He did his PhD work at UC Berkeley, where he was supported by the Apple Scholars in AI Fellowship and had his research recognized by various awards (EMNLP, PETS). Prior to OpenAI, Eric interned at Google Brain, AI2, and FAIR.
Talk title: Making “GPT-Next” Robust
Abstract: I’ll talk about three recent directions from OpenAI to make our next-generation of models more responsible, trustworthy, and secure. First, I will do a deep dive into chain-of-thought reasoning models and how we can align them with human preferences using deliberative alignment. Next, I will discuss how to mitigate prompt injections and jailbreaks by teaching LLMs to follow instructions in a hierarchical manner. Finally, I will discuss the tensions that exist between open model access and system security, whereby providing access to LM output probabilities can allow adversaries to reveal the hidden size of black-box models.
Niloofar Mireshghallah, Post-doctoral Scholar at University of Washington
Niloofar Mireshghallah is a post-doctoral scholar at the Paul G. Allen Center for Computer Science & Engineering at University of Washington. She received her Ph.D. from the CSE department of UC San Diego in 2023. Her research interests are privacy in machine learning, natural language processing and generative AI and law. She is a recipient of the National Center for Women & IT Collegiate award in 2020, a finalist of the Qualcomm Innovation Fellowship in 2021 and a recipient of the 2022 Rising stars in Adversarial ML award and Rising Stars in EECS.
A False Sense of Privacy: Semantic Leakage, Non-literal Copying, and Other Privacy Concerns in LLMs
Abstract: The reproduction of training data by large language models has significant privacy and copyright implications, with concerns ranging from exposing medical records to violating intellectual property rights. While current evaluations and mitigation methods focus primarily on verbatim copying and explicit data leakage, we demonstrate that these provide a false sense of safety at a surface level. In this talk, we show how building evaluations and red-teaming efforts solely around verbatim reproduction can be misleading - surface level sanitization, while removing direct identifiers, still poses risks of re-identification through inference, and although aligned models show fewer direct regurgitations, they still reproduce non-literal content by generating series of events that are substantially similar to original works. Looking ahead, our findings highlight the need to shift toward more dynamic benchmarks that can capture these nuanced forms of information leakage, while developing protection methods that address both literal and semantic reproduction of content.
Organizers
Program Committee
If you are interested in reviewing submissions, please fill out this form.
Please contact us at trustnlp24naaclworkshop@googlegroups.com.