TrustNLP: Fifth Workshop on Trustworthy Natural Language Processing

Colocated with the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025)

215 - San Miguel, Albuquerque Convention Center

Virtual site

About

Recent advances in Natural Language Processing, and the emergence of pretrained Large Language Models (LLM) specifically, have led to significant breakthroughs in language understanding, generation, and interaction, leading to increasing usage of the models in real-life tasks. However, these advancements come with risks, including potential breaches of privacy, the propagation of bias, copyright violation, and vulnerabilities to adversarial manipulation. The demand for trustworthy NLP solutions is pressing as the public, policymakers, and organizations seek assurances that NLP systems protect data confidentiality, operate fairly, and adhere to ethical principles.

This year, we are excited to host our TrustNLP workshop at NAACL 2025, aimed at fostering discussions on these pressing challenges and driving the development of solutions that prioritize trustworthiness in NLP technologies. The workshop aspires to bring together researchers from various fields to engage in meaningful dialogue on key topics such as fairness and bias mitigation, transparency and explainability, privacy-preserving NLP methods, and the ethical deployment of AI systems. By providing a platform for sharing innovative research and practical insights, this workshop seeks to bridge the gaps between these interconnected objectives and establish a foundation for a more comprehensive and holistic approach to trustworthy NLP.

Program

The program is below in Mountain Daylight Time Time

 

Time

Event

9:00-9:10 am

Opening Address

9:10-9:50 am

Keynote 1: Mor Geva

9:50-10:30 am

Keynote 2: Eric Wallace

10:30-11:00 am

Coffee Break

11:00-11:40 pm

Keynote 3: Niloofar Mireshghallah

11:40-12:30 pm

Trusted AI Challenge -- securely advance LLMs that code (Prasoon Goyal)

12:30-1:30 pm

Lunch break

1:30-3:00 pm

In-person + Virtual Poster Session

2:45 - 3:30 pm

Industrial-Academic Panels (Hybrid) (Moderator: Aram Galstyan)

3:00-4:00 pm

Coffee Break

4:00-5:25 pm

Best and Spotlight Paper Presentations

4:25 -5:30 pm

Closing Remarks

Call for Papers

Topics

We invite papers which focus on different aspects of safe and trustworthy language modeling. Topics of interest include (but are not limited to):

  • Secure, Faithful & Trustworthy Generation with LLMs
  • Data Privacy Preservation and Data Leakage Issues in LLMs
  • Red-teaming, backdoor or adversarial attacks and defenses for LLM safety
  • Fairness, LLM alignment, Human Preference Elicitation, Participatory NLP
  • Toxic Language Detection and Mitigation
  • Explainability and Interpretability of LLM generation
  • Robustness of LLMs
  • Mitigating LLM Hallucinations & Misinformation
  • Fairness and Bias in multi-modal generative models: Evaluation and Treatments
  • Industry applications of Trustworthy NLP
  • Culturally-Aware and Inclusive LLMs
We welcome contributions that also draw upon interdisciplinary knowledge to advance Trustworthy NLP. This may include working with, synthesizing, or incorporating knowledge across expertise, sociopolitical systems, cultures, or norms.

Important Dates

  • January 30 February 7, 2025: Workshop Paper Due Date (Direct Submission via OpenReview)
  • February 20, 2025 Workshop Paper Due Date (Fast-Track)
  • March 1, 2025: Notification of Acceptance
  • March 10, 2025: Deadline for relevant NAACL Findings to submit non-archival (Direct submission via form, link)
  • March 10, 2025: Camera-ready Papers Due
  • April 8, 2025: Pre-recorded video due
  • Saturday, May 3, 2025 TrustNLP Workshop day

Submission Information

All submissions undergo double-blind peer review (with author names and affiliations removed) by the program committee, and they will be assessed based on their relevance to the workshop themes.

All submissions go through the OpenReview. To submit, use submission link.

Submitted manuscripts must be 8 pages long for full papers and 4 pages long for short papers. Please follow NAACL submission policies. Both full and short papers can have unlimited pages for references and appendices. Please note that at least one of the authors of each accepted paper must register for the workshop and present the paper. Template files can be found here.

We also ask authors to include a limitation section and broader impact statement, following guidelines from the main conference.

Fast-Track Submission

If your paper has been reviewed by ACL, EMNLP, EACL, or ARR and the average rating is higher than 2.5 (either average soundness or excitement score), The paper is qualified to be submitted on the fast track. In the appendix, please include the reviews and a short statement discussing what parts of the paper have been revised.

Non-Archival Option

NAACL workshops are traditionally archival. To allow dual submission of work, we are also including a non-archival track. If accepted, these submissions will still participate and present their work in the workshop. A reference to the paper will be hosted on the workshop website (if desired), but will not be included in the official proceedings. Please submit through OpenReview but indicate that this is a cross submission at the bottom of the submission form. You can also skip this step and inform us of your non-archival preference after the reviews. Papers accepted to the Findings of NAACL 2025 may also submit non-archival to the workshop, link TBD.

Policies

Accepted and under-review papers are allowed to be submitted to the workshop but will not be included in the proceedings.

No anonymity period will be required for papers submitted to the workshop, per the latest updates to the ACL anonymity policy. However, submissions must still remain fully anonymized.

Info for Participants

To attend the workshop, please register through NAACL 2025 .

Speakers


Mor Geva, Assistant Professor (Senior Lecturer) at Tel Aviv University and a Research Scientist at Google

Mor Geva is an Assistant Professor (Senior Lecturer) at the School of Computer Science and AI at Tel Aviv University and a Research Scientist at Google. Her research focuses on understanding the inner workings of large language models, to increase their transparency and efficiency, control their operation, and improve their reasoning abilities. Mor completed a Ph.D. in Computer Science at Tel Aviv University and was a postdoctoral researcher at Google DeepMind and the Allen Institute for AI. She was nominated as an MIT Rising Star in EECS (2021) and received multiple awards, including Intel's Rising Star Faculty Award (2024), an EMNLP Best Paper Award (2024), an EACL Outstanding Paper Award (2023), and the Dan David Prize for Graduate Students in the field of AI (2020).

Talk title: Into the Gap Between What Language Models Say and What They Know

Abstract: Alignment efforts to make large language models (LLMs) trustworthy and safe are often easy to bypass, as it is possible to steer models away from their safe behavior to generate biased, harmful, or incorrect information. This raises the question of what information LLMs capture in their hidden representations versus in the text they generate. In this talk, we will tackle this question from a mechanistic interpretability point of view. We will show that it is possible to estimate how knowledgeable a model is about a given subject only from its hidden representations, using a simple and lightweight probe, called KEEN. While KEEN correlates with model factuality, question-answering performance, and hedging behavior, it reveals a gap between the model’s inner knowledge and the knowledge it expresses in its outputs. Next, we will consider the problem of unlearning and leverage “parametric knowledge traces” for evaluation. We will see that while existing unlearning methods succeed at standard behavioral evaluations, they fail to erase the concept from the model parameters and instead suppress its generation during inference, leaving the model vulnerable to adversarial attacks.

Eric Wallace, Research Scientist at OpenAI

Eric Wallace is a research scientist at OpenAI, where he studies the theory and practice of building trustworthy, secure, and private machine learning models. He did his PhD work at UC Berkeley, where he was supported by the Apple Scholars in AI Fellowship and had his research recognized by various awards (EMNLP, PETS). Prior to OpenAI, Eric interned at Google Brain, AI2, and FAIR.

Talk title: Making “GPT-Next” Robust

Abstract: I’ll talk about three recent directions from OpenAI to make our next-generation of models more responsible, trustworthy, and secure. First, I will do a deep dive into chain-of-thought reasoning models and how we can align them with human preferences using deliberative alignment. Next, I will discuss how to mitigate prompt injections and jailbreaks by teaching LLMs to follow instructions in a hierarchical manner. Finally, I will discuss the tensions that exist between open model access and system security, whereby providing access to LM output probabilities can allow adversaries to reveal the hidden size of black-box models.


Niloofar Mireshghallah, Post-doctoral Scholar at University of Washington

Niloofar Mireshghallah is a post-doctoral scholar at the Paul G. Allen Center for Computer Science & Engineering at University of Washington. She received her Ph.D. from the CSE department of UC San Diego in 2023. Her research interests are privacy in machine learning, natural language processing and generative AI and law. She is a recipient of the National Center for Women & IT Collegiate award in 2020, a finalist of the Qualcomm Innovation Fellowship in 2021 and a recipient of the 2022 Rising stars in Adversarial ML award and Rising Stars in EECS.

A False Sense of Privacy: Semantic Leakage, Non-literal Copying, and Other Privacy Concerns in LLMs

Abstract: The reproduction of training data by large language models has significant privacy and copyright implications, with concerns ranging from exposing medical records to violating intellectual property rights. While current evaluations and mitigation methods focus primarily on verbatim copying and explicit data leakage, we demonstrate that these provide a false sense of safety at a surface level. In this talk, we show how building evaluations and red-teaming efforts solely around verbatim reproduction can be misleading - surface level sanitization, while removing direct identifiers, still poses risks of re-identification through inference, and although aligned models show fewer direct regurgitations, they still reproduce non-literal content by generating series of events that are substantially similar to original works. Looking ahead, our findings highlight the need to shift toward more dynamic benchmarks that can capture these nuanced forms of information leakage, while developing protection methods that address both literal and semantic reproduction of content.

Committee

Organizers

Program Committee

Interested in reviewing for TrustNLP?

If you are interested in reviewing submissions, please fill out this form.

Questions?

Please contact us at trustnlp24naaclworkshop@googlegroups.com.