TrustNLP

About

Recent advances in Natural Language Processing, and the emergence of pretrained Large Language Models (LLM) specifically, have led to significant breakthroughs in language understanding, generation, and interaction, leading to increasing usage of the models in real-life tasks. However, these advancements come with risks, including potential breaches of privacy, the propagation of bias, copyright violation, and vulnerabilities to adversarial manipulation. The demand for trustworthy NLP solutions is pressing as the public, policymakers, and organizations seek assurances that NLP systems protect data confidentiality, operate fairly, and adhere to ethical principles.

This year, we are excited to host our TrustNLP workshop at NAACL 2025, aimed at fostering discussions on these pressing challenges and driving the development of solutions that prioritize trustworthiness in NLP technologies. The workshop aspires to bring together researchers from various fields to engage in meaningful dialogue on key topics such as fairness and bias mitigation, transparency and explainability, privacy-preserving NLP methods, and the ethical deployment of AI systems. By providing a platform for sharing innovative research and practical insights, this workshop seeks to bridge the gaps between these interconnected objectives and establish a foundation for a more comprehensive and holistic approach to trustworthy NLP.

Program

The program is below in Mountain Daylight Time Time

Time	Event
9:00-9:10 am	Opening Address
9:10-9:50 am	Keynote 1: Mor Geva
9:50-10:30 am	Keynote 2: Eric Wallace
10:30-11:00 am	Coffee Break
11:00-11:40 pm	Keynote 3: Niloofar Mireshghallah
11:40-12:30 pm	Trusted AI Challenge -- securely advance LLMs that code (Prasoon Goyal)
12:30-1:30 pm	Lunch break
1:30-3:00 pm	In-person + Virtual Poster Session
2:45 - 3:30 pm	Industrial-Academic Panels (Hybrid) (Moderator: Aram Galstyan)
3:00-4:00 pm	Coffee Break
4:00-5:25 pm	Best and Spotlight Paper Presentations
4:25 -5:30 pm	Closing Remarks

Call for Papers

Topics

We invite papers which focus on different aspects of safe and trustworthy language modeling. Topics of interest include (but are not limited to):

Secure, Faithful & Trustworthy Generation with LLMs
Data Privacy Preservation and Data Leakage Issues in LLMs
Red-teaming, backdoor or adversarial attacks and defenses for LLM safety
Fairness, LLM alignment, Human Preference Elicitation, Participatory NLP
Toxic Language Detection and Mitigation
Explainability and Interpretability of LLM generation
Robustness of LLMs
Mitigating LLM Hallucinations & Misinformation
Fairness and Bias in multi-modal generative models: Evaluation and Treatments
Industry applications of Trustworthy NLP
Culturally-Aware and Inclusive LLMs

We welcome contributions that also draw upon interdisciplinary knowledge to advance Trustworthy NLP. This may include working with, synthesizing, or incorporating knowledge across expertise, sociopolitical systems, cultures, or norms.

Important Dates

~~January 30~~ February 7, 2025: Workshop Paper Due Date (Direct Submission via OpenReview)
February 20, 2025 Workshop Paper Due Date (Fast-Track)
March 1, 2025: Notification of Acceptance
March 10, 2025: Deadline for relevant NAACL Findings to submit non-archival (Direct submission via form, link)
March 10, 2025: Camera-ready Papers Due
April 8, 2025: Pre-recorded video due
Saturday, May 3, 2025 TrustNLP Workshop day

Submission Information

All submissions undergo double-blind peer review (with author names and affiliations removed) by the program committee, and they will be assessed based on their relevance to the workshop themes.

All submissions go through the OpenReview. To submit, use submission link.

Submitted manuscripts must be 8 pages long for full papers and 4 pages long for short papers. Please follow NAACL submission policies. Both full and short papers can have unlimited pages for references and appendices. Please note that at least one of the authors of each accepted paper must register for the workshop and present the paper. Template files can be found here.

We also ask authors to include a limitation section and broader impact statement, following guidelines from the main conference.

Fast-Track Submission

If your paper has been reviewed by ACL, EMNLP, EACL, or ARR and the average rating is higher than 2.5 (either average soundness or excitement score), The paper is qualified to be submitted on the fast track. In the appendix, please include the reviews and a short statement discussing what parts of the paper have been revised.

Non-Archival Option

NAACL workshops are traditionally archival. To allow dual submission of work, we are also including a non-archival track. If accepted, these submissions will still participate and present their work in the workshop. A reference to the paper will be hosted on the workshop website (if desired), but will not be included in the official proceedings. Please submit through OpenReview but indicate that this is a cross submission at the bottom of the submission form. You can also skip this step and inform us of your non-archival preference after the reviews. Papers accepted to the Findings of NAACL 2025 may also submit non-archival to the workshop, link TBD.

Policies

Accepted and under-review papers are allowed to be submitted to the workshop but will not be included in the proceedings.

No anonymity period will be required for papers submitted to the workshop, per the latest updates to the ACL anonymity policy. However, submissions must still remain fully anonymized.

Info for Participants

To attend the workshop, please register through NAACL 2025 .

Speakers

Mor Geva, Assistant Professor (Senior Lecturer) at Tel Aviv University and a Research Scientist at Google

Mor Geva is an Assistant Professor (Senior Lecturer) at the School of Computer Science and AI at Tel Aviv University and a Research Scientist at Google. Her research focuses on understanding the inner workings of large language models, to increase their transparency and efficiency, control their operation, and improve their reasoning abilities. Mor completed a Ph.D. in Computer Science at Tel Aviv University and was a postdoctoral researcher at Google DeepMind and the Allen Institute for AI. She was nominated as an MIT Rising Star in EECS (2021) and received multiple awards, including Intel's Rising Star Faculty Award (2024), an EMNLP Best Paper Award (2024), an EACL Outstanding Paper Award (2023), and the Dan David Prize for Graduate Students in the field of AI (2020).

Talk title: Into the Gap Between What Language Models Say and What They Know

Abstract: Alignment efforts to make large language models (LLMs) trustworthy and safe are often easy to bypass, as it is possible to steer models away from their safe behavior to generate biased, harmful, or incorrect information. This raises the question of what information LLMs capture in their hidden representations versus in the text they generate. In this talk, we will tackle this question from a mechanistic interpretability point of view. We will show that it is possible to estimate how knowledgeable a model is about a given subject only from its hidden representations, using a simple and lightweight probe, called KEEN. While KEEN correlates with model factuality, question-answering performance, and hedging behavior, it reveals a gap between the model’s inner knowledge and the knowledge it expresses in its outputs. Next, we will consider the problem of unlearning and leverage “parametric knowledge traces” for evaluation. We will see that while existing unlearning methods succeed at standard behavioral evaluations, they fail to erase the concept from the model parameters and instead suppress its generation during inference, leaving the model vulnerable to adversarial attacks.

Eric Wallace, Research Scientist at OpenAI

Eric Wallace is a research scientist at OpenAI, where he studies the theory and practice of building trustworthy, secure, and private machine learning models. He did his PhD work at UC Berkeley, where he was supported by the Apple Scholars in AI Fellowship and had his research recognized by various awards (EMNLP, PETS). Prior to OpenAI, Eric interned at Google Brain, AI2, and FAIR.

Talk title: Making “GPT-Next” Robust

Abstract: I’ll talk about three recent directions from OpenAI to make our next-generation of models more responsible, trustworthy, and secure. First, I will do a deep dive into chain-of-thought reasoning models and how we can align them with human preferences using deliberative alignment. Next, I will discuss how to mitigate prompt injections and jailbreaks by teaching LLMs to follow instructions in a hierarchical manner. Finally, I will discuss the tensions that exist between open model access and system security, whereby providing access to LM output probabilities can allow adversaries to reveal the hidden size of black-box models.

Niloofar Mireshghallah, Post-doctoral Scholar at University of Washington

Niloofar Mireshghallah is a post-doctoral scholar at the Paul G. Allen Center for Computer Science & Engineering at University of Washington. She received her Ph.D. from the CSE department of UC San Diego in 2023. Her research interests are privacy in machine learning, natural language processing and generative AI and law. She is a recipient of the National Center for Women & IT Collegiate award in 2020, a finalist of the Qualcomm Innovation Fellowship in 2021 and a recipient of the 2022 Rising stars in Adversarial ML award and Rising Stars in EECS.

A False Sense of Privacy: Semantic Leakage, Non-literal Copying, and Other Privacy Concerns in LLMs

Abstract: The reproduction of training data by large language models has significant privacy and copyright implications, with concerns ranging from exposing medical records to violating intellectual property rights. While current evaluations and mitigation methods focus primarily on verbatim copying and explicit data leakage, we demonstrate that these provide a false sense of safety at a surface level. In this talk, we show how building evaluations and red-teaming efforts solely around verbatim reproduction can be misleading - surface level sanitization, while removing direct identifiers, still poses risks of re-identification through inference, and although aligned models show fewer direct regurgitations, they still reproduce non-literal content by generating series of events that are substantially similar to original works. Looking ahead, our findings highlight the need to shift toward more dynamic benchmarks that can capture these nuanced forms of information leakage, while developing protection methods that address both literal and semantic reproduction of content.

Committee

Organizers

Kai-Wei Chang - UCLA, Amazon AGI
Anubrata Das - University of Texas at Austin
Yixin Wan - UCLA
Ninareh Mehrabi - Amazon AGI
Jwala Dhamala - Amazon AGI
Yang Trista Cao - University of Texas at Austin
Anil Ramakrishna - Amazon AGI
Tharindu Kumarage ASU
Satyapriya Krishna Amazon AGI
Aram Galstyan - USC, Amazon AGI
Anoop Kumar - Capital One
Rahul Gupta - Amazon AGI

Program Committee

Interested in reviewing for TrustNLP?

If you are interested in reviewing submissions, please fill out this form.

Thank you to our sponsors!

Questions?

Please contact us at trustnlp24naaclworkshop@googlegroups.com.