TrustNLP: Sixth Workshop on Trustworthy Natural Language Processing

Colocated with the 2026 Annual Conference of the Association for Computational Linguistics (ACL 2026)

San Diego, California, United States

About

With the rapid advances in AI, empowered by large language models (LLMs) and natural language processing (NLP) techniques, there is an increasing integration of AI systems that directly interact with users and facilitate our daily tasks. In particular, the recent development of agentic models allows users to communicate directly with AI for complex tasks such as coding, web surfing, information seeking, and deep research. These models integrate NLP techniques with computer vision, systems engineering, and other social and physical sciences, expanding the boundaries of what AI systems can accomplish and making NLP systems omnipresent in various aspects of our everyday life. This makes the development of reliable, responsible, ethical, and safe AI increasingly important.

This year, we are excited to host our TrustNLP workshop at ACL 2026, inviting participants and papers that focus on developing models that are explainable, fair, privacy‑preserving, causal, and robust. In particular, we have secured sponsorship from major companies in the field, including Meta, Capital One, and Amazon. We will use the funding to promote diversity, participation, and mentoring, furthering our mission.

Keynote Talks

Invited Talk 1

Adversarial Arena: Driving Innovation in Responsible AI through Interactive Competition

Michael Johnston — Applied Science Manager, Responsible AI, Amazon AGI

July 4, 2026, 9:05 AM — Room: Harbor G

As AI systems become increasingly capable, it is critical that techniques ensuring their safety and alignment to human values keep up with the exponential pace of innovation. At the same time, there is increasing concern from an evaluation science perspective that static benchmarks fail to capture the true capabilities of models with respect to both utility and safety. Also, while high quality diverse data is critical for training, it is rare and expensive to create, especially for new tasks and multi-turn conversations. In this talk, I will illustrate how these three issues can be addressed through an 'Adversarial Arena' approach to driving research and data creation, where different approaches are evaluated through interactive competition. I will draw on examples and learnings from the Amazon Nova AI Challenge: Trusted AI, an international AI competition now in its second year. In the challenge, competing teams build either secure coding agents or automated red teaming bots and their creations face off in a series of tournaments setting in motion a continuous flywheel of innovation and data generation.

Michael Johnston is Applied Science Manager in the Responsible AI team in Amazon AGI. Michael has over 30 years of experience in artificial intelligence and machine learning and research contributions spanning NLP, dialog, multimodality, fusion of human and artificial intelligence, and trustworthy AI. Before joining Amazon, he was VP of Research and Innovation at Interactions Corporation, and earlier held positions at AT&T Labs Research, Oregon Graduate Institute, Brandeis University, and Apple. Michael has over 60 U.S. patents, and has published over 80 scientific papers. He has designed and overseen multiple international challenge competitions in artificial intelligence, including the Alexa Prize, the Amazon Trusted AI Challenge, and Amazon Nova AI Challenge: Trusted Software Agents.

Invited Talk 2

Toward Trustworthy Language Models through Interpretability and Control

Lilly Weng — UC San Diego

July 4, 2026, 11:00 AM — Room: Harbor G

The open exchange of ideas, the freedom of thought and expression, and respectful scientific debate are central to the aims and goals of a ACL conference. These require a community and an environment that recognizes the inherent worth of every person and group, that fosters dignity, understanding, and mutual respect, and that embraces diversity. For these reasons, ACL is dedicated to providing a harassment-free experience for participants at our events and in our programs. Harassment and hostile behavior are unwelcome at any ACL conference. This includes: speech or behavior (including in public presentations and on-line discourse) that intimidates, creates discomfort, or interferes with a person's participation or opportunity for participation in the conference. We aim for ACL conferences to be an environment where harassment in any form does not happen, including but not limited to: harassment based on race, gender, religion, age, color, national origin, ancestry, disability, sexual orientation, or gender identity. Harassment includes degrading verbal comments, deliberate intimidation, stalking, harassing photography or recording, inappropriate physical contact, and unwelcome sexual attention. It is the responsibility of the community as a whole to promote an inclusive and positive environment for our scholarly activities. In addition, any participant who experiences harassment or hostile behavior may contact any current member of the ACL Board or contact Priscilla Rasmussen, who is usually available at the registration desk of the conference. Please be assured that if you approach us, your concerns will be kept in strict confidence, and we will consult with you on any actions taken. The ACL board members are listed at: https://www.aclweb.org/portal/about. The full policy and its implementation is defined at: https://www.aclweb.org/adminwiki/index.php?title=AntiHarassment_Policy

The open exchange of ideas, the freedom of thought and expression, and respectful scientific debate are central to the aims and goals of a ACL conference. These require a community and an environment that recognizes the inherent worth of every person and group, that fosters dignity, understanding, and mutual respect, and that embraces diversity. For these reasons, ACL is dedicated to providing a harassment-free experience for participants at our events and in our programs. Harassment and hostile behavior are unwelcome at any ACL conference. This includes: speech or behavior (including in public presentations and on-line discourse) that intimidates, creates discomfort, or interferes with a person's participation or opportunity for participation in the conference. We aim for ACL conferences to be an environment where harassment in any form does not happen, including but not limited to: harassment based on race, gender, religion, age, color, national origin, ancestry, disability, sexual orientation, or gender identity. Harassment includes degrading verbal comments, deliberate intimidation, stalking, harassing photography or recording, inappropriate physical contact, and unwelcome sexual attention. It is the responsibility of the community as a whole to promote an inclusive and positive environment for our scholarly activities. In addition, any participant who experiences harassment or hostile behavior may contact any current member of the ACL Board or contact Priscilla Rasmussen, who is usually available at the registration desk of the conference. Please be assured that if you approach us, your concerns will be kept in strict confidence, and we will consult with you on any actions taken. The ACL board members are listed at: https://www.aclweb.org/portal/about. The full policy and its implementation is defined at: https://www.aclweb.org/adminwiki/index.php?title=AntiHarassment_Policy

Program

Saturday, July 4, 2026

09:00 – 09:05
Opening Remarks
09:05 – 09:50
Keynote 1 – Michael Johnston
09:50 – 10:30
Poster Session (continues during the break)
  • Evaluating Cross-Lingual Behavior and Consistency of Multimodal Large Language Models
    Hao Wang, Pinzhi Huang and Daisuke Kawahara
  • Responsible Federated LLMs via Safety Filtering and Constitutional AI
    Eunchung Noh and Jeonghun Baek
  • Uncertainty-Aware Proxy Attribute Reasoning for Reliable Media Bias Detection
    Chin-Po Chen, Jeng-Lin Li and Ming-Ching Chang
  • Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis
    Zvi Topol
  • ChatbotManip: a Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour
    Jack Luigi Henry Contro, Simrat Deol, Martim Brandao and Yulan He
  • Controllable Pareto Trade-off between Fairness and Accuracy
    Yongkang Du, Jieyu Zhao, Yijun Yang and Tianyi Zhou
  • What are They Thinking? Delineation, Probing, and Tracking of Concepts in LLMs
    Mohamed Abdelwahab, Michelle Yu Collins, Sihan Chen, Yi Cheng Zhao, Zafarullah Mahmood, Jiading Zhu, Soliman Ali and Jonathan Rose
  • Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
    Yavuz Faruk Bakman, Duygu Nur Yaldiz, Salman Avestimehr and Sai Praneeth Karimireddy
  • Teaching People LLM's Errors and Getting it Right
    Nathan Stringham, Fateme Hashemi Chaleshtori, Xinyuan Yan, Zhichao Xu, Bei Wang and Ana Marasovic
  • Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States
    Subramanyam Sahoo, Vinija Jain, Aman Chadha and Divya Chaudhary
  • KoLegalQA: A Korean Legal QA Dataset for Trustworthy and Explanation-Grounded Legal AI
    Yongtae Lee, Surin Lee, Sumin Kim, S M Wahidur Rahman and Heung-No Lee
  • On the Non-Identifiability of Steering Vectors in Large Language Models
    Sohan Venkatesh and Ashish Mahendran Kurapath
  • Authorization-First Retrieval: Enforcing Least Privilege in Multi-Agent RAG Systems
    Rohith Namboothiri
  • PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
    Krishna Kanth Nakka, Xue Jiang, Dmitrii Usynin and Xuebing Zhou
  • Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents
    Jihye Kim
  • Purdah and Patriarchy: Evaluating and Mitigating South Asian Biases in Open-Ended Multilingual LLM Generations
    Mamnuya Rinki, Chahat Raj, Anjishnu Mukherjee and Ziwei Zhu
  • Ghost Context: Measuring Cross-Context Interference in Long-Context Language Models
    Rohith Namboothiri
  • Understanding the Effects of Safety Unalignment on Reasoning- and Instruction-Tuned Large Language Models
    John Timothy Halloran
  • ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking
    Yanjun Lin, Zimo Xiao, Kartik Natarajan, Mahesh Sankaranarayanan, Niraj Nawanit, Rakshit Parashar, Austin Zhang, Karthik Konaraddi, Rishita Mote and Wei Niu
  • Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability
    Yucheng Du
  • Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
    Arth Singh
  • Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
    Yusser Al Ghussin, Daniil Gurgurov, Tanja Baeumel, Josef Van Genabith, Patrick Schramowski and Simon Ostermann
  • CARE: A Conformal Safety Layer for Medical Summarization
    Suhana Bedi, Bridget Lin, Anson Zhou, Jenelle A Jindal, Chloe O'Connell Stanwyck, Sanmi Koyejo and Nigam Shah
  • Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
    Zhiyu Xue, Zimo Qi, Guangliang Liu, Bocheng Chen and Ramtin Pedarsani
  • A Systematic Taxonomy of Failure Modes in Retrieval-Augmented Generation Systems
    Anupama Garani
  • Improving the Faithfulness of LLM-based Abstractive Summarization with Span-level Unlikelihood Training
    Sicong Huang, Qianqi Yan, Shengze Wang and Ian Lane
  • Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs
    Jinhwa Kim and Ian Harris
  • MASCOT: Towards Trustworthy Multi-Agent Socio-Collaborative Companion Systems
    Yiyang Wang, Yiqiao Jin, Alex Cabral and Josiah Hester
  • Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search
    Devang Kulshreshtha, Hang Su, Chinmay Hegde and Haohan Wang
  • The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust
    Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister and Hamid Palangi
  • Lexical Familiarity Predicts Processing Depth for Nonliteral Language in Large Language Models
    Lang-Ching Yeh, Yu-Chieh Wang and Shu-Kai Hsieh
  • Did You Forget What I Asked? Prospective Memory Failures in Large Language Models
    Avni Mittal
  • Difficulty Perception in the Reasoning of LLMs
    Quang Minh Nguyen, Uzair Ahmed and Taegyoon Kim
  • Cultural Counterfactuals: Evaluating Cultural Biases in Large Vision-Language Models with Counterfactual Examples
    Phillip Howard, Xin Su and Kathleen C. Fraser
  • Don't Want Your LLM to Recommend Nuclear Strike? Try Asking It in Japanese
    Rian Touchent
  • Toward Dialect-Aware Safety Evaluation for Arabic Large Language Models
    Wajdi Zaghouani
10:30 – 11:00
Break
11:00 – 11:45
Keynote 2 – Lilly Weng
11:45 – 12:25
Oral Session
  • Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall
    Qianli Wang, Mingyang Wang, Nils Feldhus, Simon Ostermann, Yuan Cao, Hinrich Schuetze, Sebastian Möller and Vera Schmitt
  • ClaimCLAIRE: A Trust-Aware Multi-Component Fact-Checking Agent for Open-World Claims
    Xinman Liu and Mayank Sharma
  • Fairness Failure Modes of Multimodal LLMs
    Canyu Chen, Anglin Cai, Joan Nwatu, Yale Li, Jessica Hullman, Rada Mihalcea, Kathleen McKeown and Manling Li
12:25 – 12:30
Closing Remarks

Call for Papers

Topics

We invite papers that focus on different aspects of safe and trustworthy language modeling. Topics of interest include (but are not limited to):

  • Privacy‑Preserving Model Training
  • Unlearning and Model Editing
  • Fairness and Bias: Evaluation and Treatments
  • Model Explainability and Interpretability
  • Culturally‑Aware and Inclusive LLMs
  • Accountability, Safety, and Robustness
  • Red‑teaming, backdoor or adversarial attacks and defenses for LLM safety
  • Ethics, Social responsibility, and Dual‑use
  • Causal Inference and Fair ML
  • Secure, Faithful, Safe, and Trustworthy Data/Language Generation
  • Hallucination and Unqualified Suggestion
  • Toxic Language Detection and Mitigation
  • Industry applications of Trustworthy NLP

We welcome contributions that also draw upon interdisciplinary knowledge to advance Trustworthy NLP. This may include working with, synthesizing, or incorporating knowledge across expertise, sociopolitical systems, cultures, or norms.

Important Dates
  • March 5, 2026: Workshop Paper Due Date (Direct Submission via OpenReview)
  • April 10, 2026: Workshop Paper Due Date (Fast‑Track)
  • April 10, 2026: Deadline for relevant ACL Findings to submit non‑archival
  • April 28, 2026: Notification of acceptance
  • May 12, 2026: Camera‑ready papers due
  • June 4, 2026: Pre‑recorded video due
  • July 4, 2026: Workshop date
Submission Information

All submissions undergo double‑blind peer review (with author names and affiliations removed) by the program committee, and they will be assessed based on their relevance to the workshop themes.

All standard submissions go through the OpenReview platform. To submit, use this submission link.

Submitted manuscripts must be 8 pages long for full papers and 4 pages long for short papers. Please follow ACL submission policies. Both full and short papers can have unlimited pages for references and appendices. Please note that at least one of the authors of each accepted paper must register for the workshop and present the paper.

Template files can be found here.

We also ask authors to include a limitation section and broader impact statement, following guidelines from the main conference.

Fast‑Track Submission

If your paper has been reviewed by ACL, EMNLP, EACL, or ARR and the average rating is higher than 2.75 (either average soundness or excitement score), the paper is qualified to be submitted on the fast track. In the appendix, please include the reviews and a short statement discussing what parts of the paper have been revised.

Fast-Track submissions go through the OpenReview platform. To submit, use this submission link.

Non‑Archival Option

ACL workshops are traditionally archival. To allow dual submission of work, we are also including a non‑archival track. If accepted, these submissions will still participate and present their work in the workshop. A reference to the paper will be hosted on the workshop website (if desired), but will not be included in the official proceedings. Please submit through OpenReview but indicate that this is a cross‑submission at the bottom of the submission form. You can also skip this step and inform us of your non‑archival preference after the reviews. Papers accepted to the Findings of ACL 2026 may also submit non‑archival to the workshop.

Policies

Accepted and under‑review papers are allowed to be submitted to the workshop but will not be included in the proceedings.

No anonymity period will be required for papers submitted to the workshop, per the latest updates to the ACL anonymity policy. However, submissions must still remain fully anonymized.

Committee

Organizers

Interested in reviewing for future editions of TrustNLP?

If you are interested in reviewing submissions, please fill out this form.

Questions?

Please contact us at trustnlpworkshoporganizers@gmail.com.