The top news stories from Washington state

Provided by AGP

New AI Benchmarking Reveals Leading AI Chatbots––Including Claude, ChatGPT, and Gemini—Avoid Harm, but Still Need More Support for High Risk Conversations

mpathic launches mPACT, a clinician-led benchmark for evaluating how AI models perform in high-risk scenarios, including suicide risk, eating disorders, and with misinformation.

SEATTLE, May 12, 2026 (GLOBE NEWSWIRE) -- mpathic, a clinician-founded AI safety company that works directly with leading AI labs, is launching mPACT (mpathic Psychologist-led AI Clinical Tests), a new benchmark that evaluates how leading models handle high-risk conversations.

As more people turn to AI chatbots for everyday support, the need for evaluation standards shaped by clinicians has become more urgent. mPACT is designed to address this gap by applying expert clinical judgment to assess how models recognize risk, interpret context, and avoid harmful responses.

With this launch, mpathic released initial findings from the first three mPACT benchmarks: Suicide Risk, Eating Disorders, and Misinformation. Representing some of the most complex and high-stakes settings in which AI systems are already being deployed, each benchmark uses expert judgment to capture subtle, clinically meaningful signals that automated evaluation often misses.

“mpathic’s work is crucial because we still lack comprehensive, evidence-based, scalable, and clinically grounded frameworks. We need benchmarks like mPACT that evaluate AI models against multi-dimensional risks and clinical evidence. mpathic’s exceptionally high safety standards can help companies build safer products and, importantly, evaluate real-world AI interactions,“ said Caroline Figueroa, MD, PhD, a neuroscientist at Stanford University and Delft University of Technology.

Initial Findings Show Strong Harm Avoidance, but Uneven Clinical Support

Across mPACT benchmarks, leading models generally avoided harmful responses and often recognized signs of distress, even when risk was not stated directly. However, performance was less consistent in delivering responses that would meet clinical expectations in real crisis scenarios.

In suicide risk conversations, models showed stronger overall performance. Claude Sonnet 4.5 achieved the highest composite performance across safety and clinical helpfulness, though no model led across all dimensions. GPT-5.2 stood out for consistently avoiding harmful responses, and Gemini 2.5 Flash also ranked among top performers.

In contrast, all models performed more poorly in eating disorder conversations, missing the more subtle, but crucial, cues that signal crisis in clinical situations. This gap was present in overall performance and in avoiding harmful responses, suggesting serious limitations in current approaches to safety.

In misinformation-related conversations, the benchmark found that model responses can lessen user understanding even without stating false information directly. Across models, common failure patterns included reinforcing questionable beliefs, expressing unwarranted confidence, and presenting one-sided or incomplete information without adequately challenging user assumptions. These behaviors were especially pronounced in multi-turn conversations, where models could gradually amplify flawed reasoning or encourage risky decisions over time.

“These results show clear progress, but also an important gap,” said Dr. Grin Lord, CEO/Founder of mpathic and licensed psychologist. “Most people don’t say ‘I’m at risk’ directly—they demonstrate it through subtle behaviors over time that are obvious to human clinicians. Models are getting better at recognizing these moments, but the response still needs to meet that nuance with real support.”

Making AI Safety Measurable in High-Risk Scenarios

Even top-performing models can fail in individual conversations, particularly in complex or high-risk situations. mPACT is designed to make these gaps visible and measurable, enabling:

  • Cross-model comparison of safety performance in high risk scenarios
  • Greater accountability through qualified-access data and transparent evaluation
  • A foundation for partner review regulatory assessment

“We need a shared, clinically grounded standard for AI behavior,” said Dr. Alison Cerezo, Chief Science Officer at mpathic and licensed psychologist. “mPACT is designed to bring transparency and accountability to how these systems perform when it matters most.”

Methodology:

mPACT uses a clinician-led approach to evaluate how AI systems perform in realistic, high-risk conversations. Test scenarios are created by licensed clinicians as multi-turn interactions that include both explicit and subtle expressions of risk. Model responses are evaluated directly by trained clinicians using a multi-label framework that captures both helpful and harmful behaviors, and performance is assessed across three core dimensions: detection, interpretation, and response quality. Results are summarized using a severity-weighted scoring system that places greater emphasis on higher-risk scenarios, enabling clinically grounded and comparable evaluation across models.

Independent Endorsements:

“A framework like the one developed by mpathic matters because it anchors evaluation in real-world clinical complexity. What stands out to me is that this approach doesn’t remove humans from the process, it centers them. As a psychologist, I trust a framework shaped by that level of clinical rigor far more than one relying solely on automated or LLM-based judgment.” – Jessica Jackson, PhD, Founder & CEO, Therapy Is For Everyone Psychological & Consultation Services.

“mPACT relies on expert clinicians interacting with the LLM and simulating a wide range of patients, making it more of a real-world stress test than purely technical evaluations. As both a clinical psychologist and a digital health intervention developer and researcher, I continually struggle to balance the positive potential of technologies with their unintended negative consequences. mPACT represents exactly the kind of rigorous, safety-focused work needed to help the field strike that balance.” – Adrian Aguilera, PhD, Chancellor's Professor, UC Berkeley

“The field has needed a benchmark that treats safety as a clinical standard rather than a technical constraint. What gives this approach credibility is that clinicians are embedded throughout, from scenario design to evaluation, and that performance is assessed across detection, interpretation, and response. It is a more rigorous and clinically-aligned way to evaluate safety than relying on surface-level or automated judgments.” – Dr. Ursula Whiteside, CEO, NowMattersNow.org

“As AI enters high-stakes spaces like mental health, clinically grounded evaluation is essential. mpathic’s framework helps ensure these systems are assessed on how they actually respond in complex, real-world situations where safety is of the utmost importance.” – Ellen E. Fitzsimmons-Craft, PhD, FAED, LP and Denise Wilfley, PhD, Center for Healthy Weight and Wellness, Washington University School of Medicine

“mpathic’s creation of a clinically-grounded benchmarking framework to assess how AI systems perform when users engage with them for mental health support and resources is critical during this time.” – Terika McCall, PhD, MPH, MBA, Yale School of Public Health

“It cannot be overstated how critical a clinician-led, end-to-end benchmark for evaluating how LLMs respond to high-stakes human interactions is to the field of behavioral health. Currently, no other standard exists that I would trust to evaluate how LLMs detect, interpret, and respond in high-stakes human interactions. With this work, mpathic is establishing a remarkable benchmark that current and future models can be audited against.” – Kara Emery, PhD, Director of Data Science, AI Hub, McSilver Institute for Poverty Policy and Research, New York University

About mpathic:
mpathic is keeping humans safe in the AI era through technology-enabled, expert-led human data and evaluation services. The company works with frontier AI builders to prevent harmful or unwanted model behaviors, spanning use cases from user wellbeing and mental health to financial risk and customer support, across the full AI model lifecycle.

Disclosure:
mpathic works with several frontier AI labs across the industry whose models may have been independently evaluated in mPACT. We believe rigorous, independent evaluation is part of how we contribute to safer AI, and we apply the same methodology and clinical standards regardless of commercial relationship.

Media Contact:
Nectar Communications
mpathic@nectarpr.com


Primary Logo

Legal Disclaimer:

EIN Presswire provides this news content "as is" without warranty of any kind. We do not accept any responsibility or liability for the accuracy, content, images, videos, licenses, completeness, legality, or reliability of the information contained in this article. If you have any complaints or copyright issues related to this article, kindly contact the author above.

Share us

on your social networks:

Sign up for:

Washington State Gazette

The daily local news briefing you can trust. Every day. Subscribe now.

By signing up, you agree to our Terms & Conditions.