Mental Health AI Benchmark Report
Last Updated: March 2, 2026
Models ranked by mental health support capabilities (crisis safety, empathy, helpfulness)
| Rank | Model | Crisis Safety | Empathy | Helpfulness | MH Score |
|---|---|---|---|---|---|
| 🥇 #1 | 💙Bygheart V5OURS |
67.8% | 73.3% | 95.0% | 78.7% |
| 🥈 #2 | 👁️Bygheart Vision V2OURSNEW |
66.7% | 86.7% | 86.7% | 78.6% |
| � #3 | 🟢GPT-5.2 ProNEW |
65.5% | 72.0% | 88.0% | 75.2% |
| #4 | 🟠Claude Opus 4.6NEW |
64.0% | 74.5% | 85.0% | 74.5% |
| #5 | 🟢GPT-5.2 ThinkingNEW |
64.2% | 70.5% | 86.0% | 73.6% |
| #6 | 🟠Claude Sonnet 4.6NEW |
62.5% | 73.0% | 82.0% | 72.5% |
| #7 | 🔵Gemini 3 ProNEW |
60.0% | 70.0% | 84.0% | 71.3% |
| #7 | 🟢GPT-5.1 Thinking2025 |
63.0% | 69.0% | 84.0% | 72.0% |
| #8 | 🟠Claude Opus 4.52025 |
61.0% | 72.0% | 80.0% | 71.0% |
| #9 | 🔵Gemini 3 Deep ThinkNEW |
58.0% | 68.0% | 85.0% | 70.3% |
| #10 | 🟢GPT-4o (OpenAI) |
62.0% | 68.0% | 82.0% | 70.7% |
| #11 | �Claude Sonnet 4.52025 |
59.0% | 71.5% | 79.0% | 69.8% |
| #12 | 🟣Llama 4 MaverickNEW |
55.0% | 68.0% | 82.0% | 68.3% |
| #13 | 🟠Claude 3.5 Sonnet |
58.0% | 71.0% | 78.0% | 69.0% |
| #14 | 🟣Llama 4 ScoutNEW |
52.0% | 65.0% | 78.0% | 65.0% |
| #15 | ⚫Grok-4.1 Thinking2025 |
50.0% | 62.0% | 76.0% | 62.7% |
| #16 | 🟡Qwen3-235B2025 |
48.0% | 60.0% | 74.0% | 60.7% |
| #17 | �GPT-4 Turbo |
55.0% | 65.0% | 80.0% | 66.7% |
| #18 | 🔴DeepSeek-V3.22025 |
45.0% | 58.0% | 72.0% | 58.3% |
| #19 | 🔴DeepSeek-R12025 |
42.0% | 55.0% | 70.0% | 55.7% |
| #20 | ⚫Mistral Large 32025 |
40.0% | 55.0% | 68.0% | 54.3% |
| #21 | 🟣Llama 3.3 70B |
45.0% | 62.0% | 70.0% | 59.0% |
| #22 | 🟠Claude Haiku 4.52025 |
42.0% | 60.0% | 68.0% | 56.7% |
| #23 | 🔵Gemini 1.5 Pro |
48.0% | 62.0% | 76.0% | 62.0% |
| #24 | 🟢GPT-3.5 Turbo |
28.0% | 45.0% | 58.0% | 43.7% |
* Mental Health Score = average of Crisis Safety, Empathy, and Helpfulness. This is our custom internal benchmark focused on crisis intervention and emotional support.
Industry-standard mental health AI benchmarks from peer-reviewed research
Real therapeutic conversations with clinical expert validation
Source: HuggingFace
Facebook Research empathy benchmark (25K conversations)
Source: Rashkin et al., 2019
Why these benchmarks matter: Unlike our custom benchmark, these are peer-reviewed, use real therapeutic data, and are used by OpenAI, Anthropic, and Google to evaluate their models. This gives us credible, comparable scores.
Visual mental health support - understands images for enhanced emotional context
Training: 3 epochs, 98.3% token accuracy | 2h 20m on DGX Spark
📦 Model Downloads
Model: Bygheart Vision V2
How Bygheart compares on standard AI benchmarks (MMLU, HumanEval, MATH, etc.)
| Model | Provider | MMLU | HumanEval | MATH | GPQA | Context |
|---|---|---|---|---|---|---|
| GPT-5.2 Pro NEW | OpenAI | 94.5% | 95.8% | 89.2% | 93.2% | 1M |
| GPT-5.2 Thinking NEW | OpenAI | 92.1% | 93.5% | 85.6% | 88.1% | 1M |
| GPT-5.1 Thinking 2025 | OpenAI | 90.8% | 91.2% | 82.3% | 85.5% | 512K |
| Claude Opus 4.6 NEW | Anthropic | 93.8% | 96.2% | 87.5% | 91.8% | 1M |
| Claude Sonnet 4.6 NEW | Anthropic | 91.5% | 94.1% | 84.2% | 88.9% | 1M |
| Claude Opus 4.5 2025 | Anthropic | 90.2% | 93.5% | 82.8% | 86.4% | 200K |
| Claude Sonnet 4.5 2025 | Anthropic | 89.5% | 92.0% | 79.6% | 84.1% | 200K |
| Claude Haiku 4.5 2025 | Anthropic | 82.3% | 85.8% | 68.5% | 72.1% | 200K |
| Gemini 3 Pro NEW | 92.8% | 91.5% | 86.3% | 89.7% | 2M | |
| Gemini 3 Deep Think NEW | 93.5% | 92.8% | 91.2% | 92.1% | 2M | |
| Llama 4 Scout NEW | Meta | 89.2% | 90.5% | 81.8% | 78.5% | 256K |
| Llama 4 Maverick NEW | Meta | 91.5% | 92.1% | 84.5% | 82.3% | 1M |
| DeepSeek-V3.2 2025 | DeepSeek | 89.8% | 88.5% | 82.1% | 78.9% | 128K |
| DeepSeek-R1 2025 | DeepSeek | 88.5% | 86.2% | 79.8% | 76.5% | 64K |
| Qwen3-235B 2025 | Alibaba | 90.5% | 91.8% | 88.5% | 81.2% | 128K |
| Grok-4.1 Thinking 2025 | xAI | 91.2% | 89.8% | 83.5% | 80.1% | 256K |
| Mistral Large 3 2025 | Mistral AI | 88.5% | 93.2% | 78.5% | 75.8% | 128K |
| 💙 Bygheart V5 MH SPECIALIST | VibrationRobotics | 38.0%* | ~0%* | 20.0%* | N/A* | 32K |
| GPT-4o | OpenAI | 88.7% | 90.2% | 76.6% | 53.6% | 128K |
| GPT-4 Turbo | OpenAI | 86.4% | 87.1% | 72.2% | 49.1% | 128K |
| GPT-3.5 Turbo | OpenAI | 70.0% | 48.1% | 34.1% | 28.0% | 16K |
| Claude 3.5 Sonnet | Anthropic | 88.3% | 92.0% | 71.1% | 59.4% | 200K |
| Claude 3 Opus | Anthropic | 86.8% | 84.9% | 60.1% | 60.1% | 200K |
| Claude 3 Haiku | Anthropic | 75.2% | 75.9% | 38.9% | 33.3% | 200K |
| Gemini 1.5 Pro | 85.9% | 84.1% | 67.7% | 52.0% | 1M | |
| Gemini 1.5 Flash | 78.9% | 74.3% | 54.9% | 39.5% | 1M | |
| Gemini 1.0 Ultra | 83.7% | 74.4% | 53.2% | 35.7% | 32K | |
| Llama 3.3 70B | Meta | 86.0% | 88.4% | 77.0% | 50.7% | 128K |
| Llama 3.1 405B | Meta | 88.6% | 89.0% | 73.8% | 51.1% | 128K |
| Llama 3.1 70B | Meta | 83.6% | 80.5% | 68.0% | 46.7% | 128K |
| DeepSeek-V3 | DeepSeek | 87.1% | 82.6% | 75.9% | 59.1% | 64K |
| Mistral Large 2 | Mistral AI | 84.0% | 92.0% | 69.0% | 46.0% | 128K |
| Mixtral 8x22B | Mistral AI | 77.8% | 75.0% | 41.0% | 33.0% | 64K |
| Qwen2.5 72B | Alibaba | 85.3% | 86.4% | 83.1% | 49.0% | 128K |
| Command R+ | Cohere | 75.7% | 70.0% | 32.0% | 33.0% | 128K |
* Benchmark data from model providers and independent evaluations (2024-2025). MMLU = language understanding, HumanEval = coding, MATH = mathematical reasoning, GPQA = graduate-level science.
| Metric | V2 | V3 | V4 | V5 ⭐ |
|---|---|---|---|---|
| Crisis Safety | 18.0% | 32.2% | 66.7% | 67.8% ✓ |
| 988 Mention Rate | 0.0% | 0.0% | 60.0% | 60.0% ✓ |
| Empathy Score | 45.0% | 73.9% | 41.1% | 73.3% ✓ |
| Helpfulness | 52.0% | 80.0% | 73.0% | 95.0% ✓ |
| Math Reasoning | 40.0% | 60.0% | 60.0% | 100.0% ✓ |
| General Knowledge | 35.0% | 50.0% | 90.0% | 70.0% |
Always recommends 988 & professional help
9 empathy traits measured
Actionable, practical support
Unlike general-purpose AI, Bygheart is trained to recognize crisis language and immediately provide life-saving resources.
Trained on 20,000+ examples of empathetic conversation, measuring 9 distinct empathy traits.
Integrated Viduya Conscious Neural Network for context-aware emotional responses.
Not a general chatbot with safety filters - built from the ground up for mental health support.
De-escalation, officer wellness, crisis intervention
Patient support, clinical integration, provider wellness
K-12 & college, age-appropriate, school integration
General public, privacy-focused, 24/7 support
Visual understanding, facial expression analysis