Current dermatology models see a significant drop in diagnostic accuracy in real-world settings.
AI Quick Take
- Top-3 diagnostic accuracy for open-weight models drops to 1.50%-13.35% in clinical situations.
- Incorporating clinical context boosts performance but sensitivity to consultation quality remains a challenge.
Recent research published in arXiv evaluates the real-world capabilities of multimodal large language models (MLLMs) in dermatology. The study focuses on four open-weight models-InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, and MedGemma-4B-Instruct-alongside the commercial model GPT-4.1. Although these models demonstrated promising results on public dermatology benchmarks, the real-world applicability raises significant concerns.
Evaluating across three dermatological datasets and a multi-site cohort with over 5,800 cases, the researchers focused on differential diagnosis generation and severity-based triage. While public benchmarks indicated a top-3 diagnostic accuracy of 26.55% for the best open-weight model and 42.25% for GPT-4.1, performance plummeted in clinical settings. Open-weight models scored between 1.50% and 13.35%, with GPT-4.1 achieving 24.65% in top-3 diagnostic accuracy using images alone.
The study also found that adding clinical context improved performance results significantly. For open-weight models, the accuracy increased up to 28.75%, while GPT-4.1 reached 38.93%. However, the models were highly sensitive to the quality of input consultation data, indicating that errors or omissions can drastically affect outcomes. Additionally, although the models demonstrated moderate sensitivity in severity-based triage (above 60%), their unreliability raises questions about their readiness for routine clinical deployment.
This research highlights a crucial gap between model performance in benchmark settings and actual clinical applicability. Stakeholders including healthcare providers, developers, and researchers must consider these discrepancies before integrating MLLMs into dermatology practice. The findings underscore the importance of robust consultation data to improve model accuracy, which is critical for patient safety. As the demand for AI in healthcare continues to grow, understanding these limitations will shape future development and deployment strategies for MLLMs.