My Research

偶尔记录生活,分享感受

type
status
date
slug
summary
tags
category
icon
password
[BIBM’25] Beyond Accuracy: Assessing LLMs’ Ability to Recognize Their Limits in Medical Decision-Making
Hao Fu, Zibo Xiao, Chuang Liu, Xiangfei Meng*
notion image
While Large Language Models (LLMs) demonstrate impressive medical capabilities through Retrieval-Augmented Generation (RAG) and domain optimization, a critical question remains: can LLMs autonomously recognize when to seek external help rather than provide independent medical recommendations? This metacognitive capability is essential for safe healthcare deployment. To address this gap, we introduce a novel evaluation framework assessing LLMs’ autonomous help-seeking behavior through three workflows: Force-RAG (mandated external retrieval), No-RAG (internal knowledge only), and Auto-RAG (autonomous decision-making). Our comprehensive evaluation of 13 LLMs configurations across six clinical departments using 954 real-world cases reveals three key insights: (1) larger models don’t necessarily exhibit superior help-seeking calibration; (2) reasoning strategies significantly impact metacognitive performance across medical domains; (3) proprietary models demonstrate superior autonomy in balancing self-reliance with appropriate help-seeking. These findings challenge conventional scaling assumptions and establish help-seeking behavior as fundamental to medical AI reliability.

[ICSE’26] Optimization-Aware Test Generation for Deep Learning Compilers
Qingchao Shen, Zan Wang, Haoyang Ma, Yongqiang Tian, Lili Huang, Zibo Xiao, Junjie Chen*, Shing-Chi Cheung
notion image
Deep Learning(DL) compilers have been widely utilized to optimize DL models for efficient deployment across various hardware. Due to their vital role in the DL ecosystem, ensuring their reliability is critical. Such model optimizations are often designed to match specific computational graph structures. However, existing DL compiler fuzzing techniques do not generate tests for each optimization aware of its matched graph structures. In this paper, we propose OATest, a novel technique that synthesizes optimization-aware tests by extracting patterns from documented tests and integrating them into diverse computational graph contexts. To address the key technical challenge of synthesizing valid and optimization-aware computational graphs, OATest introduces two synthesis strategies: (1) reusing compatible inputs/outputs from existing nodes and (2) creating new nodes with compatible inputs/outputs, establishing effective connections between patterns and contexts. OATest is evaluated on two popular DL compilers, TVM and ONNXRuntime, regarding the bugs revealed by crashes and inconsistencies. The experimental results show that OATest significantly outperforms the state-of-the-art DL compiler fuzzing techniques by detecting more bugs and covering more optimization code in TVM and ONNXRuntime. Particularly, OATest uncovers 56 previously unknown bugs, 42/24 of which have been confirmed/fixed by developers.
Loading...