Digital health: Peer-reviewed study reveals significant disparities in coverage and accuracy among symptom assessment apps

New study compares world’s most popular symptom assessment apps on condition coverage, accuracy and safety
The peer-reviewed study, published in BMJ Open, was conducted by a team of doctors and scientists led by Ada Health alongside independent digital health experts
Eight symptom assessment apps were tested: Ada, Babylon, Buoy, K Health, Mediktor, Symptomate, WebMD, and Your.MD

London & Berlin, 16 December 2020 – A new peer-reviewed study testing the coverage, accuracy and safety of the eight most popular online symptom assessment apps has found that the performance of apps varies widely, with only a handful performing close to the levels of human general practitioners (GPs). Published today in BMJ Open, the study is the first of its kind to be published since 2015 and was conducted by a team of doctors and scientists led by global digital health company Ada Health.

Key findings

Coverage: Coverage is an important measure for digital health tools that might be deployed at scale, since it demonstrates how well apps can handle the wide variety of cases encountered within complex real-world healthcare environments. A tool with low coverage for example may exclude users who are too young, too old, pregnant, or who are living with a pre-existing mental health condition.

The study looked at how comprehensively the apps covered possible conditions and user types, and found that just a few of the most popular apps are configured to cover all patients. The most comprehensive app was Ada, which provided a condition suggestion 99 percent of the time. The other apps tested provided a suggestion 69.5 percent of the time on average, with the lowest scoring just 51.5 percent. The least comprehensive apps were not able to suggest conditions for significant numbers of cases, including key groups such as children, patients with a mental health condition, or those that were pregnant. Human GPs provided 100 percent coverage.

Accuracy: The study also considered the accuracy of each symptom assessment app by comparing the conditions suggested with what was deemed to be the ‘gold standard’ answer for each case as determined by a panel of doctors.

The study found that the apps’ clinical accuracy was also highly variable. Ada was rated as the most accurate, suggesting the right condition in its top three suggestions 71 percent of the time. The average across all the other apps was just 38 percent, with scores falling in a range between 23.5 percent and 43 percent. This means that, with the exception of Ada, most apps didn’t correctly identify the possible conditions in the majority of the cases. Human GPs were the most accurate, with 82 percent accuracy.

Safety: Finally, the study also assessed the safety of the app’s advice by examining whether the guidance they provided – such as staying at home to manage symptoms, or going to see a doctor – was considered to have the appropriate level of urgency.

While most apps gave safe advice in the majority of cases, only three apps performed close to the level of human GPs: Ada, Babylon, and Symptomate. Although all the apps assessed scored above 80 percent on safety, compared to 97 percent for human GPs, any small disparity in the safety of advice could potentially have a major impact upon patient outcomes if deployed at scale.

Methodology
The study is the only international large-scale peer-reviewed comparison of the performance and safety of apps across a broad range of medical conditions to be published in the last five years. It was developed by a team of digital health experts and clinical practitioners, including practising GPs, independent primary care clinical experts, and members of the clinical and scientific teams at Ada Health.

To ensure a fair comparison, the study used 200 ‘clinical vignettes’ – fictional patients, generated from a mix of real patient experiences gleaned from anonymised transcripts of calls to the UK’s NHS 111 telephone triage service and from the many years’ combined experience of the research team[1]. The vignettes were reviewed externally by a panel of three experienced primary care practitioners to ensure quality and clarity and to set the list of ‘gold standard’ correct conditions and urgency advice level for each case.

The vignettes were then entered into the eight apps by eight external GPs playing the role of ‘patient’. Each app was tested once against every vignette. Seven external GPs were also tested with the vignettes, providing condition suggestions (preliminary diagnoses) for the clinical vignettes after telephone consultations. Human GPs were included to provide a benchmark by which to assess the apps.

Commentary

Dr. Hamish S F Fraser, Associate Professor of Medical Science, Brown Center for Biomedical Informatics:
“Symptom assessment apps are now used by tens of millions of patients annually in the US and UK alone. This study of eight of the most commonly used symptom assessment apps provides valuable evidence regarding the coverage of conditions, and the accuracy of condition suggestion and urgency advice.”

“Compared to a similar study from five years ago, this larger and more rigorous study shows improved performance with results closer to those of physicians. It also demonstrates the importance of knowing when apps cannot handle certain conditions. While this is a preclinical study, the one-third of clinical vignettes based on real NHS 111 helpline consultations provide an important link to real urgent care challenges. Notably, both the GPs and the apps tended to perform somewhat worse when tested on those cases.”

“These results should help to determine which apps are ready for clinical testing in observational studies and then randomized controlled trials. The study design could form a model for future evaluations of symptom checker apps, and as part of assessment for regulatory approval.”

Dr. Claire Novorol, co-founder and Chief Medical Officer, Ada Health:
“Symptom assessment apps have seen rapid uptake by users in recent years as they are easy to use, convenient and can provide invaluable guidance and peace of mind. When used in a clinical setting to support – rather than replace – doctors, they also have huge potential to reduce the burden on strained healthcare systems and improve outcomes. This peer-reviewed study provides important new insights into the development and performance of these tools. In particular, it shows that there is still much work to be done to make sure that these technologies are being built to be inclusive and to cover all patients. We believe this is vital if symptom assessment apps are to fulfil their potential: human doctors don’t have the luxury of cherry-picking which patients they help and digital health must be held to the same standard.”

Results breakdown:

App	Coverage	Accuracy	Safety
GPs (for comparison)	100.0%	82.1%	97.0%
Ada Health	99.0%	70.5%	97.0%
Babylon	51.5%	32.0%	95.1%
Buoy	88.5%	43.0%	80.0%
K Health	74.5%	36.0%	81.3%
Mediktor	80.5%	36.0%	87.3%
Symptomate	61.5%	27.5%	97.8%
WebMD**	93.0%	35.5%	N/A
Your.MD	64.5%	23.5%	92.6%

**Because WebMD does not provide an overall user triage like the other apps tested do, meaningful comparison to the other apps or tested-GPs was not possible and WebMD was excluded from the advice safety analysis in this study.

[1] Clinical vignettes are created to reflect a typical GP caseload, such as “abdominal pain in an eight-year-old boy” or “painful shoulder in a 63-year-old woman”. The transcripts used in the study had previously been used as part of an NHS Direct benchmarking exercise for recommended outcomes, and were used with full consent of NHS Direct.

Notes to editors:

About Ada
Ada is a global health company founded by doctors, scientists and industry pioneers to create new possibilities for personal health, and transform knowledge into better outcomes. Its core system connects medical knowledge with intelligent technology to help all people actively manage their health and medical professionals to deliver effective care, and the company works with leading health providers, organizations and governments to carry out this vision. The Ada platform has 10 million users worldwide, and has completed 20 million assessments since its global launch in 2016. To learn more, visit www.ada.com.

About the study
The international study was conducted between November and December 2019. A link to the report can be found here: https://dx.doi.org/10.1136/bmjopen-2020-040269

The full citation for this study is:

Gilbert S, Mehl A, et al. How accurate are digital symptom assessment apps for suggesting conditions and urgency advice? A clinical vignettes comparison to GPs. BMJ Open 2020;0:e040269. doi:10.1136/bmjopen-2020-040269

Who was involved

The study was authored by:

Stephen Gilbert (study guarantor), Alicia Mehl, Adel Baluch, Caoimhe Cawley, Elizabeth Millen, Jan Multmeier, Fiona Pick, Claudia Richter, Ewelina Türk, Shubhanan Upadhyay, Vishaal Virani, Nicola Vona and Claire Novorol of Ada Health
Jean Challiner and Paul Wicks, independent digital health experts, consultants to Ada Health
Hamish Fraser of Brown University

Paul Taylor (UCL Institute of Health Informatics) independently reviewed and made suggestions on the study protocol, and after study data collection was complete, reviewed and made suggestions in a draft of this manuscript with respect to the analysis approach and the study description.

Vignette review was carried out by Alison Grey, Helen Whitworth, and Jo Leahy, all experienced primary care physicians.

The eight GPs tasked with entering the vignettes were listed on the GP Register and licensed to practice by the UK General Medical Council, with at least two years of experience as a GP and had never worked or consulted for Ada Health; these physicians had no other role in this study.

The testing process
In the testing process, each GP entered 50 randomly assigned vignettes (out of 200) into each of four randomly assigned symptom assessment apps, and recorded the results. In this way, each vignette was entered once in each app, with four physicians entering vignettes in each app.

If the app did not allow entry of the clinical vignette (lack of coverage), the reason for this was recorded, as was the reason for every vignette for which condition-suggestions or levels of urgency advice were not provided. If entry was permitted, the physician recorded the symptom assessment app’s condition suggestions and urgency advice and saved screenshots of the app’s results to allow for source data verification.

The apps were selected based on popularity and usage at the time of selection, or compared in the same class in additional small non-peer reviewed studies.

The use of vignettes
An audit study (Semigran et al. BMJ 2015) – which highlights the need to further evaluate symptom checkers head-to-head – points out that use of clinical vignettes is a common methodology that enables direct GP-to-app comparison, allowing a wide range of case types to be explored which are generalizable to “real life” situations.

A potential limitation is that the study is based on clinical vignettes rather than real patient data. However, the effect of this limitation has been minimised through the development of the vignettes to be highly realistic through the use of anonymised real-patient data collated from NHS 111 transcripts. Using vignettes also helped the study overcome limitations of using real cases – e.g. the need for face-to-face consultations that involve physical examination – and enabled the apps to be tested on a wider range of cases.

Previous and future research
Most previous studies considered only a single symptom assessment app; focused on specific (often specialty) conditions; had a small number of vignettes (<50); were relatively uncontrolled in the nature of the cases presented, and suffered a high risk of bias.

Software evolves rapidly, and the performance of these apps may have changed significantly since the time of data collection. Future research is needed which seeks to replicate these findings and/or develop methods to continue rigorous testing of symptom assessment apps as they evolve.

More details about the study are available in the report.

Source: RealWire