AI algorithm on par with radiologists as mammogram reader

Article Type

Changed

Thu, 12/15/2022 - 17:35

Author(s)

An artificial intelligence (AI) computer algorithm performed on par with, and in some cases exceeded, radiologists in reading mammograms in a case-control study of 8,805 women undergoing routine screening.

The algorithm – from the company Lunit, which was not involved in the study – had an area under the curve of 0.956 for detection of pathologically confirmed breast cancer.

When operating at a specificity of 96.6%, the sensitivity was 81.9% for the algorithm, 77.4% for first-reader radiologists, and 80.1% for second-reader radiologists. Combining the algorithm with first-reader radiologists identified more cases than combining first- and second-reader radiologists.

These findings were published in JAMA Oncology.

The study’s authors wrote that the algorithm results are a “considerable” achievement because, unlike the radiologists, the algorithm had no access to prior mammograms or information about hormonal medications or breast symptoms.

“We believe that the time has come to evaluate AI CAD [computer-aided detection] algorithms as independent readers in prospective clinical studies,” Mattie Salim, MD, of Karolinska Institute/Karolinska University Hospital in Stockholm, and colleagues wrote.

“The authors are to be commended for providing data that support this next critical phase of discovery,” Constance Dobbins Lehman, MD, PhD, of Massachusetts General Hospital and Harvard Medical School, both in Boston, wrote in a related editorial. She added that “it is time to move beyond simulation and reader studies and enter the critical phase of rigorous, prospective clinical evaluation.”

Study rationale and details

Routine mammograms save lives, but the workload for radiologists is high, and the quality of assessments varies widely, Dr. Salim and colleagues wrote. There are also problems with access in areas with few radiologists.

To address these issues, academic and commercial researchers have worked hard to apply AI – specifically, deep neural networks – to computer programs that read mammograms.

For this study, the investigators conducted the first third-party external validation of three competing algorithms. The three algorithms were not named in the report, but Lunit announced that its algorithm was the best-performing algorithm after the study was published. The other two algorithms did not perform as well and remain anonymous.

The investigators compared the algorithms’ assessments with the original radiology reports for 739 women who were diagnosed with breast cancer within 12 months of their mammogram and 8,066 women with negative mammograms who remained cancer free at a 2-year follow-up.

The women, aged 40-74 years, had conventional two-dimensional imaging read by two radiologists at the Karolinska University Hospital during 2008-2015. The subjects’ median age at screening was 54.5 years.

The algorithms gave a prediction score between 0 and 1 for each breast, with 1 denoting the highest level of cancer suspicion. To enable a comparison with the binary decisions of the radiologists, the output of each algorithm was dichotomized (normal or abnormal) at a cut point defined by the mean specificity of the first-reader radiologists, 96.6%.

At a specificity of 96.6%, the sensitivity was 81.9% for the Lunit algorithm, 67.0% for one anonymous algorithm (AI-2), 67.4% for the other anonymous algorithm (AI-3), 77.4% for first-reader radiologists, and 80.1% for second-reader radiologists

The investigators also ran their analysis at a cut point of 88.9% specificity. The sensitivity was 88.6% for the Lunit algorithm, 80.0% for AI-2, and 80.2% for AI-3.

“This can be compared with the Breast Cancer Surveillance Consortium benchmarks of 86.9% sensitivity at 88.9% specificity,” the authors wrote.

The most potent screening strategy was combining the Lunit algorithm with the first reader, which increased cancer detection by 8% but came at the cost of a 77% increase in abnormal assessments.

“More true-positive cases would likely be found, but a much larger proportion of false-positive examinations would have to be handled in the ensuing consensus discussion,” the authors wrote. “[A] cost-benefit analysis is required ... to determine the economic implications of adding a human reader at all.”

The team noted that the Lunit algorithm was trained on images of South Korean women from GE equipment.

“Although we do not have ethnic descriptors of our study population, the vast majority of women in Stockholm are White, and all images in our study were acquired on Hologic equipment,” the authors wrote. “In training AI algorithms for mammographic cancer detection, matching ethnic and equipment distributions between the training population and the clinical test population may not be of highest importance.”

As for why the Lunit algorithm outperformed the other two algorithms, one explanation may be that the Lunit algorithm was trained on more mammograms – 72,000 cancer and 680,000 normal images (vs. 10,000 cancer and 229,000 normal images for AI-2; 6,000 cancer and 106,000 normal images for AI-3).

As for next steps, the investigators are planning a prospective clinical study to see how AI works as an independent reviewer of mammograms in a day-to-day clinical environment, both as a third reviewer and to help select women for follow-up MRI.

The current study was funded by the Stockholm County Council. The investigators disclosed financial relationships with the Swedish Research Council, the Swedish Cancer Society, Stockholm City Council, Collective Minds Radiology, and Pfizer. Dr Lehman’s institution receives grants from GE Healthcare.

[email protected]

SOURCE: Salim M et al. JAMA Oncol. 2020 Aug 27. doi: 10.1001/jamaoncol.2020.3321.

Publications

MDedge Hematology and Oncology

Oncology Practice

Internal Medicine News

Family Practice News

Ob.Gyn. News

MDedge Internal Medicine

MDedge Family Medicine

Topics

Sections

Study rationale and details

Routine mammograms save lives, but the workload for radiologists is high, and the quality of assessments varies widely, Dr. Salim and colleagues wrote. There are also problems with access in areas with few radiologists.

To address these issues, academic and commercial researchers have worked hard to apply AI – specifically, deep neural networks – to computer programs that read mammograms.

For this study, the investigators conducted the first third-party external validation of three competing algorithms. The three algorithms were not named in the report, but Lunit announced that its algorithm was the best-performing algorithm after the study was published. The other two algorithms did not perform as well and remain anonymous.

The investigators compared the algorithms’ assessments with the original radiology reports for 739 women who were diagnosed with breast cancer within 12 months of their mammogram and 8,066 women with negative mammograms who remained cancer free at a 2-year follow-up.

The women, aged 40-74 years, had conventional two-dimensional imaging read by two radiologists at the Karolinska University Hospital during 2008-2015. The subjects’ median age at screening was 54.5 years.

The algorithms gave a prediction score between 0 and 1 for each breast, with 1 denoting the highest level of cancer suspicion. To enable a comparison with the binary decisions of the radiologists, the output of each algorithm was dichotomized (normal or abnormal) at a cut point defined by the mean specificity of the first-reader radiologists, 96.6%.

At a specificity of 96.6%, the sensitivity was 81.9% for the Lunit algorithm, 67.0% for one anonymous algorithm (AI-2), 67.4% for the other anonymous algorithm (AI-3), 77.4% for first-reader radiologists, and 80.1% for second-reader radiologists

The investigators also ran their analysis at a cut point of 88.9% specificity. The sensitivity was 88.6% for the Lunit algorithm, 80.0% for AI-2, and 80.2% for AI-3.

“This can be compared with the Breast Cancer Surveillance Consortium benchmarks of 86.9% sensitivity at 88.9% specificity,” the authors wrote.

The most potent screening strategy was combining the Lunit algorithm with the first reader, which increased cancer detection by 8% but came at the cost of a 77% increase in abnormal assessments.

“More true-positive cases would likely be found, but a much larger proportion of false-positive examinations would have to be handled in the ensuing consensus discussion,” the authors wrote. “[A] cost-benefit analysis is required ... to determine the economic implications of adding a human reader at all.”

The team noted that the Lunit algorithm was trained on images of South Korean women from GE equipment.

“Although we do not have ethnic descriptors of our study population, the vast majority of women in Stockholm are White, and all images in our study were acquired on Hologic equipment,” the authors wrote. “In training AI algorithms for mammographic cancer detection, matching ethnic and equipment distributions between the training population and the clinical test population may not be of highest importance.”

As for why the Lunit algorithm outperformed the other two algorithms, one explanation may be that the Lunit algorithm was trained on more mammograms – 72,000 cancer and 680,000 normal images (vs. 10,000 cancer and 229,000 normal images for AI-2; 6,000 cancer and 106,000 normal images for AI-3).

As for next steps, the investigators are planning a prospective clinical study to see how AI works as an independent reviewer of mammograms in a day-to-day clinical environment, both as a third reviewer and to help select women for follow-up MRI.

The current study was funded by the Stockholm County Council. The investigators disclosed financial relationships with the Swedish Research Council, the Swedish Cancer Society, Stockholm City Council, Collective Minds Radiology, and Pfizer. Dr Lehman’s institution receives grants from GE Healthcare.

[email protected]

SOURCE: Salim M et al. JAMA Oncol. 2020 Aug 27. doi: 10.1001/jamaoncol.2020.3321.

An artificial intelligence (AI) computer algorithm performed on par with, and in some cases exceeded, radiologists in reading mammograms in a case-control study of 8,805 women undergoing routine screening.

The algorithm – from the company Lunit, which was not involved in the study – had an area under the curve of 0.956 for detection of pathologically confirmed breast cancer.

When operating at a specificity of 96.6%, the sensitivity was 81.9% for the algorithm, 77.4% for first-reader radiologists, and 80.1% for second-reader radiologists. Combining the algorithm with first-reader radiologists identified more cases than combining first- and second-reader radiologists.

These findings were published in JAMA Oncology.

The study’s authors wrote that the algorithm results are a “considerable” achievement because, unlike the radiologists, the algorithm had no access to prior mammograms or information about hormonal medications or breast symptoms.

“We believe that the time has come to evaluate AI CAD [computer-aided detection] algorithms as independent readers in prospective clinical studies,” Mattie Salim, MD, of Karolinska Institute/Karolinska University Hospital in Stockholm, and colleagues wrote.

“The authors are to be commended for providing data that support this next critical phase of discovery,” Constance Dobbins Lehman, MD, PhD, of Massachusetts General Hospital and Harvard Medical School, both in Boston, wrote in a related editorial. She added that “it is time to move beyond simulation and reader studies and enter the critical phase of rigorous, prospective clinical evaluation.”

Study rationale and details

Routine mammograms save lives, but the workload for radiologists is high, and the quality of assessments varies widely, Dr. Salim and colleagues wrote. There are also problems with access in areas with few radiologists.

To address these issues, academic and commercial researchers have worked hard to apply AI – specifically, deep neural networks – to computer programs that read mammograms.

For this study, the investigators conducted the first third-party external validation of three competing algorithms. The three algorithms were not named in the report, but Lunit announced that its algorithm was the best-performing algorithm after the study was published. The other two algorithms did not perform as well and remain anonymous.

The investigators compared the algorithms’ assessments with the original radiology reports for 739 women who were diagnosed with breast cancer within 12 months of their mammogram and 8,066 women with negative mammograms who remained cancer free at a 2-year follow-up.

The women, aged 40-74 years, had conventional two-dimensional imaging read by two radiologists at the Karolinska University Hospital during 2008-2015. The subjects’ median age at screening was 54.5 years.

The algorithms gave a prediction score between 0 and 1 for each breast, with 1 denoting the highest level of cancer suspicion. To enable a comparison with the binary decisions of the radiologists, the output of each algorithm was dichotomized (normal or abnormal) at a cut point defined by the mean specificity of the first-reader radiologists, 96.6%.

At a specificity of 96.6%, the sensitivity was 81.9% for the Lunit algorithm, 67.0% for one anonymous algorithm (AI-2), 67.4% for the other anonymous algorithm (AI-3), 77.4% for first-reader radiologists, and 80.1% for second-reader radiologists

The investigators also ran their analysis at a cut point of 88.9% specificity. The sensitivity was 88.6% for the Lunit algorithm, 80.0% for AI-2, and 80.2% for AI-3.

“This can be compared with the Breast Cancer Surveillance Consortium benchmarks of 86.9% sensitivity at 88.9% specificity,” the authors wrote.

The most potent screening strategy was combining the Lunit algorithm with the first reader, which increased cancer detection by 8% but came at the cost of a 77% increase in abnormal assessments.

“More true-positive cases would likely be found, but a much larger proportion of false-positive examinations would have to be handled in the ensuing consensus discussion,” the authors wrote. “[A] cost-benefit analysis is required ... to determine the economic implications of adding a human reader at all.”

The team noted that the Lunit algorithm was trained on images of South Korean women from GE equipment.

“Although we do not have ethnic descriptors of our study population, the vast majority of women in Stockholm are White, and all images in our study were acquired on Hologic equipment,” the authors wrote. “In training AI algorithms for mammographic cancer detection, matching ethnic and equipment distributions between the training population and the clinical test population may not be of highest importance.”

As for why the Lunit algorithm outperformed the other two algorithms, one explanation may be that the Lunit algorithm was trained on more mammograms – 72,000 cancer and 680,000 normal images (vs. 10,000 cancer and 229,000 normal images for AI-2; 6,000 cancer and 106,000 normal images for AI-3).

As for next steps, the investigators are planning a prospective clinical study to see how AI works as an independent reviewer of mammograms in a day-to-day clinical environment, both as a third reviewer and to help select women for follow-up MRI.

The current study was funded by the Stockholm County Council. The investigators disclosed financial relationships with the Swedish Research Council, the Swedish Cancer Society, Stockholm City Council, Collective Minds Radiology, and Pfizer. Dr Lehman’s institution receives grants from GE Healthcare.