Real-time near infrared artificial intelligence using scalable non-expert crowdsourcing in colorectal surgery
This study was approved by The Ohio State University Institutional Review Board (IRB #OSU2021H0218). All patients provided written informed consent.
Video source and frame sampling
Surgical videos were obtained from a prospective clinical trial evaluating the utility of real-time laser speckle contrast imaging for perfusion assessment in colorectal surgery (IRB #OSU2021H0218). In the source material for the train dataset, video clips were not prefiltered, and frames were extracted at a regular interval (1 frame per second and 1 frame per 30 seconds) to create a diverse set of training data and eliminate frame selection bias. For the test dataset, clips were extracted when the surgeon was assessing perfusion of the colon. Frames were extracted at 1 frame per second to minimize frame selection bias. The final video and frame counts are represented in Supplementary Table 1.
Crowdsourced annotations
Crowdsourced annotations of bowel and abdominal wall were obtained using a gamified crowdsourcing platform (Centaur Labs, Boston MA) utilizing continuous performance monitoring and performance-based incentivization8. This methodology differs from standard crowdsourcing platforms such as Amazon’s Mechanical Turk, which don’t allow for such continuous performance monitoring and incentivization9. Previous implementations of crowdsourcing annotations in surgical computer vision have typically only utilized the majority vote crowdsourcing parameter5.
Annotation instructions were developed utilizing as little specialized surgical knowledge as possible while following surgical data science best practices10. Crowdsourced annotation instructions given to the crowdsourced workers (CSW) included 13 training steps for each task with 11 and 14 example annotations of abdominal wall and bowel, respectively (Fig. 1a). Four experts (two senior surgical trainees and two trained surgeons) provided expert annotations used to calculate training (TS) and running (RS) scores. In our study, CSW were required to achieve a minimum training score (TS) as measured by intersection-over-union (IoU) with 10 expert annotations prior to performing any annotations. A running score (RS) was calculated by intermittently testing the CSW in the same fashion. Annotations from CSW with a sufficient TS and RS were used in consensus generation. A minimum of 5 annotations (n) were required to generate the consensus crowdsourced annotation using the majority vote parameter (MV) to only include pixels annotated by 4 or more, and 2 or more annotations for bowel and abdominal wall respectively. Difficulty index (DI) was calculated for each frame using IoU with values between 0 and 1, higher indicating increasing difficulty (Supplementary Eq. (2), Methods). Quality assurance (QA) was performed by experts (two surgical trainees) on randomly selected frames above the difficulty review threshold (RT) of 0.4 difficulty index (Fig. 1b).
SegFormer B3 framework and model training
SegFormer is a semantic segmentation framework developed in partnership with NVIDIA and Caltech. It was selected for the real-time implementation for powerful and yet efficient semantic segmentation capabilities accomplished by unifying transformers with lightweight multilayer perception decoders11.
Using the SegFormer B3 framework, we trained two versions of Bowel.CSS. Bowel.CSS was trained on the entire crowdsource-annotated 27,000 frame dataset (78 surgical videos). A second model, Bowel.CSS-deployed, was trained on a subset of the train dataset (3500 frames from 11 surgical videos) and optimized for real-time segmentation of bowel. This model was deployed in real-time as a part of an AI-assisted multimodal imaging platform (Methods).
Train and test dataset crowdsourced annotations and demographics
Train dataset frames (n = 27,000) were annotated by 206 CSW giving 250,000 individual annotations and 54,000 consensus annotations of bowel and abdominal wall. 3% (7/206) of CSW identified as MDs, and 1% (2/206) identified as surgical MDs. Test dataset frames (n = 510) were annotated by 48 CSW giving 5100 individual annotations and 1020 consensus annotations. 4% (2/48) of CSW identified as MDs, and 0% as surgical MDs (Fig. 1c, e, Supplementary Table 1, Methods).
To further characterize “unknown” CSW demographics in the crowdsource user population in this study, Supplementary Table 3 presents CSW demographics for the entire annotation platform in the year 2022. It shows the majority (59.7%) were health science students, and the majority listed the reason for participating in crowdsource annotations as “to improve my skills” (57.3%). This supports the conclusion that most users on this platform are non-physicians and are not full-time annotators.
Crowdsource vs expert hours saved
A primary goal of the use of crowdsourced annotations is to mitigate the rate-limiting and expensive time of experts. The average time for the three domain experts to complete a frame annotation for bowel and abdominal wall was 120.3 s in test dataset. Using the average time to annotate, and the frame totals of 27,000 and 510, crowdsourcing saved an estimated 902 expert hours in the train dataset, and 17 in the test dataset (if experts would have not been required to annotate the test dataset for this study).
Annotation comparison statistics
The pixel-level agreement of both crowdsourced and Bowel.CSS annotations were compared to expert annotation using accuracy, sensitivity, specificity, IoU and F1 scores (Supplementary Fig. 1, Supplementary Eq. (1)). These metrics are accepted measurements of accuracy of segmentation annotations in computer vision and surgical data science12.
Difficulty index
Difficulty of the annotation task was measured per frame using a difficulty index (DI) defined in Supplementary Equation 2 which utilizes the average inter-annotator agreement of the individual CSW annotations to the crowdsourced consensus annotation as measured by IoU. This index is supported by evidence that lower inter-annotator agreement has shown to be an indicator of higher annotation difficulty when other factors such domain expertise, annotation expertise, instructions, platform and source material are constant13,14. DI values range from 0 (100% inter-annotator agreement) to 1 (0% inter-annotator agreement). Values closer to 0 indicate easier frames, especially when the annotation target is not visible and the annotation of “no finding” is used since annotations of “no finding” are in 100% agreement. Values closer to 1 indicate harder frames where there is less agreement amongst the CSWs.
The DI of bowel was 0.09 and 0.12 for abdominal wall in the train dataset and was lower than the DI of 0.18 for bowel and 0.12 for abdominal wall in the test dataset. The train dataset included full surgical videos versus the test dataset, which included only clips of surgeons assessing perfusion of the bowel, leading to an increased proportion of “no finding” annotation of bowel (22%) and abdominal wall (32%) in train dataset versus 2.4% and 11% for bowel and abdominal wall in the test dataset. The “no finding” annotations have low difficulty indices leading to the lower median difficulty of the train dataset.
Real-time deployment of near infrared artificial intelligence
Advanced near infrared physiologic imaging like indocyanine green fluorescence angiography and laser speckle contrast imaging show levels of tissue perfusion beyond what is visible in standard white light imaging. These technologies are used in colorectal resections to ensure adequate perfusion of the colon and rectum during reconstruction to reduce complications and improve patient outcomes. Subjectively interpreting physiologic imaging can be challenging and is dependent on user experience.
Bowel.CSS was developed to mask the physiologic imaging data to only those tissues relevant to the surgeon during colorectal resection and reconstruction to assist with interpretation of the visual signal. The output of this model was the bowel label only and it was deployed in real-time on a modified research unit of a commercially available advanced physiologic imaging platform for laparoscopic, robotic, and open surgery.
Bowel.CSS-deployed successfully segmented the bowel in real-time during 2 colorectal procedures at 10 frames per second. The intraoperative labels were not saved from the procedures, so to evaluate the intraoperative performance of the model, 10 s clips from each procedure were sampled at 1 FPS (20 frames total) from when the surgeon activated the intraoperative AI model. To assess for accuracy, the model outputs of Bowel.CSS and Bowel.CSS-deployed were compared to annotations by one of three surgical experts (1 trainee and 2 board-certified surgeons). Model outputs were compared to the expert annotations in these 20 frames using standard computer vision metrics. (Supplementary Table 3).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.