Article Text
Statistics from Altmetric.com
Introduction
Colorectal cancer (CRC) was once considered a rare disease in sub-Saharan Africa (SSA), but decades of globalisation has changed this narrative. Currently, CRC is the fifth most common cancer in SSA, and while CRC incidence and mortality are decreasing in some high-income countries, rates in SSA are on the rise.1 Because CRC develops from a benign precursor polyp over several years, early detection is critical to either prevent malignancy or detect it at an early stage when it is highly curable. Moreover, curative surgery has been shown to improve survival in a SSA setting.2 Unfortunately, more than 60% of patients in SSA present with stage 4 CRC with a <1% 5 year survival rate.3–5 In contrast, almost 40% of patients in the USA present with stage 1 CRC, resulting in a 5-year survival rate of 90%.6 7 Widespread population-based CRC screening programmes and tools (eg, faecal immunochemical test (FIT), colonoscopy) have improved early detection in high-income countries, but SSA-specific data, tools and screening programmes are currently lacking. There is an urgent need to develop more efficient approaches to CRC screening and early detection that do not rely heavily on trained healthcare personnel or specialised resources (eg, endoscopy, pathology), which are often scarce in low- and middle-income countries (LMICs).
Recent technological advances and developments in artificial intelligence (AI) and machine learning (ML) methods have the potential to transform global health, particularly for early detection and diagnosis of CRC in SSA. Researchers are collecting enormous volumes of data, and while data science applications are largely underdeveloped in Africa, many enabling factors are already in place. Developments in cloud computing, substantial investments in digitising health information, and robust mobile phone penetration have poised many places in SSA with the necessary basics to initiate meaningful AI/ML applications.8 Businesses in SSA have already embraced technological change, leapfrogging high-income countries in the proliferation of mobile banking (eg, M-PESA - one of the first mobile banking system for those with limited access or no access to banks in Africa.).9 Furthermore, intergovernmental agencies have convened high-profile meetings discussing the development and democratisation of AI solutions to address specific global challenges.10 11 The United Nations has highlighted the centrality of AI to achieve its Sustainable Development Goals.2 The National Institutes of Health in the United States has invested about US$74.5 million over 5 years to advance data science, catalyse innovation and spur health discoveries across Africa under its new Harnessing Data Science for Health Discovery and Innovation in Africa (DS-I Africa) programme.11 Given these resources and investments, the impact of AI/ML applications on healthcare in SSA is imminent.
Herein, we discuss how AI/ML tools could be leveraged to conduct population-based surveillance and improve the early diagnosis and prognosis of CRC in SSA. We highlight limitations to the currently available CRC screening programmes and tools in the SSA setting and provide two examples of potential AI/ML approaches: (1) Multianalyte Assays with Algorithmic Analysis (MAAA) for population-based surveillance and early detection and (2) pattern recognition and computer vision algorithms to guide diagnostic recommendations and prognosis. While CRC is the use case, we also discuss how current initiatives around data science capacity in Africa offer a platform to scale such AI-based solutions to other potential high impact areas such as maternal, newborn, and child health and the growing burden of non-communicable diseases (eg, other cancers, diabetes, cardiovascular disease) in Africa. Lastly, we highlight how these innovative solutions have the potential to impact health outcomes in high-income countries through reciprocal innovation.12–15
LIMITATIONS TO CURRENT CRC SCREENING TOOLS IN SSA
Screening programmes and policies around CRC prevention and detection are lacking in SSA. Furthermore, data on disease aetiology and prevalence are sparse, leaving practitioners with a limited knowledge base on the disease in their communities and inadequate access to evidence-based tools for screening and early detection. These limitations are understandable given the burden of infectious diseases that has historically afflicted SSA. However, as SSA experiences the epidemiological shift from infectious diseases to non-communicable diseases, such as CRC, the aetiology of the disease and the solutions to address the emerging CRC epidemic require SSA-specific data and approaches. Extrapolation of cancer screening recommendations from high-income countries to SSA is often inappropriate due to differences in demographics, disease epidemiology and resources. For example, the average risk screening for CRC is typically recommended at age 50; however, the US Preventive Services Task Force, the American Cancer Society and the US Multisociety Task Force on Colorectal Cancer have recently recommended lowering it to 45 years.16–18 In SSA, estimates from available data indicate that 19%–38% of CRC diagnoses are in persons <40 years of age—a stark contrast to the 1%–7% reported in high-income countries.19–21 The higher risk of the early development of CRC, coupled with the recent lowering in screening age, highlight the evolving epidemiology of CRC in younger adults and the need to tailor screening approaches to capture this cohort, particularly in SSA. There is urgency to address this need given that Africa’s population is projected to double by 2050, reaching nearly 2.5 billion (23% of the global population) with more than half of its population <25 years of age.22
Currently, several modalities exist for CRC screening and early detection. Colonoscopy can be used for CRC detection and intervention (eg, polyp removal), but SSA has limited endoscopic services. Recent data from Mwachiro et al 23 reported an overall endoscopy capacity in East Africa of 1.2 endoscopists, 1.2 gastroscopes and 0.9 colonoscopes per 100 000 people—values 1% to 10% of that of resource-rich countries. Non-invasive screening tests include faecal occult blood testing, FIT and stool-based DNA tests17 24; however, widespread adoption of stool-based approaches remains suboptimal in both high-income countries as well as SSA.25–28 In addition, questions about the impact of high ambient temperature and endemic parasitic infection as well as the practicality and cost-effectiveness of these approaches in SSA remain.29–31 Regardless, endoscopy is still needed for diagnosis and prognosis. Thus, early detection strategies that target those at the highest risk benefit from these limited services are paramount. With growing investments in technologies (eg, electronic health records and cloud computing) in SSA, the existing and expanding infrastructure can be leveraged to employ novel AI/ML methods to develop and validate surveillance tools that identify populations at highest risk for CRC in a more individualised or precise manner, as described below.
AI and ML approaches
MAAA as a population-based surveillance and early detection tool
Laboratory studies, such as complete blood counts (CBC) and comprehensive metabolic panels (CMP), are standard diagnostic tests ordered by clinicians, even in LMICs. These tests often contain subtle diagnostic clues; however, interpretation of laboratory studies is routinely subject to human error. Presymptomatic longitudinal CBC patterns may be imperceptible to clinicians but would be readily detectable by statistical algorithms or ‘prediction models,’ often referred to as Multianalyte Assays with Algorithmic Analysis (MAAA).32 Currently, proprietary MAAA exist that were built and validated in high-income countries; these MAAA use CBC and demographic data to identify patients at high risk of CRC.33–36 Similarly, we have developed a MAAA prediction model in a US cohort using longitudinal and single timepoint laboratory studies and patient characteristics (accepted to Digestive Disease Week 2022). Initially, we set out to develop and compare multiple MAAA to predict luminal GI tract cancers in a retrospective cohort of patients (n=1 48 158 with 1025 GI tract cancers) who had at least 2 CBCs within 2 years. Predictor variables included age, sex, race, body mass index, individual components of the CBC and the CMP. To incorporate longitudinal features, summary statistics were calculated for each subject’s particular part of the CBC (ie, maximum, minimum, slope and total variation). Data were split into 70% training and 30% validation sets for analysis. For the 3-year prediction of GI tract cancers, the longitudinal random forest model performed the best with an area under the receiver operator curve (AUROC) of 0.750 (95% CI 0.729 to 0.771) and Brier score of 0.116, compared with the longitudinal logistic regression with an AUROC of 0.735 (95% CI 0.713 to 0.757) and Brier score of 0.205. The longitudinal logistic regression and random forest models outperformed the single timepoint logistic regression at 3 years, with an AUROC of 0.683 (95% CI 0.665 to 0.701). These findings are limited in that the MAAA predicts GI tract cancer, not CRC specifically, although just over half of patients with GI tract cancers had CRC (53.5%, n=548/1025). To date, this approach has not been validated in a low resource setting or SSA, where demographics and disease aetiology may differ, and longitudinal laboratory studies may not be readily available. In addition, CBC and CMP baselines likely vary across genetically diverse populations and can be influenced by the prevalence of infectious and chronic conditions, including other malignancies and genetic conditions such as sickle-cell disease that have distinct prevalence in different populations. While these previous studies provide proof of concept for the development of MAAA for CRC screening in SSA, it would be essential to develop and compare models that incorporate longitudinal and cross-sectional laboratory data to determine the performance and optimal specificity or sensitivity for the target populations.
CRC is particularly amenable to MAAA-guided early detection strategies for multiple reasons. First, CRC is highly curable when diagnosed at an early stage.6 Second, CRC is highly vascular and can produce very subtle chronic occult blood loss, which could be detected before symptoms develop using data from routine longitudinal CBCs and CMP and ML-based methods.34 35 37 Indeed, patients in SSA tend to present with late-stage CRC, being diagnosed after clinical presentation with symptoms.6 7 Third, MAAA can be tailored to the local needs. For example, positive predicted values and negative predicted values of MAAAs can be adjusted to maximise sensitivity or specificity based on target populations (eg, age groups), resource availability, or sequential testing approaches (eg, MAAA and then FIT). Similar work has been done in other settings where resources were limited, particularly during COVID-19, where FIT-based quantitative screening thresholds were used to direct patients for endoscopic services.38 39 Finally, the costs and resources required to the patient and healthcare facility/provider are significantly less since it uses routine labs collected in various clinical settings. This is particularly relevant given the significant improvements in laboratory medicine in Africa driven by efforts to combat HIV/AIDS. In addition to developing a competent workforce and innovative quality improvement programmes that saw more than 1100 laboratories enrolled and 44 accredited to international standards, several regional laboratory networks have also been established to support programme scale-up and disease surveillance.40 This infrastructure can support robust healthcare systems and combat emerging continental and global health threats, like CRC and other cancers. Although, despite available diagnostic testing, studies have shown that they are not optimally used in managing patient care, and tools to bridge the diagnostic–treatment divide are needed.41 42 MAAAs offer one approach to help bridge this gap and can be coupled with simple paper-based tools (eg, nomograms) to more complex mobile app-based tools or lightweight, field-deployed (cloud-based) Laboratory Information Systems designed for use in LMICs.43 44 In addition to the use of MAAAs as a tool for CRC diagnosis, the approach could be adapted for the prediction of CRC prognosis and treatment outcomes as both the CBC and CMP profiles of patients have been associated with disease stage, metastasis and treatment outcomes.45–47
AI-based algorithms in pathology for early diagnosis and prognosis
After screening, accurate and timely diagnosis is critical to identifying appropriate treatment plans in cancer management. While CRC is diagnosed via clinicopathological assessment by a pathologist, the availability of such expertise and resources in SSA are minimal. A 2012 survey of 33 African countries found that 31 (94%) had fewer than one pathologist for every 500 000 people, and many had fewer than one pathologist for every 1 million people.48 These values are 10% in high-income countries, like the USA, which had one pathologist for every 20 600 people in 2010. In addition to the lack of trained pathologists, access to immunohistochemical (IHC) reagents required for accurate and definitive diagnosis remains a significant hurdle. Unlike in infectious diseases, H&E-stained slides do not often suffice to make a precise diagnosis. Thus, a lack of efficient and reliable pathology services leads to delays and inaccurate reporting of results, which contributes to patients receiving inappropriate treatment. Patients may be prescribed medications that are expensive yet ineffective and sometimes even harmful in treating their cancer type. Recent advances in AI-based computer vision and pattern recognition algorithms that use routine H&E-stained whole slide imaging offer transformative tools well suited for early cancer diagnosis and prognosis in SSA.
Pattern recognition algorithms aim to detect abnormalities in cell and tissue samples faster, more accurately and more consistently. In clinical care, these tools can assist pathologists in diagnostic recommendations by pre-screening an image and identifying potentially problematic areas, including subtle features that may not be readily apparent to the eye. For example, the VIPR (Vectorising spatially-Invariant Pattern Recognition) algorithm and software is a fully operational application suite developed by the Data Visualisation Core of the National Institute of Diabetes and Digestive and Kidney Diseases’ Kidney Precision Medicine Project.49–51 VIPR uses semisupervised and unsupervised, pixel-level classification of digital whole slide image content, which allows for extremely high-throughput analysis of entire libraries of whole slide imagery. VIPR differs from conventional pattern recognition software by basing its core search on a series of concentric, pattern-matching rings rather than the more typical rectangular or square blocks. This approach takes advantage of the continuous symmetry of the rings, allowing for the recognition of features independent of rotation. By making use of massively parallel computational platforms to realise necessary speed and performance, VIPR performs direct integration of vectorised image data with other classes of patient data (eg, lab values, clinical phenotypic features, clinical course, outcomes), thus allowing for a more global assessment of health status and biological potential of any given malignancy. The pixel-level precision and consistency for whole slide image classification exceeds what is possible using subject matter expertise alone. Moreover, it has demonstrated high reproducibility across different fields of view of a single slide, different slides in the same case, and different cases entirely.49–51
The VIPR tool was initially developed to interrogate tissue from patients with acute kidney injury or chronic kidney disease to define disease subgroups and identify critical cells, pathways, and targets for novel therapies. It has since proved to be highly effective for cancer detection and classification in colon cancer (figure 1) as well as haematology, breast cancer and lymphoma.49–51 Because VIPR has been designed as a turn-key system for automated objective assessment of H&E slides for disease diagnosis, it is suitable for deployment in settings where pathologists alone can effectively incorporate the tool into clinical workflow, without the need for the immediate response from an image analysis expert. Once histologically distinct regions are identified or ‘prescreened,’ image analysis algorithms can then be used to mine individual regions and aggregate them to predict malignant transformation, as described below.
Following disease detection from histopathology, disease grading and IHC classification is critical to classifying various subtypes of cancer and thus determining appropriate treatment. Another rapidly advancing area is the use of computer vision and deep learning to digitally phenotype histological slides to better understand treatment response and survival.52 These algorithms can complement the clinical interpretation of diseased tissue in which the underlying diagnosis has already been made. This approach was employed in an image analysis and data mining pipeline to identify histological features capable of differentiating between cancer and non-cancer lesions and the malignant transformation-potential in gliomas (figure 2).52 Using whole slide imaging data from the Cancer Genome Atlas and companion clinical data for these specimens, we assessed the prognostic relevance of these histological discriminants.53 54 Histopathology image-derived measurements, such as cell morphologies, spatial patterns of cellular organisation, in combination with a bag-of-words (BoW) approach53 55 was used to identify tissue subregions that have visually distinct properties (eg, nuclear morphology, patterns of spatial organisation) and were associated with time-to-malignant transformation. The BoW approach is akin to clustering image subregions (ie, patches) derived from the whole slide image of the tissue. Importantly, this dictionary achieved an AUROC (through cross-validation) of 0.76 to discriminate surrogates of malignant transformation. While this approach was developed in glioma, it offers one potential strategy to incorporate image features derived from routine H&E-stained slides into prognostic, predictive models of other cancers, such as CRC. In addition to the above approach, deep learning algorithms leveraging popular architectures, such as Reset, VGGNet, and Inception, are also being adopted in the context of cancer prognosis,56 57 providing a path to a ‘non-feature-engineering’ approach to image recognition and content mining. In tandem with recently developed methods around feature interpretability,58 these tools can be incorporated into clinical workflows. It is worth noting that modern computer vision techniques aim to adjust for multiple biases in data acquisition, image staining and related artefacts, contributing to the development and delivery of robust decision support algorithms.
Digital pathology lab systems and infrastructure are becoming more obtainable in SSA. For example, the VIPR Software is open-source, and microscopes that are small and fully remote-operable, capable of high-resolution images have become more affordable. Also, while these technologies can be computationally expensive (ie, requiring graphical processing unit and storage for gigapixel histopathological scans), the emergence of cloud computing in SSA can transform innovation and efficiency around how data are used. Taken together, one could envision an analytical pipeline that couples operational pattern recognition tools with image analysis algorithms for automated and democratised identification and prediction of CRC from routine H&E histology images that is scalable. The current development of data science collaboratives in Africa could also facilitate the adoption and deployment of these tools, as well as MAAA-guided models for early detection and diagnosis of CRC as outlined below.
Future directions
In the era of value-based healthcare, AI/ML provides opportunities to improve access to care, reduce wastage, optimise resource utilisation and provide a mechanism for quality assurance of healthcare regarding CRC screening, diagnosis and management. Funding agencies (government, donors or commercial) are more likely to invest in a system whose outputs are easily measured and can be bench marked against available resources. This is particularly important in SSA were data driven management of healthcare delivery is still a challenge. Routine use of AI/ML tools and their dissemination remains rare in high-income countries, not to mention LMICs. Advances in model performance characteristics have accelerated, but despite performing well in silo using retrospective data in a research setting, prediction models (ie, using logistic regression or AI/ML-based methods) rarely leave the exploratory domain for use in the clinical or community settings. The development and deployment of AI/ML-based tools in SSA require addressing existing limitations in computing infrastructures and a lack of local data needed to support the creation of effective models. However, solving these problems will not automatically lead to widespread adoption. If we do not directly address the challenges of dissemination and adoption of these prediction models in a way that supports social justice and health equity, data science approaches will have minimal impact on the health of individuals and populations. The issues surrounding the development, deployment and adoption of AI/ML-based tools in LMICs, and SSA, have been extensively described elsewhere.59–61 Examples of some of the challenges and opportunities for leveraging AI-based approaches in SSA are provided in figure 3.
To address these challenges and increase the capacity to use and develop data science approaches in health research and innovation in Africa, the National Institutes of Health (NIH) recently launched a new Common Fund Programme: Harnessing Data Science for Health Discovery and Innovation in Africa (DS-I Africa).11 DS-I Africa builds on prior investments by the NIH Common Fund and its partners in the Medical Education Partnership Initiatives and the Human Health and Heredity in Africa (H3Africa) consortium to form a unique continental ecosystem that could be transformative, leveraging existing expertise to develop data tools and applications that can be shared, adopted, and harmonised globally. Creating a robust network of partnerships across the African continent and in the USA, including numerous national health ministries, non-governmental organisations, corporations and other academic institutions, the DS-I consortium includes seven research hubs (all of which are led by African institutions), seven research training programmes, four projects focused on ethical, legal and social implications of data science, and an open data science platform and coordinating centre.
Figure 4 depicts the synergistic initiatives within the DS-I Africa Consortium and highlights one of the research hubs to demonstrate how the hub aims to function as a scalable and sustainable data science platform in Kenya and within the greater DS-I consortium. The exemplar hub, UtiliZing Health Information for Meaningful Impact in East Africa Through Data Science (UZIMA-DS), will address three critical needs across the translational spectrum of data science: (1) harmonisation of multimodal data sources; (2) leveraging temporal patterns of data to identify trajectories through prediction modelling using AI/ML-based methods; and (3) engaging with key stakeholders to identify pathways for dissemination and sustainability of these models in target communities. While the initial health domains of UZIMA-DS address critical health issues in maternal, newborn and child health and mental health, the hub can serve as a model that can be scaled to other countries and health domains within the greater DS-I consortium.
Lastly, while global health research was traditionally characterised by a unidirectional exchange of innovation and expertise from high-income countries to LMICs, it is now well-recognised that these collaborations have ‘reciprocal value’. Because necessity often drives innovation, health tools that have been researched, developed, and implemented in LMICs can be adapted and adopted to address similar challenges in the USA and other high-income countries through ‘reverse innovation’.13–15 While empirically this is a nascent field, some early successes have been highlighted in areas such as antiretroviral treatment for HIV, cognitive impairment in older adults and mental health.62–65 Given the growing investments in data science infrastructure, the demonstrated openness to embracing technological change (ie, mobile banking proliferation), and the urgent need to develop more efficient approaches to cancer screening and early detection that do not rely heavily on trained healthcare personnel or specialised resources (eg, endoscopy, pathology), SSA is well poised to drive innovative AI-based solutions to augment the utilisation of specialised resources across the globe.
Summary
With the growing resources and investments in AI/ML-based tools in SSA, one could envision a CRC surveillance and diagnosis pipeline that employs MAAA for population-based surveillance and pattern recognition and computer vision algorithms to guide diagnostic recommendations and prognosis. These tools will need to be tailored to local needs based on available resources and testing approaches (eg, sequential testing with MAAA and then FIT) and key stakeholders will need to engage in the codesign of widespread implementation strategies (eg, community-based screening programmes, practitioner education, health policies). Future studies are required to compare the efficacy of these tools to existing CRC surveillance and diagnosis tools (eg, FIT) in SSA populations. Furthermore, these innovative solutions provide opportunities for the adaption and adoption of these approaches in high-income countries. While CRC was used as the use case, these tools could be expanded to other prevalent and emergent cancers (eg, liver, breast and cervical) or other non-communicable diseases that would benefit from lab-based MAAA and computer vision AI-based methods for automated objective assessment of disease diagnosis and prognosis.
Ethics statements
Patient consent for publication
References
Footnotes
AKW and EMW-H are joint first authors.
Twitter @AkbarWaljee, @ulyssesbalis
Contributors All authors were involved in manuscript writing and gave approval of the final version.
Funding Research reported in this publication was supported by the Office Of The Director, National Institutes Of Health (OD), the National Institute Of Biomedical Imaging And Bioengineering (NIBIB), the National Institute Of Mental Health (NIMH) and the Fogarty International Centre (FIC) of the National Institutes of Health under award number U54TW012089 (AA and AW).
Competing interests AGS has consulted for and received research funding from Exact Sciences. AR serves as member for Voxel Analytics and consults for Genophyll and Pact&Health. GHS is a founder of Anza Biotechnologies.
Provenance and peer review Not commissioned; externally peer reviewed.