Article Text

Download PDFPDF

Original article
Deep learning Radiomics of shear wave elastography significantly improved diagnostic performance for assessing liver fibrosis in chronic hepatitis B: a prospective multicentre study
  1. Kun Wang1,2,
  2. Xue Lu1,
  3. Hui Zhou2,3,
  4. Yongyan Gao4,
  5. Jian Zheng1,5,
  6. Minghui Tong6,
  7. Changjun Wu7,
  8. Changzhu Liu8,
  9. Liping Huang9,
  10. Tian’an Jiang10,
  11. Fankun Meng11,
  12. Yongping Lu12,
  13. Hong Ai13,
  14. Xiao-Yan Xie14,
  15. Li-ping Yin15,
  16. Ping Liang3,
  17. Jie Tian2,3,
  18. Rongqin Zheng1
  1. 1 Guangdong Key Laboratory of Liver Disease Research, Department of Medical Ultrasound, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
  2. 2 CAS Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing, China
  3. 3 Department of the Artificial Intelligence Technology, University of Chinese Academy of Sciences, Beijing, China
  4. 4 Department of Interventional Ultrasound, Chinese PLA General Hospital, Beijing, China
  5. 5 Department of Medical Ultrasonics, Third Hospital of Longgang, Shenzhen, China
  6. 6 Functional Examination Department of Children’s Hospital, Lanzhou University Second Hospital, Lanzhou, China
  7. 7 Ultrasound Department, The First Affiliated Hospital of Harbin Medical University, Harbin, China
  8. 8 Ultrasound Department, Guangzhou Eighth People’s Hospital, Guangzhou, China
  9. 9 Department of Ultrasound, Shengjing Hospital of China Medical University, Shenyang, China
  10. 10 Department of Ultrasonography, The First Affiliated Hospital, Medical College of Zhejiang University, Hangzhou, China
  11. 11 Function Diagnosis Center, Beijing Youan Hospital, Affiliated to Capital Medical University, Beijing, China
  12. 12 Ultrasound Department, The Second People’s Hospital of Yunnan Province, Kunming, China
  13. 13 Ultrasound Department, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, China
  14. 14 Department of Medical Ultrasonics, Institute of Diagnostic and Interventional Ultrasound, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
  15. 15 Department of Ultrasound, Jiangsu Province Hospital of TCM, Affiliated Hospital of Nanjing University of TCM, Nanjing, China
  1. Correspondence to Proffesor Ping Liang, Department of Interventional Ultrasound, Chinese PLA General Hospital, Beijing 100853, China; liangping301{at}hotmail.com, Proffessor Jie Tian, CAS Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; jie.tian{at}ia.ac.cn and Proffessor Rongqin Zheng, Guangdong Key Laboratory of Liver Disease Research, Department of Medical Ultrasound, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou 510630, China; zhengrq{at}mail.sysu.edu.cn

Abstract

Objective We aimed to evaluate the performance of the newly developed deep learning Radiomics of elastography (DLRE) for assessing liver fibrosis stages. DLRE adopts the radiomic strategy for quantitative analysis of the heterogeneity in two-dimensional shear wave elastography (2D-SWE) images.

Design A prospective multicentre study was conducted to assess its accuracy in patients with chronic hepatitis B, in comparison with 2D-SWE, aspartate transaminase-to-platelet ratio index and fibrosis index based on four factors, by using liver biopsy as the reference standard. Its accuracy and robustness were also investigated by applying different number of acquisitions and different training cohorts, respectively. Data of 654 potentially eligible patients were prospectively enrolled from 12 hospitals, and finally 398 patients with 1990 images were included. Analysis of receiver operating characteristic (ROC) curves was performed to calculate the optimal area under the ROC curve (AUC) for cirrhosis (F4), advanced fibrosis (≥F3) and significance fibrosis (≥F2).

Results AUCs of DLRE were 0.97 for F4 (95% CI 0.94 to 0.99), 0.98 for ≥F3 (95% CI 0.96 to 1.00) and 0.85 (95% CI 0.81 to 0.89) for ≥F2, which were significantly better than other methods except 2D-SWE in ≥F2. Its diagnostic accuracy improved as more images (especially ≥3 images) were acquired from each individual. No significant variation of the performance was found if different training cohorts were applied.

Conclusion DLRE shows the best overall performance in predicting liver fibrosis stages compared with 2D-SWE and biomarkers. It is valuable and practical for the non-invasive accurate diagnosis of liver fibrosis stages in HBV-infected patients.

Trial registration number NCT02313649; Post-results.

  • hepatitis B
  • cirrhosis
  • ultrasonography

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Significance of this study

What is already known on this subject?

  • There are more than 93 million chronic hepatitis B (CHB) carriers in China, and accurate assessment of liver fibrosis is essential for patients with CHB.

  • Liver stiffness measurement (LSM) by two-dimensional shear wave elastography (2D-SWE) is widely applied, but different studies showed great variability of cut-off values for staging liver fibrosis.

  • Radiomics for quantitative analysis of medical images has been proven to be a powerful tool, but its application in 2D-SWE images for classifying liver fibrosis stages in patients with CHB has not been systematically studied.

What are the new findings?

  • The deep learning Radiomics of elastography (DLRE) showed similar diagnostic efficacy with the liver biopsy for assessing cirrhosis (area under the receiver operating characteristic curve (AUC) 0.97) and advanced fibrosis (AUC 0.98), which were significantly better than LSM in 2D-SWE and biomarkers.

  • The diagnostic accuracy of DLRE improved as acquiring more 2D-SWE images from each patient.

  • There was no significant variation of the DLRE performance if enough 2D-SWE images were applied to train it, no matter training images came from which hospitals.

How might it impact on clinical practice in the foreseeable future?

  • As a non-invasive tool, DLRE in 2D-SWE may achieve a better overall diagnostic accuracy than LSM in 2D-SWE for assessing liver fibrosis stages in patients with CHB.

Introduction

HBV infection is a serious problem in China, causing more than one-third of the world’s HBV-infected people (approximately 93 million) residing in this country.1 Liver fibrosis is a progressive condition in chronic hepatitis B (CHB), and the accurate assessment of fibrosis is essential for prognosis, surveillance and management of patients with CHB.2

Liver biopsy is considered the reference standard for hepatic fibrosis staging. However, it is invasive and limited by sample errors, interobserver variability and various potential complications.3 Biomarkers, such as aspartate transaminase-to-platelet ratio index (APRI) and fibrosis index based on four factors (FIB-4), are also used to assess liver fibrosis, but their diagnostic performance remains controversial in HBV-infected patients.4 Recently, liver stiffness measurement (LSM) based on non-invasive ultrasonic imaging technologies is strongly recommended by many guidelines because of its effectivity and feasibility in the liver fibrosis evaluation.2 5

Two-dimensional (2D) shear wave elastography (SWE) is a new LSM technology with many advantages. Compared with transient elastography (TE), its application is not limited by ascites.6 It integrates B-mode imaging and colour-coded tissue stiffness map in real time, so that non-target structure and artefacts can be effectively avoided for acquiring more reliable LSM.7 Furthermore, it also can be used to detect focal liver lesions or assess liver morphological and blood flow changes.8 Therefore, 2D-SWE has been widely applied for the surveillance of HBV-infected patients in more than 400 Chinese hospitals in recent years. Several studies demonstrated that the diagnostic performance of 2D-SWE was comparable or even better than that of TE or point SWE in assessing liver fibrosis.9 10 However, despite these advantages, LSM of 2D-SWE is still affected by many factors. Important criteria for defining the optimal region of interest (ROI) of LSM, distinguishing reliable and unreliable measurements and controlling the overall image quality are still ambiguous in guidelines. As a consequence, the cut-off of 2D-SWE values for identifying cirrhosis in HBV-infected patients showed great variability ranging from 10.1 to 11.7 kPa in several studies.11 to 14 Therefore, the conventional strategy of using 2D-SWE values alone is likely to be insufficient for accurate assessment of liver fibrosis stages.

In contrast, an emerging technology named Radiomics can provide automated quantification of large amounts of image features (termed radiographic phenotypes) from medical images, which has the potential to uncover disease characteristics that fail to be appreciated by naked eyes.15 Radiomics has been proven to be useful in clinical oncology, where CT and/or MR images were acquired for analysis.16 17 We hypothesised that a distinctive radiomic technique might be able to use more valuable information from 2D-SWE images rather than just rely on the 2D-SWE value alone, and thus may provide better liver fibrosis staging accuracy.

There are only a few studies that applied radiomic methods on ultrasound images for chronic liver disease (CLD) diagnosis.18–21 They all successfully demonstrated the feasibility and potential benefits of using Radiomics for quantitative analysis of ultrasound images. However, there were some inherent limitations among these studies, such as lack of liver biopsy as reference, lack of a thorough comparison between proposed radiomic techniques and other well-established methods, not a prospective multicentre study focused on HBV-infected patients, or used engineered features (hard-coded features) for quantitative analysis, which is suitable for relatively smaller sample size. Different from these studies, our study sought to investigate the diagnostic performance of a deep learning method, named convolutional neural network (CNN),22 in 2D-SWE images for liver fibrosis staging in multicentre patients with HBV infection. Deep learning radiomic methods can learn features included in neural nets’ hidden layers automatically from imaging data, and thus they do not need object segmentation and hard-coded feature extraction, but their application requires a relatively large amount of imaging data.23

Here, we successfully enrolled 398 patients from 12 hospitals in China, with 1990 2D-SWE images, which we believe were suitable for the application of the deep learning radiomic method. To the best of our knowledge, this is the first prospective multicentre study that applied the deep learning radiomic method on 2D-SWE images for staging liver fibrosis in patients with CHB. Furthermore, in this study, histology obtained from liver biopsy was used as reference, as well as 2D-SWE and biomarkers were employed for the comparison with this new quantitative diagnostic strategy, named deep learning Radiomics of elastography (DLRE).

Patients and methods

Design and overview

This was a multicentre, prospective study. A new diagnostic approach named DLRE was used to assess liver fibrosis stages. Liver histology was used as the reference standard, and DLRE was compared with 2D-SWE, APRI and FIB-4. From January 2015 to January 2016, patients with CHB who provided informed consent to participate in this study were enrolled from 12 Chinese hospitals in different regions. This multicentre study was approved by the ethics committee of the principal investigator’s hospital and is registered at ClinicalTrials.gov (NCT02313649).

Patient enrolments

The inclusion criteria were as follows: (1) HBsAg positive more than 6 months; (2) older than 18 years; and (3) liver fibrosis stage scheduled for liver biopsy assessment. The exclusion criteria were as follows: (1) companied with other liver disease, including alcoholic CLD, haemochromatosis, autoimmune hepatitis, or intrahepatic biliary tract disease; (2) coinfection with HIV or any other viral hepatitis; (3) previous liver transplantation; (4) antiviral treatment in the previous 6 months; (5) unqualified histological samples (length was smaller than 15 mm, or the portal tract number was less than 6; (6) missing important serological results; and (7) unsuccessful 2D-SWE measurements. The demographic and clinical data of the patients (gender, age, height, weight and body mass index (BMI)) were recorded.

Two-dimensional shear wave elastography

Measurements of the 2D-SWE value were obtained by using the Aixplorer US imaging system (SuperSonic Imagine, SSI, France). The protocol of performing 2D-SWE was described in our previous studies,24 which is also recommended by the latest EFSUMB guidelines.6 B-mode ultrasound scan was first performed, and then 2D-SWE was performed in a well-visualised area that was free of large vessels. The size of the 2D-SWE ROI was 4 cm×3 cm, and it was located 1–2 cm under the liver capsule. A 2 cm diameter circular Q-Box ROI was placed in the 2D-SWE image, and the mean, maximum, minimum and SD of the elasticity within it were automatically calculated and displayed (figure 1A). Five independent 2D-SWE values and corresponding five 2D-SWE images were obtained from each patient, and the median value was used for statistical analysis. To be emphasised, strict quality controls were taken throughout the entire procedure. Operators who have performed more than 300 abdominal ultrasound scans or more than 50 supervised 2D-SWE examinations were enrolled in this multicentre study, and they were all strictly trained for the 2D-SWE measurement using the uniform procedure.24 Measurements were considered as failed or unqualified when little or no signal was obtained in the 2D-SWE ROI for every acquisition. Two 2D-SWE operators with more than 1-year 2D-SWE and 10 years of ultrasound operating experience were employed as quality controllers for reviewing all 2D-SWE images and excluding unqualified acquisitions.

Figure 1

Illustration of the two-dimensional shear wave elastography (2D-SWE) measurement and the deep learning Radiomics of elastography (DLRE) flow chart. (A) The top shows the 2D-SWE region of interest (ROI) (pseudocolour area), Q-Box (white circle area within 2D-SWE ROI) and DLRE ROI (red square area). The obtained 2D-SWE values are displayed on the right yellow box. The bottom is the corresponding B-mode ultrasound image. (B) An input layer (DLRE ROI) is analysed by using four convolution-pooling procedures (C1-P1 to C4-P4), and then last pooled maps are fully connected with 32 neural nodes to calculate its probability for classification. The neural nodes and other parameters of the convolutional neural network (CNN) model were automatically optimised by using all 2D-SWE images in the training cohort.

Liver biopsy

Liver biopsy was performed in the right lobe of a liver by using a 16 or 18 G needle (Bard Magnum, GA, USA) within 1 week of the 2D-SWE scan. All the biopsy specimens were transported to one centre and examined by two liver pathologists. Each of them had more than 6 years of work experience, and they were both blind to 2D-SWE and clinical results. Unqualified samples including length less than 15 mm and portal tracts less than 6 were strictly excluded. Histological staging of fibrosis was based on METAVIR scoring system, and the grades of ≥F2, ≥F3 and F4 indicated the significant fibrosis, advanced fibrosis and cirrhosis, respectively.25

Serological examinations

Serological examinations were performed within 1 week of 2D-SWE. The platelet count, fasting blood glucose, aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma-glutamyl transpeptidase, total bilirubin, direct bilirubin, indirect bilirubin, albumin and prothrombin activity levels were recorded. Two biomarker models were employed and calculated as: APRI=[(AST/upper limit of normal AST)×100]/platelet count (109/L) and FIB-4=[age (year)]×[AST (U/L)]/[platelet count (109/L)]×[ALT (U/L)1⁄2].26 27

Deep learning Radiomics of elastography

For using Radiomics, the enrolled patients were randomly divided into the training cohort and validation (or testing) cohort. One is for training the radiomic model to optimise its parameters, the other is to validate the performance of the generated model. In the training cohort, to reduce the potential bias caused by the unbalanced data for binary classification, a strategy called data augmentation was applied before the training procedure.28 2D-SWE images in the training cohort were augmented through a number of random transformations, which increased the training data pool and decreased the overfitting of the generated radiomic model.

In this study, DLRE adopted the CNN method, one of the deep learning radiomic techniques, for the automatic analysis of 2D-SWE images. The three major operations of CNN are the convolution, activation and pooling, and the entire process can be divided into two steps, the forward computation and the back propagation.29 Finally, online supplementary figure 1 defines the termination of the process in building the CNN model. The detailed introduction and the mathematical descriptions of these operations and steps are demonstrated in the online supplementary materials.

Supplementary file 1

For applying DLRE, a square DLRE ROI containing the entire 2D-SWE ROI with the size of 250×250 pixels was manually selected as the input layer (figure 1B), and then the CNN model was triggered. Four hidden layers (convolutional layers) were designed in CNN, and each followed with a max pooling layer to combine the neuron clusters at the prior layer into a single neuron in the next layer. The first hidden layer contained 16 feature maps, and each of the rest three contained 32 feature maps, which were obtained by applying 16 or 32 convolution filters (3×3 pixels) to the prior layer. The pooling size was 2×2 pixels. At the end, a fully connected layer with 32 nodes was applied to connect every neuron in the fourth pooling layer, so that the binary classification can be calculated in the output layer in the form of probabilities (figure 1B). The DLRE model generated by using the training cohort of this study is available at http://www.casmi.science/index.php/s/WZrE61nXlrZupi9. Some 2D-SWE images of four patients can also be downloaded as examples for testing the DLRE model.

Assessing the overall diagnostic accuracy of DLRE

Two-thirds of the enrolled patients were randomly selected, and their corresponding 2D-SWE images and histological results were used as the training cohort of DLRE. Images were sent to the input layer of the CNN model directly, so that the low-level to high-level features included in neural nets’ hidden layers were automatically extracted. DLRE then learnt these features to fine-tune its parameters and finally established its classification model for liver fibrosis staging. The 2D-SWE images and histological results of the other one-third patients were used as the validation cohort to evaluate the diagnostic accuracy of DLRE. All five images acquired from each patient were employed in this assessment. The diagnostic accuracy of DLRE was compared with 2D-SWE and biomarkers. After that, all enrolled patients with CHB were further divided into subgroups regarding to their ALT, BMI and inflammation levels. Then, the diagnostic performances of DLRE and 2D-SWE were compared in different subgroups for each fibrosis stage (online supplementary materials).

Assessing the diagnostic accuracy versus the number of acquisitions

DLRE was trained by one, three and five 2D-SWE images of each patient in the training cohort, respectively, and then the corresponding three DLRE models were used to assess liver fibrosis stages in the validation cohort. As for using 2D-SWE values, one, three and five measurements of each individual were also separately employed for liver fibrosis classification. For each staging strategy, the diagnostic accuracy of using three images/measurements was compared with that of using one and five images/measurements, respectively (intrastrategy comparison). Moreover, for using the same number of images/measurements, the diagnostic accuracies of these two strategies were also compared in each classification of liver fibrosis stages (interstrategy comparison).

Assessing the diagnostic robustness of DLRE

There were 12 Chinese hospitals (coded as A–L) participating in this study. Three different training cohorts were composed of patients enrolled from different combinations of hospitals, whereas patients in the rest hospitals consisted the three corresponding validation cohorts. These combinations were all random, but we still kept about two-thirds of the enrolled patients for training, and the rest of patients for validation in all three cases. Then, the diagnostic robustness of DLRE for liver fibrosis staging was evaluated through these different arrangements. Five 2D-SWE images of each patient were all employed in this experiment.

Statistical analysis

Descriptive statistics were summarised as mean±SD or median and IQR. Comparisons between groups were made with the Student’s t test or Mann-Whitney U test, when appropriate, for quantitative variables and with the Χ2test or Fisher’s test for qualitative variables. Area under the receiver operating characteristic (ROC) curve (AUC) was used to estimate the probability of the correct prediction of liver fibrosis stages. Differences between various AUCs were compared by using a Delong test. Sensitivity, specificity, positive and negative predictive values, and positive and negative diagnostic likelihood ratios were calculated. All statistical tests were two sided, and p values less than 0.05 indicated statistical significance. The statistical analyses were performed using SPSS software for Windows, V.20.0 (SPSS) and MedCalc software (V.11.2; 2011 MedCalc Software bvba, Mariakerke, Belgium).

Results

Baseline characters

Between January 2015 and January 2016, up to 654 potentially eligible patients form 12 Chinese hospitals were prospectively enrolled in this study. Among them, 256 patients were excluded because of the combination with other diseases, antiviral treatment, as well as unqualified histological, serological and/or 2D-SWE results. Thus, 398 patients with 1990 2D-SWE images were finally enrolled for analysis (figure 2). The mean liver biopsy length of all patients is 17.7 mm.

Figure 2

The results of the multicentre patient enrolments. In total, 398 out of 654 patients from 12 Chinese hospitals were enrolled in this study. 2D-SWE, two-dimensional shear wave elastography.

After randomisation of these patients, 266 patients with 1330 images were assigned to the training cohort, and the other 132 patients with 660 images composed the validation cohort. Their characteristics are summarised in table 1. Between the training and validation cohorts, there were neither significant differences in all baseline characters (p>0.05), nor the distribution of patients among all fibrosis stages (p>0.05).

Table 1

Baseline characters of patients

Overall diagnostic accuracy of DLRE in comparison with 2D-SWE, APRI and FIB-4

In the training cohort, DLRE demonstrated the highest diagnostic accuracy compared with all other methods for classifying of F4, ≥F3 and ≥F2 (figure 3A–C), and differences of AUCs were all statistically significant (p<0.001, table 2). AUCs of DLRE reached startling 1.00, 0.99 and 0.99 for three stratifications, respectively, which were 0.13, 0.18 and 0.25 higher than these of 2D-SWE who offered the second highest AUCs. The sensitivity and specificity analyses also demonstrated that DLRE was universally better than 2D-SWE and biomarkers (table 2).

Figure 3

Comparison of ROC curves between DLRE, 2D-SWE and biomarkers for the assessment of liver fibrosis stages in training and validation cohorts, respectively. (A, D) F0-F3 versus F4 (F4) in training and validation cohorts. (B, E) F0-F2 versus F3-F4 (≥F3) in training and validation cohorts. (C, F) F0-F1 versus F2-F4 (≥F2) in training and validation cohorts. 2D-SWE, two-dimensional shear wave elastography; APRI, aspartate transaminase-to-platelet ratio index; DLRE, deep learning Radiomics of elastography; FIB-4, fibrosis index based on four factors.

Table 2

Diagnostic performance of DLRE, 2D-SWE, APRI and FIB-4 for the assessment of liver fibrosis stages in training and validation cohorts

In the validation cohort, AUCs of DLRE dropped slightly for the diagnosis of F4 and ≥F3 (figure 3D,E), but they still reached 0.97 and 0.98, which were significantly higher than other methods (p<0.01 or p<0.001, table 2). However, the performance of DLRE for ≥F2 became much poorer than it was in the training cohort (figure 3F). AUC decreased from 0.99 to 0.85. It still demonstrated the highest AUC, and was significantly better than APRI (p<0.001) and FIB-4 (p<0.01), but no significant difference was found between DLRE and 2D-SWE (p>0.05, table 2).

For all 398 patients, the performances of DLRE and 2D-SWE did not show significant differences among ≥F2, ≥F3 and F4 regarding to different ALT and BMI levels (online supplementary tables 1 and 2 and figures 2 and 3). However, for F4, AUC of 2D-SWE in non-severe inflammation (A0-2) group was significantly higher than that in severe inflammation (A3) group (0.88 vs 0.69, p<0.001), whereas no significant difference was found between AUCs of DLRE in different inflammation subgroups (online supplementary table 3 and figure 4).

Diagnostic accuracy versus number of acquisitions: intrastrategy and interstrategy comparison of DLRE and 2D-SWE

For the intrastrategy comparison, when 2D-SWE separately adopted one, three and five stiffness measurements of each patient to assess liver fibrosis stages, its diagnostic accuracy showed no significant variation for classifying F4, ≥F3 and ≥F2 (table 3). Their ROC curves overlapped each other in all three fibrosis staging cases (figure 4), which indicated that the sensitivity and specificity of 2D-SWE had no obvious correlation with the number of acquisitions. This phenomenon was confirmed in both training and validation cohorts (figure 4).

Figure 4

Comparison of receiver operating characteristic (ROC) curves between deep learning Radiomics of elastography (DLRE) and two-dimensional shear wave elastography (2D-SWE) using different number of image acquisitions/measurements (1, 3 and 5) of each patient for the assessment of liver fibrosis stages. (A, D) F0-F3 versus F4 (F4) in training and validation cohorts. (B, E) F0-F2 versus F3-F4 (≥F3) in training and validation cohorts. (C, F) F0-F1 versus F2-F4 (≥F2) in training and validation cohorts.

Table 3

Intrastrategy and interstrategy comparisons of DLRE and 2D-SWE for their relationship of diagnostic accuracy versus the number of image/measurement acquisitions in assessing liver fibrosis stages in training and validation cohorts

However, DLRE demonstrated a very different nature. Its diagnostic accuracy increased as more 2D-SWE images of each individual were added to the training procedure (table 3). This was particularly obvious in the assessment of F4 and ≥F3 (figure 4A,B,D,E), in which significant improvements of AUC were found from using one to three images in both training (F4, AUC: 0.94 vs 1.00, p<0.01; ≥F3, AUC: 0.91 vs 0.96, p<0.05) and validation (F4, AUC: 0.84 vs 0.96, p<0.001; ≥F3, AUC: 0.82 vs 0.95, p<0.01) cohorts. Although there were no statistically significant improvements from using three to five images, AUCs still increased in all cases, unless it already reached 1.00 with three images (table 3). For ≥F2, AUCs of DLRE increased in both cohorts, when more numbers of images were employed, but these increases were not significant.

For the interstrategy comparison, DLRE showed different performances in the training (figure 4A–C) and validation (figure 4D–F) cohorts. In the training cohort, AUCs of DLRE were significantly better than those of 2D-SWE in all stratifications when using the same number of images/measurements (table 3). However, in the validation cohort, DLRE offered similar accuracy with 2D-SWE when only employing one image/measurement from each patient. If more images were adopted, DLRE outperformed 2D-SWE in the stratification of F4 and ≥F3 (all p<0.01), but it did not offer significantly higher AUC for ≥F2.

Diagnostic robustness of DLRE

Three randomly selected combinations of hospitals were employed to establish three different training cohorts (with similar number of patients), so that the DLRE model with three sets of parameters was obtained, respectively. For each fibrosis classification in either training or validation cohort, the resulted three ROC curves always overlapped each other (figure 5), and no significant differences were found (table 4). This revealed that DLRE demonstrated robust and consistent performances regardless of the training data coming from which hospitals, as long as the number of enrolled patients in different training cohorts was fairly constant.

Figure 5

Comparison of receiver operating characteristic (ROC) curves between different combinations of hospitals for training deep learning Radiomics of elastography (DLRE) in the classification of liver fibrosis stages. (A, D) F0-F3 versus F4 (F4) in training (combination of hospitals B, D, G, E, H and J) and validation cohorts. (B, E) F0-F2 versus F3-F4 (≥F3) in training (combination of hospitals A, C and K) and validation cohorts. (C, F) F0-F1 versus F2-F4 (≥F2) in training (combination of hospitals A, G and K) and validation cohorts. Note: three ROC curves completely overlap each other in (A) and (C), as they all reach the optimal profile (area under the receiver operating characteristic curve (AUC)=1).

Table 4

Comparisons using different combinations of hospitals for training DLRE to classify liver fibrosis stages in training and validation cohorts

Discussion

In this multicentre prospective study, the diagnostic accuracy of DLRE, 2D-SWE and biomarkers in assessing liver fibrosis stages was compared against histology in patients with CHB. For assessing cirrhosis and advanced fibrosis, DLRE demonstrated significant improvements compared with 2D-SWE and biomarkers. In the training cohort, AUCs of DLRE reached 1.00 and 0.99, and in the validation cohort, they were 0.97 and 0.98, which indicated that DLRE provided similar diagnostic efficacy with the reference standard liver biopsy. 2D-SWE showed the second highest diagnostic accuracy, with AUCs of 0.87 and 0.81 in the training cohort, as well as 0.86 and 0.85 in the validation cohort. AUCs of biomarkers were all ≤0.75 in both stratifications and both cohorts. In the assessment of ≥F2, DLRE (AUC: 0.99) still performed significantly better than the other methods (AUC: ≤0.74) in the training cohort. However, its accuracy decreased in the validation cohort (AUC: 0.85), which did not show a significant difference with 2D-SWE, but was significantly better than biomarkers.

In order to investigate whether different levels of ALT, BMI and inflammation affected the performances of 2D-SWE and DLRE or not, stratification analysis in subgroups was performed. The results revealed that for F4, the inflammation grade did show significant impact on the performance of 2D-SWE, whereas its impact on that of DLRE was not significant.

These findings suggest that DLRE can be successfully used for the assessment of liver fibrosis stages in patients with CHB, and provides comparable diagnostic accuracy with current reference standard in classifying cirrhosis and advanced fibrosis. Its diagnostic accuracy was higher than 2D-SWE, and it may overcome the influence of inflammation for cirrhosis evaluation, which is likely to be a potential breakthrough in elastography diagnosis.

DLRE was completely established on analysing 2D-SWE images with the Radiomics concept. It uses exactly the same images as 2D-SWE stiffness measurement does, but it has two major advantages with respect to 2D-SWE. First, for the manual initiation, the input layer of DLRE contained the entire 2D-SWE ROI, whereas 2D-SWE performed LSM inside the Q-Box, which was only a portion of the 2D-SWE ROI. Therefore, DLRE fully used the 2D-SWE ROI (area about 10.5 cm2) instead of just using Q-Box (area about 3.1 cm2) for quantitative analysis. Second, DLRE employed the CNN method to achieve automatic feature extraction and deep learning in 2D-SWE images. Instead of solely measuring the average liver stiffness inside the Q-Box based on shear wave velocities, a large variety of features included in multiple hidden layers of 2D-SWE images, which reflected the heterogeneity of intensity and texture of these images, were quantitatively analysed to classify liver fibrosis stages. This offered a more thorough and comprehensive assessment compared with using 2D-SWE values as a single parameter for diagnosis. As a result, DLRE significantly improved the accuracy in the assessment of cirrhosis and advanced fibrosis.

In the assessment of significant fibrosis, the performance of DLRE became worse in the validation cohort, even though it was very accurate in the training. It is commonly seen that differentiating F0-F1 from F2-F4 is more challenging in many studies.7 9 30 to 32 This is because the heterogeneity of liver fibrosis is more severe in ≥F2 compared with that in ≥F3 and F4, which reduces the accuracy of all classification strategies in general, and DLRE was no exception. One possible way to overcome this challenge is to integrate multiple strategies for fibrosis classification. The current DLRE model still has tremendous room for improvements. If DLRE can be further optimised and integrated with other approaches, such as LSM by 2D-SWE and biomarkers, it might be possible to achieve a better performance in classifying ≥F2. Furthermore, only 16.3% enrolled patients were in the F0-F1 stage in this study (table 1), which was much less than portions of patients in other stages. The unbalanced data further compromised the efficacy of DLRE. This was probably because the involved 12 hospitals were all high-level teaching hospitals all over China, thus their patients were more likely to be in a severe condition. Since CNN requires larger data volume for more complicated classification, it is likely that DLRE may achieve better accuracy in assessing significant fibrosis, if the sample population of F0-F1 could be further extended in future studies.

The second finding of our study was that DLRE was highly data volume dependent. If more 2D-SWE images were acquired from each patient to train the DLRE model, it showed systematic improvements of the diagnostic accuracy in the assessment of all fibrosis stages in both training and validation cohorts. Different from DLRE, 2D-SWE did not show any significant differences with the increase of data volume, which was consistent with our previous studies.33 EFSUMB guideline suggests at least three measurements of each individual for assessing liver fibrosis using elastography.7 Coincidentally, with three acquisitions per person, our study demonstrated that DLRE significantly improved the diagnostic accuracy for assessing cirrhosis and advanced fibrosis compared with 2D-SWE. Moreover, the results also suggested five acquisitions might be even better when using DLRE.

Last but not the least, DLRE showed remarkable robustness in this multicentre study. When three randomly selected combinations of hospitals were used to build training cohorts, no significant variation was found for classifying liver fibrosis in training and validation cohorts using DLRE, and the diagnostic accuracy (table 4) matched the overall diagnostic accuracy of DLRE (table 2). These findings proved DLRE to be robust and reliable, which was valuable of clinical generalisation in China. Utilising the data acquired from limited number of hospitals to train and establish the DLRE model is likely to be sufficient in applying it for assessing liver fibrosis stages in other hospitals with a consistent accuracy.

To the best of our knowledge, this is the first prospective study that aimed to compare the diagnostic accuracy of liver fibrosis by means of deep learning Radiomics on 2D-SWE images, 2D-SWE and biomarkers in patients with CHB who underwent liver biopsy. About 2000 images obtained from 12 hospitals were enrolled here, which, we believe, is also the largest study of investigating Radiomics in diagnosing liver fibrosis stages with 2D-SWE so far. Strict quality control was applied for all image acquisition and histological analysis in every individual. Furthermore, this study only enrolled HBV-infected patients as a single-disease investigation to eliminate unnecessary interference. The final results proved that applying DLRE for the quantitative analysis of 2D-SWE images offered valuable benefits of diagnosing liver fibrosis in patients with CHB. Once the DLRE model is established, operators only need to perform a standardised selection of DLRE ROI in the daily workflow of 2D-SWE to conduct such analysis, which is extremely easy for clinical applications.

There is only one study we found that applied Radiomics for 2D-SWE analysis besides us. Gatos et al reported a multicentre study (126 patients) that adopted 35 hard-coded radiomic features extracted from 2D-SWE images to identify patients with CLD from healthy people.20 AUC reached 0.87 for the proposed machine learning method. However, their machine learning was fundamentally different from our deep learning approach, and their method was neither used to assess liver fibrosis stages, nor compared with any other diagnostic strategies.

The major limitations in our study were the limited population size, the unbalanced distribution of the patient population and the still developing DLRE method. Future studies need to involve more patients with CHB in a larger scale, as well as to achieve an equal distribution of patients in all fibrosis stages, so that the deep learning model can be better trained. The model itself also needs to be further optimised with better engineering design, as well as further developed with more comprehensive integration of other clinical data, such as serological results. All these effects can be made, so that more available data can be thoroughly analysed to enhance the overall performance of DLRE and achieve more accurate non-invasive diagnosis of liver fibrosis stages. Besides these limitations, our study did not investigate the performance of DLRE for classifying patients with CHB of different ethnic populations, for classifying patients with other aetiologies (chronic hepatitis C, non-alcoholic fatty liver disease, and so on), as well as its efficacy of using different commercial 2D-SWE systems, which are also worthy of further studies in the future.

In conclusion, this study demonstrated that DLRE was more accurate than 2D-SWE in assessing cirrhosis and advanced fibrosis, and more accurate than biomarkers in assessing all three liver fibrosis stages in patients with CHB. With more imaging acquisitions of each patient, DLRE provided increased diagnostic accuracy. With different training cohorts, DLRE also showed excellent robustness. All of these suggested a good potential of DLRE for clinical generalisation. Further studies in larger patient populations and balanced patient distribution are still needed.

Acknowledgments

We thank Doctor Rongkui Luo and Doctor Jing Zhao from Zhongshan Hospital Fudan University for their help with the pathological diagnoses.

References

Footnotes

  • KW, XL, HZ and YG contributed equally.

  • Contributors Study conception: RZ, JT and PL. Data collection: KW, XL, HZ, YG, JZ, MT, CW, CL, LH, TJ, FM, YL, HA, XYX and LY. Data analysis: KW, XL, HZ and YG. Administrative support: RZ and PL. Manuscript drafting: KW, XL and HZ. All authors read and approved the final version of the manuscript.

  • Funding The work is supported by the National Key Research and Development Program of China under Grant No 2017YFA0205200, the National Natural Science Foundation of China under Grant Nos 81227901, 61231004, 61671449 and 61401462, and Beijing Municipal Science and Technology Commission No Z161100002616022.

  • Competing interests None declared.

  • Patient consent Obtained.

  • Ethics approval Chinese PLA General Hospital.

  • Provenance and peer review Not commissioned; externally peer reviewed.