Statistics from Altmetric.com
Identification of the DNA structure as a double stranded helix consisting of two nucleotide chain molecules was a milestone in modern molecular biology. This discovery has lead to a rapidly accumulating body of genomic information. The dramatic developments in genetic analysis during the past decade have placed genomics at the forefront of life science research. Large multinational projects (for example, the human genome project) have generated a host of sequence, mapping, and expression data. At the same time, new technologies for automated sequencing, data analysis, genotyping, expression, and protein analysis have been developed. An important result is identification of the genetic aetiology of many monogenic diseases and the enormous progress in the exploration of polygenic disorders.
This article reviews the possible applications and future developments of genomic technologies in disorders of the gastrointestinal tract.
Development of genomic technology
In recent years, significant progress in the elucidation of the pathophysiology and aetiology of gastrointestinal disorders has been made. Novel experimental therapies which specifically target single steps in disease relevant regulatory pathways have been designed and are in different stages of clinical development. Examples are inhibition of the gastric proton pump in peptic ulcer disease1 and anti-tumour necrosis factor α therapy which has been developed for Crohn's disease and rheumatoid arthritis.2-4 This process was made possible by technological advances in cell biology, biochemistry, and microbiology and their application to disorders of the gastrointestinal tract. While this methodological approach has produced substantial advances in our understanding of gastrointestinal pathophysiology and increased available treatment options, there are currently several new challenges that call for the development of novel research techniques.
- The pursuit of hypothesis driven functional research limits the scope of discovery to known genes and pathways and may therefore hamper the elucidation of novel genes and regulatory mechanisms.
- Exploration of signalling cascades and their role in disease pathophysiology has reached an enormous degree of complexity. Therefore, distinction between primary and secondary events is impossible in most cases. Even the exact mechanisms of a successful therapeutic application of targeted immune therapies (for example, therapy of Crohn's disease with tumour necrosis factor binding monoclonal antibodies) are not completely clear.5 6 Application of exploratory techniques based on a hypothesis free approach using high throughput (“parallel”) methodologies may help to establish a hierarchy of known and new players in signalling pathways and to identify key events in the mechanism of anti-inflammatory therapeutics.
- A genetic—or familial—component has been identified in many gastrointestinal disorders, including gastrointestinal malignancies, inflammatory bowel disease, susceptibility to Helicobacter pylori related pathology, pancreatitis, bile stone formation, and others.7 8
Progress in the different fields outlined above may rely to different extents on the availability of genomic information and technology.
Methodological problems in the systematic exploration of disease pathophysiology
Taking a cell biological hypothesis to a specific test has for a long time been the only possible approach.9 10 The published experimental results from these studies are always biased towards a positive result with suppression of negative data. In addition, in vitro experimental systems are often highly sophisticated and results may not be fully applicable to the complex in vivo situation. Two recent examples may illustrate this situation: leukotrienes were regarded for a long time as the driving forces for chronic intestinal inflammation.11-13 Convincing and reproducible data were generated that showed upregulated leutrotriene production which paralleled the activity of colonic inflammation in inflammatory bowel disease. It was also assumed that the anti-inflammatory action of glucocorticoids is mainly due to induction of lipocortin, a protein which blocks phospholipase A2 and therefore inhibits generation of leukotrienes.14 15 It took more that a decade until clinical trials with zileutone and other effective inhibitors of leukotriene synthesis showed that a dramatic reduction in leukotriene B4 did not result in a marked improvement in intestinal inflammation in ulcerative colitis.16 17 The role of leukotriene B4 in disease pathophysiology has since been re-evaluated. It now appears that the main anti-inflammatory action of glucocorticoids is by inhibition of activation of the transcription factor nuclear factor kappa B.18-21 Although inhibition of nuclear factor kappa B activation explains many of the pathophysiological and clinical findings in the analysis of glucocorticoid action, it is not clear if another, more important, mechanism of action may be discovered in future.
The reason for using genomic technology
A close interaction with cell biological and immunological agendas is expected to be promoted through a systematic hypothesis free analysis of gene expression and protein interactions. This article therefore describes examples of genomic technologies that allow the highly parallel evaluation of large numbers (hundreds to several tens of thousands) of known and unknown genes (in the form of ESTs—expressed sequence tags). This methodology therefore offers the possibility of a hypothesis free approach and may identify new targets for further in depth exploration and therapeutic interventions.
The search for genetic susceptibility factors in gastrointestinal disorders requires the use of technological approaches that are conceptually very different from pathophysiology based research. The most attractive potential of genomic exploration is differentiation between primary and secondary events in disease pathophysiology, which is extremely difficult (with the likely exception of infectious disorders) using the analytical methods of cell biology and immunology. The high degree of complexity and redundancy of regulatory cascades as well as the obvious disadvantages of hypothesis driven research efforts appear to be the limiting factors in the discovery of novel pathways. The redundancy of physiological regulation is clearly demonstrated by a series of genetically altered, gene deficient (“knock out”) mice which all develop a similar (for example, colitis-like) phenotype.22 23 Therefore, it appears unlikely that a primary cause of a complex disease can be identified by a “trial and error” algorithm which is used in most cases to dissect pathophysiology. The positional gene identification approach on the other hand—as demonstrated for example by identification of the haemochromatosis gene—has the potential to uncover entirely novel and unexpected molecules which will influence the pathophysiological understanding of the disease process.24
Tools for a genomic exploration of pathophysiology
A critical idea in the progress of the human genome project was the introduction of systematic sequencing of cDNA libraries.25 The public databases (http://www.ncbi.nlm.nih.gov) currently contain over 1.6 million of these fragments, which have been named expressed sequence tags (ESTs) (see glossary in table 1). Only 10–15% of all human genes have yet been defined. The availability of EST databases facilitates the identification of cDNA fragments ultimately resulting in the cloning of potentially novel genes. However, EST libraries also carry a large redundancy with many overlapping or partially identical fragments. Therefore, techniques have been developed to group ESTs into “UniGene” clusters (table 1) (representing only sequences which are unique in the library). These techniques are based on computational (in silico) methods using sequence homologies (http://www.ncbi.nlm.nih.gov,http://www.tigr.org) or on oligonucleotide fingerprinting analyses (see below).
The availability of mapping approaches that do not need polymorphic markers (that is, radiation hybrid mapping) has allowed a genomic localisation of the ESTs and the assembly of genome wide transcript maps.27 An example of the EST representation of a known gene (STAT6) and its position on the radiation hybrid gene map is detailed in figs 1 and 2. Some limitations of the EST based approach are instantly evident from fig 1: (i) the 5′ ends of genes are underrepresented in these databases and (ii) the complete coding mRNA sequence can rarely be derived from the EST data alone.
The aim of genomic expression analysis techniques is measurement of gene expression of (ideally) all human genes in a parallel and automated fashion. Hybridisation based techniques require in general certain scientific and technological steps.
- Generation of cDNA libraries representing the genes of interest.
- Clustering of the libraries to identify a non-redundant or “UniGene” set of transcripts.
- Representation of the identified transcripts on a hybridisation carrier, for instance through spotting of representative polymerase chain reaction products on membranes or glass slides28-30 or the use of representative oligonucleotides on a carrier.31
- Determination of the relative abundance (that is, expression) of the corresponding mRNAs (for instance, from diseased tissues) by complex cDNA hybridisations.
- Image and data analysis.
An alternative to hybridisation based methods is representational cloning approaches such as serial analysis of gene expression (SAGE) (see table 1).32 These methods rely on identification of oligomers characteristic for each transcript that are identified by sequencing random linear assemblies of characteristic sequence elements. Particularly for identification of rare transcripts, this method requires large amounts of sequences to be generated for each experiment. A central repository for SAGE sequence data has been established at the National Center for Biotechnology Information (NCBI:http://www.ncbi.nlm.nih.gov/SAGE). It is not yet clear what importance this technique will attain relative to the hybridisation based expression analysis techniques.33 34
In this report we will focus on hybridisation based methods for expression analysis.
GENERATION OF LIBRARIES
For generation of an appropriate cDNA library, high quality mRNA has to be prepared from appropriate tissues. The libraries in the EST sequencing projects have been derived from mainly non-gastrointestinal organs and tissues (for example, brain, liver, lymphatic cells, muscle). However, genes relevant for gastrointestinal pathology may not be adequately represented. Until now only a few libraries have been generated from gastrointestinal organs. Therefore, collaborative projects to generate intestine and liver specific libraries have been started. In the first step, reverse transcribed cDNA molecules from tissue have to be randomly fragmented and inserted into the appropriate bacterial vectors. The size for a tissue specific cDNA library is typically a few hundred thousand clones. In order to reproducibly use the individual clones from a library, single colonies can be chosen in an automated way from agar plates and transferred into 384 well plates. The randomly distributed colonies are identified using image analysis software. The software selects clones to be picked on the basis of user specified criteria, including the roundness of colonies, their closeness to neighbouring clones, and others. Clone positions are identified, and coordinates are transferred to a robotic manipulation system which uses an automated arm to “pick” colonies. A single stainless steel tip is fired into the colony which sticks to the pin and is subsequently delivered into a microtitre plate. Current systems are capable of picking and inoculating approximately 3000 clones per hour into 384 well microtitre plates prefilled with growth media. The microtitre format of the libraries allows long term storage, analysis, and subsequent standardised retrieval of individual clones on the basis of the microtitre plate address.
CLUSTERING OF LIBRARIES
The libraries derived as outlined above will contain many redundant clones (that is, fragments of cDNA representing different overlapping or identical parts of the same transcript; example shown in fig 1). As a prerequisite for efficient use in later experiments, this redundancy has to be eliminated. This can be achieved by direct sequencing of cDNA clones, by oligonucleotide fingerprinting, cross hybridisations of the library on itself, or a combination of these methods. The datasets generated by all methods can be used to produce a non-redundant set of cDNA clones from the complete cDNA expression library (for example, a “UniGene” representation of the library). At the same time, information on the relative abundance of each transcript is generated because highly expressed transcripts will be represented in the library by a larger number of clones. The information generated by oligonucleotide fingerprinting techniques allows the matching of clones and to the theoretical (“in silico”) fingerprints of database sequences. Compared with EST sequencing, oligonucleotide fingerprinting can be conducted at 10-fold lower costs per clone. Therefore, much larger libraries can be analysed, which will result in the identification and accurate quantification of the expression level of a higher number of rarely expressed genes.35 In addition, the more complete sequence coverage increases the chance to detect internal sequence variants (mutations and alternative splice forms), and simplifies the identification of related sequences. Since known sequences can be directly identified, based on the match of the observed to the predicted fingerprint, and clones from clusters representing unknown genes can be partially sequenced, the combination of fingerprinting and EST sequencing currently represents the most effective approach to identifying new genes. Analysis of both EST sequencing and oligonucleotide fingerprinting data requires complex statistics and results depend on the quality of the library and the experimental data. Therefore, “non-redundant” clusters of EST sequences (as shown in an example from the NCBI UniGene set in fig 1) may in reality represent multiple genes or may be parts of transcripts represented in other clusters also.
SPOTTING ON HYBRIDISATION CARRIERS
To generate an expression array, DNA from all representative clones forming the non-redundant UniGene set has to be transferred to a filter or glass slide. Prior to spotting, the re-arrayed library is amplified by polymerase chain reaction for spotting rather then directly spotting the clones on the filters. This eliminates variations in the DNA amount which could be caused by different growth kinetics of the clones. DNA is spotted automatically in a full robotic set up from microtitre plates on nylon membranes or glass slides. An example of a spotting robot is demonstrated in fig 3. An alternative to attaching full length polymerase chain reaction products to a solid support is an oligonucleotide representation of each clone which can be generated and attached to a silicium based support medium.31 The latter approach is used for some commercial libraries including the ones manufactured by Affymetrix.
For expression analysis in diseased tissues or cell cultures, mRNA is reversely transcribed to cDNA which is then radioactively or fluorescent labelled and hybridised onto the array. Typical experiments require between 10 and 50 μg of total tissue RNA per experiment. This amount of RNA can for instance be obtained from 2–5 typical colonic biopsies. The results of the hybridisation are read by either direct exposure of x ray films, phosphor imaging, or fluorescent scanning.
These experiments involve hybridising reverse transcribed mRNA from a particular tissue, cell, probe, patient sample, etc., onto the UniGene array. The signals for each gene are then analysed to determine the relative abundance in the samples of mRNA derived cDNAs corresponding to each gene. Internal controls need to include back hybridisations with short oligomers, appearance of housekeeping genes in the array, and presence of control clones (for example,Arabidopsis thaliana) in the array. Controls also need to include the admixture of control RNA (for example, fromA thaliana) to the tissue RNA probe.
Figure 4 shows a typical example from a complex hybridisation. The magnified view is from an array of 35 000 UniGene clusters from the IMAGE library.
BIOINFORMATICS FOR EXPRESSION ANALYSIS
Images obtained by these methods are highly complex and require the use of sophisticated analysis software, which is available from different academic and commercial sources (for example, from GPC-AG under www.gpc-ag.com). Image analysis still requires substantial human interaction until expression data for each clone are obtained. The statistical analysis and evaluation of expression data from the signals obtained from thousands of clones (that is, the definition and differentiation of clusters) is the subject of intensive statistical research.36 In particular, the statistical problems of false positive results (that is, due to multiple testing) are not yet adequately resolved.
The term cluster analysis (first used by Tryon37) actually encompasses a number of different classification algorithms. The question is how to organise observed data into meaningful structures—that is, to develop taxonomies. A powerful exploratory data analysis tool is most likely the two way joining procedure38 which is useful in circumstances when one expects that both cases and variables will simultaneously contribute to the uncovering of meaningful patterns of clusters. An example is the identification of clusters of patients that are similar with regard to particular clusters of similar measures of expression of certain genes. The difficulty with interpreting these results may arise from the fact that the similarities between different clusters may pertain to (or be caused by) somewhat different subsets of variables. Thus the resulting structure (clusters) is by nature not homogeneous. This may seem confusing at first and, compared with other clustering methods including joining (tree clustering and K-means clustering), two way joining is probably the least commonly used. However, two way joining may be the most powerful among these exploratory data analysis tools.
APPLICATION OF EXPRESSION ANALYSIS
Expression analysis can be applied to any regulatory phenomenon both in cell lines and cell cultures. Initial experiments have included the response of defined cells to certain stimuli such as serum on fibroblasts39 or interferon on a fibrosarcoma cell line.40 The complexity of the systems tackled has increased to aging processes in mouse muscle41 or to T cell activation in mice.42 Gastrointestinal tissues have been investigated in comparison with cancer (normal colon versus colon cancer).36 A pilot study of differential expression in inflammatory bowel disease tissue has been published as an abstract.43 We anticipate that a broader and systematic use will be made of these techniques in the coming years for exploration of gastrointestinal diseases. The potential applications include differential characterisation of expression patterns describing disease pathophysiology but also exploration of mechanisms of drug action. In longitudinal short interval studies, repetitive sampling, which is possible with a limited invasion in the gastrointestinal tract, can be used for a longitudinal characterisation of expression patterns during the course of disease.
SYSTEMATIC INVESTIGATION OF PROTEIN INTERACTION WITH PROTEIN ARRAYS
A logical extension from expression analysis into cell function is complete analysis of all proteins, the proteome. Because of the enormous number of proteins which are potentially involved in the cause or pathophysiology of complex disorders, an automated approach, with very high throughput, is required for protein analysis. Until recently, it was not possible to analyse proteins using the same high density arrayed approach as for cDNA based expression studies. However, novel tools have been developed to generate protein arrays. Recombinant proteins are clonally expressed from arrayed cDNA libraries allowing direct connection from DNA sequence information on individual clones to protein products, and back again, on a whole genome level. This approach, which has been pioneered by Cahill and coworkers, makes translated gene products directly amenable to high throughput experimentation, generates a direct link between protein, expression, and sequence data, and facilitates additional generation of gene product-antibody catalogues.28
Proteomic research is highly dependent on the quality and extent of databases which are currently being further extended (examples include the SWISS-PROT, TrEMBL, and PROSITE databases). Database access is bundled through systems such as the ExPASy (Expert Protein Analysis System, http//www.expasy.ch). ExPASY proteomic tools use annotations of the SWISS-PROT entries which reflect post-translational modifications as well as sequence variants which are important for prediction. The SWISS-MODEL server and its front end, the Swiss-PdbViewer, are a protein modelling tool also suitable for non-expert use. Currently, more than 65 000 model structures have been modelled from over 200 000 known protein sequences.
Future applications in our laboratories will include identification of new autoantigens by screening of sera for novel antibody specificities and identification of new transcription factors using labelled random sequences from inflammation gene promoter regions.
Many pharmacological interventions produce response rates of only approximately 30–50%.2 44 45 Clinical characteristics are often not sufficient to predict a favourable drug response. One can therefore hypothesise that genetic predisposition may play a role in this situation. This genetic claim is, however, very difficult to substantiate in the absence of a molecular lead finding. The main difficulty is that heritability cannot be established with the typical methods of genetic epidemiology (familial clustering or concordance in monozygotic twins) as is possible for other polygenic traits. Systematic linkage approaches are not possible because the relevant families cannot be recruited. On the other hand, the signalling pathways of many drugs are often known in detail or can be explored by expression analysis. Thus corresponding genes can be specifically evaluated. In principle, the drug responder trait may be linked to one of the disease genes or, alternatively, may represent genes which are completely distinct from the actual disease genes. Because the therapeutic principle of most drugs is pathophysiology and not aetiology based, the latter scenario is more likely. A recent example of this approach is the association of variants in the ALOX5 promoter and the response to an investigational leukotriene synthetase inhibitor in asthma.45 A systematic investigation of a drug responder requires examination of all genes encoding relevant signalling molecules. At this point, expression analysis technology, as described above, may also be used to identify additional molecules involved in the mechanisms of drug response.
Genomic technologies to examine the genetic aetiology of gastrointestinal disorders
Genetic predisposition is suspected to play an important role in the aetiology of many gastrointestinal disorders. The spectrum of genetic determination spans from clearly inherited diseases such as haemochromatosis (classic Mendelian trait leading to high penetrance) to polygenic and highly complex disorders. A polygenic predisposition is suspected for the development of complications in infectious disorders such as hepatitis and H pyloriinduced gastroduodenal inflammation. Many clinically important gastrointestinal diseases including inflammatory bowel disease, coeliac disease, and diabetes mellitus have to be placed in the middle of this spectrum.46-48 It is assumed that genetic predisposition conveys the basic susceptibility but that epigenetic (for example, lifestyle or environmental factors) are needed to a different extent as trigger factors for the actual manifestation of disease. The epidemiology of these “complex” disorders is characterised by familial clustering and increased concordance of the phenotype in monozygotic twins. Once heritability of a disease is established it will allow the use of a systematic genetic analysis in the search for the molecular variants that convey the predisposition. Systematic genome wide linkage analyses have resulted in the definition of genomic susceptibility regions in a number of these disorders which likely contain one or multiple disease causing mutations.49 The technology for these large scale experiments has been available for less than a decade. The next experimental step—identification of specific risk mutations in genes within the large linkage intervals—is much more complicated and has not yet found a resounding and widely applicable solution. Current approaches include systematic candidate gene analysis50 and linkage disequilibrium studies in susceptibility intervals.51 Several requirements for successful execution of these experiments are only now becoming available through the progress of the human genome project. These include: (i) information on all genes in a genomic region which will be available with the complete genomic sequence of the human genome; (ii) the ability to perform large scale mutation detection experiments is expanded due to recent technical advancements in the design of automated sequencing machines (automated loading, capillary arrays instead of manually cast gels, HPLC based heteroduplex analysis); (iii) efficient methods for high throughput single nucleotide polymorphism typing methods such as Taqman or mass spectroscopy (MALDI-MS, matrix assisted laser desorption/ionisation-mass spectroscopy); and (iv) novel bioinformatics and statistical analysis methods that may allow more powerful gene focused analyses than the conventional association designs.
Researchers have established replicated linkage regions and association leads for a number of these disorders which may prove instrumental in successful positional gene identification.49
The technologies outlined above in our view represent tools that will have a major impact on gastroenterology based research in the future. However, these methods are still under development and have shortcomings that need to be resolved. This applies to all of the genomic technologies discussed. Major unclear issues include: procedures for standardisation and data analysis from the highly complex expression analysis experiments (applying to both SAGE or hybridisation techniques); conformational properties and degree of representation of the native protein structures on the protein arrays; and the appropriate experimental algorithm for the identification of actual risk genes for complex disorders. As of now, much of the work involved in finding these genes is of a highly exploratory nature.
Despite these problems, genomic technologies hold substantial potential for novel and important discoveries in gastrointestinal disorders. At the same time, genomic tools will not replace biochemical, cell biological, and immunological methods. For example, unknown transcripts which are identified through expression screening will have to be characterised. Full length mRNA, subcellular localisation, regulation, and precise interaction with other molecules will have to be determined. Ultimately, one wants to identify and validate single critical molecules that regulate large parts of signalling cascades and can therefore serve as successful drug targets.
The degree to which the apparent complexity of the pathophysiological regulation mechanisms can be reduced is not clear at this point. Researchers may also be left with systems of multiple genes and gene variants which are too complex to identify or predict the important drug targets or prognostic markers. Statistical and bioinformatics tools and methods to handle the enormous complexity will need to be developed. Recognised patterns are prone to false positivity because of their complexity and the multiple testing involved. Therefore, the consequent replication and verification of any results in independent and well characterised secondary cohorts and experimental set ups (for pathophysiology) is imperative. We anticipate that genomic technologies will prompt a closer cooperation of clinical and molecular expertise. At the end of the day only the combination of a reliable and prospective clinical characterisation which parallels the level of molecular sophistication will allow the robust definition and clinical use of a molecular epidemiology of complex gastrointestinal disorders.
A new website has been developed to help gastroenterologists calculate CDAI (Crohn's Disease Activity Index) and PCDAI (Paediatric Crohn's Disease Activity Index).
The user registers his/her patient under a name/code and registers the sex, age, and weight of the patient, and is then asked to fill in raw data. The website will then calculate CDAI (and for patients below 19 years, PCDAI). The site enables the user to store indices over time, and the disease activity of patients will also be presented as a diagram. The data are stored on the Örebro Medical Centre Hospital for one year. You can visit the website at:
Users must first register; this is done by entering a user name and password, which the user chooses (user names and passwords are not distributed by Dr Ludvigsson).
For further information: Dr Jonas Ludvigsson, Scientific Coordinator, Department of Paediatrics, Örebro Medical Centre Hospital, 701 85 Örebro, Sweden www.geocites.com/jonasludvigsson
This work was supported by grants from Deutsche Forschungsgemeinschaft (SCHR512/1–3, SCHR512/5–1, SFB 415), BMBF (Competence Network-IBD), and MFG.
The URLs for the CDAI website and for Dr Jonas Ludvigsson are incorrect. Please use the following links to reach the sites:
Leading articles express the views of the author and not those of the editor and editorial board.
- Abbreviations used in this paper:
- expressed sequence tag
- integrated molecular analysis of genomes and their expression
- National Center for Biotechnology Information
- serial analysis of gene expression
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.