Article - Issue 7, February 2001
Some engineering implications of the Human Genome Project
Professor Richard I. Kitney FREng
The aim of this article is to provide a brief overview of the importance of certain areas of engineering to the Human Genome Project and to the future of molecular biology. An appropriate starting point for discussing the human genome in this context is the work of Francis Crick and James Watson, whose paper on the double-helix structure of DNA was published in 1953. The model they constructed of the double helix was, in every sense, physical, comprising chemistry laboratory clamps and other bits and pieces of hardware. (This model is now on display in the Science Museum in London.)
Almost 50 years later, the whole area of molecular biology is developing at an increasingly rapid rate. Today it is one of the most important areas of basic medical science. It is no accident that this rapid development in molecular biology and, specifically, genomics has been paralleled by the development of computer technology. In 1953 there were only a handful of computers available; indeed one study by IBM at that time estimated that only ten computers would ever be needed worldwide! In contrast with the millions of computers that are used worldwide today, the difference is spectacular. The key breakthrough now is the high performance of supercomputers, at relatively low cost, which enables genes to be sequenced by means of millions of calculations. Computer technology is in reality the foundation for much of the progress on the Human Genome Project.
Although the first phase of the Project – the sequencing of the human genome – was completed last year, there is still a very long way to go. One distinguished research worker described the present state of affairs as having clearly identified all the components of a Boeing 747-400, but not having any kind of plan telling us how to assemble the aircraft. We know what the components look like but not what they do. Hence, there is much work to be done in bridging the gap between what is known about the human genome and what is known about human physiology and pathophysiology, leading diagnosis and treatment. It is therefore likely to be many years before total gene therapy becomes truly effective.
At the summer 2000 World Conference on Medical Physics and Biomedical Engineering, Dr Francis S. Collins (Head of the National Institute of Health Human Genome Program) stressed the importance of engineering in the creation of both software and devices. One of the key areas of activity in the second phase of the Human Genome Project (likely to last for 20 years) will be data mining. Computers and software will be crucially important in bridging the gap between gene sequencing, a true physiological understanding of the human body and the application of this knowledge to clinical medicine.
This article concentrates on the principles involved in five key areas of engineering that are central to the future of genomics and molecular biology: databases, data transmission, data compression, distributed computing and display.
Database technology is essential in any information infrastructure and particularly in the context of genomics. At its simplest level database technology is needed in order to store and retrieve pieces of information (bottom left of diagram). If more detailed information needs to be extracted – simple forms of data mining – a relational database with a database management system (DBMS) can be used (top left of diagram). These databases can be interrogated by means of simple questions relating to alpha-numeric data. Such databases are now widely available at relatively low cost and go under the general heading of Structured Query Language or SQL databases. This type of database allows very detailed combinations of information to be compiled from the raw information that is stored.
With the advent of the far more detailed information produced by the Human Genome Project (for example, image data), it has become necessary to employ other types of database. Fortunately, the need for new database structures is not a problem which is confined to genomics. These databases (right-hand column of diagram) treat each piece of information, whether it is a string of numbers or an image, as an object or entity. Any object can be stored and retrieved. The management systems associated with these databases are known as ORDBMS (Object Relational Database Management Systems). As shown in the diagram, the ORDBMS can either simply retrieve data or ask more complex questions. The latter case is important, for example in the automatic retrieval of data to obtain gene sequence matches.
Because of the enormous amounts of data generated in gene sequencing, the process of retrieval and comparison may involve a number of databases at different locations. Some of these databases may simply contain alphanumeric data, whilst others may store 2-D and 3-D image data. Different types of information relating to the same sequence may be stored in two or more databases. At best, the retrieval of a range of information relating to the same sequence can be time consuming. At worst, the wrong information can be combined because of errors in the retrieval process. This problem is now being addressed by a new type of software known as a DataBlade, which can store information about the location of related data across a number of databases at different locations.
The key to any integrated information system which is geographically distributed is the use of telecommunications technology. Because of the costs involved, it is imperative that commercial telecommunications systems are used. There are currently a number of standard types of network available, from standard telephone lines through ISDN (128 kbit/s per line) to ATM (155 Mbit/s). Broadly speaking, the cost to the user of such systems is a function of the speed at which data is transferred. Currently, much of the information which is retrieved from databases at different locations is transmitted on a so-called ‘store and forward’ basis; for example, data is sent overnight when transmission is quicker.
However, much broader bandwidth networks are now being implemented, for example CALREN 2 (California Research and Educational Network). CALREN 2 comprises a consortium of telecommunications companies, industrial companies and universities in California. The network links various locations between the San Francisco and Los Angeles areas via a network with a bandwidth of 1 Gbit/s.
The difference in transmission rates is staggering. A conventional 64 Mbit chest X-ray can be transmitted via CALREN 2 from Stanford Medical Center to the USC Medical Center in 60 milliseconds; the same image would take 67 minutes to transmit via an ordinary telephone line. Similarly, a typical set of 80 MR brain scans will take 80 milliseconds instead of 87 minutes. Such bandwidths will become common throughout the US and EU within a few years and will allow large amounts of complex data from gene databases (at almost any location in Europe or the US) to be retrieved and displayed virtually in real-time.
Compression software is employed in two principal areas: in the transmission of data and the storage of information in a database. Compression software is normally divided into two types, lossy and lossless. As its name implies, when a ‘lossy’ algorithm is used the data is irreversibly compressed. In many areas of application this is not a problem, but generally in medical and biological applications it is unacceptable. Consequently, lossless compression software is normally used in these application areas. Here data can be compressed for the purposes of transmission or storage and then decompressed without any loss of information. Because of the large amounts of data involved, lossless compression algorithms are very important. (Most algorithms of this type give a compression ratio of about 4:1.)
Many of the tasks associated with gene sequencing and drug design are computationally intensive. It is likely that the amount of computation which is required in the future for a particular application will exceed the capabilities of local computers. Circumstances such as these are driving the development of a distributed computer environment, in which data is automatically transferred from a local computer to a supercomputer (which may be at a remote site), the calculations are performed there and the results transferred back to the local computer. An important example of a protocol which addresses this type of problem is CORBA (Common Object Request Broker Architecture).
In medical information technology, the display of information usually refers to the use of special types of user interface. The situation has been vastly simplified by the advent of web-browser technology which has allowed data and information to be displayed on a wide range of computers. It is important to understand that web-browser technology can be implemented on Intranets as well as on the Internet. Hence it is possible, for example, to use a distributed computing environment to process data and to use a much simpler, less powerful, computer to display the data. The figure here shows a crystal structure: the detailed and computationally intensive 3-D processing has been carried out by a remote supercomputer and only display and image manipulation is done locally. With a broadband network of the type described in the last section this is feasible even for large amounts of data.
Areas of application
The areas of engineering I have briefly described feed directly into BioInformatics applications for the Human Genome Project. About 50 Terabytes of data relating to the Human Genome have already been acquired. (One Terabyte equals 1 million Megabytes.) Celera Genomics, one of the leading companies in the field, estimates that the amount of information being processed is equivalent to 80,000 compact discs, which would take up almost one kilometre of shelf space. Clearly the key issue is to make sense of all this information by effectively mining data from the database. A large number of companies are already working to achieve this.
One of the main drivers in genomics is the change which has occurred in drug company strategy. In 1975 a large drug company spent on average 5% of its profits on R&D; today the figure is 22%.The average time taken for a drug to reach the market from concept stage is now eleven years (with international patents only having a lifespan of nineteen years). The average development cost for a single drug is currently around £200 million. It is estimated that it might be possible with some drugs to achieve £350 million extra sales by getting a drug to market one year early. Computer-based techniques such as 3-D visualisation are perceived as the way to achieve this. A significant amount of time can be saved in drug development using such methods. One example of this approach is the work of scientists who in the mid 1990s were researching drugs to treat bone tumours. Using computer-based BioInformatics techniques they were able to obtain gene sequence matches in a few weeks which would previously have taken months or even years using traditional wet laboratory methods.
The traditional approach to drug design has been broad spectrum; that is, drugs are used to target a particular problem in a broad range of patients. This situation is now rapidly changing. Current thinking is that it will be possible in the future to design drugs which can literally be customised according to the individual patient’s genetic make-up. The customisation of drugs will be based on genetic differences between individuals know as single-nucleotide polymorphisms or SNPs. Protein interaction and maps will be key elements in this development. This, in turn, requires advanced database technology and data processing, as well as signal and image processing, visualisation and virtual reality techniques.
When Frederick Sanger revealed for the first time the complete genetic information of a micro-organism in 1977, he was paving the way for the Human Genome Project. As Francis Collins pointed out in his lecture to the 2000 International Congress on Medical Physics and Biomedical Engineering, ‘human genetics, molecular biology and the whole genomics revolution will continue to require major input from engineering’.
Useful web sites
http://www.ncbi.nlm.nih.gov National Center for Biotechnology Information
http://www.sciam.com/explorations/2000/051500chrom21 Mapping Chromosome 21
http://www.pbs.org/wgbh/aso/databank/entries/do53dn.html Watson and Crick