Living Language

 

Everything’s coming up roses!  Or more precisely, Arabidopsis thaliana, a somewhat nondescript white flower selected by biologists as the model in the plant kingdom for genetic research.  Related to broccoli and cauliflower, A. thaliana is the most studied plant in human history; every week new papers are published about its properties.  In an astounding leap forward, the journal Nature just announced that the entire genome for A. thaliana has been sequenced, a first for the vegetal world.  As plants go, it was a relatively easy task; the complete genome runs to 120 MB of information, compared to 1.6 GB for wheat and a hefty 3 GB for humanity.  What makes this discovery just a bit different from the ever-increasing flow of genetic revelations is that, in another first, Nature has announced that all genomic information presented in its pages – and on its website – will be published in GEML, or Gene Expression Markup Language, a lingua franca defining a common standard for the bits of life.

 

Why the need for standards?   We need only look back at the roots of the current information age to understand the power of a common standard.  Back in 1990, a young and naive Tim Berners-Lee went to a hypertext conference at Versailles, hoping to garner support for a still nascent set of standards for hypertext information interchange.  He found a community – if you could call it that – of squabbling companies, each with their own “correct” approach to hypertext, none of them able to work with the others.  Hypertext had been around for nearly thirty years – since Ted Nelson began to work on Project Xanadu – but had gone nowhere, because this “insanely great” idea had inspired only competitiveness, avarice, and arrogance.  Berners-Lee left the conference disappointed, but he succeeded in convincing the powers-that-be at CERN, the gigantic European atomic accelerator, to release his HTML and HTTP protocols freely, as an open standard.  Thus was the World Wide Web born.  It succeeded because it provided a common platform to answer the unprecedented built-up demand to use computers and their ever-expanding networks as shared resources.

 

Similarly, as an early researcher in virtual reality, I saw an industry wither and die because every tiny little company – including my own – wanted to “rule the world.”  Although the principles behind virtual reality were well understood, and common ground should have been easy to create, no commercial enterprise could see any value in “sharing” their secrets.  We all assumed the market would be monolithic, controlled by a Microsoft-like enterprise which would define VR.  Killed by too many ambitions, virtual reality died on the vine, but out of the decaying remains, Tony Parisi and I wrote a standard language for virtual reality information, known as VRML, and contributed that specification to the Web community, knowing they had the right idea – about sharing ideas.  Although many people think VRML is dead, it’s actually very widely used as a common interchange format for three-dimensional information, and as part of the new MPEG-4 standard for multimedia, its reach truly is global, and growing.  Such is the power of standards.

 

Which brings us back to GEML.  It’s a DTD (Document Type Definition) for the common expression of genetic information.  Those of you who have done any Web design are likely familiar with another DTD – HTML – and its “tags”, those little bits of formatting information enclosed by the “< >” symbols.  In HTML there are tags such as TITLE (which gives a page its title), B (for bold), IMG (for images) and so forth.  GEML has its own tags, which define the kinds of data that interest geneticists.  Here’s a bit of example GEML:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE project SYSTEM "GEMLPattern.dtd">
<project name="Hsapiens-421205160837"
   date="07-12-1999 12:43:48"
   by="jzsmith"
   company="JZSmith Technologies" >
   <pattern name="Hsapiens-421205160837">
       <reporter
    name="T89593"
    systematic_name="T89593"
    active_sequence="TACAGTGTCAGAATTAACTGTAGTC"
    start_coord="201"   
   >
     <feature number="6879">
       <position x="0.340707" y="0.508374" units="inches" />
     </feature>
     <gene primary_name="T89593" systematic_name="T89593" >
       <accession database="n/a" id="T89593" />
     </gene>
   </reporter>
</project>

       

 

While all of this is fairly unreadable – even by geneticists – it is easily read by a computer, and it might even might look vaguely familiar if you’ve taken a peek at raw HTML.  The reporter tag defines a sequence of codons (the four amino acids that comprise DNA) –  TACAGTGTCAGAATTAACTGTAGTC – as having been identified in a particular section of a gene, that it’s feature 6879 of that gene, in a specific position, and that the gene’s name is “T89593”.    GEML can also reference the database of genomic data from which this gene has been extracted, identify the species (in this case, Homo sapiens), and even the company which lays claim to the gene.  

 

There’s an immediate need GEML’s standards-based genetic descriptions.  Although the human genome has been sequenced in its entirety (the work is presently being checked for errors, a process expected to continue for at least the next year), it remains little more than a bag of bits.  The four base pairs do describe the proteins which comprise us, but finding the protein definitions in a sequence of codons is a little more difficult than finding a needle in a haystack.  Genomics companies identify promising sequences (immediately patenting them) without actually knowing if they define a “true” gene, one which describes a unique protein.  Many of the identified sequences overlap, claimed by multiple companies, but scientists have only the vaguest of ideas which sequences are meaningful and which are just genetic junk.

 

GEML, together with some clever computer programs, could help scientists greatly accelerate the process of winnowing the chaff from the grain of our genetics, allowing them to share their complementary (and often conflicting) databases of identified gene sequences to produce a more accurate map of ourselves.  Craig Venter, CEO of Celera Genomics, has openly speculated that mapping the human genome onto gene sequences could take the next 50 years; with GEML this estimate could easily be cut in half, provided that geneticists in competitive commercial organizations find it more profitable to share what they’ve learned than to keep it hoarded away and hidden away view.

 

The year 2001 is to the genomics industry what the year 1991 was to informatics.  The pieces are all in place for an incredible explosion in discovery, creativity and wealth.  But they’re locked behind the prison walls of fear.  Each genomics company rapidly sequences base pairs, hoping to identify a critical gene that will lead to a cure for cancer, or Alzheimer’s, or the common cold.   GEML could be the HTML of biology, a Rosetta stone, granting humanity unrestricted access to the stuff that we are made of. 

 

GEML isn’t alone.  It has a competitor, another DTD known as CellML, used to define the complex interactions which take place within cells.   CellML takes an integrated approach to describing all of the processes within a living cell – its genes, proteins, enzymes and chemical reactions, the pathways and connections between each part of the whole.  CellML seems well suited to the kinds of work that supercomputers do – creating simulations of incredibly complex systems – while GEML only defines the genetics that create the cell.

 

Neither GEML nor CellML may be the final word in this convergence between biology and information.  And, despite Metcalfe’s Law – which states that the value of a thing increases as more and more people use it – the CEOs of the genomics companies are at least a little afraid that if knowledge advances too widely, their hard-earned advantages will slip away like water through their fingers.

 

Toward the end of Weaving the Web, Tim Berners-Lee’s tale of the birth of the Web, he touts the importance of designing a Web that’s machine readable.  While he agrees it’s well and good to have billions of web pages that humans can read, he argues that once computers can talk to each other – in their own languages – the power of the Web grows enormously, because many tasks can be automated, and sped up tremendously.   GEML is the foundation for an unprecedented acceleration in the search for genes that could unlock the secrets of aging, disease, and human development.  Perhaps Venter and the burgeoning crowd of genomics CEOs should follow the example of Berners-Lee, who gave everything away, and gained the whole world.

 

Mark Pesce is the author of The Playful World : How Technology is Transforming Our Imagination, recently published by Ballantine Books.