Transtronics, Inc.


Boolean: Case

Genetic differences and information theory


How should we measure understand genetic differences in a meaningful way?


OK, my curiosity has me again - I often read things and find that instead of understanding better I have ended up with many more questions than answers.

I was listening to a lecture (celebrating Dawkins book, the Selfish Gene) and the professor said that of the information in our chromosomes about 2% codes for proteins, perhaps 3% is used for control, but the remaining 95% is thought to be made up of remnants of retro viruses and empty "line genes" that don't do anything.

Couple that with a typical quote about the difference between Human and chimp genes, this one from National Geographic:

"The goal is to answer the basic question: What makes us humans?" said Eichler.
 
Eichler and his colleagues found that the human and chimp sequences differ  by only 1.2 percent in terms of single-nucleotide changes to the genetic
code. But 2.7 percent of the genetic difference between humans and chimps are duplications, in which segments of genetic code are copied many times in the genome.
 
"If genetic code is a book, what we found is that entire pages of the book duplicated in one species but not the other," said Eichler. "This gives us some insight into the genetic diversity that's going on between chimp and human and identifies regions that contain genes that have undergone very rapid gnomic changes."

And this one:
The new estimate could be a little misleading, said Saitou Naruya, an evolutionary geneticist at the National Institute of Genetics in Mishima, Japan. "There is no consensus about how to count numbers or proportion of nucleotide insertions and deletions," he said.
 
So I still don't think I have a good feel for the magnitude of the difference.

I have an interest in how to measure the difference in data - it is of key importance in computer data compressing algorithms.

A reversal of data has little difference, and the insertion or deletion of data that changes the position of other data is not much of a true difference. So how do they measure these gene differences in a meaningful way?

When they say that humans vary from chimps by only a few percent  - that could mean a lot of different things - are they only looking at the 5% that counts? (that would make sense to me) or are they overstating it by looking at all the junk DNA? I don't think they are looking at mitochondrial DNA.  How do they deal with reversals and relocations? Are they only looking at the DNA that codes for protein amino acid sequences? How can this difference be expressed in a meaningful way?

,.,.

On Linux based computers there is a command called diff that creates a difference file, while this command deals with insertions and deletions well - it is not so good at reversals or rearrangements of blocks. Related to this program is one call rsync. (http://samba.anu.edu.au/rsync/)
rsync was originally written by Andrew Tridgell as the basis of his PhD thesis. (http://en.wikipedia.org/wiki/Andrew_Tridgell) is the key person and in some ways more important to Linux than Linus. (BTW - for a bit of self referentialism, I use rsync to transfer updates of this web site!)

He was solving a problem that often comes along in computer files where there are different versions of - lets say a text file. Instead of sending the entire file, he came up with a system that broke the file into parts, created a hash (a mathematical method that identifies a block of data with a type of checksum) and compares the hashes on both ends on then only transmits the differences (they compress the differences as well) allowing files to be 'synchronized' with out sending the whole thing - thus speeding up the process. Other attempts to create compact differences have been worked on that use more complex algorithms at the expense of processor time.

The important point here, is how to measure the actual meaningful difference. It seems to me that only looking at the 5% and then finding the best compression of the difference data would get us a representation that has meaning. Then making a fraction based on the best compression of the difference over the best compression of the useful data is the way to go.
,.,.

I've also heard estimates on how much information our genes represent - and again I don't know if they are looking at both the real data and the junk data in these estimates. Is it 700MB or compressed to 250MB? or does this include all the junk? Is there a difference in how the junk genetic information compresses?.

Transtronics, Inc. 3209 W.9th street
Lawrence, KS 66049
USA

Ph
FAX
Email
WEB

(785) 841 3089
(785) 841 0434
inform@xtronics.com
http://xtronics.com
Bookmark this page


Boolean: Case

Transtronics Home Page
See our line of industrial control electronics
PLC's Index PC test equipment and EPROM programmer Process Control
Panel meters
Current sensors Resource library handbooks, primers and spec sheets

Corporate information and privacy statement
(C) Copyright 1994-2007, Transtronics, Inc. All rights reserved
TranstronicsĀ® is a registered trademark of Transtronics, Inc.