Welcome to my substack! Let’s dive straight into a biological question of interest to many human beings - how similar are we really to other apes, such as chimpanzees? This is a multi-layered concept with a long history but in this post I’ll stick to where I have expertise - comparing genome sequences.
Numbers given for the similarity of human and chimpanzee genomes across the internet range fairly widely - we’ll see some reasons why in this post. Commonly given figures in the popular science literature include 96% (e.g. here) and the 98.8% (e.g. here) which when compared to findings from a new paper has led to recent claims of a debunked ‘icon of evolution’.
A recent study published in Nature (Yoo et al. 2025) - which produced new high quality genome assemblies for a range of non-human ape genomes - has been interpreted as providing a new calculation for pairwise similarity between the genomes of humans and various other apes. It has gotten some attention from people skeptical of evolutionary theory e.g. here and in podcast form here, which led me to write this post. Evolution critic Dr. Casey Luskin summarises his perspective “For now, we can safely say that this latest study shows that the human and chimp genomes are at least 14.9 percent different” and “I suspect that this radical finding has implications — not just for science, but also for human exceptionalism, for the reliability of heavily marketed talking points, and more — that people will be discussing for a long time.” I’ll explain why I believe that commentary overstates the differences and their significance and in the process hopefully help you understand a little more about the fascinating world of comparative genomics.
Calculating similarity between sequences
Sentences and genomes have some relevant similarities - as well as important differences which we’ll set aside for now. To get into the topic of genome similarity let’s start by comparing two sentences, which happen to be “pangrams”; that is, they use all letters of the alphabet at least once each (thanks Wikipedia). The choice of the first sentence is inspired by the fox family living on my street and regularly waking us up over the last couple of weeks.
“The quick brown fox jumps over the lazy dog”
“Pack my box with five dozen liquor jugs”
None of the words are shared. All of the letters are shared but they’re in a more or less completely different order, so it seems reasonable to me to say these sentences are something like 0% identical. If we treat these as arbitrary sequences of characters (i.e. don’t worry about word units or meanings), there are some small similarities, but none more than two letters long. Whether we ought to consider short matches within words as relevant will depend on the intended use of our similarity measure. Now, how about these two?
“The lazy dog sat on the mat”
“The pink lazy dog sat on the mat”
Using the concept of a “gap difference” which we’ll come back to soon, it can be said that these sequences differ by four letters (the additional word “pink”) so are 4/25 = 16% different (ignoring spaces). So far, so good, I think. But now, consider these:
“The lazy dog sat on the mat”
“The lazy lazy dog sat on the mat”
Perhaps the second is a typo, or perhaps lazy is repeated for emphasis. Either way, the “gap distance”, defined as the proportion of subsequences lacking a 1:1 alignment, is the same as in the pink case. Using this measure here seems a bit off to me however, when considered in terms of functional information. Of course, “information” in biology is a fraught concept. One way to state the difficulty is to say that copies - whether of words or genes - don’t really add information. I think that is more or less right, and that copy number variants should be distinguished from other kinds of differences - as they typically are in technical reports of genome comparisons. Interestingly, in other contexts, opponents of evolution are keen to make the claim that copy number changes don’t change information content (e.g. here), but this idea seems to be quietly forgotten when comparing human and chimpanzee genomes.
Imagine being asked to compare the texts of two books on similar topics and to give a measure of how similar they are, at the level of individual letters. With some thought, you could come up with various ways to do this. There are various technical measures of string similarity, e.g. Hamming distance, Levenshtein distance, and various different approaches for aligning strings (both different algorithms and different scoring systems) which will result in different percentage identities in the resulting alignments. Clearly the concept of “percentage identity” between two very long sequences is not straightforward.
Recent Nature paper and some conclusions drawn
The recent Nature paper mentioned above reports in supplementary data a “gap divergence” or “gap difference” of 12.5-13.3% between human and chimpanzee genomes (specifically non-sex chromosomes). A gap difference in an alignment occurs for all sequences in e.g. the human genome where a 1:1 alignment with the chimpanzee genome isn’t successfully calculated. Note that a pairwise “alignment” is a data structure showing which parts of two sequences are similar to each other, and which are not able to be paired (more background on this here). In the next two sections I’ll pull out a few relevant numbers to explore the meaning of this, but if you don’t care too much about the details don’t worry, I’ll bring it together in a summary at the end.
If you delve into the data in the supplementary tables, you’ll find some interesting context for the numbers reported. Consider Supplementary Table III.19 which compares the gold standard T2T-CHM13 human genome to other genomes including another human genome reference, GRCh38 (data for autosomes, in sheet 19 of the supplementary tables Excel file and similar data in sheet 17):
Query Aligned(%) Identical(%) Aligned 1:1(%) Identical 1:1(%)
GRCh38 93.81 93.67 87.08 86.96
Chimp 91.47 90.22 86.1 84.95
Bonobo 91.48 90.22 85.98 84.83
Gorilla 90.89 89.33 85.44 83.99
When you compare different human genome assemblies to each other (autosomes only) you can only get 93.67% of the sequence to align. This is presumably due to a combination of (i) the T2T-CHM13 having additional genetic material than the other assemblies and (ii) repetitive regions which aligners can’t cope well with. The first issue isn’t a major problem for the T2T human vs non-human ape comparisons as they are said in the paper to be of similar quality to the human assembly - but the latter issue is important.
The previous paper reporting the T2T (telomere to telomere) human genome (Nurk et al. 2022) states that it includes 4.5% more genetic material than the GRCh38 reference over the autosomes and X chromosome (see Table 1). Comparing this to the table above, it seems to leave a proportion that doesn’t align due to technical rather than biological issues, but how much is unclear. It seems to me that any accurate estimate of genuine human-chimp genome differences would need to take the proportion that won’t align even in human-human comparisons into account. I’m not sure the best way to deal with this so would be glad to see any published discussions of how/whether to take the intra-species pangenome and non-aligning regions into account when doing cross-species comparisons.
Further important context when thinking about comparative genomics
Biology is a very ‘high context’ science - properly understanding a particular biological fact - e.g. concerning the function of a given protein - requires a lot of background knowledge. There are two important classes of facts that I think people considering numbers presented for genome similarities should know before trying to draw any conclusions about human uniqueness or similar concepts. (1) human genomes differ from each other by non-trivial amounts, albeit likely not as much as most other primate species do; there is no one standard human genome (2) species which creationists (of the various flavours available) widely accept share a common ancestor have genomes which differ substantially. The human-chimp genome difference doesn’t seem special when compared to differences within other taxonomic families.
On point (1), human genomes differ, but by how much of course depends on how we count differences. Approximately 2% of the genome lies within regions which differ by complex structural variants within the human population according to a review by Miga & Wang (2021) - this number is probably higher now with full length assemblies. A more recent paper (Liao et al. 2023) suggests on average 4.4% of sequences in pairwise human genome comparisons are either not assembled or can’t be aligned. The T2T genome which was the main human reference for the study of interest was sequenced not from normal human tissue but from the product of a failed pregnancy, called a hydatidiform mole, which came from a European individual. Other T2T sequences are now available from Han Chinese individuals (He et al, 2023), and they are reported to differ from the T2T-CHM13 genome and each other by a remarkable approximately 300+Mb each (9% of the genome!). I expect other alignment methods would bring this down a lot, but as far as I can tell, our current understanding is that random individuals across the human population will on average differ by at least tens of millions of base pairs of structural variants (largely in repeat regions). These will show up as “gap differences” in the alignments.
On point (2), skeptics of evolution typically accept common descent within the genus or family level - the taxonomic family or higher is the typical level accepted. Species within a family however have genomes with similar levels of divergence to those seen between humans and chimpanzees - as expected, given that humans and chimps are in the family Hominidae. This is seen by comparing the large scale properties of mouse versus rat genomes (family: Mus) or human versus chimp genomes (Thybert et al. 2018). Similar comparisons can be seen for Lion, Tiger, and house cat genomes (Cho et al. 2013, Samaha et al. 2021).
Summary of the science:
I don’t think that simply counting the regions between two genomes that either align 1:1 or that are exactly identical and reporting this as a percentage of the total sequence is an accurate representation of genome similarity. Some context is needed if the aim is understanding rather than a “gotcha” response against the “98.8% identical” claim (which is, like the ~85% similar claim, false or misleading without giving the necessary context). Some sequence regions are highly similar between the two genomes but are very repetitive so cannot be aligned with the algorithms standardly used. Many of the putatively different sequence regions differ substantially across the human population as well. Some regions in one genome are exactly identical to regions in the other (whether we are comparing two human genomes or a human genome with a chimpanzee genome) but were duplicated or copied multiple times in the lines of descent connecting the two via a common ancestor so don’t have 1:1 alignments. A total genome difference of on the order of 10% is what is expected across a mammalian taxonomic family and there’s no indication yet that I’m aware of that human-chimpanzee differences are different to what is seen among other mammalian species that diverged on the order of 10 million years ago. In my view, that’s the key point which is liable to be drowned out in discussions of percentages which miss the clear patterns built into nature - if we focus on the leaves we can miss the tree.
Finally, my worldview, in case you’re wondering: I’m a Christian - I’ve come to think that Darwin was right on the big picture of evolution, and I share many of the views of early Christian supporters of Darwin like Asa Gray and Charles Kingsley. I’m working out various details, and hope to write them up. For now, thanks for reading my first substack post! Feel free to comment if you have any response, particularly if I got something wrong - and if you’d like to read more along these lines, be sure to subscribe!
A fascinating read - I need to update my lectures
Aren't you making false accusations and misrepresenting what Dr. Casey Luskin has been writing about this scientific issue? I fear it's not a very nice way for you to launch your substack.