
As the technology underpinning genome studies continues to improve, an over-reliance on Eurocentric datasets holds research back, asserts Neil Ward. It’s time for greater diversity.
Genomics has emerged as one of the most powerful approaches for understanding human health and disease. Yet, while our ability to accurately sequence and analyse DNA has grown exponentially, the diversity of the data underpinning research progress has not kept pace.
More than 90% of publicly available genome- wide association studies (GWAS) data [1] comes from participants of European descent, despite Europeans making up less than a fifth of the global population. A 2025 study [2] revealed that 86% of large genomic studies include only one ancestry group, severely limiting the utility of results to only the studied population. This imbalance has far-reaching consequences, from skewed drug development pipelines to missed diagnoses in communities whose genetic profiles aren’t represented in datasets.
The impact is already visible. Genetic variants common in many non-European populations – such as African, Arab or South Asian groups – often go undetected or misclassified [3], simply because they don’t appear in the datasets researchers use to interpret DNA. The result is ineffective treatments and entire populations excluded from the benefits of genomic innovation.
To move forward, we need a shift in how genetic research is designed that recognises inclusion as both a scientific and social imperative. Increasingly, researchers are stepping up to fill the gap with pangenomes.
Single reference vs pangenomes
When researchers analyse an individual’s genome, they typically compare it to a reference genome – a single composite DNA sequence used as a baseline. This method has powered much of modern genomics, helping identify genetic variants linked to disease, predicting how people will respond to drugs, and guide diagnosis and treatment decisions.
But there’s a problem. The most widely used reference genome, GRCh38 [4], is a composite of fragments of DNA from multiple individuals, primarily of African and Northern European descent. As a result, GRCh38 captures only a narrow slice of global genetic diversity. Variants common in other ancestries may appear rare or be missed altogether – not because they are unimportant but because they’re simply not present in the reference dataset.
Pangenomes offer an alternative, built from the genomes of many individuals rather than just one. Pangenomes represent core genes shared across humanity and variable genes found in certain populations. This fuller picture enables accurate variant interpretation, especially for communities historically left out of genomic research.
Solving the puzzle
Building a comprehensive picture of the human genome has never been straightforward. Until recently, despite being the most widely used human reference genome, GRCh38 was only 92% complete. That missing 8% included some of the most complex and repetitive regions of human DNA, often referred to as ‘dark’ regions that earlier sequencing technologies either misassembled or missed [5]. The final 8% [6] was only sequenced in 2022 by the Telomere-to- Telomere (T2T) Consortium. That breakthrough was made possible by a fundamental advance in technology.
Variants common in many non-European populations often go undetected or misclassified, simply because they don’t appear in the datasets researchers use to interpret DNA
Whereas previously used short-read sequencing technologies fragmented DNA into thousands of tiny pieces of around 150 bases in order to read a sample, the advent of long- read sequencing enabled scientists to capture much longer, continuous stretches of DNA. This means long-read technologies, such as HiFi, are significantly more effective at spanning repetitive sequences, resolving complex structural variants and accurately reconstructing difficult genomic regions.
The difference is like trying to assemble a jigsaw puzzle with scattered, identical blue tiles (short-reads) versus working with recognisable sections of sky, sea and clouds (long-reads).
The completion of the human reference genome with long-read technologies demonstrates the importance of using such advanced sequencing to build pangenomes.
If it took long-reads to finish sequencing even a single, mostly European genome, the technology is essential for capturing the full breadth of human variation across diverse populations.
The Arab dimension
The recently published Arab human pangenome in Nature [7] offers a compelling case study in why population-specific genomic references matter. Although Arab populations represent nearly 6% of the global population, they’ve been largely absent from genomic research. Countries such as the UAE have invested heavily in sequencing initiatives, but until now, these efforts were being compared to reference genomes that didn’t reflect Arab ancestry.
Using long-read sequencing, researchers of the Nature paper have now generated a haplotype- resolved pangenome from 53 individuals across eight countries in the Middle East and North Africa. These were chosen to span both the Gulf and North African regions, reflecting a broad cross-section of Arab ancestries.
The researchers uncovered more than 111 million base pairs of previously unsequenced DNA, including 235,000 structural variants that were unique to Arab individuals – revealing just how much has been missing from standard references. Additionally, the study found 883 duplicated genes that were present in every individual studied and potentially linked to recessive disease. A duplicated gene refers to a gene that appears more than once in the genome, which can confuse variant interpretation and disrupt gene function.
Having accurate data on complex variants and duplications brings researchers closer to distinguishing which are benign and which may be pathogenic. This insight is valuable for understanding recessive disease in Arab populations and paves the way for improved diagnostic outcomes and better informed care for patients of Arab ancestry.
Pangenomes need partnership
The Arab pangenome study is one of a growing number that proves that inclusive pangenomes are equally scientifically valuable and clinically essential. Such high-quality datasets are only made possible by using sequencing technologies capable of capturing the full complexity of human DNA.
But building pangenomes for every population can’t happen in isolation. It requires sustained investment, international research partnerships, and a commitment to genomic equity Govern- ments, funders and industry must work together to ensure that no group is left behind. Only then can genomic medicine truly deliver on its promise, for everyone, everywhere.
References:
- https://gwasdiversitymonitor.com/
- https://genebites.org/2025/02/17/diversity-in-genetic-data-where-is-everyone/
- https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.660428/full
- https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/
- https://www.nih.gov/news-events/nih-research-matters/first-complete-sequence-human-genome
- https://www.science.org/doi/10.1126/science.abj6987#:~:text=Addressing%20the%20remaining%208%25%20of,200%20 million%20base%20pairs%20of
- https://www.nature.com/articles/s41467-025-61645-w
Pics: Shutterstock
Neil Ward is VP and general manager EMEA at PacBio