How Viral DNA in Your Blood May Influence the Severity of Autoimmune Diseases and COVID-19

Nasim Azizi

Using whole-genome sequencing in a large Japanese cohort, researchers uncovered intriguing links between viruses circulating in the bloodstream or integrated into the human genome —such as anelloviruses and endogenous HHV-6—and chronic diseases like lupus, rheumatoid arthritis, and COVID-19.

What if a piece of viral DNA lurking in your blood could determine the likelihood and severity of a disease you may develop over time? A recent study helps us get closer to answering this question by analyzing two viruses in the human blood and genome, anellovirus and eHHV-6B, and their role in autoimmune diseases and COVID-19.

Currently, there are gaps in our understanding of the relationship between viral infection and autoimmune diseases. Viruses often exist within the human blood without causing symptoms.1 For instance, anelloviruses have been observed in 8% of healthy individuals’ blood 2 and eHHV-6, which is a virus integrated within the genome, exists in 1% of humans, leading to the characterization of them as the human ‘virome’.3 To explore potential links between the human virome and immune responses, large scale studies need to be carried out.  A study by Sasa et al. set out to explore how two specific viruses in humans, eHHV-6 and anellovirus, contribute to the pathogenesis of five autoimmune diseases, psoriasis vulgaris (PV), systemic lupus erythematous (SLE), rheumatoid arthritis (RA), pulmonary alveolar proteinosis (PAP), multiple sclerosis (MS), with the addition of COVID-19.4 The results revealed that patients with eHHV-6B have a higher risk of SLE and PAP, while high loads of anellovirus in the blood is strongly associated with RA, SLE and COVID-19 severity. This study has uncovered important aspects of the relationship between the human virome and both autoimmune and infectious diseases, providing a better understanding of these conditions and the potential role of these viruses in the clinical world as biomarkers.

Researchers further investigated this connection by analyzing the association between eHHV-6, anelloviruses, certain autoimmune diseases, as well as COVID-19 in a cohort of over 6300 Japanese individuals and healthy controls. They used whole-genome sequencing to study each individual’s genome for the presence of either eHHV-6, or anellovirus. The results seemed to confirm their suspicions; they discovered that eHHV-6B, a type of eHHV-6 virus, is much more common in SLE and PAP patients than in healthy controls. In addition to that, they also noticed that SLE patients with eHHV-6B showed more severe symptoms, confirming the significant correlation between this virus and SLE severity.

Figure 1: The workflow of the study. An overview of the study by Sasa et al. on 6321 Japanese patients with PV, RA, SLE, PAP, MS or COVID-19, along with healthy controls. Researchers used long-read sequencing and genome mapping to detect the presence of eHHV-6 and anellovirus. Figure adapted from Sasa, N. et al., 2025  4.

Another intriguing observation from this study was the role of eHHV-6B in immunity to HHV-6B virus. According to Sasa et al., eHHV-6B triggers immune responses against HHV-6B. Since eHHV-6B originated from a virus but is now a part of the human genome, it may act as both virus and self. Therefore, eHHV-6B may be a heritable viral infection that the immune system responds to. This occurrence is called ‘endoimmunity’ 4.

Moving onto anellovirus, high levels of this virus were seen in the SLE, RA, and COVID-19 patients. Interestingly, while the number of COVID-19 patients with anellovirus was similar to the controls, the viral load of anellovirus, which is the amount of the virus in the patient’s blood, was much higher in the individuals with COVID-19. Additionally, most of the cases carrying the anellovirus infection, were individuals with severe COVID-19 symptoms. The increased load of anellovirus may be because of COVID-19 or the effect of its treatments on the immune system. Alternatively, the high viral load of anellovirus may be contributing to the weakened immune responses and the development of this disease. They also observed that anellovirus prevalence is higher in patients with SLE and RA, further supporting the hypothesis that the human virome may be playing a role in autoimmune diseases.

These findings help us understand that, even though there is a small number of people who have high loads of eHHV-6B and anellovirus, they may have a significant influence on disease risk and clinical outcomes. This impact is more notable when compared to other genetic or environmental factors.5–7

Therefore, eHHV-6B and anellovirus can potentially play a role as biomarkers for these diseases, helping us move closer to a personalized medicine approach for them. Furthermore, by fully understanding the influence that these viruses and the mentioned diseases such as SLE, PAP, RA and COVID-19 may have on the immune response, we can develop targeted therapies, improving prevention and treatment strategies.

This approach can be expanded to a broader scope, and potentially inspire similar studies on other diseases and their links to viral infections. By applying this perspective, researchers could uncover previously unknown connections between the virome and diseases beyond the ones studied here, offering new insights into how viruses influence immune responses and disease progression. Examining diseases from this new perspective, we can deepen our understanding of the complex interplay between viruses, the immune system, and the human health.

While this study offers valuable insights, it has some limitations. Its focus on a single ethnic group means the findings may not apply to a broader more diverse population. Additionally, although the study included many participants, its short duration leaves questions about the long-term effects unanswered. To build on these findings, researchers need to conduct extended studies, particularly longitudinal ones, to fully understand the role of anellovirus and eHHV-6B in autoimmune diseases and COVID-19. Future research could help uncover how these viruses influence immune responses and potentially pave the way for new treatment strategies.

References

1.         Haynes, M. & Rohwer, F. The Human Virome. in Metagenomics of the Human Body (ed. Nelson, K. E.) 63–77 (Springer New York, New York, NY, 2011). doi:10.1007/978-1-4419-7089-3_4.

2.         Moustafa, A. et al. The blood DNA virome in 8,000 humans. PLOS Pathog. 13, e1006292 (2017).

3.         Liu, X. et al. Endogenization and excision of human herpesvirus 6 in human genomes. PLOS Genet. 16, e1008915 (2020).

4.         Sasa, N. et al. Blood DNA virome associates with autoimmune diseases and COVID-19. Nat. Genet. 57, 65–79 (2025).

5.         Okada, Y. et al. A Genome-Wide Association Study Identified AFF1 as a Susceptibility Locus for Systemic Lupus Eyrthematosus in Japanese. PLoS Genet. 8, e1002455 (2012).

6.         Sakaue, S. et al. Genetic determinants of risk in autoimmune pulmonary alveolar proteinosis. Nat. Commun. 12, 1032 (2021).

7.         Ogawa, K. et al. Next-generation sequencing identifies contribution of both class I and II HLA genes on susceptibility of multiple sclerosis in Japanese. J. Neuroinflammation 16, 162 (2019).

Revealing the Hidden Genetic Diversity within Human Segmental Duplications

Priyal Bhavsar

Recent sequencing technology advances have allowed for a genome-wide representation of the structural diversity of human segmental duplications, a widely understudied variation due to their size and sequence similarity.

Between the monumental release of the first draft of the human genome in 2001, to the generation of the first gapless sequence of the genome 21 years later, significant insights about the diversity of segmental duplications (SD) have been revealed1. SDs are homologous DNA sequences greater than 1 kb with more than 90% sequence identity that are repeated in multiple locations in the genome in variable copy numbers2. These SDs lead to structural variations such as deletions and duplications in the human genome through a process known as non-allelic homologous recombination2. Increasing knowledge about human SDs, such as copy number differences, location and structure of the duplication, and variation between African and non-African populations, has been credited to advancements in DNA sequencing technologies and alignment algorithms2. The implications of SDs in human diseases and our understanding of genomic evolution and diversity allows the results of population genetics surveys of SDs to be an important piece in completing the entire pangenome puzzle (Figure 1)3.

One such piece is a recent genome-wide analysis of the population genetic diversity of autosomal human SDs from African and non-African samples reported by Jeong et al. They investigated SD copy numbers, gene content, intrachromosomal (SDs positioned on the same chromosome) versus interchromosomal (SDs positioned on different chromosomes) distribution, and sequence patterns between populations2,4. These findings further support the completeness of population-specific human genome reference sequences, understandings of disease-associated SD variations and further research into the functional roles and expression of genes within copy number variable SDs.

Figure 1 | Rendition of a human pangenome reference sequence. A) The currently used human reference genome has some missing information about repetitive regions and segmental duplications. B) Recent advances in long-read DNA sequencing technology, reading longer regions of DNA with high accuracy, have allowed for the generation of complete human genome sequences including missing sequence information about segmental duplications. The pangenome reference aims to provide this complete picture of the genome, representing different version of the human genome sequence at the same time, while capturing the diversity from different human populations. Figure adapted from Leja et al., 20235.

Previously estimated to have accounted for about 5% of the genome, the proportion of SDs within the latest telomere-to-telomere (T2T) gapless sequence of the genome has now increased to about 7%6. SD regions of the genome contain the majority of copy number variable genes, genes with differences in the number of duplicated DNA segments among individuals’ genomes. These genes have been implicated in cardiovascular, neurological, immune and autoimmune diseases, as well as the evolution of the human frontal cortex, the development of colour vision and our adaption to high starch diets2.

Recently, Jeong et al. aimed to present a population genetics survey of human SDs by analyzing DNA sequences of ethnically diverse samples against human genome reference assemblies, a computational representation of the sequence2. In recent times, these human genome reference assemblies have had high-fidelity (HiFi) long-read sequencing data as part of the Human Pangenome Reference Consortium (HPRC) and Human Genome Structural Variation Consortium2. Specifically, PacBio HiFi sequencing technology has allowed for long segments of DNA to be read with high accuracy, helping researchers uncover genetic information with great precision7. In this study, SDs were identified in samples by generating HiFi long-read sequencing data and confirming their presence through Illumina short-read sequencing2. SDs were mapped to the T2T human reference genome (T2T-CHM13) to determine their novelty, and analyzed against the human genome assemblies to determine their variability between African and non-African populations2.  

Through this study, Jeong and colleagues have contributed to advancements in our knowledge of human genomic architecture, and set a strong foundation for further investigations. Jeon et al. reported a population-level overview of SDs and found that African populations presented with higher copy numbers for many duplicated gene families (related to immunity, drug detoxification and environmental interactions) compared to non-African populations2. The genomes of African samples also showed significantly more intrachromosomal SDs2. Authors reported the identification of 183 novel protein coding genes within SD regions enriched for functions related to immunity2. These results further confirm the increased genetic diversity and greater population substructure within African populations. Providing one of the first overviews of a pangenome approach to SD classification, the findings of this study open many doors for further clinical applications, population-specific therapeutic strategies, personalized genome analysis tools and understanding overall evolutionary mechanisms for SD regions.

Although the small sample size of the study poses a limitation in capturing the overall structural polymorphism and genetic diversity of human SDs, current efforts by the HPRC steering away from only one human reference genome will bridge this gap in the future3. As more human population genomes are completely T2T sequenced, further information about the role of SDs in the recombination of gene-poor acrocentric short arms of chromosomes will also be revealed. Another limitation of the study was the inconsistency in samples between those used to identify novel genes in SD regions and their compared genome reference assemblies. The functionality of these novel genes can be better determined through comparisons between the same samples.  

Due to the large length and sequence similarity of SD regions, understanding the functional consequences of variations in SDs has been difficult with standard traditional genotyping and sequencing techniques2. Many genome-wide studies of transcription, regulation and association hence often exclude these SDs. Future research should aim to study the role of SDs in identifying population-specific regions of the genome more prone to mutations, and the accurate identification of population-specific structural variations. The functionality of variants and genes within SDs, and how they contribute to phenotypic diversity and disease predisposition, should also be investigated. The 183 novel protein coding and copy number variable genes identified in this study should also be further functionally tested to reveal information about their transcription patterns and tissue specificity. Overall, the analysis of genome reference sequences from diverse and underrepresented global populations through growing multi-omics, long-read sequencing, and transcriptomics approaches will provide pieces to the puzzle of a complete pangenome representation for humans.

References

1.         Human Genome Project Fact Sheet. https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project.

2.         Jeong, H. et al. Structural polymorphism and diversity of human segmental duplications. Nat. Genet. 1–12 (2025) doi:10.1038/s41588-024-02051-8.

3.         Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

4.         Abdullaev, E. T., Umarova, I. R. & Arndt, P. F. Modelling segmental duplications in the human genome. BMC Genomics 22, 496 (2021).

5.         Scientists release a new human “pangenome” reference. National Institutes of Health (NIH) https://www.nih.gov/news-events/news-releases/scientists-release-new-human-pangenome-reference (2023).

6.         Telomere-to-Telomere. https://www.genome.gov/about-genomics/telomere-to-telomere.

7.         Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13, 278–289 (2015).

Untangling Disease Effects on Gene Expression in the Human Brain

Anushka Deshmukh

A new study shows how brain disease alters gene expression, uncovering hidden genetic patterns and pointing to new therapeutic targets.

Neuroscience has long been focused on understanding how genetic variation influences brain traits and disease progression. Genetic variation plays a huge role in determining susceptibility to neurological diseases such as Alzheimer’s disease (AD), Parkinson’s disease (PD), multiple sclerosis (MS), schizophrenia, and cognitive decline1. As of 2023, an estimated 6.7 million Americans aged 65 and older are living with Alzheimer’s dementia, with this number projected to reach 13.8 million by 20602.  However, uncovering the specific genetic pathways involved is challenging due to the heterogeneity of brain tissue and the difference in disease manifestations1,3,4. Most studies rely on brain tissue samples from individuals with neurological diseases, but this can skew results since the disease itself changes how genes are expressed5.

To address this issue, Haglund et al. analyzed how brain diseases alter gene expression quantitative trait loci (eQTLs), regions of the genome where genetic variation influences gene activity, which are valuable tools for connecting these variants to their functional outcomes5. Using single-nuclei RNA sequencing (snRNA-seq), a technique that measures gene activity in individual brain cells, they identified disease-dependent regulatory changes that would otherwise be masked in bulk-tissue analyses. The study workflow (Figure 1) outlines this approach, highlighting the integration of snRNA-seq, eQTL mapping across brain cell types, and Mendelian randomization (MR), a method that uses genetic variants as natural experiments to test whether gene activity causes disease, to identify causal gene-trait associations and prioritize therapeutic targets. While previous studies assumed that eQTL effects remain consistent across healthy and disease states, this study revealed that disease states can significantly alter genetic regulation, leading to biased conclusions5–7. By comparing data from diseases and healthy brains, they demonstrated the importance of using disease-free samples to draw accurate conclusions about genetic regulation in the central nervous system (CNS)5.

Figure 1: Overview of study workflow. This diagram shows how researchers analyzed nearly 2.3 million individual brain cells to explore how genetic differences affect brain diseases. The authors mapped expression quantitative trait loci (eQTLs) across eight brain cell types, evaluated disease-dependent effects, and applied Mendelian randomization (MR) and colocalization analysis. They looked at how genes are turned on or off in healthy vs. diseased brains and used MR to find which genes may actually cause brain conditions. Figure taken from Haglund et. al 5.

Analyzing over 2.3 million single-cell profiles from 391 individuals, the study identified nearly 14,000 genes with eQTL effects across eight brain cell types. Surprisingly, between 16.7% and 40.8% of these eQTLs exhibited disease-dependent allelic effects, meaning that genetic regulation of gene expression changed significantly in the disease state. For example, specific genetic variants affecting microglial genes, which are the brain’s immune cells, were influenced predominantly by AD, while others showed altered regulation in PD and MS. This shows that if scientists study only diseased brain tissue, they might draw the wrong conclusions because the disease can distort how genes behave. Some gene effects may even appear reversed, making it harder to tell which genetic changes are actually causing the disease. Adjusting for disease states in analyses does not fully correct for these effects5,8. Instead, using healthy brain data provides a clear picture of the baseline regulatory effects of genetic variants, which is essential for accurately identifying genetic risk factors and potential therapeutic targets.

By isolating data from 183 disease-free brains, serving as control samples, the researchers identified 91 gene-trait colocalizations undetectable in the larger mixed datasets. Colocalization is a method that determines whether a shared genetic variant drives both genetic expression and a trait, in this case, susceptibility to a CNS disorder9,10. One notable example is the identification of novel gene-trait links for MS, including genes like PEX13 in excitatory neurons. PEX13 helps control how cells manage waste and stress and has been linked to cell damage in the nervous system. Its role in MS had not been identified before this study. This underscores how disease-free data can improve our ability to detect critical disease mechanisms.

To infer causality, the study applied MR, a method that uses natural genetic variation to mimic a randomized experiment to test whether gene activity actually contributes to disease5,10. Using control brain data, researchers identified 140 causal gene-trait associations across 26 CNS phenotypes. Among these, genes such as EGFR (linked to AD) and GPNMB (linked to PD) emerged as potential therapeutic targets. EGFR inhibitors are already used in cancer treatments, and since increased expression of EGFR was linked to higher Alzheimer’s risk, these could be repurposed for neurodegenerative diseases11. Similarly, GPNMB could be a potential therapeutic target and biomarker for PD progression. Notably, these findings were validated using UK Biobank plasma protein data, reinforcing their potential clinical relevance12.

These findings have far-reaching implications for neuroscience and genomics. First, they emphasize the complex nature of genetic regulation in the brain, particularly in the context of disease. By showing that eQTL effects can change depending on the disease state, the study highlights the importance of separating disease and healthy samples in genetic analyses. Second, the study shows the power of single-cell technologies to resolve cell-type-specific effects, uncovering regulatory relationships that would remain hidden in bulk-tissue analyses. Third, the combination of MR and plasma proteomics offers a promising framework for identifying peripheral biomarkers that predict CNS disease outcomes.

Future research should expand to larger, more diverse datasets, including brains affected by other conditions such as traumatic injury or psychiatric disorders. Additionally, testing interventions that target genes like EGFR or GPNMB in animal models could validate their potential for drug development. The development of blood-based biomarkers informed by this research could revolutionize how CNS diseases are diagnosed and treated. As snRNA-seq and similar single-cell technologies continue to evolve, these methods could help decode the genetic basis of psychiatric conditions like depression or autism, areas where bulk-tissue studies have often fallen short. Finally, the study raises intriguing questions about the “Achilles’ heel” hypothesis: whether certain genetic variants predispose individuals to disease only under specific pathological conditions. Exploring this phenomenon could enhance our understanding of gene-environment interactions and their role in disease susceptibility.

By disentangling the effects of brain disease on gene expression, this study sets a new standard for interpreting eQTL data and prioritizing therapeutic targets. Its innovative use of healthy brain data and MR provides a clearer view of the genetic regulation underlying CNS traits. This study not only helps us rethink how to study the brain, but it could also pave the way toward more personalized treatments for neurodegenerative and psychiatric disorders.

References

1.         Misra, M. K., Damotte, V. & Hollenbach, J. A. The immunogenetics of neurological disease. Immunology 153, 399–414 (2018).

2.         2023 Alzheimer’s disease facts and figures. Alzheimers Dement. 19, 1598–1695 (2023).

3.         Wareham, L. K. et al. Solving neurodegeneration: common mechanisms and strategies for new treatments. Mol. Neurodegener. 17, 23 (2022).

4.         Woodward, A. A., Urbanowicz, R. J., Naj, A. C. & Moore, J. H. Genetic heterogeneity: Challenges, impacts, and methods through an associative lens. Genet. Epidemiol. 46, 555–571 (2022).

5.         Haglund, A. et al. Cell state-dependent allelic effects and contextual Mendelian randomization analysis for human brain phenotypes. Nat. Genet. 57, 358–368 (2025).

6.         Wingo, A. P. et al. Integrating human brain proteomes with genome-wide association data implicates new proteins in Alzheimer’s disease pathogenesis. Nat. Genet. 53, 143–146 (2021).

7.         Bryois, J. et al. Cell-type-specific cis-eQTLs in eight human brain cell types identify novel risk genes for psychiatric and neurological disorders. Nat. Neurosci. 25, 1104–1112 (2022).

8.         Porcu, E. et al. Differentially expressed genes reflect disease-induced rather than disease-causing changes in the transcriptome. Nat. Commun. 12, 5647 (2021).

9.         Bao, J. et al. Brain-wide genome-wide colocalization study for integrating genetics, transcriptomics and brain morphometry in Alzheimer’s disease. NeuroImage 280, 120346 (2023).

10.       Zuber, V. et al. Combining evidence from Mendelian randomization and colocalization: Review and comparison of approaches. Am. J. Hum. Genet. 109, 767–782 (2022).

11.       Mansour, H. M., Fawzy, H. M., El-Khatib, A. S. & Khattab, M. M. Repurposed anti-cancer epidermal growth factor receptor inhibitors: mechanisms of neuroprotective effects in Alzheimer’s disease. Neural Regen. Res. 17, 1913 (2022).

12.       Sun, B. B. et al. Plasma proteomic associations with genetics and health in the UK Biobank. Nature 622, 329–338 (2023).

Harnessing the full power of genome editing

Sofia Edissi

Scientists have conducted the largest functional study of TP53 to date, revealing an accurate, high-throughput method that facilitates variant interpretation and identifies promising therapeutic targets to enhance patient diagnosis and treatment.

Tumour suppressor protein 53, TP53, is a big player in cancer research for its role as a master regulator of cell-cycle arrest and programmed cell death. Somatic variants of TP53 are observed in approximately 50% of all cancers.1 While scientists have the ability to generate enormous amounts of genetic information, interpreting the clinical significance of variants remains a major obstacle, thereby creating a bottleneck in clinical decision-making. Writing in Nature Genetics, Funk et al.2 use saturation genome editing (SGE) with clustered regularly interspaced short palindromic repeat mediated homology directed repair (CRISPR-HDR) to conduct the largest functional study of TP53 to date, revealing a high-throughput technology to accurately interpret variants. This research is part of a recent influx of publications using SGEto conduct functional studies for variant classification3,4.

In this new era of genomics, scientists can now sequence vast amounts of genetic information, but interpreting its clinical significance remains a major barrier to advancing diagnosis and care.5 As such, there has been a pressing need for researchers to bridge this gap in knowledge by performing studies which evaluate the impact of these variants on disease.While certain prediction software exists, these tools are not sufficient on their own to support classification of a variant as pathogenic or benign, according to the American College of Medical Genetics and Genomics (ACMG) guidelines for variant interpretation.6 Instead, the gold standard approach is to conduct functional studies whereby variants are modeled using cell lines or non-human species. Unfortunately, there are certain challenges with functional studies, specifically the ability for the test to be time-efficient in order to clear the backlog of variants whose impact on health is unknown.

Funk et al. were able to contribute towards this effort for the interpretation of variants in TP53. To do so, they used a technique known as SGE that makes use of CRISPR-HDR technology to simultaneously analyze all possible single nucleotide variants in a genomic region.7 They also used this technology to introduce short insertions and deletions into the gene. By doing so, they can study how changes in the DNA sequence of TP53 can lead to characteristic features of cancer cells, such as increased proliferation and survival1.

The CRISPR-HDR system is used to induce double-stranded DNA breaks at a region of interest, while providing a near identical DNA template for repair (figure 1). For mutation-based studies, the DNA template contains a sequence variant the researcher wants to introduce. As a result, the DNA will be repaired with the inclusion of that variant. Using SGE to model more than 9,000 TP53 variants in cancer cells, Funk et al. covered 94.5% of cancer associated variants in TP53, making this the largest functional study of TP53 to date.2

Figure 1. Workflow of saturation genome editing (SGE) using clustered regularly interspaced short palindromic repeat mediated homology directed repair CRISPR-HDR. CRISPR-HDR is used to introduce single nucleotide variants, and small insertions and deletions into the DNA binding domain of TP53 in cancer cells. The CRISPR-Cas9 system cuts the double-stranded DNA at a target region. The DNA is repaired through the HDR pathway which involves using a near-identical DNA template that includes the specific TP53 variant. The survival of cells with these incorporated variants are analyzed simultaneously to determine which variants provide an advantage for cell survival. Created with BioRender.

This strategy permitted highly accurate and specific separation for cell proliferation and survival between cancerous cells with pathogenic TP53 variants compared to those with benign variants. This allowed researchers to clearly identify tumour-associated TP53 variants from those not associated with tumour formation. In fact, they were able to reclassify ~20% of variants previously classified as benign, to pathogenic according to ACMG guidelines.6 Not only did this allow for patient diagnosis, but it also facilitated therapeutic interventions for certain variants, restoring the protein to its normal functionality.  Moreover, Funk et al. identified that TP53 variants which cause slight protein unfoldingresulting in partial loss-of-function (pLOF) is enough to enhance cell proliferation. Excitingly, pLOF variants have strong potential for correction through pharmacological intervention with targeted treatments. Therefore, the findings by Funk et al. reveal the power of SGE to advance TP53 variant interpretation, leading to better diagnosis and improved treatment options for patients.

This study reveals that the gold standard method previously used for TP53 variant classification, is insufficient to detect a substantial proportion of pathogenic TP53 variants. These results emphasize the importance of conducting functional studies in the native cellular environment (i.e. cancer cells), to detect all biological variation. While this was not possible with previous methods, it is now achievable using CRISPR-HDR. Additionally, the approach by Funk et al. surpasses the diversity of previous functional studies of TP53, improving the clinical utility of TP53 variant databases for determining disease causation. Their results also uncovered variants suggested as promising targets for pharmacological reactivation of normal TP53 function. These findings will improve variant interpretation for TP53, allowing for improved genetic counselling and advancements in cancer therapy. 

Although CRISPR-HDR worked well for this study, substantial limitations of CRISPR-HDR include the high frequency for off-target effects and repair by non-homologous end joining, an alternative repair pathway which joins double-stranded DNA breaks without the use of a template. This is especially true for non-dividing cells, which could make this technique not as accessible for studying non-cancerous cells.8 As a result of these limitations, there has been recent interest in CRISPR-prime editing for wide-scale mutation-based studies. Although this technology outperforms CRISPR-HDR in efficiency, it is mainly limited to single nucleotide changes and specific nucleotide changes. However, CRISPR-HDR is more versatile because it is not limited to specific nucleotide changes and can also introduce insertions and deletions.8,9 Despite these limitations, Funk et al. demonstrate the power and versatility of CRISPR-HDR for wide-scale functional studies. Their research provides evidence that this method is a feasible solution to determine the clinical significance of variants, which is essential for clinical decision-making. Ongoing research will continue to improve the limitations of this technology to harness the full power of CRISPR-HDR for genome editing.

References

1.            Whibley, C., Pharoah, P. D. P. & Hollstein, M. p53 polymorphisms: Cancer implications. Nature Reviews Cancer vol. 9 95–107 (2009).

2.            Funk, J. S. et al. Deep CRISPR mutagenesis characterizes the functional diversity of TP53 mutations. Nat Genet (2025).

3.            Buckley, M. et al. Saturation genome editing maps the functional spectrum of pathogenic VHL alleles. Nat Genet 56, 1446–1455 (2024).

4.            Sahu, S. et al. Saturation genome editing-based clinical classification of BRCA2 variants. Nature (2025).

5.            Burke, W., Parens, E., Chung, W. K., Berger, S. M. & Appelbaum, P. S. The Challenge of Genetic Variants of Uncertain Clinical Significance: A Narrative Review. Annals of Internal Medicine vol. 175 994–1000 (2022).

6.            Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: A joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine 17, 405–424 (2015).

7.            Findlay, G. M., Boyle, E. A., Hause, R. J., Klein, J. C. & Shendure, J. Saturation editing of genomic regions by multiplex homology-directed repair. Nature 513, 120–123 (2014).

8.            Liao, H., Wu, J., VanDusen, N. J., Li, Y. & Zheng, Y. CRISPR-Cas9-mediated homology-directed repair for precise gene editing. Molecular Therapy Nucleic Acids vol. 35 (2024).

9.            Gould, S. I. et al. High-throughput evaluation of genetic variants with prime editing sensor libraries. Nat Biotechnol (2024).

Closing the Gaps: T2T Assembly Uncovers Hidden Functional Genomics

Weier Fan

The discovery of novel paralogues of WASHC1 and GPRIN2 in the T2T-CHM13 assembly highlights the importance of the new assembly and accurate genomic annotation in understanding genetic function.

Since its creation in 2013, the GRCh38 genome assembly has been the standard reference genome for scientists and researchers to compare and study genomic variations within the human population.1 However, with the recent publication of the T2T-CHM13 genome assembly by the Telomere-to-Telomere consortium, this new “gapless” assembly addresses the missing 8% of the human genome in the GRCh38 reference.1 A recent study by Cerdán-Vélez and Tress highlights the discovery of novel WASHC1 and GPRIN2 paralogues – genes that arose from duplication within the same genome – uncovered by the new assembly, shedding new light on these missing regions and their functionality.2

The newly assembled T2T-CHM13 added 1,956 gene predictions, which are stretches of DNA that, based on their patterns, might be genes. About 140 of these are similar to genes known to contain instructions for making proteins.1 This may open up discoveries that couldn’t have been made from the previous, incomplete genomes in human biology, evolution and even diseases.

However, it is unclear how many of these 140 genes produce proteins or their functions. Cerdán-Vélez and Tress set out to investigate the protein-coding status of two genes: WASH1-20p13 (LOC124908094), which shares a similar sequence with WASHC1, and GPRIN2L (LOC124900631), which is closely related to GPRIN2. These kinds of related genes—known as paralogues—arise from gene duplication events and may have similar or diverging functions. The researchers used multiple lines of evidence to study these genes, including proteomic data, evolutionary conservation, and cDNA sequencing.²

The Wiskott-Aldrich syndrome protein and SCAR Homologue (WASH) complex helps in controlling how cells organize and move materials inside them.3,4 It plays a key role in shaping transport pathways by working with another protein complex called Arp2/3.3,4 This Arp2/3 complex builds a network of tiny filaments, like a scaffold, that helps move and sort cargo inside the cell.4 The WASH complex is made up of five subunits, with WASH complex subunit 1 being the main subunit that interacts with the Arp2/3 complex.4 There is a disagreement between the three main reference databases as to which WASH1 gene encodes for this subunit. RefSeq annotates the WASHC1 gene as the only coding gene,5 Ensembl/GENCODE annotates WASHC1 and WASH6P as coding genes,6 and UniProtKB lists the isoforms of WASHC1, WASH2P, WASH3P, WASH4P, and WASH6P to be protein-coding.7 This lack of consensus underscores the gaps in the human genome, especially when it comes to correctly identifying which genes actually produce functional proteins. 

Cerdán-Vélez and Tress conducted a phylogenetic analysis to reveal that WASHC1 and other paralogues clustered separately from WASH1-20p13 and functional WASH1 genes in primates, as shown in Figure 1C.2 This cross-species conservation supports the functional importance of WASH1-20p13 and raises questions about which gene is the true protein-coding gene for the WASH complex. Understanding the precise gene responsible for encoding each protein is essential, as it helps clarify their roles in cellular functions, disease mechanisms, and the development of targeted therapies. The other WASH1 isoforms annotated in UniProtKB contain various mutations in their amino acid sequences, whereas WASH1-20p13 is the only isoform that maintains the conserved amino acid sequence across vertebrates (seen in Figure 1A).2 WASHC1, originally thought to be the protein-coding gene of the WASH complex, lacked these conserved residues (seen in Figure 1B).2 The authors propose that the conservation of WASH1-20p13 across species provides compelling evidence that it is the functional gene responsible for encoding the WASH complex protein. The conservation of this gene suggests that it plays an important role in basic biological functions; otherwise, it would have changed or been lost over time.

Figure 1. A comparative phylogenetic analysis of the difference WASH1 isoforms. (A) The five full-length WASH1 protein isoforms and the number of non-conserved amino acids, consisting of single amino acid variations (SAAVs) and deleted regions, differed from amino acids that are conserved across primates, mammals, and tetrapods. WASH1-20p13 is left blank as there was no difference between its conserved amino acids and those across the different species. (B) A comparative analysis of the amino acids between WASHC1 protein and WASH1-20p13 across regions conserved across vertebrates. The WASHC1 protein is shown to differ in all of the conserved amino acid positions. (C) A phylogenetic tree of great ape and human genes. Genes annotated from the T2T-CHM13 assembly are labelled with their RefSeq name, and the WASH1-20p13 gene (LOC124908094) is highlighted in red. The other WASH1 isoforms branch off to a separate cluster, the WASH1-20p13. Figure taken from Cerdán-Vélez and Tress.2

The authors went on to show that the protein produced by WASH1-20p13 captured almost all of the known peptide evidence. By comparing peptide data from a large protein database, PeptideAtlas, they found that WASH1-20p13 captured 47 out of 52 detected peptides.2 Seventeen of those peptides were unique to the WASH1-20p13 gene alone.2 Whereas, the previously assumed coding gene WASHC1 only captured a small portion of these peptides. This strongly suggests that WASH1-20p13 is the true protein-coding gene of the WASH complex.2

The GPRIN2 gene encodes a protein that helps regulate growth in nerve cells.8, 9 In the GRCh38 assembly, the GPRIN2 gene was found to have a missing region in the genome.2 The newer T2T-CHM13 assembly added a single gene to this missing region, GPRIN2L, a close paralogue to GPRIN2.The authors showed that GPRIN2L produced six unique peptides in proteomic analysis, while GPRIN2 didn’t produce any (seen in Figure 2).2 This suggests that GPRIN2L might be more important for certain functions, but it doesn’t rule out GPRIN2’s role in protein production. These findings help clarify the roles of these genes in development and diseases affecting the nervous system.

Figure 2. Mapping of peptide sequences detected in PeptideAtlas of the two human GPRIN2 proteins. GPRIN2L is highlighted in blue. The colour-coded residues indicate the number of observations that are detected for that protein. Those highlighted in red have >100 observations, orange >20 observations, yellow >10, green >5, and blue >2. The differences between the two sequences are highlighted in yellow. These differences tend to be in areas that are less conserved (yellow, green, blue), displaying a difference between these two proteins. The regions that only correspond to GPRIN2L are most likely regions that support the six unique peptides found only in GPRIN2L. Figure adapted from Cerdán-Vélez and Tress.2

Cerdán-Vélez and Tress’s paper highlights the importance of accurate gene annotation and the necessity of updating these major genomic databases to reflect these newly identified functional paralogues. Misannotation in reference genomes can lead to incorrect conclusions in genetic research and disease studies, potentially wasting resources and delaying the discovery of therapeutic targets. These errors can also hinder the development of effective treatments, impacting progress in personalized medicine. By leveraging a more comprehensive reference genome, researchers can not only confirm the current understanding of gene function and potential disease causation but also uncover previously hidden disease-relevant variants that may have been overlooked in earlier assemblies.

Despite the significant advantages offered by the T2T-CHM13 assembly, the study also has some limitations. The presence of multiple paralogues in subtelomeric regions makes it difficult to distinguish functional from non-functional copies in the T2T-CHM13 assembly.1 Furthermore, while bioinformatics and proteomics analyses provide strong evidence for the functional evidence of WASH1-20p13 and GPRIN2L, their biological roles must still be directly confirmed through in vitro or in vivo functional studies to solidify these findings. Thus, although the T2T-CHM13 assembly represents a groundbreaking step towards a complete human genome, ongoing efforts in annotation and functional characterization of these paralogues make the GRCh38 assembly still the standard human reference genome for genetic research. 

References

  1. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
  2. Cerdán-Vélez, D. & Tress, M. L. The T2T-CHM13 reference assembly uncovers essential Wash1 and GPRIN2 paralogues. Bioinformatics Advances 4, (2024).
  3. Helfer, E. et al. Endosomal recruitment of the wash complex: Active sequences and mutations impairing interaction with the Retromer. Biology of the Cell 105, 191–207 (2013).
  4. Schurr, Y., Reil, L., Spindler, M. et al. The WASH-complex subunit Strumpellin regulates integrin αIIbβ3 trafficking in murine platelets. Sci Rep 13, 9526 (2023).
  5. Sayers, E. W. et al. GenBank 2023 update. Nucleic Acids Research 51, (2022).
  6. Frankish, A. et al. Gencode: Reference annotation for the human and mouse genomes in 2023. Nucleic Acids Research 51, (2022).
  7. Bateman, A. et al. Uniprot: The Universal Protein Knowledgebase in 2023. Nucleic Acids Research 51, (2022).
  8. GPRIN2 G protein regulated inducer of neurite outgrowth 2 [homo sapiens (human)] – gene – NCBI. National Center for Biotechnology Information Available at: https://www.ncbi.nlm.nih.gov/gene/9721. (Accessed: 27th January 2025)
  9. Iida, N. & Kozasa, T. Identification and biochemical analysis of GRIN1 and grin2. Methods in Enzymology 475–483 (2004). 

Targeting the duo responsible for C9orf72 ALS/FTD pathogenesis.

Connie Fierro

RNA-targeting CRISPR systems have potential for modulating expression of RNA species involved in neurodegenerative disease pathogenesis.

The most common cause of amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD) is the expansion of a hexanucleotide (GGGGCC) repeat in the chromosome 9 open reading frame 72 (C9orf72) gene. ALS and FTD are characterized by motor dysfunction and cognitive and behavioural impairments, respectively, but these diseases converge on similar mechanisms of RNA-mediated toxicity1. In a new study, Kempthorne and colleagues developed a CRISPR-CasRx system to target and reduce both sense and antisense C9orf72 repeat RNAs in ALS and FTD models2. Their findings hold promise for a future therapeutic strategy to alleviate RNA-mediated toxicity and neurodegeneration in ALS and FTD.

Healthy individuals have between two to eight hexanucleotide repeats where ALS/FTD patients have upwards of 30 repeats. Bidirectional transcription of the C9orf72 repeat region results in the accumulation sense and antisense RNAs. The sense and antisense RNAs undergo repeat-associated non-ATG (RAN) translation producing dipeptide repeats (DPRs)3. Unlike normal translation, RAN translation begins in the absence of a start codon and is a known contributor to neurodegenerative disease due to the production of toxic RNA species, such as DPRs4. Six reading frames are associated with RAN translation and may be specific to the sense, antisense or both strands (figure 1)4. These DPRs disrupt protein homeostasis, exerting toxic effects through an unknown mechanism. Further, DPRs have been identified in the hippocampus, frontal cortex, cerebellum and spinal cord in patients with ALS/FTD3.

Kempthorne et al. noted the need for therapies targeting both sense and antisense repeat RNAs as previous clinical trials focusing only on sense RNAs failed2. They engineered an RNA-based CRISPR-CasRx, using guide RNAs (gRNAs) to target the C9orf72 repeat RNAs. Once the CRISPR-CasRx is bound, it exerts its ribonuclease activity that cleaves both sense and antisense RNA to prevent translation into toxic DPRs. This system functions as molecular scissors to eliminate specific RNA sequences. The researchers first confirmed that CRISPR-CasRx could simultaneously target and degrade both sense and antisense repeat RNAs. They reported 99% and 89% degradation, respectively, in a traditional cell line. It can be appreciated that this experiment was first carried out using a traditional cell line to confirm the baseline effectiveness of this system.

Figure 1: The negative effects of DPRs within the cell.  Six reading frames are associated with RAN translation and are listed as follows: glycine-alanine (GA) and glycine-arginine (GR) DPRs from the sense RNA strand, proline-alanine (PA) and proline-arginine (PR) from the antisense strand and glycine-proline (GP) from both strands. DPRs can be divided into nontoxic, highly toxic and moderately toxic repeat RNA species. Highly toxic DPRs, GR-DPR and PR-DPR, are reported to interfere with RNA metabolism and disruptions to non-membrane-bound organelles. The moderately toxic DPR, GA-DPR is most visible as inclusions in the central nervous system of FTD/ALS patients. GA-DPR leads to reduced dendritic branching, increased cellular stress, proteosome inhibition and apoptosis. Figure from4.

Using this information, they transitioned their CRISPR-CasRx into ALS/FTD patient-derived induced pluripotent stem cells (iPSCs) to assess its effect on endogenous C9orf72 repeat RNAs. They expanded their scope to investigate DPRs and neurotoxicity in the iPSCs after treatment with the CRISPR-CasRx. Sense repeat RNA expression decreased by 40% while antisense repeat RNA expression decreased by 73%. Immunoassays detecting GP-DPRs and GA-DPR levels revealed a 60% reduction in these toxic proteins. It is important to note that these DPRs pertain to the sense strand and the researchers were limited by the current immunoassays available to detect antisense-specific DPRs. There was no significant reduction in viable cells after transduction of the CRISPR-CasRx, which is beneficial when considering therapeutic applications. To analyze the effects at the phenotypic level, they analyzed a zebrafish model harbouring 45 hexanucleotide repeats, with a confirmed population of GP-DPRs, and a hyperactive behavioural phenotype. Injection of plasmids encoding the CRISPR-CasRx system were able to rescue this hyperactive phenotype by significantly decreasing the amount of GP-DPRs. Translating to a mouse model with 149 hexanucleotide repeats, CRISPR-CasRx and its gRNAs were delivered via neonatal intracerebroventricular (ICV) injection using adeno-associated viruses’ (AAVs). Antisense-specific gRNAs were not used, as the sequence did not match the mouse model. In the hippocampus, a 50% reduction on sense repeat RNAs were reported while there was no difference in levels of GP-DPRs.

To combat the limitation of the previous mouse model, a bacterial artificial chromosome (BAC) mouse was designed to have the full C9orf72 sequence and 500 hexanucleotide repeats. ICV injection of CRISPR-CasRx revealed a 20% decrease of sense and antisense repeat RNAs and no change in GP-DPR levels. This shallow decrease in repeat RNA levels was assumed to be due to a low transduction efficiency when transitioning to AAVs for the in vivo experiments compared to plasmids for the in vitro experiments. Taken together, Kempthorne et al. address the gap in antisense repeat RNAs research by designing a CRISPR-CasRx that targets both the sense and antisense strand to decrease levels of repeat RNAs in both cellular and animal models of disease2.

Despite the lack of robust evidence confirming the decrease of DPRs, the results presented by Kempthorne et al. provide a baseline for novel therapeutic strategies targeting both the sense and antisense repeat RNAs2. In-depth RNA sequencing revealed no off-targets effects of the CRISPR-CasRx which highlights its therapeutic applications. However, FTD and ALS are age-related diseases and neonatal injection of the CRISPR-CasRx system is not feasible, therefore future research should explore an alternative methodology. Alternative routes of administration of AAVs must be investigated for optimal penetration across the blood-brain-barrier (BBB) in older mice models. Deverman et al. engineered AAV variants that efficiently transduce the central nervous system through intravenous injection6. Future studies should investigate an engineered AAV to improve delivery and enhance transduction efficiency. As ALS/FTD treatments shift toward precision medicine, understanding individual RNA profiles can help tailor therapies to individual patients which improves their efficacy7. The failure of clinical trials targeting sense repeat RNAs in FTD/ALS highlights the demand for a therapy that addresses both sense and antisense repeat RNAs and DPRs to decrease cellular toxicity and rescue the respective phenotypes.

References

  1. Ling, S.-C., Polymenidou, M. & Cleveland, Don W. Converging Mechanisms in ALS and FTD: Disrupted RNA and Protein Homeostasis. Neuron 79, 416–438 (2013).
  2. Kempthorne, L. et al. Dual-targeting CRISPR-CasRx reduces C9orf72 ALS/FTD sense and antisense repeat RNAs in vitro and in vivo. Nature Communications 16, (2025).
  3. Banez-Coronel, M. & Ranum, L. P. W. Repeat-associated non-AUG (RAN) translation: insights from pathology. Laboratory Investigation 99, 929–942 (2019).
  4. Freibaum, B. D. & Taylor, J. P. The Role of Dipeptide Repeats in C9ORF72-Related ALS-FTD. Frontiers in Molecular Neuroscience 10, (2017).
  5. Gao, J. et al. Gene therapy for CNS disorders: modalities, delivery and translational challenges. Nature reviews. Neuroscience (2024) doi:https://doi.org/10.1038/s41583-024-00829-7.
  6. Deverman, B. E. et al. Cre-dependent selection yields AAV variants for widespread gene transfer to the adult brain. Nature Biotechnology 34, 204–209 (2016).
  7. Tzeplaeff, L., Wilfling, S., Requardt, M. V. & Herdick, M. Current State and Future Directions in the Therapy of ALS. Cells 12, 1523 (2023).

Neural Network Allows for a Comprehensive Method of Assessing Gene Regulation

Nithya Gopalakrishnan

Borzoi is a sequence-based machine-learning model trained on RNA-seq that can make gene expression predictions based on longer stretches of DNA than any prior models.

The field of genomics has evolved in tandem with advances in data analysis and computational processes, allowing researchers to assess complex datasets of gene regulation information1,2. Presently, neural networks and machine-learning modelling serve as an exciting development within the field, suggesting that we may soon be able to fully accurately predict the effects of uncategorized genetic variants on gene function from DNA sequences alone1,3,4. By developing Borzoi, a sequence-based machine-learning model that makes direct use of RNA-seq assay data, Linder et al. have put forward a new approach to capturing a predictive sequence coverage, or sequencing reads that map to the reference genome1. This model uses data from multiple species and incorporating a breadth of forms of gene regulation including splicing, polyadenylation, and transcription1. Borzoi’s efficacy was tested against established and validated computational models across many variant interpretation tasks, including characterizing distal cis-regulatory motifs in tissue-specific datasets and differentiating between benign and pathogenic variants within an individual’s genome1. In comparison to established and validated models such as Enformer and Pangolin in terms of analyzing gene regulation at multiple loci, Borzoi performed at either an equal or higher level, demonstrating the tool’s utility within genomics going forward1. On the whole, Borzoi utilizes an immense amount of epigenetic data for focused predictions regarding gene expression, which could allow for easier variant interpretation and a heightened knowledge of transcriptional regulation within the human genome1.

Figure 1: A graphical representation of the breadth of uses for RNA-seq, including the workflow for transcriptome construction highlighted in orange, the assembly of epigenetics datasets highlighted in green, and the possible downstream analyses highlighted in brown. Figure adapted from2.

To date, the majority of genomics machine-learning tools such as Enformer and Pangolin have been trained to predict transcriptional regulation effects using assays used for predictions based on regulatory elements that are within 2,000 bp of the transcription start site (TSS), a relatively short distance1,3,4. In contrast, the most popular assay for elucidating the effects of transcriptional regulators on gene function, RNA-seq, makes use of much larger stretches of sequence to assess gene expression holistically, including exons, introns, and long untranslated regions (UTRs)1,2,5. Despite this approach’s ubiquity in transcriptomics and use across comparatively more species than other assays, no computational predictor model had been trained directly on RNA-seq coverage prior to the inception of Borzoi. With this tool, predicting gene expression from DNA sequence across multiple forms of genetic regulation has been made more sophisticated1.

As a transcriptomics assay, RNA-seq is useful for describing sequence coverage for processed RNAs that have been transcribed, making it a proxy for gene expression2,5. The caveat for this approach is that mammalian gene sequences are often long, with cis-regulatory elements far upstream and downstream of a given gene1,6. This makes training a machine-learning model difficult, as longer sequences mean sacrificing prediction resolution and clarity1. Borzoi was constructed using the established deep learning architecture Enformer, which was originally trained to predict enhancer-promoter interactions based on DNA sequence1,3. To attempt to specialize this model, the neural network was trained on tissue-specific data from GTEx, allowing for a localized prediction of differential splicing, adenylation, or transcriptional regulation to be made1. Both the TSS and the 3’ UTR are essential for gene regulation, with the former also playing a role in polyadenylation signals and the differential splicing of different isoforms for many genes1. This prompted Linder et al to pay specific attention to these regions when looking at RNA-seq data1. Using five GTEx tissues (whole blood, liver, brain, muscle, and esophagus), Borzoi was able to predict the variation in differential tissue-specific gene expression to a level of high significance across five replicates1. Comparing Borzoi’s predictive ability against Enformer was also an essential step undertaken by the researchers, especially when assessing more distal gene regulatory interactions1. Given the long stretches of DNA sequence that comprise RNA-seq data, it follows that Borzoi was able to assess sites almost twice as far away from the TSS as the core Enformer architecture alone could achieve1. In addition, the combination of multiple forms of epigenetics assays beyond RNA-seq for model training data led to even higher accuracy, lending further credence to Borzoi’s predictive power1.

Amongst the most significant applications of Borzoi highlighted in this paper is that the model performs gene variant analysis interpretation tasks to a higher degree of accuracy than Enformer1. This finding is essential when considering Borzoi’s future applications: given that this specific model is trained on data taken across mammalian species and with a tissue-specific focus, Borzoi could be an immensely useful approach to identifying variants of unknown significance in essential genes that are evolutionarily conserved. The sheer amount of RNA-seq and GTEx data available is a major advantage when it comes to model training, as deep neural networks such as Borzoi are computationally intensive and require vast training datasets6. Variant analysis is time-consuming and often requires consulting multiple different assays, and making use of a single toolkit such as Borzoi that is trained comprehensively could be a decisive step towards a more streamlined approach to genomic interpretation. A further direction of model validation could be testing its performance on genome-wide association study data as valuable form of benchmarking for accuracy6,7. In the future, there is still much to be improved upon; whether the tool can reduce the false positive prediction rate and increase prediction accuracy across all layers of transcriptional regulation will effectively decide Borzoi’s role in genomic analysis.

References

  1. Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat Genet 1–13 (2025) doi:10.1038/s41588-024-02053-6.
  2. Muhammad, I. I., Kong, S. L., Akmar Abdullah, S. N. & Munusamy, U. RNA-seq and ChIP-seq as Complementary Approaches for Comprehension of Plant Transcriptional Regulatory Mechanism. Int J Mol Sci 21, 167 (2019).
  3. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 18, 1196–1203 (2021).
  4. Zeng, T. & Li, Y. I. Predicting RNA splicing from DNA sequence using Pangolin. Genome Biology 23, 103 (2022).
  5. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10, 57–63 (2009).
  6. Alharbi, W. S. & Rashid, M. A review of deep learning applications in human genomics using next-generation sequencing data. Human Genomics 16, 26 (2022).
  7. Sasse, A. et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat Genet 55, 2060–2064 (2023).

Small populations, big impacts: unravelling complicated heritability within bottlenecks

Lathursha Kalaranjan

Bottlenecked subpopulations are influenced by initial population size and time since contraction, which uniquely affect their gene pools and heritability patterns of complex traits. The frequencies of causal alleles are altered compared to larger reference populations, thereby suggesting conditions that are underrepresented in current databases.

Human populations have been shaped by hundreds to thousands of generations worth of migration and drift, altering our already complex genetic characteristics. Some populations have grown, and others have found a stable middle-ground— but what happens when a population is drastically isolated?

Genome-wide association studies (GWAS) investigate thousands of common single nucleotide polymorphisms (SNPs) across various populations to identify associations between genetic markers and complex traits1,2. Genotype-phenotype associations can be used to assess genetic risks and inform health strategies. The issue is that variant frequencies differ within and between populations. Smaller populations with reduced genetic diversity face additional complexities due to different selection behaviours, resulting in varying allele frequencies and accumulations of deleterious variants3,4. This can skew the results of GWAS, thereby complicating its actionability. Thus, it is crucial to investigate underrepresented populations to understand the relationship between demographic history and the heritability of complex traits.

Isolated or bottlenecked populations face complexities that aren’t necessarily captured by GWAS. These populations include Icelandic, Ashkenazi Jewish, Amish, and South Asian subpopulations5. Current GWAS panels use reference panels from larger populations, who exhibit differences that don’t reflect such isolates. Thus, Taylor & Lawson employ a simulation-based approach to characterize the evolution of complex traits and understand the genetic response to different environmental conditions1.

Figure 1. Visual demonstration of the computational generation of various bottlenecked populations from the reference population (N = 10,0000). Bottlenecked subpopulations were randomly obtained to satisfy different conditions— initial population size (N) and time of contraction (T). Initial population size refers to the number of individuals in a subpopulation at the point of contraction, referring to the founder size. Time of contraction refers to the point at which the subpopulation migrated or split from a larger population, referring to the generation at which separation occurred. The grey barrier denoting the separation simulates real-world effects such as migration, physical isolation, or social substructuring. Primary consequence of the bottlenecking event is a reduction in gene pool, as visualized. Figure made using BioRender and adapted from Taylor & Lawson, 20241.

Taylor & Lawson generated a reference population of 10,000 individuals, from which subpopulations were generated to satisfy different conditions (Figure 1)1. A random sample of SNPs was selected as causal variants with frequencies assumed to be under selection. Lower population sizes and older population-splitting were associated with reduced heritability (Figure 2). Ultimately, bottleneck sizes and time of population-splitting play an influential role in reducing genetic variation1. This simulation provides insights regarding real-world bottlenecks, who experience drastic differences in their gene pools and subsequent risks of traits. For example, some South Asian subpopulations experience homozygous effects on genes that may increase the risk of coronary artery disease5. However, not all isolates are equivalent— so the heritability within the South Asian subpopulations cannot be applied to Icelandic populations. Therefore, understanding how specific characteristics affect heritability can inform predictive factors for population stratification. Theoretically, this could reshape how GWAS are applied, thereby improving the accuracy of inferences and enabling precise clinical applications by tailoring variant interpretation.

Figure 2. Major findings regarding the influence of initial population size and time of contraction, based on Taylor & Lawson, 20241. a) Population size was assessed at T=200 generations ago. Lower population initial population sizes were associated with reduced heritability of complex traits, due to the reduced gene pool that results in a reduction in genetic variation. b) Contraction time, or the point at which the population split, was assessed at N=1,000 individuals. Older points of contraction were associated with a reduced mean heritability of complex traits, due to the increased loss of SNPs that results in a reduction in genetic variation. Figure made using BioRender.

The initial population size contributes to the differences in allele frequencies which, consequently, influences heritability. Founder events with fewer individuals contributing to a new gene pool can result in potent genetic drift, thereby reducing genetic diversity (Figure 2a)4. This can result in the Wahlund effect, where subpopulations experience excessive homozygosity due to the increased frequencies of recessive alleles4,6. Thus, reduced population sizes may experience a heightened risk of recessive disorders. A prime example is the Ashkenazi Jewish (AJ) population, who faced a narrow founding effect of less than 1,000 chromosomes7. Recent studies have found rare variants unique to this population that haven’t been fully captured by variant databases like gnomAD, suggesting a lack of coverage for bottlenecked populations7. This highlights the need to account for bottlenecks due to the prevalence of rare variants.

The point at which populations split contributes to the gene pool, which similarly affects heritability of traits. Older populations have a greater number of generations across which SNPs can be lost, therefore reducing the average heritability (Figure 2b). This can elevate frequencies of extreme SNPs due to non-random selection, since individuals from bottlenecks repopulate within a reduced gene pool1. The oldest example of bottlenecking is the out-of-Africa (OOA) bottleneck when human subpopulations migrated out of Africa roughly 2,000 generations ago1. This OOA bottleneck resulted in reduced heterozygosity in non-African populations3,8. This further suggests the importance of considering demographic histories, since the contraction point may stabilize deleterious variants.

The primary issue with isolated populations is that rare variants undergo strong selection. A severe reduction in populations can normalize mutant alleles or result in complete loss of SNPs4. As mutant traits ascend to relatively high frequencies, variance is reduced which affects inferences we can draw from GWAS9. GWAS models use reference panels, predominantly collected from European individuals and fail to account for isolates or bottlenecks2. Such isolates cannot be adequately grouped with reference populations, considering they exhibit different levels of genotypic variants5. As a result, it is essential to assess the prevalence of rare variants across underrepresented populations to identify underlying biological mechanisms. Clinically, this can inform population-specific panelling to enhance the filtering for variants that are likely, or unlikely, to associate with a given disease2,7.

So, what are the next steps for population-based variant analysis? In recent years, population-specific reference panels have been developed to account for underrepresented populations. One example is the African-specific reference panels, which have been assessed to consider various subpopulations within the continent10. Given what we know about the sheer diversity in Africa, the rectification of large-scale panels and newfound representation of subpopulation allows for promising actionability from resulting GWAS8,10.

Future studies can benefit from using empirical data to examine and compare subpopulations to reference populations. Simulation models are useful for forward predictions; however, they fail to account for real-world variables affecting heritability. One example is allele frequencies, which vary with genetic drift and therefore, are not known in empirical studies, but are assumed in simulations1. Collecting empirical data is a challenging yet imperative feat, as it better defines genetic responses to environmental changes and disease susceptibility. Improving GWAS interpretations to account for various demographic histories allows for accurate actionability of the findings.

The future of population genetics lies in population-scale genotyping projects of underrepresented populations and populations that exhibit unique substructuring. Drawing comparisons between such isolates and a larger reference population may reveal differences in phenotypic prevalence at a genotypic level5. Introducing national biobanks and opening the door for empirical study of underrepresented populations is the future for genetic discovery and clinical applications for personalized treatments.

References

1. Taylor, C. S. & Lawson, D. J. Heritability of complex traits in sub-populations experiencing bottlenecks and growth. J. Hum. Genet. 69, 329–335 (2024).

2. Quick, C. et al. Sequencing and imputation in GWAS: Cost-effective strategies to increase power and genomic coverage across diverse populations. Genet. Epidemiol. 44, 537–549 (2020).

3. Gravel, S. When Is Selection Effective? Genetics 203, 451–462 (2016).

4. Oliver, M. K. & Piertney, S. B. Selection Maintains MHC Diversity through a Natural Population Bottleneck. Mol. Biol. Evol. 29, 1713–1720 (2012).

5. Wall, J. D. et al. South Asian medical cohorts reveal strong founder effects and high rates of homozygosity. Nat. Commun. 14, 3377 (2023).

6. Overall, A. D. J. The Influence of the Wahlund Effect on the Consanguinity Hypothesis: Consequences for Recessive Disease Incidence in a Socially Structured Pakistani Population. Hum. Hered. 67, 140–144 (2008).

7. Lencz, T. et al. High-depth whole genome sequencing of an Ashkenazi Jewish reference panel: enhancing sensitivity, accuracy, and imputation. Hum. Genet. 137, 343–355 (2018).

8. Henn, B. M. et al. Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc. Natl. Acad. Sci. 113, E440–E449 (2016).

9. Simons, Y. B., Bullaughey, K., Hudson, R. R. & Sella, G. A population genetic interpretation of GWAS findings for human quantitative traits. PLoS Biol. 16, (2018).

10.       Bentley, A. R., Callier, S. L. & Rotimi, C. N. Evaluating the promise of inclusion of African ancestry populations in genomics. Npj Genomic Med. 5, 1–9 (2020).

Copy-Paste Genes: Revealing a Missing Piece of the Cancer Puzzle

Rohan Khan

Methylation patterns of “copy-paste” DNA elements reveal links to cancer, uncover their impact on nearby DNA, and open new avenues for therapeutic strategies.

The human genome consists not only of genes that encode proteins but also mobile genetic elements, sequences which can relocate to different parts of the genome. One of these is long interspersed element-1 (L1), which comprises around 17% of total human DNA1. Certain L1s have the ability to “copy-paste” themselves into different genomic locations through a process known as autonomous retrotransposition (Fig1)2. While this mobility plays a role in driving genetic variation and human evolution, it is also implicated in diseases through disrupting genome stability3. It is understood that epigenetic modifications – chemical changes to DNA that do not alter the sequence – regulate L1 retrotransposition activity, but the specific mechanisms have been unclear4. Now, Lanciano and Philippe et al. delve deeper into this topic, investigating how L1s are epigenetically modified across different cell types5. Their findings provide a deeper insight into L1 regulation, potentially opening the door to new therapeutic strategies that take advantage of L1 epigenetics.

Figure 1 | The autonomous retrotransposition activity of human-specific long interspersed element-1 (L1HS/LINE1). (a) The L1HS DNA segment is transcribed into an mRNA sequence (red) and then (b) translated into two different proteins, ORF1 (blue) and ORF2 (pink), which bind to the mRNA of origin (red). (c) ORF2 then cuts into an unassociated DNA strand, at a new position (d) and primes the reverse transcription of the mRNA into DNA that can be incorporated into the strand. (e) Novel insertions created by this process can disrupt functional gene sequences, drive genomic instability, and ultimately create genetic variants which could be beneficial or harmful. Figure adapted from Gasparotto et al., 20236.

A key epigenetic modification that regulates L1 activity is methylation, a process by which a methyl group is added to certain DNA bases, silencing the corresponding sequence. However fundamental questions surrounding L1 methylation exist: When are L1s activated in humans? Which cells are L1s active in? And how do L1 modifications impact the genome beyond retrotranspostion?

To address these gaps, the researchers analyzed methylation patterns of individual L1s in multiple cell lines such as embryonic cells, stem cells, and cancer cells5. Previously, it was believed that all L1 elements, including those capable of copy-paste retrotransposition, known as L1HS (L1 human-specific elements), were hypomethylated and active in tumor cells7. In fact, this assumption was proposed for use in cancer diagnosis, where L1 hypomethylation would act as a biomarker for tumors8. However, Lanciano and Philippe et al., challenge this view.

Their findings revealed that L1HS elements were hypomethylated – meaning they had lower methylation levels and potentially increased activity – in most stem and embryonic cells. In contrast, they were hypermethylated in cancer cells, likely leading to decreased activity5. This contradicts the hypothesis that L1HS elements are hypomethylated in cancer cells to drive genomic instability by creating novel insertions and further fueling tumor progression9. Instead, these new findings suggest that L1HS hypermethylation may act as a regulatory mechanism for the cancer cells, potentially protecting them from novel L1 insertions that could compromise their survival.

Given this newfound understanding of L1HS methylation in cancer cells, an intriguing question arises: Could reactivating L1HS elements through demethylation serve as a form of cancer therapy?

The researchers conducted further experiments in which demethylation of L1s was induced in cells5. Notably, this did not lead to increased L1 expression5. These results suggest that while methylation is an important part of managing L1 expression, it is not the only factor to consider. Instead, other regulatory mechanisms which control how tightly the DNA is packed, such as histone modification or chromatin accessibility, are also likely at play in suppressing L1 activity.

This finding, that demethylation alone does not increase L1 expression, highlights the intricacies of L1 regulation and indicates that potential therapeutic strategies targeting L1 would need to consider other regulatory mechanisms as well. As these mechanisms become better understood, future therapies could potentially activate L1HS in cancer cells and employ antigens derived from L1HS-encoded proteins to stimulate an immune response against the cancer cells10. Future research should investigate methods to specifically activate L1HS in cancer cells such that L1HS reactivation is cancer cell specific to prevent an increase in L1 insertions that disrupt functional genes.

Another notable finding of this study was that hypomethylation of L1s leads to methylation of surrounding DNA segments up to 300 base pairs from the L1. The researchers referred to this as a “sloping shore” effect, implying a gradual decrease in methylation levels closer to hypomethylated L1 DNA segments (Fig2)5. However, the study did not address whether these methylation changes had any functional consequences, leaving an open question around whether there are measurable effects of the sloping shore on gene expression.

Figure 2 | “Sloping Shore” Effect of L1 Elements. Through Oxford Nanopore sequencing, it was discovered that the methylation status of L1 elements influence the methylation of DNA segments up to 300 base pairs (bp) upstream of the L1 element. This raises the possibility of L1 methylation patterns influencing the regulation of nearby genes. Methylated bases are represented by black markers and demethylated bases are represented by white markers. Figure adapted from Lanciano et al., 20245.

This effect could offer further explanation to the aforementioned cancer cell L1 methylation analysis, where L1s without retrotransposition activity were observed to be hypomethylated. Through the sloping shore effect, the hypomethylation of L1s may lead to potential demethylation of nearby gene promoters, activating or overexpressing certain genes. For example, if an L1 is located near the promoter of an oncogene like KRAS – a regulator of cell proliferation pathways – demethylation of the L1 could lead to reduced methylation of the KRAS promoter thereby increasing the oncogene’s expression, cell proliferation, and ultimately causing colorectal cancer. While this is speculative, future research should explore whether sloping-shore induced methylation/demethylation has a significant impact on nearby gene expression.

The work from Lanciano and Philippe et al. provides a refined understanding of the epigenetic regulation of L1 elements. Their research has challenged previous assumptions about the relationship between cancer and L1 methylation and has showcased new functionality of L1s through the sloping shore effect. These findings also raise important questions that future research could tackle. How do environmental factors like diet and toxin exposure influence L1 methylation? Are L1 epigenetic modifications causative of cancer or a consequence? And how can we develop methods to specifically control L1 activity in a targeted manner? Answers to these questions could have notable implications for cancer biology, genome evolution, and potential epigenetic therapies, providing more perspective on the role of copy-paste DNA elements in health and disease.

References

1.         Ardeljan, D., Taylor, M. S., Ting, D. T. & Burns, K. H. The human LINE-1 retrotransposon: an emerging biomarker of neoplasia. Clin. Chem. 63, 816–822 (2017).

2.         Thomas, C. A., Paquola, A. C. M. & Muotri, A. R. LINE-1 retrotransposition in the nervous system. Annu. Rev. Cell Dev. Biol. 28, 555–573 (2012).

3.         Roy, N. et al. Elevated expression of the retrotransposon LINE-1 drives Alzheimer’s disease-associated microglial dysfunction. Acta Neuropathol. (Berl.) 148, 75 (2024).

4.         Mobile genomics: tools and techniques for tackling transposons | Philosophical Transactions of the Royal Society B: Biological Sciences. https://royalsocietypublishing.org/doi/10.1098/rstb.2019.0345.

5.         Lanciano, S. et al. Locus-level L1 DNA methylation profiling reveals the epigenetic and transcriptional interplay between L1s and their integration sites. Cell Genomics 4, 100498 (2024).

6.         Gasparotto, E. et al. Transposable Elements Co-Option in Genome Evolution and Gene Regulation. Int. J. Mol. Sci. 24, 2610 (2023).

7.         Baylin, S. B. & Jones, P. A. Epigenetic Determinants of Cancer. Cold Spring Harb. Perspect. Biol. 8, a019505 (2016).

8.         Ardeljan, D., Taylor, M. S., Ting, D. T. & Burns, K. H. The Human Long Interspersed Element-1 Retrotransposon: An Emerging Biomarker of Neoplasia. Clin. Chem. 63, 816–822 (2017).

9.         Alves, G., Tatro, A. & Fanning, T. Differential methylation of human LINE-1 retrotransposons in malignant cells. Gene 176, 39–44 (1996).

10.       Jung, H., Choi, J. K. & Lee, E. A. Immune signatures correlate with L1 retrotransposition in gastrointestinal cancers. Genome Res. 28, 1136–1146 (2018).

From Baby to Adult: 9 Gene Variants that Affect Birthweight

Yusra Khan

A new study identifies multiple gene variants that are involved with birth weight and subsequent health through exome wide association analyses.

Chubby or not, babies are cute. But cuteness aside, does the size of a baby at birth impact their current and future health? A study published in Nature Communications sheds light on this question by identifying rare genetic variants that influence birth weight. In 2025, Kentistou and colleagues conducted the research using whole-exome sequencing data from the UK Biobank, revealing how 9 genes involved in insulin-like growth factor (IGF) signaling, placental function, and fat metabolism contribute to fetal growth1. These findings offer insights not just into birth weight itself but also into the lifelong health risks associated with being born too small or too large, like obesity, diabetes, and cardiovascular disease.

Birth weight is a key indicator of neonatal health and influenced by genetic and environmental factors1. While previous studies have identified common genetic variants affecting birth weight, this study investigates the role of rare variants, which are mutations that occur in a small percentage of the population but can have significant biological effects. Until now, the specific genetic mechanisms overseeing the variation in birth weight remained widely unknown. By filling this research gap, this study provides a starting point for future research into fetal growth patterns and potential interventions to mitigate risks associated with abnormal birth weight.

Researchers identified nine key genes associated with birth weight, classified as affecting the fetus only, mother only or both fetus and mother (Figure 1). Among them, IGF1R and PAPPA2 play a crucial role in IGF signaling, a pathway that regulates fetal growth and energy metabolism1. These genes influence the availability of IGF, a hormone essential for normal intrauterine development1. One type of IGF is IGF1, which is an important regulator of growth in infancy and childhood promoting linear bone growth, muscle mass, and adipocyte maturation2. Similarly, PAPPA2 is a protease that increases IGF1 bioactivity by cleaving IGF1 from its binding proteins2. These genes influence the availability of IGF, a hormone essential for normal intrauterine development1. Disruptions in IGF signaling can lead to growth restriction or overgrowth, influencing birth weight and subsequent health issues.

Variants in PPARG, INHBE, NYNRIN and ACVR1C show that fat metabolism is important for determining birth weight and adult height1. These genes influence how the fetus stores and utilizes fat, potentially explaining variations in neonatal body composition1. PPARG mutations have also been linked to obesity and diabetes in adulthood3. Another study found that there was a high correlation between PPARG expression and glycemic control, leading to adult onset of diabetes3. NYNRIN was hypothesized to be involved in placental development, which has a direct link to the growth of the fetus7. Interestingly, rare loss- of-function (LoF) variants in INHBE and ACVR1C have been linked to favorable body fat distribution in adulthood, suggesting that birth weight may have a lasting impact on metabolic health4. It would be interesting to see if these genes play a role in other aging phenotypes, like during adolescence, do these gene variants express differently as someone ages? Going further, there could also be potential roles as a biomarkers or therapeutic potential in treatments.

The study also highlights the importance of placental function through the NOS3, NRK, and ADAMTS8 genes. These genes were found to have a concordant effect between birth weight and hieght1. NOS3 and ADAMTS8 are involved in maternal and fetal blood pressure and have been linked to hypertension in adulthood, while NRK appears to influence placental development1. Variants in the NOS3 gene have a significant effect on cardiovascular developmental processes, specifically increasing the risk of congenital heart disease5. Going further, there could also be potential roles as a biomarkers or therapeutic potential in treatments.

One limitation of this study was the lack of diversity, as they only used samples from the UK Biobank and compared results to an Icelandic population. However, the researchers mentioned increasing sample size and diversity beyond the UK Biobank as a priority for future research to assess genetic associations in different ethnic populations worldwide1. A possible route for future research is the interaction between these genetic variants and environmental factors such as maternal diet, stress, and lifestyle. Could personalized prenatal care, based on a mother’s and fetus’s genetics, help mitigate risks associated with abnormal birth weight? Exploring how these genes react to different environmental stimuli could aid in tracking the long-term effects into adulthood.

The significance of this research extends beyond genetics with implications for pregnancy management and long-term health. Babies born with extremely low or high birth weights are at increased risk for metabolic disorders, cardiovascular disease, and other health complications later in life1. Given the strong associations between rare variants and birth weight, identifying these specific genetic pathways allowed this study to lay the groundwork for targeted interventions that could optimize pregnancy outcomes and long-term health. This could be explored through the potential for early screening and predictive modeling. One concept this could apply to is the fetal-insulin hypothesis, which states that lower birth weight and Type II diabetes onset in adulthood are caused by the same genotype6. If a mother has an inheritable mutation that affects insulin function, the fetus will immediately be affected as insulin secretion and resistance is present from conception6. If doctors can identify high-risk pregnancies based on genetic screening, they could implement early interventions, like nutritional adjustments or closer monitoring, for better chances of reducing risks.

This study marks a significant step forward in understanding the genetic mechanisms of birth weight, connecting fetal genetics, maternal biology, and long-term health. As genetic research advances, the study’s findings have potential for leading to better health outcomes for both mothers and babies.

Figure 1: A summary chart illustrating the 9 genes found from this study. They are separated into the affected individual(s): fetal only (A), fetal and mother (B) or only mother (C), and then separated into the genes identified. The genes are then described with their function and overall outcome. Figure created on BioRender, information taken from Kentistou et al, 2025.

References

  1. Kentistou, K. A. et al. Rare variant associations with birth weight identify genes involved in adipose tissue regulation, placental function and insulin-like growth factor signalling. Nature Communications 16, (2025).
  2. Upners, E. N. et al. Dynamic Changes in Serum IGF-I and Growth During Infancy: Associations to Body Fat, Target Height, and PAPPA2 Genotype. The Journal of Clinical Endocrinology & Metabolism 107, 219–229 (2021).
  3. Darwish, N. M., Gouda, W., Almutairi, S. M., Elshikh, M. S. & Morcos, G. N. B. PPARG expression patterns and correlations in obesity. Journal of King Saud University – Science 34, 102116 (2022).
  4. Deaton, A. M. et al. Rare loss of function variants in the hepatokine gene INHBE protect from abdominal obesity. Nature Communications 13, 4319 (2022).
  5. Yi, K. et al. Association between NOS3 gene polymorphisms and genetic susceptibility to congenital heart Disease: A systematic review and meta-analysis. Cytokine 173, 156415–156415 (2024).
  6. Hughes, A. E., Hattersley, A. T., Flanagan, S. E. & Freathy, R. M. Two decades since the fetal insulin hypothesis: what have we learned from genetics? Diabetologia 64, 717–726 (2021).
  7. Plianchaisuk, A. et al. Origination of LTR Retroelement–Derived NYNRIN Coincides with Therian Placental Emergence. Molecular Biology and Evolution 39, (2022).