× Close
Missing gene reads and sequencing in single-cell RNA-seq experiments. credit: Nature’s methods (2023). doi: 10.1038/s41592-023-02003-s
In 2018, researchers in the Caltech laboratory of Yuki Oka, a professor of biology and a Heritage Medical Research Institute researcher, made a major discovery: They identified a type of neuron, or brain cell, that mediates the process of thirst gratification. But they were faced with a problem: A recent technique called single-cell RNA sequencing (scRNA-seq) couldn’t find those neurons associated with thirst in samples of brain tissue (specifically, from an area called the medial preoptic nucleus) that were It is known to contain it.
“We knew that the genetic marker we added to our labeled neurons was expressed in the brain’s intermediate preoptic nucleus, but we didn’t see the gene when we profiled that region of the brain using scRNA-seq,” Oka says. “We heard this from many colleagues – scRNA-seq was missing cell types and gene expression that they knew should be there. We started to wonder why that was.”
Identifying different cell types is crucial to understanding the vast number of functions our bodies perform, from healthy processes such as thirst sensing to cellular dysfunction in disease states. For example, many researchers are currently looking for cell types that may be linked to certain diseases, such as Parkinson’s disease. Identifying the precise cell types involved in such processes is crucial to all such studies.
Now, a collaboration between the Oka lab at the California Institute of Technology and the Allan-Hermann Pool lab at the University of Texas Southwestern Medical Center has shown how to improve a key step in scRNA-seq analysis to recover lost cell types and gene expression data that would normally be obtained while discarded. A paper describing the work appears in the journal Nature’s methods On September 11th.
“We have improved the analysis of existing single-cell RNA-seq data, revealing the expression of hundreds or sometimes thousands of genes for individual data sets,” Oka says. “It is important to enable this kind of precision because biological processes are rich and complex. Recent research has identified more than 5,000 different types of neurons in the mouse brain, and the human brain is presumably even more complex. We need our techniques to be as sensitive and comprehensive as possible.”
Understanding gene expression
There are trillions of cells in your body, each performing different functions that enable you to live your life, or in some cases lead to disease. Cells are distinguished from each other according to their function. For example, killer T cells in the immune system seek out and destroy pathogens that cause disease, neurons fire electrical signals that form the basis of brain function, and skin cells pack tightly together to form a barrier against the outside world. Researchers have now identified thousands of distinct cell types, but other unique types likely remain undiscovered.
Although cells can vary in shape and function, most cells in a given organism contain an identical genetic blueprint—the genome. The genome contains instructions on how to do any cellular task. The genes that make up the genome are written into the DNA found in the cell nucleus. Expressed genes are transcribed into RNA, which is transported out of the nucleus to the rest of the cell to carry out its functions.
In any cell (and cell type), only a certain subset of genes are expressed, or turned on, at a given time. These differences in gene expression add up to differences in cell types.
As an analogy, think of a huge library with books sorted into different sections. If you want to build an airplane, just look at books on aviation and mechanics. If you are interested in other topics, you can browse a different range of books. The cells of an organism are no different from each other: while each cell contains the entire “library” of genes, only genes that relate to the cell’s specialized functions are activated in the cell.
Improving gene expression estimation techniques
scRNA-seq is a powerful technology for identifying cell types. Using this method, the cell is opened and the genetic information expressed within is labeled with a molecular tag that acts as a barcode. scRNA-seq can do this quickly for thousands of cells in a single tissue sample, with each cell receiving its own unique barcode. Computational analysis can then be performed to determine which sets of genes are expressed in individual cells, and computer models can evaluate that data to look for patterns and identify distinct cell types.
However, one problem with this technique was that some RNA-seq data were not typically included in gene expression estimates, even though they represented expressed genes.
Oka and his colleagues found that the reason is related to a problem with the so-called reference transcriptome to which researchers map sequence data. For example, researchers have studied the mouse genome extensively, naming it or annotating it in great detail, creating a digital reference, or “transcript,” that maps DNA sequences and their corresponding genes.
The researchers found that this annotation should be optimized for scRNA-seq to prevent loss of gene expression information — which can arise if genes at the tail ends of a DNA strand are poorly annotated, for example, or if there is a wide range of annotations. . Overlap between copies of neighboring genes. Such complications can prevent the discovery of thousands of genes. (These problems are particularly apparent when using high-throughput forms of scRNA-seq, which, to reduce cost, only the end of genes must be examined; most atlases created to describe the cellular complexity of our tissues rely on these methods.)
High precision and accuracy are very important when identifying distinct cell types. For example, suppose two cells each express genes A, B, C, and D, but only one cell expresses gene E and the other does not. If the sequencing technology cannot capture the expression from “E,” the data indicates that the two cells are identical when in fact they are not.
Led by Paul, a former postdoctoral researcher at Caltech and first author on the study, the team improved the reference transcriptome of the human and mouse genomes and, over several years, built a computational framework for repairing the reference transcriptome of other organisms.
“Optimizing reference texts enables us to see cell types and conditions that we would otherwise be oblivious to,” Paul says. “For example, thanks to our improved reference versions, we are now able to monitor the full range of neural populations that sense thirst, satiety and temperature in areas of our brain that we suspected existed but were unable to detect. This approach will also be very useful in revealing diversity “New cellular and genetic atlases of current and upcoming cell types for the brain and other organs.”
The paper is titled “Recovery of lost single-cell RNA-seq data using improved text references.” In addition to Paul and Oka, Caltech co-authors include former research scientist Sisi Chen and Matt Thompson, assistant professor of computational biology and researcher at the Heritage Medical Research Institute. Helen Boldsam of the University of Texas Southwestern Medical Center is also a co-author.
more information:
Alan Hermann Paul et al., Recovering missing single-cell RNA-seq data using enhanced text references, Nature’s methods (2023). doi: 10.1038/s41592-023-02003-s
Magazine information:
Nature’s methods