Preface

While waiting for transfering blog articles from other platform, I think create more is more important.

Currently I devote myself to research, work with a very nice professor Heyrim Cho :) (University of California, Riverside) and hope to have a publication at the end of this year. So I start to write this series of blogs, aim to record and clear up my thoughts, also can be seen as my psychological journel, the article’s birth witness and evidence.

Finally, hope it can provide some inspirations to readers.

First paper in RNA velocity

We all know RNA-seq, but this approach captures only a static snapshot at a point in time. When we want to analyze time-resolved phenomena such as embryogenesis or tissue regeneration, we need something more.

RNA abundance is a powerful indicator of the state of individual cells. This paper use this indicator to define a new feature called RNA velocity—the time derivative of the gene expression state. This feature can be directly estimated by distinguishing between unspliced and spliced mRNAs in common single-cell RNA sequencing protocols.

Here I captured some important points and listed together:

There is many theoritical and analysis detail in that origin paper, but maybe we don’t need to really go into the detail of it. RNA velocity basically is to reuse the information from reads to expression, in its demo they use DentateGyrus.loom.

Loom is an efficient file format for large omics datasets. Loom files contain a main matrix, optional additional layers, a variable number of row and column annotations, and sparse graph objects. Under the hood, Loom files are HDF5 and can be opened from many programming languages, including Python, R, C, C++, Java, MATLAB, Mathematica, and Julia

HDF5 is a file format and h5py is a package to process it in python, for more you can see this blog: blog link.

Cellranger

In another paper’s souce code I can tell that the original data need go through cellranger pipeline to give this velocity code needed file.

So what is cellranger? Cellranger is a 10X genomic’s software, it is a set of analysis pipelines that process Chromium single-cell RNA-seq output to align reads, generate feature-barcode matrices and perform clustering and gene expression analysis. Here I summarized some of the commands( you can also find yourself in their website ):

Then to annotate reads following some rules (detail in paper)


supplement information: BCL(single cell sequence result file) –> fasta(after referencing) –> fastq(fasta plus quality control) –> SAM(after reads mapping) <–> BAM(compressed binary format, can be decompressed by samtool)

some extensible text saving format: XML, CSS, JSON, JavaScript, Java, SQL, HTML

different protocal: ftp(facing file), http(facing website)

SRA(a database) – SRP(project) – SRX(experiment) – SRS(sample) – SRR(run)

Data Format: SOFT(text), MINiML(XML), TXT

GEO(gene expression Omnibus) – GSE(whole project) – GDS(platform) – GSM(sample), GPL(platform information)

Velocity.py

In short for 10x genomics, the Cellranger can give us the BAM file with annotaion, then use the CLI to ield a .loom file with counts divided in spliced/unspliced/ambiguous.

Next step is to get velocity features. They provided two version, one is R version and another is python version. Here I choose python version. There are nine datasets and several analysis pipeline. We first focus on three of them, which I will descibe more in Current Research series(2). Go and check it: Current Research series(2)