Preface
While waiting for transfering blog articles from other platform, I think create more is more important.
Currently I devote myself to research, work with a very nice professor Heyrim Cho :) (University of California, Riverside) and hope to have a publication at the end of this year. So I start to write this series of blogs, aim to record and clear up my thoughts, also can be seen as my psychological journel, the article’s birth witness and evidence.
Finally, hope it can provide some inspirations to readers.
First paper in RNA velocity
We all know RNA-seq, but this approach captures only a static snapshot at a point in time. When we want to analyze time-resolved phenomena such as embryogenesis or tissue regeneration, we need something more.
RNA abundance is a powerful indicator of the state of individual cells. This paper use this indicator to define a new feature called RNA velocity—the time derivative of the gene expression state. This feature can be directly estimated by distinguishing between unspliced and spliced mRNAs in common single-cell RNA sequencing protocols.
Here I captured some important points and listed together:
- RNA velocity is a high-dimensional vector that predicts the future state of individual cells on a timescale of hours.
- This paper validate its accuracy in the neural crest lineage, demonstrate its use on multiple published datasets and technical platforms, reveal the branching lineage tree of the developing mouse hippocampus, and examine the kinetics of transcription in human embryonic brain.
- We expect RNA velocity to greatly aid the analysis of developmental lineages and cellular dynamics, particularly in humans.
There is many theoritical and analysis detail in that origin paper, but maybe we don’t need to really go into the detail of it. RNA velocity basically is to reuse the information from reads to expression, in its demo they use DentateGyrus.loom.
Loom is an efficient file format for large omics datasets. Loom files contain a main matrix, optional additional layers, a variable number of row and column annotations, and sparse graph objects. Under the hood, Loom files are HDF5 and can be opened from many programming languages, including Python, R, C, C++, Java, MATLAB, Mathematica, and Julia
HDF5 is a file format and h5py is a package to process it in python, for more you can see this blog: blog link.
Cellranger
In another paper’s souce code I can tell that the original data need go through cellranger pipeline to give this velocity code needed file.
So what is cellranger? Cellranger is a 10X genomic’s software, it is a set of analysis pipelines that process Chromium single-cell RNA-seq output to align reads, generate feature-barcode matrices and perform clustering and gene expression analysis. Here I summarized some of the commands( you can also find yourself in their website ):
- cellranger mkfastq: turn base call (BCL) files generated by Illumina sequencers into FASTQ files. Wrapped as bcl2fastq.
- cellranger count: takes FASTQ files from cellranger mkfastq and performs alignment, filtering, barcode counting, and UMI counting. Can generate count matrices( comonly row represents genes and column represents cells,), determine clusters, and perform gene expression analysis.
- cellranger aggr: aggregates outputs from multiple runs of cellranger count, normalizing those runs to the same sequencing depth and then recomputing the counts matrices and analysis on the combined data.
- cellranger reanalyze: takes count matrices produced by cellranger count or cellranger aggr and reruns the dimensionality reduction, clustering, and gene expression algorithms using tunable parameter settings. Output is delivered in standard BAM, MEX, CSV, HDF5 and HTML formats that are augmented with cellular information.
Then to annotate reads following some rules (detail in paper)
supplement information: BCL(single cell sequence result file) –> fasta(after referencing) –> fastq(fasta plus quality control) –> SAM(after reads mapping) <–> BAM(compressed binary format, can be decompressed by samtool)
some extensible text saving format: XML, CSS, JSON, JavaScript, Java, SQL, HTML
different protocal: ftp(facing file), http(facing website)
SRA(a database) – SRP(project) – SRX(experiment) – SRS(sample) – SRR(run)
Data Format: SOFT(text), MINiML(XML), TXT
GEO(gene expression Omnibus) – GSE(whole project) – GDS(platform) – GSM(sample), GPL(platform information)
Velocity.py
In short for 10x genomics, the Cellranger can give us the BAM file with annotaion, then use the CLI to ield a .loom file with counts divided in spliced/unspliced/ambiguous.
Next step is to get velocity features. They provided two version, one is R version and another is python version. Here I choose python version. There are nine datasets and several analysis pipeline. We first focus on three of them, which I will descibe more in Current Research series(2). Go and check it: Current Research series(2)
Comments