Hi guys,
Would love some feedback on my RNA-seq workflow. I've got to analyze NGS dataset, and basically, I'm a noob in NGS - a bench researchers, who wants to move closer to bioinformatics.
The goal of the experiment is to find differentially expressed genes between two treatment groups, I've got ~ 10 mil reads per sample, 36 cycles, barcoded.
Here is the workflow:
1. Examined quality in FastQC;
2. Trimmed adapters (I used cutadapt program);
3. Aligned reads to the genome with bowtie: bowtie -v 2 --best -s <ebwt> <reads> <sam>
4. Converted sam to bam:
samtools -view -bS <sam> <bam>
5. Downloaded refseq gene annotation from UCSC in bed format from gene and prediction tracks using UCSC table browser;
6. This step bothers me a bit. As far as I understand UCSC views genes as transcripts, so there are multiple entries listed for the same or closely overlapping genomic coordinates. As the result I'll end up with the same read counts vs basically same gene, which is listed under multiple transcript names. I suspect this could mess with the size factors estimation in DESeq.
So I merged overlapping transcripts in order to avoid this ambiguity in read counting using bedtools software, which I really came to appreciate:
mergeBed -nms -i <bed> > output_bed: -nms option lists all of the merged entries opposite to new set of genomic coordinates
The problem with merging is that some genuinely different genes may have small overlaps and will be merged into one feature. But I hope that this is a minor risk and I'll be able to examine differentially expressed genes manually.
7. Finally, I've obtained read count for each of the merged features using bedtools:
intersectBed -abam <bam> -b refseq_exons_hg19.bed -bed -split -f 1.0 | intersectBed -a stdin -b merged_refseq_hg19.bed | coverageBed -a stdin -b merged_refseq_hg19.bed > file.coverage
This sequence of commands should obtain only those reads which have 100% overlaps with known refseq exons (uploaded from UCSC), intersect those with whole genes and count raw read coverage for each gene;
8. And lastly, I'm going to use DESeq to detect differential expression.
Thanks in advance!
Slava.
Would love some feedback on my RNA-seq workflow. I've got to analyze NGS dataset, and basically, I'm a noob in NGS - a bench researchers, who wants to move closer to bioinformatics.
The goal of the experiment is to find differentially expressed genes between two treatment groups, I've got ~ 10 mil reads per sample, 36 cycles, barcoded.
Here is the workflow:
1. Examined quality in FastQC;
2. Trimmed adapters (I used cutadapt program);
3. Aligned reads to the genome with bowtie: bowtie -v 2 --best -s <ebwt> <reads> <sam>
4. Converted sam to bam:
samtools -view -bS <sam> <bam>
5. Downloaded refseq gene annotation from UCSC in bed format from gene and prediction tracks using UCSC table browser;
6. This step bothers me a bit. As far as I understand UCSC views genes as transcripts, so there are multiple entries listed for the same or closely overlapping genomic coordinates. As the result I'll end up with the same read counts vs basically same gene, which is listed under multiple transcript names. I suspect this could mess with the size factors estimation in DESeq.
So I merged overlapping transcripts in order to avoid this ambiguity in read counting using bedtools software, which I really came to appreciate:
mergeBed -nms -i <bed> > output_bed: -nms option lists all of the merged entries opposite to new set of genomic coordinates
The problem with merging is that some genuinely different genes may have small overlaps and will be merged into one feature. But I hope that this is a minor risk and I'll be able to examine differentially expressed genes manually.
7. Finally, I've obtained read count for each of the merged features using bedtools:
intersectBed -abam <bam> -b refseq_exons_hg19.bed -bed -split -f 1.0 | intersectBed -a stdin -b merged_refseq_hg19.bed | coverageBed -a stdin -b merged_refseq_hg19.bed > file.coverage
This sequence of commands should obtain only those reads which have 100% overlaps with known refseq exons (uploaded from UCSC), intersect those with whole genes and count raw read coverage for each gene;
8. And lastly, I'm going to use DESeq to detect differential expression.
Thanks in advance!
Slava.
Comment