SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Mapping reads to reference genome + count reads of genes cumulonimbus RNA Sequencing 12 10-02-2013 08:07 AM
BWA-MEM: output mapped reads larger than input reads Kennels Bioinformatics 9 09-05-2013 05:27 PM
Is Separose bead protein G sufficient for ChIP-seq analyse? fangou Epigenetics 2 04-26-2013 09:27 PM
duplicate reads in Illumina short, single end reads of RNAseq data inbarpl Bioinformatics 4 05-22-2012 08:36 AM
Vector trimming: are flanking sequences sufficient? sulicon Bioinformatics 1 09-20-2010 07:02 AM

Reply
 
Thread Tools
Old 12-23-2013, 12:22 PM   #1
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default Sufficient Reads

Hello.

I am doing fastQC on a batch of samples and am wondering how many millions of reads suffices?

One sample passed QC with 13 x 10^6 reads (roughly 13 million)

over 20 million reads and of good quality is ideal. but what about 13 million?
arcolombo698 is offline   Reply With Quote
Old 12-23-2013, 01:36 PM   #2
Bukowski
Senior Member
 
Location: UK

Join Date: Jan 2010
Posts: 390
Default

That really depends what you're doing with the reads as to whether there are enough reads... More information is required.
Bukowski is offline   Reply With Quote
Old 12-23-2013, 01:45 PM   #3
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

I am doing variant calling, and will convert these files into VCF files, and do SNP calling for analysis of any SNP's.
arcolombo698 is offline   Reply With Quote
Old 12-23-2013, 01:46 PM   #4
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

I am not doing any differential expression analysis. I imagine after I get the VCF files, I will then do a pathway analysis.
arcolombo698 is offline   Reply With Quote
Old 12-23-2013, 03:56 PM   #5
EpiBrass
Member
 
Location: Australia

Join Date: Nov 2013
Posts: 16
Default

It depends on what genome you're using, how long your reads are and whether they're paired-end or single-end and how even the coverage is. If you're dealing with say a wheat genome (17 GB) then I'd say the number of reads you have is too low. A good guide to go by is that any variations should be supported by a minimum of 10 reads (preferable both in forward and reverse).

Further information is required, but I hope this helps.
EpiBrass is offline   Reply With Quote
Old 12-23-2013, 03:58 PM   #6
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

I am dealing with a human genome (homo sapien) using the reference genome Hg19 from UCSC.

I have reads from 13 million, some in 15 million, and others above 20 million.

if 13 million is too low, then do I sacrifice quality?
arcolombo698 is offline   Reply With Quote
Old 12-23-2013, 03:59 PM   #7
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

sorry. I am doing paired end reads, and most sequence lengths are from 30-128 bp in length
arcolombo698 is offline   Reply With Quote
Old 12-23-2013, 04:00 PM   #8
EpiBrass
Member
 
Location: Australia

Join Date: Nov 2013
Posts: 16
Default

What is your current quality cut off? I wouldn't go below Phred20.

I'm going to assume your "13/15/20 millions" are different samples which can't be pooled?
EpiBrass is offline   Reply With Quote
Old 12-23-2013, 04:02 PM   #9
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

looking through my QC, most reads are around 17 million. does this suffice? or should I lower my parameters in trimmomatic?
arcolombo698 is offline   Reply With Quote
Old 12-23-2013, 04:11 PM   #10
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

1) yes they are different samples, we can not combine the reads
2) I chose phred score of 33
3) I am using illumina clip, and this is for an RNA seq experiment of bone marrow using a truseq prep kit.

here are my parameters,

java -classpath /auto/rcf-proj/sa1/software/Trimmomatic-0.32/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticPE -threads 16 -phred33 931269_R1.fastq.gz 931269_R2.fastq.gz paired_trimmed_931269_R1.fastq.gz unpaired_trimmed_931269_R1.fastq.gz paired_trimmed_931269_R2.fastq.gz unpaired_trimmed_931269_R2.fastq.gz ILLUMINACLIP:/auto/rcf-proj/sa1/acolombo/Target_2013_229/BoneMarrows_PolyA/Sample_931269/TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 HEADCROP:15 SLIDINGWINDOW:4:10 MINLEN:30
arcolombo698 is offline   Reply With Quote
Old 12-23-2013, 04:13 PM   #11
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

I am trimming the adapters using a custom made TruSeq2-PE.fa file as well.
arcolombo698 is offline   Reply With Quote
Old 12-24-2013, 05:46 AM   #12
Bukowski
Senior Member
 
Location: UK

Join Date: Jan 2010
Posts: 390
Default

Quote:
Originally Posted by arcolombo698 View Post
I am dealing with a human genome (homo sapien) using the reference genome Hg19 from UCSC.

I have reads from 13 million, some in 15 million, and others above 20 million.

if 13 million is too low, then do I sacrifice quality?
Is it exome, whole genome, RNA-Seq, smaller targeted capture?

Dan
Bukowski is offline   Reply With Quote
Old 12-24-2013, 08:43 AM   #13
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

Thank you very much for your response

It is RNA-seq experiment.
arcolombo698 is offline   Reply With Quote
Old 12-24-2013, 08:57 AM   #14
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

If you are doing variant calling from RNA-seq data, 13M reads is enough to get sufficient read depth on a subset of the genes. Because the number of transcripts from genes varies 1000-fold, it is very difficult to get high depth from genes that are poorly expressed (and impossible to get high depth from genes that are not expressed). So for any particular number of reads, you will be able to make SNP calls for a particular number of genes, and as the number of reads increase, you'll be able to call SNPs from more genes.

edit: removed phred bit... thought was about parameters for cutting poor quality, not encoding!
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

Last edited by SNPsaurus; 12-24-2013 at 11:17 AM.
SNPsaurus is offline   Reply With Quote
Old 12-24-2013, 09:15 AM   #15
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

note that -phred33 in the trimmomatic parameters refers to the Illumina encoding for the base qualities, and not to the cutoff value.

Last edited by mastal; 12-24-2013 at 10:44 AM.
mastal is offline   Reply With Quote
Old 12-24-2013, 09:28 AM   #16
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

what is the optimal value for the phred score for rna seq data?
arcolombo698 is offline   Reply With Quote
Old 12-24-2013, 11:56 PM   #17
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

The "optimal value" is as high as possible. I think the Illumina limit at the moment is something like 43, but most people will settle for a "good enough" value that is considerably lower (e.g. 20 or 30).
gringer is offline   Reply With Quote
Old 12-25-2013, 05:48 AM   #18
Bukowski
Senior Member
 
Location: UK

Join Date: Jan 2010
Posts: 390
Default

Quote:
Originally Posted by arcolombo698 View Post
Thank you very much for your response

It is RNA-seq experiment.
It's probably good you're not doing too much on DE/transcript reconstruction as I would have said the read numbers were a little low for anything other than gene level expression testing.

As already pointed out, you should be fine to do variant calling from the expressed genes, where there is sufficient depth - but you will want to put some kind of depth threshold for the variant calling (also there will be instances of allele specific expression and presumably RNA editing in the data). With different numbers of reads per sample, obviously some samples will have better coverage than others. For differential expression studies this is not so much of an issue when the data is normalised to the number of mapped reads anyway.

Last edited by Bukowski; 12-25-2013 at 05:58 AM.
Bukowski is offline   Reply With Quote
Old 12-25-2013, 12:03 PM   #19
arcolombo698
Senior Member
 
Location: Los Angeles

Join Date: Nov 2013
Posts: 142
Default

thank you for the response
are there resources recommended that gives insight about understanding sufficient depth of reads for variant calling and other common problems? it seems sufficient to have read lengths of 13M.. i have a sample read with 7M , this seems also low.
these samples have all passed QC with no duplicated sequences (all the repeated sequences have been trimmed out...dont want duplicated sequences)

would appreciate resources on how to properly process VCF , and also how to analyze the VCF results, error checking etc...
arcolombo698 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO