Hello,
I have some paired-end RNA-Seq reads (length=100) sequenced using Illumina's HiSeq platform and was wondering whether I should trim/ilter my reads. I've run FastQC with some of my samples and the immages are attached. Here are my questions:
1) The per_base_n_content plots show many Ns in the last (about 10 or so) positions. Is this normal? Do I need to trim the end of reads if I use these reads for differential expression analysis? What if I use them for variant calling? If trimming is necessary, what is a good strategy? To trim a fixed number of bases, to trim by quality score, or to trim if there's an N and the quality is low?
2) The duplication levels of my samples are all over 50%. I can use Picard to mark those caused by PCR amplification or something else. But how reliable is that? Is it safe to remove duplicates marked by that software?
3) There are some adapter sequences in my reads but the percentage is low (from 0.1% to <1%). Is it necessary to remove them?
I actually have two independent sets of data sequenced by our NGS core facility. The samples were prepared by different people for different purposes at different times, but both data sets have similar FastQC results. If they both look unusual, could it be that there was something worng with the core facility?
Thanks,
Sylvia
I have some paired-end RNA-Seq reads (length=100) sequenced using Illumina's HiSeq platform and was wondering whether I should trim/ilter my reads. I've run FastQC with some of my samples and the immages are attached. Here are my questions:
1) The per_base_n_content plots show many Ns in the last (about 10 or so) positions. Is this normal? Do I need to trim the end of reads if I use these reads for differential expression analysis? What if I use them for variant calling? If trimming is necessary, what is a good strategy? To trim a fixed number of bases, to trim by quality score, or to trim if there's an N and the quality is low?
2) The duplication levels of my samples are all over 50%. I can use Picard to mark those caused by PCR amplification or something else. But how reliable is that? Is it safe to remove duplicates marked by that software?
3) There are some adapter sequences in my reads but the percentage is low (from 0.1% to <1%). Is it necessary to remove them?
I actually have two independent sets of data sequenced by our NGS core facility. The samples were prepared by different people for different purposes at different times, but both data sets have similar FastQC results. If they both look unusual, could it be that there was something worng with the core facility?
Thanks,
Sylvia
Comment