![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Bowtie: How to retain only uniquely mapped reads? | EstherKLather | Bioinformatics | 10 | 03-25-2018 12:39 AM |
retain multihit reads for CNV analysis | jorge | Bioinformatics | 0 | 08-30-2011 01:41 AM |
bfast for unmapped reads | Protaeus | Bioinformatics | 2 | 11-17-2010 03:35 PM |
What are the unmapped reads | beelu | Illumina/Solexa | 1 | 09-09-2010 06:18 AM |
% unmapped reads | bioinfosm | Illumina/Solexa | 8 | 07-05-2010 01:36 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: US Join Date: Sep 2010
Posts: 14
|
![]()
I have heard that it is important for downstream analyses to retain unmapped reads. I am interested to know the reason for this recommendation.
Specifically, I am using BWA + GATK to call SNPs from Illumina data. It is not clear to me if the GATK SNP calling pipeline ever utilizes unmapped reads. We expect a large proportion of unmapped reads, so we could save a lot of disk space by getting rid of them. |
![]() |
![]() |
![]() |
#2 |
Super Moderator
Location: US Join Date: Nov 2009
Posts: 437
|
![]()
You don't *have* to retain unmapped reads if you are calling SNPs and especially if you are archiving the original FASTQ files you could remove unmapped reads from the BAMs...
Last edited by adaptivegenome; 01-12-2012 at 09:07 PM. Reason: typo |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: St. Louis Join Date: Dec 2010
Posts: 535
|
![]()
If you want to call structural variants at some point, you will probably want to keep the unmapped reads as they could cover breakpoints that prevent them from aligning.
However, if you only want to call SNPs and you are guaranteed not to care about calling anything else, then I agree with genericforms. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
So that others can re-run the data. It tells others what the real source data is. There's other information in the unmappeds: often viral or bacterial sequences that may be of interest (i.e. the sample has herpesvirae).
A classic example is a paired end rna seq. One read pair may not map but you still need it to do paired end processing; aligners require the two pairs to be there. Improvements in alignment software with something as tricky as rna alignment are likely (someday). Another case might be a very wacky indel. Trying to align all the reads to a small area or alternate genome build using different software might provide insight. |
![]() |
![]() |
![]() |
#5 |
Member
Location: Oxford Join Date: Apr 2010
Posts: 51
|
![]()
If there are unmapped reads, either the mapper has made a mistake, the reference has gaps, or the sample is different from the reference in some way that the mapper cannot compensate for. The differences may be structural variants, repeats, paralogues of genes, duplications of regions, etc.
If you want a set of conservative SNPs and you don't care about accessing all variation, then that's fine, you don't care about those problematic parts of the genome. If you have some phenotype you are investigating, or you want a complete/sensitive set of variants, then you may be concerned about missing SNPs or more complex variants. In that case you want to keep the unmapped reads to do stuff with them (count them, or assemble the unampped reads, or assemble ALL the reads, or use paired-ends to detect structural variants, etc) |
![]() |
![]() |
![]() |
#6 |
Member
Location: Belgium Join Date: Jun 2011
Posts: 45
|
![]()
You can also use the reads to look for potential contamination. Throw them into an assemblier and blast the bigger contigs you get out. If you see decent size contigs for a viral or bacterial species you man want to add a contamination filter step to the begining of the mapping pipeline and see how that changes your results.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|