SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
merge BED files from two lanes? alittleboy Bioinformatics 1 08-19-2013 12:25 AM
Same sample - multiple lanes lre1234 Bioinformatics 0 05-29-2013 04:11 AM
multiple cartridges in a single MiSeq run? wingtec General 1 08-10-2012 12:21 PM
Multiple fragment lengths in single 454 titanium run? Tom McFarland 454 Pyrosequencing 3 05-18-2011 06:47 AM
Input to BWA from multiple Lanes AvinashP Bioinformatics 2 06-11-2010 08:24 AM

Reply
 
Thread Tools
Old 07-24-2014, 03:48 AM   #1
jullee
Member
 
Location: Switzerland

Join Date: Apr 2014
Posts: 19
Default WHEN to merge data from single library run on multiple lanes?

Hi

Even after reading several posts, I'm still confused about merging data from multiple lanes...

I've got a single library run on four lanes (each library run with several other libraries in a given lane but the paired-end fastq files are already separated by library). My goal is to generate a single consensus sequence of the mitochondria for each library.

My overall pipeline for trimmed reads consists of 1) alignment to the reference with BWA mem 2) convert sam to bam 3) sorting with Piccard tools 4) removing duplicates with Piccard tools 5) removing ambiguous reads with samtools and 6) then splitting the bam file into separate nuclear and mitochondrial bams (using samtools).

I'm specifically wondering if there are any problems with merging the resulting mitochondrial bams from running this pipeline separately for each lane? Or should I be merging the data from the four lanes at an earlier step?

Also, I'm really confused about the concept of @RGs and @SQ...Is a RG simply the bam version of SQ? I was thinking of using samtools merge with the -r parameter specified...does this replace the original RGs somehow? How does the RG effect things downstream (for me I'm eventually generating an mpileup file and then a consensus sequence...).

Thanks in advance for the help!
jullee is offline   Reply With Quote
Old 07-24-2014, 05:44 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You can pretty much always merge data from a single library run on multiple lanes, since they should have the same biases. If you really wanted, you could align them separately and then give them different PL identifier, though practically speaking I kind of doubt that's normally all that useful (any weird bias that you'll want to filter on should be library, rather than lane, specific, though I'm sure someone has a counter example).

Unless you have very few reads per sample per lane then it makes little practical difference when you merge things. The only benefit to merging before alignment is that you might get a better estimate of the template length distribution, though unless you really really heavily multiplex I doubt you get much of any difference.

BTW, BAM files have @SQ lines and SAM files also have @RG lines. The @SQ lines describe reference sequences (names and lengths). These will be the same in all of your samples. The @RG lines describe sample metainformation that you can then keep along with your alignments for post-processing. The nice thing is that you can merge files with different read groups and that information is then preserved (the alignments themselves RG auxiliary tags).

Yes, the -r flag will replace read groups with something that's likely less useful for you.
dpryan is offline   Reply With Quote
Old 07-24-2014, 11:00 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by jullee View Post
My overall pipeline for trimmed reads consists of 1) alignment to the reference with BWA mem 2) convert sam to bam 3) sorting with Piccard tools 4) removing duplicates with Piccard tools 5) removing ambiguous reads with samtools and 6) then splitting the bam file into separate nuclear and mitochondrial bams (using samtools).

I'm specifically wondering if there are any problems with merging the resulting mitochondrial bams from running this pipeline separately for each lane? Or should I be merging the data from the four lanes at an earlier step?
You would definitely want to merge the aligned reads (BAM) before doing duplicate removal. Duplicate removal tools need to know about all reads in your data set to work properly. Also, if you are only interested in the mitochondrial reads it pays to reduce your data set to mitochondria only as soon as is practical, to avoid unnecessary computation. Here is what I would do:

- Align each lane to reference.
- Convert output to BAM, sort and index these BAM files.
- Separate nuclear and mito alignments, keeping only uniquely mapped reads in each case (as you described "removing ambiguous reads").
- Merge your mito BAM files.
- Deduplicate your merged, mito BAM file.
kmcarr is offline   Reply With Quote
Old 07-24-2014, 01:22 PM   #4
jullee
Member
 
Location: Switzerland

Join Date: Apr 2014
Posts: 19
Default

Thanks for the replies.

kmcarr does it hurt to do the removal of duplicates twice? I'm thinking the first round takes care of PCR artifacts (e.g. library effects), the second sequencing artifacts (e.g. lane effects)?

Also do I need to index again after the bam files are merged?

Thanks!
jullee is offline   Reply With Quote
Old 07-24-2014, 01:36 PM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by jullee View Post
Thanks for the replies.

kmcarr does it hurt to do the removal of duplicates twice? I'm thinking the first round takes care of PCR artifacts (e.g. library effects), the second sequencing artifacts (e.g. lane effects)?
No, it really doesn't make sense to run de-duplication twice. If by "lane effects" you mean optical duplicates it is true that to detect these you only need the reads from each lane, but these are pretty rare. De-duplication is really all about removing PCR duplicates and PCR duplicates can only be identified if you have all of the reads from a given library together in a single, coordinate sorted BAM file.

Quote:
Originally Posted by jullee View Post
Also do I need to index again after the bam files are merged?

Thanks!
Yes(*), because your merged BAM is a completely new file and needs a new index.

(*) I say yes assuming that you will be performing some task further downstream which relies on having a BAM index present, which is just about any task.
kmcarr is offline   Reply With Quote
Old 07-24-2014, 10:06 PM   #6
jullee
Member
 
Location: Switzerland

Join Date: Apr 2014
Posts: 19
Default

Thanks kmcarr! Your response was very helpful!
jullee is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:51 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO