SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Downsampling a bam file for a specific number of reads kmac Bioinformatics 4 03-29-2014 07:56 AM
Can you read a bam file by mate pairs? kgulukota Bioinformatics 5 08-29-2013 11:12 AM
Downsampling a BAM file KathrineBL Epigenetics 6 05-31-2012 12:16 PM
getting read quality out of a bam file blu78 Bioinformatics 0 07-06-2010 07:39 AM
replacing read id in bam file blu78 Bioinformatics 2 06-23-2010 12:14 PM

Reply
 
Thread Tools
Old 04-05-2016, 05:44 AM   #1
EcoRInya
Member
 
Location: UK

Join Date: Apr 2016
Posts: 10
Default downsampling read pairs from a bam file

Hi guys,

I am trying to downsample a bam file with paired reads. Initially I was using samtools view -s, providing a fraction of reads I'd like to keep. But the caveat for this is that the output file will contain not only intact read pairs but also individual reads. It is bad for me because I am trying to use this downsampled bam file for analysis with packages that take as an input bed files (such as diffReps). In case when I have both paired and not paired reads in a bam file, there is no way I can convert it to bed 100% correctly.

Do you have any suggestions on how I could downsample a bam file keeping read pairs intact? I really struggled to find a ready solution for this by googling.

Many thanks in advance!
EcoRInya is offline   Reply With Quote
Old 04-05-2016, 07:03 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,602
Default

Take a look at @Brian's post in this thread.

Edit: Looking at @Brian's post again it seems as if the paired information would not be kept. If you have original fastq files available you should be able to retrieve the other read from a pair by using repair.sh.

Last edited by GenoMax; 04-05-2016 at 07:05 AM.
GenoMax is offline   Reply With Quote
Old 04-05-2016, 07:47 AM   #3
fanli
Senior Member
 
Location: California

Join Date: Jul 2014
Posts: 197
Default

seqtk does this and will keep read pairs intact if you pass the same random seed:
https://github.com/lh3/seqtk

Code:
seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq
fanli is offline   Reply With Quote
Old 04-05-2016, 07:51 AM   #4
EcoRInya
Member
 
Location: UK

Join Date: Apr 2016
Posts: 10
Default

GenoMax, thank you for your reply!
Do you know what happens with the total read count after this repair? I don't quite understand that. If I just complete all the read pairs, it will obviously affect total number of reads I am getting and screw downsampling.
EcoRInya is offline   Reply With Quote
Old 04-05-2016, 07:59 AM   #5
EcoRInya
Member
 
Location: UK

Join Date: Apr 2016
Posts: 10
Default

thank you, fanli!
I am not sure that downsampling of a fastq file is a good idea. I use downsampling for normalisation and if I downsample initial fastq files I don't know how many reads I am getting back after the alignment is done which is not suitable for normalisation. However, I could convert bam with uniquely aligned reads after PCR duplicates removal to fastq and use seqtk. I am not sure that it is the most optimal solution though. But I will think about it.
EcoRInya is offline   Reply With Quote
Old 04-05-2016, 08:02 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,602
Default

"Repair.sh" (re-pair : a trick name) should only recover corresponding reads to the ones that are present in the downsampled file.

I assume you are doing this from the BAM because you only want to sample reads that aligned (your BAM does not have unmapped reads?). If you don't care about the alignments then you could downsample the original fastq files by using reformat.sh or the seqtk method @fanli posted above.
GenoMax is offline   Reply With Quote
Old 04-05-2016, 08:28 AM   #7
EcoRInya
Member
 
Location: UK

Join Date: Apr 2016
Posts: 10
Default

>"Repair.sh" (re-pair : a trick name) should only recover corresponding reads to the ones that are present in the downsampled file.

But if I am doing it for several files, for different files ratio of the incomplete pairs will be random and different. It means that if I add a pair to the reads in the downsampled file I will potentially get different number of reads and scew downsampling (with subsequent normalisation).

>I assume you are doing this from the BAM because you only want to sample reads that aligned (your BAM does not have unmapped reads?).

Exactly!

Last edited by EcoRInya; 04-05-2016 at 08:30 AM.
EcoRInya is offline   Reply With Quote
Old 04-05-2016, 10:17 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,602
Default

Can you clarify what exactly are you trying to achive? Are these separate files/samples or same sample multiple files? Your BAM files don't have unmapped reads?
GenoMax is offline   Reply With Quote
Old 04-05-2016, 11:14 AM   #9
EcoRInya
Member
 
Location: UK

Join Date: Apr 2016
Posts: 10
Default

Sure! I am downsampling bam files that contain uniquely aligned reads with PCR duplicates removed. In total I have 6 bam files (2 conditions, 3 replicated each). For each file I have a normalisation coefficient which I want to use as a downsampling factor, bringing each bam file to a specific read count. I will be using the resulting downsampled bam files for different kinds of comparative analysis between groups. For instance, using diffReps package. It takes as an input bed files. For paired end reads these bed files should contain position of the centre of a fragment. In order, to create such a file all the reads in the bam file should be paired, which I have failed to achieve using standard samtools view -s.
EcoRInya is offline   Reply With Quote
Reply

Tags
bam, diffreps, downsample, paired-end reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO