SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
BAM files from RNAseq -Alignment (Basespace) not working with DESEq2 in R Rammohan Bioinformatics 6 04-12-2016 11:11 AM
bamToFastq not working with .bam file from bwa-mem prs321 Bioinformatics 13 06-26-2014 07:55 PM
Working/Visualising Phased VCF files? aeonsim Bioinformatics 1 06-19-2013 01:45 PM
MarkDuplicates not working on CG BAM files? biscuit13161 Bioinformatics 0 04-29-2013 04:28 AM
"R Killed" when working with large BAM files mixter Bioinformatics 2 07-04-2010 11:47 PM

Reply
 
Thread Tools
Old 03-11-2019, 12:53 AM   #1
deKoch13
Member
 
Location: HD

Join Date: Mar 2019
Posts: 12
Default Working with BAM files

Hi everybody!

This is my first thread in this forum.
Recently, I started an internship in a bioinformatics research group. Unfortunately, I have only little experience regarding programming, bioinformatic data handling, ...
I have basic programming skills in Bash, Python and R, but that's it.

My task is to inspect three BAM files (> 1 Mio reads). The three BAM files were generated using different methods. I want to find out which BAM files contain the same reads, which reads are only in BAM file 1, which reads are missing in BAM file 3 and so on.

Can you give me some advice how to deal with this task? Do you have experiences in BAM file handling?

Many greetings!
deKoch13 is offline   Reply With Quote
Old 03-11-2019, 01:19 AM   #2
deKoch13
Member
 
Location: HD

Join Date: Mar 2019
Posts: 12
Default more details

Maybe I should add some information:
We took one sample and generated the BAM files using three different pipelines.
At the moment, we are only interested in the read names (first column of the BAM files) and want to find out which reads are present in all BAM files, which are present in file 1, file 2, file 3...
deKoch13 is offline   Reply With Quote
Old 03-11-2019, 06:06 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

You could simply get the names (field 1 as you already note, sort | uniq them in bash) and do a "comm" comparison of the three results. If your aim is just to find which reads are present in all three files.
GenoMax is offline   Reply With Quote
Old 03-11-2019, 06:22 AM   #4
deKoch13
Member
 
Location: HD

Join Date: Mar 2019
Posts: 12
Default progress

Thank you for the answer!

I already extracted the read names from all files separately using:

> samtools sort -n bam_filename | samtools view | awk -F "\t" '{print $1}' > output_filename

Now, my supervisor supposed to use python to do the rest of the task...
Or can you recommend another possibility?

Greetings
deKoch13 is offline   Reply With Quote
Old 03-11-2019, 06:28 AM   #5
deKoch13
Member
 
Location: HD

Join Date: Mar 2019
Posts: 12
Default

I looked the "comm" command up. Sounds promising, but I am not sure if this works for such big data files with > 1 Million reads. Do you have an idea for a smart python-based solution?

Nevertheless, I will try it also using comm.

Greetings
deKoch13 is offline   Reply With Quote
Old 03-11-2019, 06:37 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,909
Default

If this is an assignment then use what you have to but comm should work (as long as you have enough RAM available). Since you are working with only read names (if you are not then you should).
GenoMax is offline   Reply With Quote
Reply

Tags
bam, reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:13 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO