Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem when extracting reads from BAM to FASQT format

    Hello,

    Overall, I need extract paired and mate reads (Illumina) in FASTQ format, from an specific region of a BAM file (generated with Bowtie2). For this purpose I have used SAMtools, bam2fastq and basm2fastx; however I'm not obtaining all the reads that I would expected.

    This is what I am doing:

    I have used the following commands to generate BAM files (for my mate and paired reads) using SAMtools

    Code:
    $ samtools view -u -F4 alignment.bam 'myregion' > mapped_reads.bam 
    $ samtools view -u -F8 mapped_reads.bam > paired_reads.bam
    $ samtools view -u -f8 mapped_reads.bam > mate_reads.bam
    THe number of reads (in a BAM format, obtained with samtools view -c argument) that I would expect are: 2845 mapped reads, 2695 paired reads and 150 mate reads.

    Now, the next step would be to extract each set of reads into a FASTQ format. I have used the (Tophat) bam2fastx tool to extract the mate reads:

    Code:
    $ bam2fastx -q -A -o mate_reads.fastq mate_reads.bam
    For the paired reads I used the bam2fastq tool, which generates two FASTQ files, but the reads counts don't correspond with the numbers previously described. This is the command I used and its output:

    Code:
    $ bam2fastq -o paired_reads#.fastq paired_reads.bam
    
    This looks like paired data from lane 223.
    Output will be in paired_reads_1.fastq and paired_reads_2.fastq
    2695 sequences in the BAM file
    2695 sequences exported
    WARNING: 15 reads could not be matched to a mate and were not exported
    The warning message reports that there are 15 reads that don't have a mate.

    I find no explanation for this result. Does these 15 reads should be in the
    mate_reads.bam file? Do I have a problem with the samtools command flags to extract the reads?

    Any sugestions would be appreciated. Thanks!

    Regards

    Héctor Spitia

  • #2
    Hi,
    I encountered a similar problem: I want to extract a certain region from my final .bam file and reobtain the reads from this region. I want to rerun the alignment with the read pairs mapping to this region with different alignment options/parameters.

    I did the following:

    To extract the desired region in bam format:
    Code:
    samtools view -h -b -o LC14.final.90Kregion.bam LC14.final.bam chr2:905,000-906,000
    To reobtain the two .fastq files with paired reads:
    Code:
    bam2fastq -o paired_reads#.fastq LC14.final.90Kregion.bam
    Output from bam2fastq was:
    Code:
    This looks like paired data from lane 193.
    Output will be in paired_reads_1.fastq and paired_reads_2.fastq
    1735 sequences in the BAM file
    1735 sequences exported
    WARNING: 491 reads could not be matched to a mate and were not exported
    So my two output .fastq files contain 622 reads each (which is 1244 in sum) and 491 reads are simply gone. Can this be due to the fact that I am extracting this region and thus am missing the mates of the reads mappping at the end of my extracted region because the mates lie outside this region? How can I obtain these missing reads as well?

    Cheers,
    Stroehli
    MSc Bioinformatics student at the Free University Berlin , Germany

    Comment


    • #3
      @Stroehli

      Maybe your aligner discarded reads from the .bam files.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 08:47 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X