Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat2: prepare unmapped.bam file for input into a tophat run on alternative genome

    I have some paired-end Illumina RNAseq data and have run tophat2 on it against the human genome. I would like now like to run tophat2 again to align the unmapped bams on some alternative genomes to check for contamination/infection. To do this I need to convert the unmapped.bam into fastq files.

    To do this I do the following:
    1) Remove any reads without a matching pair
    Code:
    samtools view -f1 -b unmapped.bam > unmapped_paired.bam
    2) Sort the reads according to name
    Code:
    samtools sort -n unmapped_paired.bam unmapped_paired_sort.bam
    3) Run tophat's bam2fastx to get fastq
    Code:
    bam2fastx -q -Q -A -P -o test unmapped_paired_sort.bam
    Unfortunately this reports an error:
    Code:
    Error: couldn't retrieve both reads for pair HISEQ2500-01:110:H7AGVADXX:1:1101:1336:2967. Perhaps the input file is not sorted by name?
    The problem is that the unmapped.bam file does not seem to have any information in the RNEXT column about the read name of the matched pair. Anyway three steps just to convert the data back to fastqs seems over the top.

    Does anyone have any idea how to fix this problem, or provide a better way to do it?

    Thanks
    Last edited by danielsbrewer; 01-13-2014, 02:51 AM.

  • #2
    On further examination, it appears that the FLAGS in the unmapped.bam are inaccurate and even after filtering out the reads without the unpaired flag, there are still reads that are unpaired. I assume this is because the other read of the pair has been mapped.

    Comment


    • #3
      You might want to try the "--no-mixed" option for tophat2 next time.

      Comment


      • #4
        Yes that would have done the trick. Still playing around with RNAseq data so I am definitely in the learning phase!

        The script in the following looks like it will help:
        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


        Just giving it a go now.

        Comment


        • #5
          Yes that would have done the trick. Still playing around with RNAseq data so I am definitely in the learning phase!

          The script in the following looks like it will help:
          Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


          Just giving it a go now.

          Comment


          • #6
            bam2fastx libz error

            I too am trying to make a fast file out of the unmapped reads so that I can run top hat on an alternative genome. I get a different error:

            samtools sort -n unmapped.bam unmapped_sort.bam
            bam2fastx -q -Q -A -o outfile unmapped_sort.bam.bam

            I get this error:
            bam2fastx: /lib64/libz.so.1: no version information available (required by bam2fastx)

            Anyone come across this error before?

            Comment


            • #7
              One possibility is that you are running older versions of libz/libxml2. Are you able to get the bam2fastx to complete (that "error" is likely a warning) otherwise?

              Comment


              • #8
                Warning can be ignored

                Originally posted by GenoMax View Post
                One possibility is that you are running older versions of libz/libxml2. Are you able to get the bam2fastx to complete (that "error" is likely a warning) otherwise?
                Hm…sure enough, despite the warning, there is in fact a fastq file produced anyway.

                But when I run the program from the cluster's login node (shame on me, I know) I don't get the error, and I still get the fast file. Could that be due to different versions of the program running on the login vs. compute nodes? Any idea?

                Comment


                • #9
                  Originally posted by bpb9 View Post
                  But when I run the program from the cluster's login node (shame on me, I know) I don't get the error, and I still get the fast file. Could that be due to different versions of the program running on the login vs. compute nodes? Any idea?
                  That is certainly a possibility. On large clusters sometimes a few stray nodes don't get updated properly/fully. If you know which node gave you the error let the admins know. They should be able to manually update that node.

                  Comment


                  • #10
                    Just a note on this general topic, the script fix_tophat_unmapped_reads.py in https://github.com/cbrueffer/misc_bioinf/ fixes various issues in unmapped.bam files that prevent them from being used in downstream tools.

                    Comment


                    • #11
                      It might be a very late answer, but apparently, tophat can even accept bam files as input. I tested it by error and it works perfectly, no differences with an alignment with a fastq file obtained after bam2fatsq transformation...
                      If anyone can confirm that I'm not doing anything wrong, it would be nice.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      66 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X