Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bam file with unmapped reads from another genome than reference

    Hello everyone,

    I started with bioinformatics 2 weeks ago, so maybe my question is a bit too easy for you, but I don't find any answer for it in other threads. I would appreciate a lot if you could help me out.

    This is the case:

    I mapped my paired end reads to my viral reference genome (Nextgenmap) and selected those that were mapped and paired (both pairs mapped only). Now I want to filter out, those reads that map also to human DNA (my contaminant). For that, I took the viral_mapped_paired_end_reads.bam, converted it to fastq (R1 and R2) and mapped it to hg19. In this bam file, I extracted the unmapped paired end. So, I assume that here I have the paired end reads that only mapped to virus before , but not to human now.
    This bam file has the reads that I want, but unmapped to the human reference genome.

    Now, how do I continue? I want to do coverage analysis for the viral genome for example. Can I use the unmapped bam file? or do I need to use the viral_mapped_paired_end_reads.bam and filter out the reads that mapped to human? If so, Is it done by extracting reads IDs? Or the IDs change depending on the reference genome?

    Thanks a lot

  • #2
    I suggest you try BBSplit as in this thread. That will give you one fastq file for human reads, and one for viral reads. Then, map the viral output fastq file to the virus reference with a normal aligner (such as BBMap or Nextgenmap which you used previously). BBMap can output coverage data directly (well, BBSplit can too, actually...) using the covstats or basecov flags, if you want.

    In answer to your last question, read IDs do not change with respect to reference genomes. However, unmapped bam files are not very useful and you can't use them for coverage analysis.

    Comment


    • #3
      Dear Brian, Thanks a lot. I did as you suggested.
      I also used BBMap (pileup.sh) for output coverage. I get the following results
      Avg fold 53035,994
      Length 7904
      Ref_GC 0.0000
      Covered % 100
      Covered bases 7904
      Plus reads 3190731
      Minus reads 3190731
      Read GC 0.395
      Median_fold 22882
      Std 24900.22

      I assume I covered the whole genome. Do you happen to know whey the Ref_GC equals to 0.0000? I calculated GC content % of my reference genome and it is 36.51%. Thanks!

      Comment


      • #4
        That's because you did not specify the reference file with "ref=" when you ran pileup.sh. It's not necessary; you only need it if you want the Ref_GC column to be correct.

        Comment


        • #5
          Ref_GC

          Originally posted by Brian Bushnell View Post
          That's because you did not specify the reference file with "ref=" when you ran pileup.sh. It's not necessary; you only need it if you want the Ref_GC column to be correct.
          I wrote the following, but still didn't get it. Maybe something not right?

          /home/sara/bbmap/pileup.sh in=1409_cat_sorted.bam "ref=/home/sara/HPV16Reference.fasta" out=1409.ref_stats1.txt hist=1409.ref1_histogram.txtll

          Thanks

          Comment


          • #6
            That's strange - I tested it and it works fine for me. No reference:

            Code:
            pileup.sh in=mapped.sam.gz stats=covstats.txt
            
            cat covstats.txt
            
            #ID     Avg_fold        Length  Ref_GC  Covered_percent Covered_bases   Plus_reads      Minus_reads     Read_GC Median_fold     Std_Dev
            chr1    0.0908  249250621       0.0000  8.1241  20249466        76413   74517   0.4179  0       0.33
            chr10   0.0960  135534747       0.0000  8.6332  11700973        43491   43228   0.4160  0       0.33
            With reference:
            Code:
            pileup.sh in=mapped.sam.gz stats=covstats2.txt ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz
            
            cat covstats2.txt
            
            #ID     Avg_fold        Length  Ref_GC  Covered_percent Covered_bases   Plus_reads      Minus_reads     Read_GC Median_fold     Std_Dev
            chr1    0.0908  249250621       0.4183  8.1241  20249466        76413   74517   0.4179  0       0.33
            chr10   0.0960  135534747       0.4167  8.6332  11700973        43491   43228   0.4160  0       0.33
            I tried various things including adding "hist=" and using "out=" instead of "stats=", and putting quotes around the reference flag, but was unable to replicate this. Are you sure you are looking at the correct output file, rather than an old one?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM
            • seqadmin
              The Impact of AI in Genomic Medicine
              by seqadmin



              Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
              02-26-2024, 02:07 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-14-2024, 06:13 AM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-08-2024, 08:03 AM
            0 responses
            72 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-07-2024, 08:13 AM
            0 responses
            80 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-06-2024, 09:51 AM
            0 responses
            68 views
            0 likes
            Last Post seqadmin  
            Working...
            X