Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Vca80553
    Junior Member
    • Sep 2017
    • 5

    Bam file with unmapped reads from another genome than reference

    Hello everyone,

    I started with bioinformatics 2 weeks ago, so maybe my question is a bit too easy for you, but I don't find any answer for it in other threads. I would appreciate a lot if you could help me out.

    This is the case:

    I mapped my paired end reads to my viral reference genome (Nextgenmap) and selected those that were mapped and paired (both pairs mapped only). Now I want to filter out, those reads that map also to human DNA (my contaminant). For that, I took the viral_mapped_paired_end_reads.bam, converted it to fastq (R1 and R2) and mapped it to hg19. In this bam file, I extracted the unmapped paired end. So, I assume that here I have the paired end reads that only mapped to virus before , but not to human now.
    This bam file has the reads that I want, but unmapped to the human reference genome.

    Now, how do I continue? I want to do coverage analysis for the viral genome for example. Can I use the unmapped bam file? or do I need to use the viral_mapped_paired_end_reads.bam and filter out the reads that mapped to human? If so, Is it done by extracting reads IDs? Or the IDs change depending on the reference genome?

    Thanks a lot
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    I suggest you try BBSplit as in this thread. That will give you one fastq file for human reads, and one for viral reads. Then, map the viral output fastq file to the virus reference with a normal aligner (such as BBMap or Nextgenmap which you used previously). BBMap can output coverage data directly (well, BBSplit can too, actually...) using the covstats or basecov flags, if you want.

    In answer to your last question, read IDs do not change with respect to reference genomes. However, unmapped bam files are not very useful and you can't use them for coverage analysis.

    Comment

    • Vca80553
      Junior Member
      • Sep 2017
      • 5

      #3
      Dear Brian, Thanks a lot. I did as you suggested.
      I also used BBMap (pileup.sh) for output coverage. I get the following results
      Avg fold 53035,994
      Length 7904
      Ref_GC 0.0000
      Covered % 100
      Covered bases 7904
      Plus reads 3190731
      Minus reads 3190731
      Read GC 0.395
      Median_fold 22882
      Std 24900.22

      I assume I covered the whole genome. Do you happen to know whey the Ref_GC equals to 0.0000? I calculated GC content % of my reference genome and it is 36.51%. Thanks!

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        That's because you did not specify the reference file with "ref=" when you ran pileup.sh. It's not necessary; you only need it if you want the Ref_GC column to be correct.

        Comment

        • Vca80553
          Junior Member
          • Sep 2017
          • 5

          #5
          Ref_GC

          Originally posted by Brian Bushnell View Post
          That's because you did not specify the reference file with "ref=" when you ran pileup.sh. It's not necessary; you only need it if you want the Ref_GC column to be correct.
          I wrote the following, but still didn't get it. Maybe something not right?

          /home/sara/bbmap/pileup.sh in=1409_cat_sorted.bam "ref=/home/sara/HPV16Reference.fasta" out=1409.ref_stats1.txt hist=1409.ref1_histogram.txtll

          Thanks

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            That's strange - I tested it and it works fine for me. No reference:

            Code:
            pileup.sh in=mapped.sam.gz stats=covstats.txt
            
            cat covstats.txt
            
            #ID     Avg_fold        Length  Ref_GC  Covered_percent Covered_bases   Plus_reads      Minus_reads     Read_GC Median_fold     Std_Dev
            chr1    0.0908  249250621       0.0000  8.1241  20249466        76413   74517   0.4179  0       0.33
            chr10   0.0960  135534747       0.0000  8.6332  11700973        43491   43228   0.4160  0       0.33
            With reference:
            Code:
            pileup.sh in=mapped.sam.gz stats=covstats2.txt ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz
            
            cat covstats2.txt
            
            #ID     Avg_fold        Length  Ref_GC  Covered_percent Covered_bases   Plus_reads      Minus_reads     Read_GC Median_fold     Std_Dev
            chr1    0.0908  249250621       0.4183  8.1241  20249466        76413   74517   0.4179  0       0.33
            chr10   0.0960  135534747       0.4167  8.6332  11700973        43491   43228   0.4160  0       0.33
            I tried various things including adding "hist=" and using "out=" instead of "stats=", and putting quotes around the reference flag, but was unable to replicate this. Are you sure you are looking at the correct output file, rather than an old one?

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            24 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            41 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            48 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Working...