Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bam file with unmapped reads from another genome than reference

    Hello everyone,

    I started with bioinformatics 2 weeks ago, so maybe my question is a bit too easy for you, but I don't find any answer for it in other threads. I would appreciate a lot if you could help me out.

    This is the case:

    I mapped my paired end reads to my viral reference genome (Nextgenmap) and selected those that were mapped and paired (both pairs mapped only). Now I want to filter out, those reads that map also to human DNA (my contaminant). For that, I took the viral_mapped_paired_end_reads.bam, converted it to fastq (R1 and R2) and mapped it to hg19. In this bam file, I extracted the unmapped paired end. So, I assume that here I have the paired end reads that only mapped to virus before , but not to human now.
    This bam file has the reads that I want, but unmapped to the human reference genome.

    Now, how do I continue? I want to do coverage analysis for the viral genome for example. Can I use the unmapped bam file? or do I need to use the viral_mapped_paired_end_reads.bam and filter out the reads that mapped to human? If so, Is it done by extracting reads IDs? Or the IDs change depending on the reference genome?

    Thanks a lot

  • #2
    I suggest you try BBSplit as in this thread. That will give you one fastq file for human reads, and one for viral reads. Then, map the viral output fastq file to the virus reference with a normal aligner (such as BBMap or Nextgenmap which you used previously). BBMap can output coverage data directly (well, BBSplit can too, actually...) using the covstats or basecov flags, if you want.

    In answer to your last question, read IDs do not change with respect to reference genomes. However, unmapped bam files are not very useful and you can't use them for coverage analysis.

    Comment


    • #3
      Dear Brian, Thanks a lot. I did as you suggested.
      I also used BBMap (pileup.sh) for output coverage. I get the following results
      Avg fold 53035,994
      Length 7904
      Ref_GC 0.0000
      Covered % 100
      Covered bases 7904
      Plus reads 3190731
      Minus reads 3190731
      Read GC 0.395
      Median_fold 22882
      Std 24900.22

      I assume I covered the whole genome. Do you happen to know whey the Ref_GC equals to 0.0000? I calculated GC content % of my reference genome and it is 36.51%. Thanks!

      Comment


      • #4
        That's because you did not specify the reference file with "ref=" when you ran pileup.sh. It's not necessary; you only need it if you want the Ref_GC column to be correct.

        Comment


        • #5
          Ref_GC

          Originally posted by Brian Bushnell View Post
          That's because you did not specify the reference file with "ref=" when you ran pileup.sh. It's not necessary; you only need it if you want the Ref_GC column to be correct.
          I wrote the following, but still didn't get it. Maybe something not right?

          /home/sara/bbmap/pileup.sh in=1409_cat_sorted.bam "ref=/home/sara/HPV16Reference.fasta" out=1409.ref_stats1.txt hist=1409.ref1_histogram.txtll

          Thanks

          Comment


          • #6
            That's strange - I tested it and it works fine for me. No reference:

            Code:
            pileup.sh in=mapped.sam.gz stats=covstats.txt
            
            cat covstats.txt
            
            #ID     Avg_fold        Length  Ref_GC  Covered_percent Covered_bases   Plus_reads      Minus_reads     Read_GC Median_fold     Std_Dev
            chr1    0.0908  249250621       0.0000  8.1241  20249466        76413   74517   0.4179  0       0.33
            chr10   0.0960  135534747       0.0000  8.6332  11700973        43491   43228   0.4160  0       0.33
            With reference:
            Code:
            pileup.sh in=mapped.sam.gz stats=covstats2.txt ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz
            
            cat covstats2.txt
            
            #ID     Avg_fold        Length  Ref_GC  Covered_percent Covered_bases   Plus_reads      Minus_reads     Read_GC Median_fold     Std_Dev
            chr1    0.0908  249250621       0.4183  8.1241  20249466        76413   74517   0.4179  0       0.33
            chr10   0.0960  135534747       0.4167  8.6332  11700973        43491   43228   0.4160  0       0.33
            I tried various things including adding "hist=" and using "out=" instead of "stats=", and putting quotes around the reference flag, but was unable to replicate this. Are you sure you are looking at the correct output file, rather than an old one?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Best Practices for Single-Cell Sequencing Analysis
              by seqadmin



              While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
              Today, 07:15 AM
            • seqadmin
              Latest Developments in Precision Medicine
              by seqadmin



              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

              Somatic Genomics
              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
              05-24-2024, 01:16 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:18 AM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Today, 08:04 AM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 06-03-2024, 06:55 AM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-30-2024, 03:16 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Working...
            X