Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Large proportion of mapping quality scores of 0 with bwa

    Hello,

    I'm working with RNA-Seq data (100 bp, Single End, Illumina) from Drosophila melanogaster. I'm hoping to map the reads to the transcriptome using bwa, obtain feature (transcript) counts using HTSeq, and perform the differential expression analysis in EdgeR. I'm aware that HTSeq was designed for splicing-aware aligners and I also plan to align my reads to the Genome using Tophat so that I can compare the two approaches. While aligning to the transcriptome with bwa, ~70% of my reads have mapping quality scores of 0. I did not realize this until I used HTSeq got the following output (in addition to 0 reads mapping to any feature):

    __no_feature 6122078
    __ambiguous 0
    __too_low_aQual 16002186
    __not_aligned 1410838
    __alignment_not_unique

    It seems very odd that the majority of my reads (and indeed all of the reads that either didn't have alignments or aligned to no feature) would have a low quality map score. If I understand a quality score of 0 correctly, this means that a read maps to two or more places equally. This occurs regardless if I use bwa mem or bwa aln or if I use different versions of the transcriptome (contiguous protein coding sequence versus mRNA from FlyBase) . Why do all of my reads that align to transcripts map to two or more locations?

    My pipeline is listed below. I used loops for the trim and alignment steps since I have numerous fastq files. Full disclosure, this is the first time I've done this and thus my loops could be erroneous, so I have included them below. The HTSeq output above is for one sam file that was generated via bwa mem, but I get similar results for all other alignment files.

    #Trim and Clean
    #!/bin/bash
    FILES=~/mouchka/raw_reads/fly_raw_reads/*.fastq
    for f in $FILES
    do
    NAME=${f%.*}_CT.fastq
    java -jar /opt/modules/biology/trimmomatic/0.33/bin/trimmomatic-0.33.jar SE -phred33 $f $NAME ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:13 TRAILING:13 SLIDINGWINDOW:4:20 MINLEN:25
    echo $f
    done

    #Align
    bwa index dmel-all-transcript-r6.05.fasta
    #!/bin/bash
    FILES=~/mouchka/trimmed/fly_trim_reads/test/*.fastq
    for f in $FILES
    do
    NAME=${f%.*}_P.sam
    bwa mem -t 4 dmel-all-transcript-r6.05.fasta $f > $NAME
    echo $f
    done

    Sample samtools flagstat ouput:
    23535102 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    22124264 + 0 mapped (94.01% : N/A)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (N/A : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (N/A : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    HTSeq
    htseq-count -i transcript_id alignment.sam dmel-all-r6.05.gtf

    Any advice would be much appreciated. Thanks so much.

  • #2
    Transcriptomes of organisms with differential splicing will have a lot of multi-mapped reads because the same exons occur in multiple transcripts. If you want to map the reads unambiguously, use a splice-aware aligner such as BBMap and map to the genome.

    Comment


    • #3
      My suggestion is to go for alignment to the genome using the known trascriptome as guide (somthing like tophat2 [options] --GTF myannotation.gtf bwtindex myreads.fq).

      Comparing this result to the alignment to the transcriptome only, as you tried, might still be a good exercise. About the reason why you see so many reads mapping with 0 quality (ambiguously) my bet is that your reference transcriptome is redundant and it includes all the splice variants. For example, if a gene has exons A, B, C and it can produce transcripts with exons AB or AC or CB than all the reads will be ambiguously mapped since each exon appear twice in your reference fasta.

      Edit: Oops, just realized that Brian has already answered...!
      Last edited by dariober; 06-16-2015, 11:16 PM.

      Comment


      • #4
        Thanks for your timely feedback Brian and dariober. I'm certainly planning to align to the genome as well, although I had hoped to do both. I have seen others align reads to the Drosophila transcriptome in the literature, but their bioinformatic methods are sparse. Perhaps they use a modified transcriptome that is less redundant?

        dariober, would you mind clarifying what you mean by using the known transcriptome as a guide? Is the GTF file referenced above the genome or transcriptome? I apologize for my ignorance concerning the various approaches...my previous work was in a non-model organism and although not ideal, the only option was to map the reads to a de novo transcriptome.

        Thanks again!

        Comment


        • #5
          Originally posted by mouchkam View Post
          dariober, would you mind clarifying what you mean by using the known transcriptome as a guide?
          The GTF file you have, dmel-all-r6.05.gtf, has the coordinates of the gene features (exons, cds etc). You pass it to tophat so that the aligner "knows" where reads are expected to map. Have a look at the tophat documentation, which is quite extensive. So if your indexed genome files have prefix (for example) dmel.genome you would run tophat like
          Code:
          tophat2 [more options] -G dmel-all-r6.05.gtf dmel.genome reads.fq

          Comment


          • #6
            Thanks dariboer...this is actually what I did yesterday and I ran into yet another stumbling block. Perhaps you or Brian or others could shed some light on this problem?

            Here is my code for the alignment:

            bowtie2-build dmel-all-chromosome-r6.05.fa dmel-all-chromosome-r6.05

            tophat -p 4 -i 50 -I 5000 -o C8B1_R1_CT.tophat -G dmel-all-r6.05.gtf dmel-all-chromosome-r6.05 C8B1_R1_CT.fastq

            The align summary stats are as follows:
            Input : 23535102
            Mapped : 21714703 (92.3% of input)
            of these: 365295 ( 1.7%) have multiple alignments (5959 have >20)
            92.3% overall read mapping rate.

            HTSeq code:
            htseq-count C8B1_accepted_hits_sorted.sam dmel-all-r6.05.gtf

            But the report said that 20 million of my 23 million hits aligned to no feature! I know this is not possible because the sequencing facility provided a list of count data (via their generic pipeline) and it shows thousands of genes that have counts. I noticed that the features that do have counts are all non-protein coding mRNA. I have a feeling that my coordinates are not lining up between the sam file and the gtf file in HTSeq. But I'm not entirely sure how do deal with this? Do I need to realign with tophat? Or specify a different feature id when using HTSeq? I don't get an error message from tophat when running the alignment, so I think the the 1st column of my gtf file matches the name of the reference sequence in the bowtie index. I tried to confirm this running bowtie2-inspect on my base index, but to be honest, I'm having a really hard time finding the chromosome/contig name in the index names. Here is the code and a few lines of the output.

            bowtie2-inspect -n dmel-all-chromosome-r6.05

            dmel_mitochondrion_genome type=chromosome; loc=dmel_mitochondrion_genome:1..19517; ID=dmel_mitochondrion_genome; dbxref=GB:NC_001709; MD5=61af8db53361cd5744f41f773d21c3d4; length=19517; release=r6.05; species=Dmel;
            211000022279114 type=golden_path_region; loc=211000022279114:1..14983; ID=211000022279114; dbxref=GBS483726,GBS483726,REFSEQ:NW_001845015; MD5=cf55405ff9a66546f5d6dad8cf539944; length=14983; release=r6.05; species=Dmel;
            211000022280270 type=golden_path_region; loc=211000022280270:1..13108; ID=211000022280270; dbxref=GBS483666,GBS483666,REFSEQ:NW_001844955; MD5=366559f8dc9b57982af33ec0aac92804; length=13108; release=r6.05; species=Dmel;
            211000022280187 type=golden_path_region; loc=211000022280187:1..13079; ID=211000022280187; dbxref=GBS483725,GBS483725,REFSEQ:NW_001845014; MD5=aaf75285222c30583ffafc89dbf37b0c; length=13079; release=r6.05; species=Dmel;
            211000022280742 type=golden_path_region; loc=211000022280742:1..12513; ID=211000022280742; dbxref=GBS483677,GBS483677,REFSEQ:NW_001844966; MD5=10425aa97972c5f2aa93c1e01c885505; length=12513; release=r6.05; species=Dmel;
            211000022280763 type=golden_path_region; loc=211000022280763:1..12001; ID=211000022280763; dbxref=GBS483690,GBS483690,REFSEQ:NW_001844979; MD5=e2bbfb2660294c610bf16fb4587f1a2b; length=12001; release=r6.05; species=Dmel;

            Thanks in advance!

            Comment


            • #7
              As an addition to my last post, after doing a more thorough investigation, I noticed that for the Drosophila chromosomes, the 1st column of the gtf lists "3R," where as the index has a much longer name (e.g. 3R type=golden_path_region; loc=3R:1..32079331; ID=3R; dbxref=GB:AE014297,GB:AE014297,REFSEQ:NT_033777; MD5=420540d26d86fbf14185d2f2d68af9c4; length=32079331; release=r6.05; species=Dmel). The tophat manual states that "Please note that the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat." So does this mean that I need to go into the gtf and replace the 1st column with the much longer name of the reference sequence from the reference genome?

              Comment


              • #8
                You need to rename them so that they match. The easiest way is to rename it in the reference, like this:

                reformat.sh in=reference.fa out=renamed.fa trd

                "trd" means "trimreaddescription" and will trim everything after the first whitespace.

                Comment


                • #9
                  Got it! Will try this. Thanks!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X