Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • DarthMapper
    Junior Member
    • Sep 2015
    • 4

    Greetings community! I am only a first year undergrad student, so please bare with my ignorance.

    I have encountered some difficulties trying to use BBMap to align WGS pair end reads belonging to a Sheep to it's pertinent rna.fa file - I'm only interested in the organism's full mRNA sequences, but I do not know where else to get them.

    The problem I'm facing surfaces after verifying Average Coverage and Standard Deviation calculations. While the coverage results are as expected, the standard deviation values are well over 600!

    I will post the command line and the path to the reference data I am using below, to ease out the process of troubleshooting:

    Reference Data: found under NCBI's ftp site, under the following path: genomes/Ovis_aries/RNA/rna.fa.gz
    Here is a link that leads directly to the directory containing the reference file I am using: ftp://ftp.ncbi.nlm.nih.gov/genomes/Ovis_aries/RNA/

    *If anyone else would kindly suggest a better way of obtaining only the full mRNA nucleotide sequences of the organism, please be kind enough to enlighten my poor soul*

    Reads: I am using paired end fastq WGS reads that were used to build the organism's genome assembly. They have already undergone proper quality testing.

    After I create my build using BBMap and my reference file (rna.fa), I proceed to align my reads to the reference build. Here is what my command line looks like:

    bbmap.sh in1=read_1.fastq in2=read_2.fastq out=MappedFile.sam build=1 minratio=0.56

    (I am trying to compare ambiguous values later on, so I need the min ratio. Making the "minratio=<>" parameter closer to 1 does lower the Standard Deviation, but it is still unusually high)

    Cool, up to there everything runs smoothly. I take a look at my alignment results and proceed to calculate coverage and standard deviation values using pileup.sh. Again, here is my command line:

    pileup.sh in=MappedFile.sam out=stats.txt hist=histogram.txt

    After looking at the results, they look something like this:
    Set USE_COVERAGE_ARRAYS to true
    Set USE_BITSETS to false
    Note: Coverage capped at 65535

    Average Coverage: 37.84
    Standard Deviation: 707.47
    Percent scaffolds with any coverage: 80.76
    Percent of reference bases covered: 35.87

    I'd be incredibly thankful if anyone would shed some light into this little hindrance. What am I messing up?

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      Well... I'm not quite sure why you're mapping DNA reads to the transcriptome. If the annotation is accurate, you already have the full mRNA nucleotide sequences in the file "rna.fa.gz" and don't need to generate a sam file at all, nor will WGS help you find the mRNA sequences with any methodology I can think of.

      Are you sure you meant WGS (ie, DNA reads) rather than RNA-seq data? For RNA-seq data, to find full-length mRNAs, you would probably assemble it using an assembler like Trinity, or map it to the genome and use some other tool to guess isoforms from mapping data.

      As for the high standard deviation - BBMap by default will map reads that could map to multiple locations (aka multimapping, ambiguous, or non-uniquely-mapping reads) to the first genomic location they can map to optimally. For this kind of purpose (RNA-seq, or mapping to a transcriptome) I recommend you set "ambig=random" or "ambig=all". Otherwise, for genes with multiple isoforms, all the reads from that gene will map to the first isoform containing the exon in question, leading to a high standard deviation. For mapping RNA to a transcriptome I generally recommend "ambig=random".

      Also, if you ARE mapping DNA to the transcriptome (which, again, is probably not useful), you should set the "local" flag - otherwise reads that span intron/exon boundaries will have a lot of mismatches since there is intronic content in DNA reads, but none in the transcriptome.

      -Brian

      Comment

      • DarthMapper
        Junior Member
        • Sep 2015
        • 4

        I tried using the settings you had recommended, but the standard deviation is still awfully high. Would this be related to the fact that we are trying to map WGS sequences to a transcriptome?

        Here are the results:

        With ambig=random setting and local=t

        Set USE_COVERAGE_ARRAYS to true
        Set USE_BITSETS to false
        Note: Coverage capped at 65535

        Average coverage: 31.27
        Standard deviation: 628.72
        Percent scaffolds with any coverage: 99.94
        Percent of reference bases covered: 75.89

        Time: 2084.076 seconds.

        ----------------------------------------------------------------------------------------------------------------------


        With ambig=random setting, but no local flag

        Set USE_COVERAGE_ARRAYS to true
        Set USE_BITSETS to false
        Note: Coverage capped at 65535

        Average coverage: 31.27
        Standard deviation: 635.37
        Percent scaffolds with any coverage: 99.94
        Percent of reference bases covered: 77.75

        Time: 1935.627 seconds.
        --------------------------------------------------------------------------------------------------------------------
        Thank you again Brian for your last reply!

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          I don't understand why you are aligning DNA reads to a transcriptome. To clarify, are these DNA reads or RNA-seq reads? And can you explain what you're trying to do?

          Comment

          • DarthMapper
            Junior Member
            • Sep 2015
            • 4

            I'm so sorry; I was editing my last reply when you responded.

            I forgot to mention that we are trying to determine copy number via read depth analysis, so we need to know the average and SD read mapping to all genes present in the genome (to compare against the ambiguous genes in question).

            We are using WGS reads, which are DNA, and we are mapping these to the transcriptome.

            Comment

            • HESmith
              Senior Member
              • Oct 2009
              • 512

              You should be mapping to the genome, then extracting the information that corresponds to the transcriptome. I'm not surprised that your standard deviations are wonky. It's probably due to repeat elements in the genome that are under-represented in the transcriptome, which would produce much higher-than-expected coverage. You can check by BLASTing a few examples of your high-coverage regions against the genome, and see how many hits you get.

              Comment

              • DarthMapper
                Junior Member
                • Sep 2015
                • 4

                Well, usually we map the WGS reads to the genome assembly for quality testing. But, how exactly would we extract the information corresponding to the transcriptome? Would I be able to use BBMap for this?

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  I think you can do that with Bedtools and a gene annotation file.

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    One would need BBMap to create the alignment file that can then be used with bedtools

                    Comment

                    • JamesJoyce
                      Junior Member
                      • Oct 2015
                      • 5

                      Hello Brian;
                      I am trying to do something similar to what DarthMapper was suggesting.
                      My PI had instructed me to determine the copy number of certain genes in a single organism by mapping WGS reads to said gene sequences as reference. I then am supposed to compare coverage values of these individual genes with that obtained from mapping the WGS reads to the whole set of mRNA sequences present in the studied organism's transcriptome. The problem is that I am also getting strange standard deviations when using BBMap.
                      What is wrong with my methodology?

                      Comment

                      • Brian Bushnell
                        Super Moderator
                        • Jan 2014
                        • 2709

                        Hi James,

                        Can you give me the exact command line you are using?

                        Also, the complete output printed to the screen would be helpful, as well as the name of the organism.
                        Last edited by Brian Bushnell; 10-04-2015, 05:27 PM.

                        Comment

                        • JamesJoyce
                          Junior Member
                          • Oct 2015
                          • 5

                          Thanks for your reply Brian.
                          I am using BBMap to align WGS reads to a set of mRNA sequences extracted from the organism's NCBI transcriptome. This is being done with different organisms, such as mammals, reptiles, and fungi. So far, I have tested a few mammals, including Felix Catus (cat) and Rattus Norvegicus (Rat). Again, I was instructed to find the copy number of certain genes within a single organism. My PI told me to map WGS sequences to the transcriptome's mRNA sequences as reference before mapping to individual gene sequences, but we stopped once we saw the results.

                          The results for both mammals are very similar, so I will only post one example

                          Here is the commandline I am using and the results:

                          Index Build
                          bbmap.sh ref=mRNA.fa build=1

                          Alignment
                          bbmap.sh build=1 in1=catreads/read_1.fastq in2=catreads/read_2.fastq ambig=random local=t out=mRNA.sam

                          Coverage and Standard Deviation:
                          pileup.sh in=mRNA.sam out=covstats.txt hist=histogram.txt

                          P.S. I have tried to use samtools depth to calculate coverage and STDV with BBMap's sam outputs but it does not let me - even when I have converted the sam's to bam, and sorted the bam. How can this be achieved?

                          CoverageResults
                          Set USE_COVERAGE_ARRAYS to true
                          Set USE_BITSETS to false
                          Note: Coverage capped at 65535

                          Average coverage: 15.29
                          Standard deviation: 384.14
                          Percent scaffolds with any coverage: 99.81
                          Percent of reference bases covered: 69.32

                          Is there any way to lower that STDV?

                          Comment

                          • Brian Bushnell
                            Super Moderator
                            • Jan 2014
                            • 2709

                            First off, you should expect a high standard deviation in this scenario. If one gene has 10 isoforms and another has 1 isoform, the gene with 1 isoform will have (on average) 10x the coverage of the isoforms from the gene with 10, when ambiguously-mapped reads are placed randomly. You need to use exactly one isoform per gene (preferably the longest).

                            However, it's also possible that there's a problem with the random selection of sites for ambiguously-mapped paired reads. Would you mind trying this (mapping with read 1 only):

                            bbmap.sh build=1 in1=catreads/read_1.fastq ambig=random local=t out=mRNA.sam covstats=covstats.txt hist=histogram.txt

                            Note that bbmap can internally generate coverage information and does not need a second pass from pileup. As for why samtools depth is not working, I have no idea. Does it give some kind of error message?

                            Comment

                            • GenoMax
                              Senior Member
                              • Feb 2008
                              • 7142

                              @Brian: How is this analysis is going to give information about number of copies of a gene to @DarthMapper/@JamesJoyce? Am I missing something obvious?

                              Comment

                              • Brian Bushnell
                                Super Moderator
                                • Jan 2014
                                • 2709

                                Well... it is kind of a strange analysis... and a normal transcriptome file, with many isoforms per gene, will not very useful. But, if you had exactly one transcript per gene in your reference, then theoretically, genes with 2 copies should get double the coverage of genes with 1 copy, if you map uniform WGS data. Using the "local" flag would be helpful because wherever reads go into introns they will not match the transcriptome. Also, pseudogenes will cause confusion. So I'm not sure how much signal there will be compared to the noise, but with those assumptions (one transcript per gene, local alignments, fairly uniform coverage, no pseudogenes), it should work.
                                Last edited by Brian Bushnell; 10-05-2015, 11:07 AM.

                                Comment

                                Latest Articles

                                Collapse

                                • GATTACAT
                                  Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                  by GATTACAT
                                  Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                  07-01-2026, 11:43 AM
                                • SEQadmin2
                                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                  by SEQadmin2


                                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                  Here are nine questions we think about, in roughly the order they matter, before...
                                  06-18-2026, 07:11 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 07-02-2026, 11:08 AM
                                0 responses
                                11 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-30-2026, 05:37 AM
                                0 responses
                                13 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-26-2026, 11:10 AM
                                0 responses
                                20 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-17-2026, 06:09 AM
                                0 responses
                                54 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...