Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Greetings community! I am only a first year undergrad student, so please bare with my ignorance.

    I have encountered some difficulties trying to use BBMap to align WGS pair end reads belonging to a Sheep to it's pertinent rna.fa file - I'm only interested in the organism's full mRNA sequences, but I do not know where else to get them.

    The problem I'm facing surfaces after verifying Average Coverage and Standard Deviation calculations. While the coverage results are as expected, the standard deviation values are well over 600!

    I will post the command line and the path to the reference data I am using below, to ease out the process of troubleshooting:

    Reference Data: found under NCBI's ftp site, under the following path: genomes/Ovis_aries/RNA/rna.fa.gz
    Here is a link that leads directly to the directory containing the reference file I am using: ftp://ftp.ncbi.nlm.nih.gov/genomes/Ovis_aries/RNA/

    *If anyone else would kindly suggest a better way of obtaining only the full mRNA nucleotide sequences of the organism, please be kind enough to enlighten my poor soul*

    Reads: I am using paired end fastq WGS reads that were used to build the organism's genome assembly. They have already undergone proper quality testing.

    After I create my build using BBMap and my reference file (rna.fa), I proceed to align my reads to the reference build. Here is what my command line looks like:

    bbmap.sh in1=read_1.fastq in2=read_2.fastq out=MappedFile.sam build=1 minratio=0.56

    (I am trying to compare ambiguous values later on, so I need the min ratio. Making the "minratio=<>" parameter closer to 1 does lower the Standard Deviation, but it is still unusually high)

    Cool, up to there everything runs smoothly. I take a look at my alignment results and proceed to calculate coverage and standard deviation values using pileup.sh. Again, here is my command line:

    pileup.sh in=MappedFile.sam out=stats.txt hist=histogram.txt

    After looking at the results, they look something like this:
    Set USE_COVERAGE_ARRAYS to true
    Set USE_BITSETS to false
    Note: Coverage capped at 65535

    Average Coverage: 37.84
    Standard Deviation: 707.47
    Percent scaffolds with any coverage: 80.76
    Percent of reference bases covered: 35.87

    I'd be incredibly thankful if anyone would shed some light into this little hindrance. What am I messing up?

    Comment


    • Well... I'm not quite sure why you're mapping DNA reads to the transcriptome. If the annotation is accurate, you already have the full mRNA nucleotide sequences in the file "rna.fa.gz" and don't need to generate a sam file at all, nor will WGS help you find the mRNA sequences with any methodology I can think of.

      Are you sure you meant WGS (ie, DNA reads) rather than RNA-seq data? For RNA-seq data, to find full-length mRNAs, you would probably assemble it using an assembler like Trinity, or map it to the genome and use some other tool to guess isoforms from mapping data.

      As for the high standard deviation - BBMap by default will map reads that could map to multiple locations (aka multimapping, ambiguous, or non-uniquely-mapping reads) to the first genomic location they can map to optimally. For this kind of purpose (RNA-seq, or mapping to a transcriptome) I recommend you set "ambig=random" or "ambig=all". Otherwise, for genes with multiple isoforms, all the reads from that gene will map to the first isoform containing the exon in question, leading to a high standard deviation. For mapping RNA to a transcriptome I generally recommend "ambig=random".

      Also, if you ARE mapping DNA to the transcriptome (which, again, is probably not useful), you should set the "local" flag - otherwise reads that span intron/exon boundaries will have a lot of mismatches since there is intronic content in DNA reads, but none in the transcriptome.

      -Brian

      Comment


      • I tried using the settings you had recommended, but the standard deviation is still awfully high. Would this be related to the fact that we are trying to map WGS sequences to a transcriptome?

        Here are the results:

        With ambig=random setting and local=t

        Set USE_COVERAGE_ARRAYS to true
        Set USE_BITSETS to false
        Note: Coverage capped at 65535

        Average coverage: 31.27
        Standard deviation: 628.72
        Percent scaffolds with any coverage: 99.94
        Percent of reference bases covered: 75.89

        Time: 2084.076 seconds.

        ----------------------------------------------------------------------------------------------------------------------


        With ambig=random setting, but no local flag

        Set USE_COVERAGE_ARRAYS to true
        Set USE_BITSETS to false
        Note: Coverage capped at 65535

        Average coverage: 31.27
        Standard deviation: 635.37
        Percent scaffolds with any coverage: 99.94
        Percent of reference bases covered: 77.75

        Time: 1935.627 seconds.
        --------------------------------------------------------------------------------------------------------------------
        Thank you again Brian for your last reply!

        Comment


        • I don't understand why you are aligning DNA reads to a transcriptome. To clarify, are these DNA reads or RNA-seq reads? And can you explain what you're trying to do?

          Comment


          • I'm so sorry; I was editing my last reply when you responded.

            I forgot to mention that we are trying to determine copy number via read depth analysis, so we need to know the average and SD read mapping to all genes present in the genome (to compare against the ambiguous genes in question).

            We are using WGS reads, which are DNA, and we are mapping these to the transcriptome.

            Comment


            • You should be mapping to the genome, then extracting the information that corresponds to the transcriptome. I'm not surprised that your standard deviations are wonky. It's probably due to repeat elements in the genome that are under-represented in the transcriptome, which would produce much higher-than-expected coverage. You can check by BLASTing a few examples of your high-coverage regions against the genome, and see how many hits you get.

              Comment


              • Well, usually we map the WGS reads to the genome assembly for quality testing. But, how exactly would we extract the information corresponding to the transcriptome? Would I be able to use BBMap for this?

                Comment


                • I think you can do that with Bedtools and a gene annotation file.

                  Comment


                  • One would need BBMap to create the alignment file that can then be used with bedtools

                    Comment


                    • Hello Brian;
                      I am trying to do something similar to what DarthMapper was suggesting.
                      My PI had instructed me to determine the copy number of certain genes in a single organism by mapping WGS reads to said gene sequences as reference. I then am supposed to compare coverage values of these individual genes with that obtained from mapping the WGS reads to the whole set of mRNA sequences present in the studied organism's transcriptome. The problem is that I am also getting strange standard deviations when using BBMap.
                      What is wrong with my methodology?

                      Comment


                      • Hi James,

                        Can you give me the exact command line you are using?

                        Also, the complete output printed to the screen would be helpful, as well as the name of the organism.
                        Last edited by Brian Bushnell; 10-04-2015, 05:27 PM.

                        Comment


                        • Thanks for your reply Brian.
                          I am using BBMap to align WGS reads to a set of mRNA sequences extracted from the organism's NCBI transcriptome. This is being done with different organisms, such as mammals, reptiles, and fungi. So far, I have tested a few mammals, including Felix Catus (cat) and Rattus Norvegicus (Rat). Again, I was instructed to find the copy number of certain genes within a single organism. My PI told me to map WGS sequences to the transcriptome's mRNA sequences as reference before mapping to individual gene sequences, but we stopped once we saw the results.

                          The results for both mammals are very similar, so I will only post one example

                          Here is the commandline I am using and the results:

                          Index Build
                          bbmap.sh ref=mRNA.fa build=1

                          Alignment
                          bbmap.sh build=1 in1=catreads/read_1.fastq in2=catreads/read_2.fastq ambig=random local=t out=mRNA.sam

                          Coverage and Standard Deviation:
                          pileup.sh in=mRNA.sam out=covstats.txt hist=histogram.txt

                          P.S. I have tried to use samtools depth to calculate coverage and STDV with BBMap's sam outputs but it does not let me - even when I have converted the sam's to bam, and sorted the bam. How can this be achieved?

                          CoverageResults
                          Set USE_COVERAGE_ARRAYS to true
                          Set USE_BITSETS to false
                          Note: Coverage capped at 65535

                          Average coverage: 15.29
                          Standard deviation: 384.14
                          Percent scaffolds with any coverage: 99.81
                          Percent of reference bases covered: 69.32

                          Is there any way to lower that STDV?

                          Comment


                          • First off, you should expect a high standard deviation in this scenario. If one gene has 10 isoforms and another has 1 isoform, the gene with 1 isoform will have (on average) 10x the coverage of the isoforms from the gene with 10, when ambiguously-mapped reads are placed randomly. You need to use exactly one isoform per gene (preferably the longest).

                            However, it's also possible that there's a problem with the random selection of sites for ambiguously-mapped paired reads. Would you mind trying this (mapping with read 1 only):

                            bbmap.sh build=1 in1=catreads/read_1.fastq ambig=random local=t out=mRNA.sam covstats=covstats.txt hist=histogram.txt

                            Note that bbmap can internally generate coverage information and does not need a second pass from pileup. As for why samtools depth is not working, I have no idea. Does it give some kind of error message?

                            Comment


                            • @Brian: How is this analysis is going to give information about number of copies of a gene to @DarthMapper/@JamesJoyce? Am I missing something obvious?

                              Comment


                              • Well... it is kind of a strange analysis... and a normal transcriptome file, with many isoforms per gene, will not very useful. But, if you had exactly one transcript per gene in your reference, then theoretically, genes with 2 copies should get double the coverage of genes with 1 copy, if you map uniform WGS data. Using the "local" flag would be helpful because wherever reads go into introns they will not match the transcriptome. Also, pseudogenes will cause confusion. So I'm not sure how much signal there will be compared to the noise, but with those assumptions (one transcript per gene, local alignments, fairly uniform coverage, no pseudogenes), it should work.
                                Last edited by Brian Bushnell; 10-05-2015, 11:07 AM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                27 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                26 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X