Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with the insert size of RNA-seq paired end reads!

    The problem is that I cannot get a huge part of my RNA-seq data aligned!
    These are some samples from the liver, intestine and colon of mice. We are analyzing the expression level of their genes. The data has been sequenced in paired end by NextSeq 500 (2*75). And you can find the BioAnalyzer plot for the reads before sequencing in the attachment.

    If you look at the trace the fragment sizes are distributed between approx. 190-650 bp with an average of somewhere between 300-350 bps. 130pbs of those belong to upper and lower primers, which won't be sequenced. So with 2*75 sequencing length we might have overlap in some cases (I looked in one sample and in that sample 30% of the fragments should have overlapped reads) and inner distance in most of reads. Please see the summary of that in the following table:

    Initial read length Primer length Actual read length Sequenced length 2*75 Overlap length Inner distance
    190 130 60 150 90 0
    300 130 170 150 0 20
    350 130 220 150 0 70
    650 130 520 150 0 370

    Now the problem is the standard deviation and median length options while you want to align data by TopHat2. To solve the problem I thought I could align data first by bowtie2 to get some idea about the inner distance. But unfortunately I cannot get about 50% of my reads aligned.


    Here is the command I use for that:

    Code:
    bowtie2 -q --phred33 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 --n-ceil L,0,0.15 --end-to-end --score-min L,-0.6,-0.6 -I 45 -X 900 -t --met-file bowtie_align_metrix.txt --met-stderr bowtie_stderr.txt --no-unal --al  ~/al --un ~/un --un-conc ~/un_conc --al-conc ~/all_conc -p 8 --non-deterministic -x ~/ref-files/mm37 -1 R1.paired.fq -2 R1.unpaired.fq -S result.sam >& bowtie_log_file

    And here is the statistical result gotten from Bowtie2 aligning:

    Code:
    16852758 reads; of these:
      16852758 (100.00%) were paired; of these:
        8254156 (48.98%) aligned concordantly 0 times
        6476798 (38.43%) aligned concordantly exactly 1 time
        2121804 (12.59%) aligned concordantly >1 times
        ----
        8254156 pairs aligned concordantly 0 times; of these:
          974394 (11.80%) aligned discordantly 1 time
        ----
        7279762 pairs aligned 0 times concordantly or discordantly; of these:
          14559524 mates make up the pairs; of these:
            10751027 (73.84%) aligned 0 times
            3072453 (21.10%) aligned exactly 1 time
            736044 (5.06%) aligned >1 times
    68.10% overall alignment rate
    Time searching: 05:33:24
    Overall time: 05:33:24
    Why this (8254156 (48.98%) aligned concordantly 0 times) happends???????
    Attached Files
    Last edited by rozitaa; 02-24-2015, 05:21 AM.

  • #2
    Why are you using bowtie2? Are you aligning to the transcriptome?

    In my mouse liver RNAseq datasets (aligned with STAR to the genome), I've gotten ~80% of reads uniquely aligned and another ~19% multimapped (so ~1% unaligned). Granted, theses are single-end, but you wouldn't expect paired-end reads to differ that much.

    Comment


    • #3
      Originally posted by dpryan View Post
      Why are you using bowtie2? Are you aligning to the transcriptome?

      In my mouse liver RNAseq datasets (aligned with STAR to the genome), I've gotten ~80% of reads uniquely aligned and another ~19% multimapped (so ~1% unaligned). Granted, theses are single-end, but you wouldn't expect paired-end reads to differ that much.
      No I am aligning to the genome reference! Because later on I want to align it by TopHat. So this is only a way to get idea about inner distance. But I can try star also.
      I have had several experiences with single-end or paired-end with overlap alignment by bowtie2 and always got good results! but this time I doubt it might be the inner size that I cannot get them aligned.

      Comment


      • #4
        Honestly, I wouldn't worry too much about the insert size settings, it doesn't make much difference as far as I've seen. In any case, you'll get much faster results with STAR (and the results are just as reliable).

        Comment


        • #5
          Now I was also checking my sam file and interestingly some of my sam alignment lines don't have all sam fields! like this one:

          @NS500175:21:H2T5HBGXX:1:11101:24788:1114 1:N:0:CTGAAGCT+NGGATAGG%0ACTCCAGTATAAACTACTTTCCATATTCATTGTAAATCACAATGGTTTCCCACAGGCACAAAACAAAGCACAGAAAT%0A+%0AA)AAAFFFFFAAFFFFFFFFFFFAFFFFFFFFFFAFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFA%0A
          Do you have any idea about this?

          Comment


          • #6
            Which version of bowtie2 are you using? It really should be clipping off the extraneous information from the read name (1:N:0...). I wonder if not doing so is leading to a buffer overflow.

            Comment


            • #7
              Did you check FastQC plots? Maybe there was an error during sequencing, and some kmers were enriched at the ends of the reads.

              I dealt with this some time ago. I had a weird pattern of kmers at the 3' end of my reads, in all my samples, and about 50% of the reads were discarded during the alignment. I had to remove these kmers to improve the mapping percentage (luckily, my reads were long enough to perform a hard trimming).

              Other cause is maybe that your DNA was contaminated. You should try to align your reads to some common source of contamination, like human or mycoplasma.

              Comment


              • #8
                Originally posted by dpryan View Post
                Which version of bowtie2 are you using? It really should be clipping off the extraneous information from the read name (1:N:0...). I wonder if not doing so is leading to a buffer overflow.
                I am using version 2.0.6!!!

                Comment


                • #9
                  Originally posted by diego diaz View Post
                  Did you check FastQC plots? Maybe there was an error during sequencing, and some kmers were enriched at the ends of the reads.

                  I dealt with this some time ago. I had a weird pattern of kmers at the 3' end of my reads, in all my samples, and about 50% of the reads were discarded during the alignment. I had to remove these kmers to improve the mapping percentage (luckily, my reads were long enough to perform a hard trimming).

                  Other cause is maybe that your DNA was contaminated. You should try to align your reads to some common source of contamination, like human or mycoplasma.
                  I attached plots and statistical tables for two different samples one from read one end and read two for the other sample! I don't have any idea what is the cutoff kmer enrichment. Can you please explain this more?
                  Attached Files

                  Comment


                  • #10
                    Originally posted by rozitaa View Post
                    I am using version 2.0.6!!!
                    That is a pretty old version (current as of January 2013).

                    Current is 2.2.4.

                    Comment


                    • #11
                      Have you seen this (post #18 by Brian has some interesting data) http://seqanswers.com/forums/showpost.php?p=156399. It may be worth checking your own data.

                      Hopefully you had adapter trimmed your data before doing the alignments.

                      Comment


                      • #12
                        kmer is a substring of length k present in a sequence (In this case, DNA sequence)

                        For example,

                        ATTACGAGCGATCGCGCG

                        If we consider kmers of length 5, then from left to the right we have:

                        ATTAC, TTACG, TACGA, and so on.

                        In Bioinformatics, is a common task get the frequency of all possible kmers of a given sequence.

                        During the sequencing protocol, DNA is randomly fragmented (theoretically), then all kmers should have similar frequency (although in reality it is not always the case). If you see some kmers enriched in your reads, this means possibly that you have a bias. For example, when the sequencing adapter is not removed from the reads, the kmers in the adapter will be overrepresented, because sequencing adapter is the same for all reads.

                        In the plots that you attached I can see a kmer enrichment at the end of reads. At the 5' end is maybe due random priming, and at the 3' end maybe some remains of sequencing adapter, I don't know.

                        The tables shows a observed/expected rate, if the value is greater than 1, it means that the observed frequency is greater than the expected.

                        Hope that helps!

                        Comment


                        • #13
                          Originally posted by rozitaa View Post
                          No I am aligning to the genome reference! Because later on I want to align it by TopHat. So this is only a way to get idea about inner distance. But I can try star also.
                          I have had several experiences with single-end or paired-end with overlap alignment by bowtie2 and always got good results! but this time I doubt it might be the inner size that I cannot get them aligned.
                          Mammals have teeny little exons spread out over 10's-100's of kilobases of the the genome. Mapping RNA (which has the introns spliced out) reads to the genome isn't a good way to determine insert size. And only getting 50% of the reads to map "concordantly" doesn't seem so bad. How is bowtie2 going to handle reads spanning a splice site?

                          If you want to determine your insert sizes, try aligning your reads to a long (spliced) transcript instead of genomic DNA. In my experience with the MiSeq and HiSeq, your sizes will look like all the very shortest library products were sequenced preferentially.

                          --
                          Phillip

                          Comment


                          • #14
                            Originally posted by GenoMax View Post
                            Have you seen this (post #18 by Brian has some interesting data) http://seqanswers.com/forums/showpost.php?p=156399. It may be worth checking your own data.

                            Hopefully you had adapter trimmed your data before doing the alignments.
                            Well it should be trimmed!!

                            Comment


                            • #15
                              Originally posted by diego diaz View Post
                              kmer is a substring of length k present in a sequence (In this case, DNA sequence)

                              For example,

                              ATTACGAGCGATCGCGCG

                              If we consider kmers of length 5, then from left to the right we have:

                              ATTAC, TTACG, TACGA, and so on.

                              In Bioinformatics, is a common task get the frequency of all possible kmers of a given sequence.

                              During the sequencing protocol, DNA is randomly fragmented (theoretically), then all kmers should have similar frequency (although in reality it is not always the case). If you see some kmers enriched in your reads, this means possibly that you have a bias. For example, when the sequencing adapter is not removed from the reads, the kmers in the adapter will be overrepresented, because sequencing adapter is the same for all reads.

                              In the plots that you attached I can see a kmer enrichment at the end of reads. At the 5' end is maybe due random priming, and at the 3' end maybe some remains of sequencing adapter, I don't know.

                              The tables shows a observed/expected rate, if the value is greater than 1, it means that the observed frequency is greater than the expected.

                              Hope that helps!
                              Thanks for nice explanation!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X