Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Double reads for one gene

    Hi all,

    I have performed RNA-seq on 2 samples, a "case" and "control", with the idea being to find differential expression between the two.

    I have noticed, for one gene, that I have loads of "double" sequences mapping, i.e.: the bamfile for the sequence (which I have converted to a GenomicRanges class), has two reads mapping to the same point in the gene, but mapping repeatedly at different points along the gene.

    >countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000024028)[which(countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000024028) != 0)]
    [1] 2 2 2 2 1 2 1 2 1 2 2 2 1 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
    [38] 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
    [75] 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
    [112] 2 1 2

    Whereas for another gene (and I think typically...at least for my top differentially expressed genes) I get:

    >countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000006151)[which(countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000006151) != 0)]
    [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

    I have traced these "double mappings" back to the fastq files, they are definitely the same sequences, e.g. for one of the sequences, I found the original reads in the sam and fastq files, here I show an example:


    % grep HWUSI-EAS4752312156373911 s_1_1.qseq_ACAGTGA.fastq -A 1
    @HWUSI-EAS4752312156373911
    GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
    +HWUSI-EAS4752312156373911
    geggggggggdggggfcfdfffaffggggcffff

    % grep HWUSI-EAS4752317017106159551 s_1_1.qseq_ACAGTGA.fastq -A 1
    @HWUSI-EAS4752317017106159551
    GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
    +HWUSI-EAS4752317017106159551
    hhhhhhhhhhhhhhghhhehhfdhhhhhhhhghh

    It seems really weird, like somehow the reads for this gene are being duplicated. Is there a known reason for this?

    To make it clear, it's not the same read is mapping at different positions in the same gene,The reads map to one position only, it's just that 2 reads map to the same position, consistently, so it's almost the opposite problem of a read mapping unambiguously!

    so if you imagine:

    Situation for most genes in sample:
    Code:
    [FONT="Courier New"]DIFFERENT READS:
                 ------
                              ------
                                        ------
                        ------
                                                ------
    
    GENE :------------------------------------------------->[/FONT]
    Situation for this 1 particular gene:

    Code:
    [FONT="Courier New"]DIFFERENT READS:
                              ------
                              ------
                        ------
                        ------
                                      ------
                                      ------
                           ------
                           ------
                   ------
                   ------
    
    GENE :------------------------------------------->[/FONT]
    At least, this is what I think is happening.

    I tested for a load of other genes in the same sample and they all follow the pattern of the first figure, i.e. reads map in different regions, the second figure shows something that only happens for this gene.

    I've written a script that looks to see if this happens for any other genes at all in the genome, but it's taking a while to run! (to be fair my R could probably be more efficient!).

    The other weird thing is it happens in different samples! It happens in all 3 "case" samples, and not at all in the 3 "control" samples, but I can't think of any possible biological reason for it (at least not before going into more detail..I just wanted to see if any one here had encountered it, or if it was a common thing before I did so).

    Any thougts at all would be greatly appreciated. It's very weird.

    My plan otherwise is to count each "double" as one count. Though since it is not expressed in control, it's not such a problem..we can't calculated a fold change anyway.

    Cheers,

  • #2
    My first wild guess is that they look like PCR duplicates.

    Comment


    • #3
      Originally posted by lukas1848 View Post
      My first wild guess is that they look like PCR duplicates.
      But surely PCR duplicates should be seen more randomly across the genome rather than for just one gene?

      Comment


      • #4
        PCR duplicates should be removed by SAMtools anyway. So these might have just slipped through the filtering. Don't ask me why though.

        Comment


        • #5
          First guess is, of course, PCR duplicates.
          Second guess is repeat/low complexity regions within the gene.
          Third guess is a co-expressed homologous gene.
          Fourth guess is gene duplication.

          Comment


          • #6
            Originally posted by TonyBrooks View Post
            But surely PCR duplicates should be seen more randomly across the genome rather than for just one gene?
            Exactly... it's definitely not random. It might happen for other genes (we'll find out when my script finishes) but it doesn't happen "a little bit" for many genes, just (almost) all for one.

            It's like something happens when the mRNA gets sheared, and for a given mRNA it doubles it... or something?

            Comment


            • #7
              Originally posted by GW_OK View Post
              First guess is, of course, PCR duplicates.
              Second guess is repeat/low complexity regions within the gene.
              Third guess is a co-expressed homologous gene.
              Fourth guess is gene duplication.
              Again, if it's duplicates, why might it be particularly happening for this one gene? And why an exact doubling each time? Has anyone seen this before specifically for one gene?

              I'm not sure that it's repeat/low complexity regions, since wouldn't the read map unambiguously to several positions in the gene? Not exactly two reads mapping to separate positions, several times? I blatted a couple of reads and they do map unambiguously to the genome.

              It seems strange that if it is a co-expressed homologous gene, that the reads would be mapping in exactly the same places? The gene length is ~500 bp, and there are ~200 reads mapping, but most of them are mapping twice in the same location, which seems odd to me.

              Also, it seems weird it's so reproducable, it's happening specifically for this gene in different biological replicates
              Last edited by jimineep; 10-27-2011, 07:14 AM. Reason: corrected overuse of the word weird

              Comment


              • #8
                Is it possible that having different isoforms for the same gene is messing up the mapping?

                Comment


                • #9
                  Originally posted by gringer View Post
                  Is it possible that having different isoforms for the same gene is messing up the mapping?
                  How do you mean? You mean there are 2 different isoforms being expressed in the tissue? I don't see how that would lead to the 2x sequence generation?

                  Comment


                  • #10
                    You mentioned "converted to a GenomicRanges class", and I wonder if hits to different isoforms for the same gene might be recorded multiple times. Is there any way you can look at overlaps by isoform('tx_by_isoform'), rather than overlaps by gene ('tx_by_gene')?

                    Comment


                    • #11
                      Interesting point. I will investigate. I get the same result when mapping to exons though (i.e. tx_by_exon).

                      By isoforms do you mean splice variants or something deeper? One thing I can think of against that theory is that the problem can be traced back to the fastq file:

                      % grep HWUSI-EAS4752312156373911 s_1_1.qseq_ACAGTGA.fastq -A 1
                      @HWUSI-EAS4752312156373911
                      GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
                      +HWUSI-EAS4752312156373911
                      geggggggggdggggfcfdfffaffggggcffff

                      % grep HWUSI-EAS4752317017106159551 s_1_1.qseq_ACAGTGA.fastq -A 1
                      @HWUSI-EAS4752317017106159551
                      GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
                      +HWUSI-EAS4752317017106159551
                      hhhhhhhhhhhhhhghhhehhfdhhhhhhhhghh

                      So there is definitiely a pair of identical sequences that map unambiguously to the genome in the fastq file. That said I'm going to go even further back to the qseq file just to triple check!

                      Comment


                      • #12
                        Wouldn't you expect multiple hits to the same gene regions for RNA-seq experiements with a suitably large number of reads? What happens if you grep for the sequence itself (or arbitrary other sequences that have been mapped)?:

                        Code:
                        % grep '^GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA' s_1_1.qseq_ACAGTGA.fastq | wc -l
                        By isoforms do you mean splice variants or something deeper
                        I think they're usually just splice variants. To be more verbose, different transcripts that map to the same gene region. I'll give a rough example. Say you have a gene with two different isoforms:

                        Code:
                        Gene:  [exon 1  ][exon 2  ]-------[exon 3  ]---[exon 4  ]---[exon 5  ]
                        Iso_1: [exon 1  ][exon 4  ][exon 5  ]
                        Iso_2: [exon 1  ][exon 3  ][exon 5  ]
                        Let's say you get a cell that has both isoforms expressed at the same time:

                        Code:
                        Iso_1: [exon 1  ][exon 4  ][exon 5  ]
                        [hit]    ----     ----      ----
                        Iso_2: [exon 1  ][exon 3  ][exon 5  ]
                        [hit]      ----    ----  ---- ----
                        And have some way of converting those to binary counts per isoform:

                        Code:
                        Iso_1: [exon 1  ][exon 4  ][exon 5  ]
                        [hit]    1111     1111      1111
                        Iso_2: [exon 1  ][exon 3  ][exon 5  ]
                        [hit]      1111    1111  1111 1111
                        When these are summed up for a single gene, it will look something like this:
                        Code:
                        Gene:  [exon 1  ][exon 2  ]-------[exon 3  ]---[exon 4  ]---[exon 5  ]
                        [hit]    112211                     1111  11     1111       1212211
                        [note that the exon mapping will be the same as the gene mapping in terms of summed coverage]

                        Comment


                        • #13
                          Originally posted by gringer View Post
                          When these are summed up for a single gene, it will look something like this:
                          Code:
                          Gene:  [exon 1  ][exon 2  ]-------[exon 3  ]---[exon 4  ]---[exon 5  ]
                          [hit]    112211                     1111  11     1111       1212211
                          [note that the exon mapping will be the same as the gene mapping in terms of summed coverage]
                          This still wouldn't explain why there are always two reads mapping exactly to the same position. I'd still say that this is some artifact resulting from the library prep.

                          Comment


                          • #14
                            I feel the same. at first look ,they look like PCR duplicates and hence overamplification of the samples. I would contact the person making the libraries and get details on amount of input DNA. If the run is a PE run, you can safely remove these duplicate rates that have the same start and the end as they are PCR duplicates.

                            Comment


                            • #15
                              I think Gringer has it,

                              I think the problem comes down to the Granges object, which is as so:

                              > tx_by_exon$ENSRNOG00000024028
                              GRanges with 4 ranges and 2 elementMetadata values
                              seqnames ranges strand | exon_id exon_name
                              <Rle> <IRanges> <Rle> | <integer> <character>
                              [1] chr2 [185464227, 185464733] - | 40667 NA
                              [2] chr2 [185464256, 185464351] - | 40669 NA
                              [3] chr2 [185464373, 185464714] - | 40670 NA
                              [4] chr2 [185465839, 185465857] - | 40668 NA

                              and for the tx_by_gene is as follows:

                              > tx_by_gene$ENSRNOG00000024028
                              GRanges with 2 ranges and 2 elementMetadata values
                              seqnames ranges strand | tx_id tx_name
                              <Rle> <IRanges> <Rle> | <integer> <character>
                              [1] chr2 [185464227, 185465857] - | 6823 ENSRNOT00000012174
                              [2] chr2 [185464256, 185464714] - | 6824 ENSRNOT00000068099

                              These ranges are overlapping, so clearly the reads are being counted twice for the different isoforms, which is why I was getting a 2.

                              I believe I have misinterpreted the results, thinking something much weirder was going on than was actually happening. Apologies.

                              Basically I wasn't expecting the two isoforms to coexist in this way, I didn't think that exonsBy would allow exons to overlap in this way & therefore get counted twice. I hadn't realised because all the other genes I had checked had non-overlapping exons.

                              This is interesting, does this therefore mean when someone performs:

                              library(Rsamtools)
                              reads=readBamGappedAlignments("aligned_reads_sorted.bam")
                              counts=countOverlaps(tx_by_gene,reads)

                              to get the overlap between the reads and genes, they could actually count the same read several times? If so they could in theory end up with more reads mapping to genes than there are reads in a bam file?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X