Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat/Cufflinks is sensitive to the read name in the gene expression estimation?

    Hi, all
    Recently, I have found that tophat/cufflinks is sensitive to the read name in the alignment.
    In a project, I have a sequencing paired-end reads, of which the read names are as follows:
    Code:
    Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:1:1:0:CGATGT/1
    Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:2:1:0:CGATGT/2
    After mapping these reads to the reference by tophat, the read names in the alignment were like:
    Code:
    Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:1:1:0:CGATGT
    Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:2:1:0:CGATGT
    Obiously, they were different. So when I only used the proper paired reads in the alignment as the input for Cufflinks, I got over 90% genes of 0 FPKM.
    After manually changing the name of Read 2 to the same as Read 1, I got a proper result with over 70% genes expressed.

    It seems that Tophat/Cufflinks is sensitive to the read name. Dose anyone have experience in this condition?
    Last edited by wisense; 04-24-2014, 04:39 PM. Reason: some syntax error

  • #2
    The sam specification requires read 1 and read 2 to have identical names, so the /1 and /2 will be dropped. BBMap has a flag that will retain them ("keepnames=t"), but bear in mind that the result will be an improper sam file, so some downstream programs may not be able to process it.

    I'm just guessing, but probably, Tophat was treating the reads you renamed as single-ended?

    FYI, those names look odd to me. I'm used to seeing something like this:

    Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /1:CGATGT
    Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /2:CGATGT

    ...well, that may not be exactly right, but anyway, there's normally a space before the "/1", and everything before the space is identical for both reads (whereas in your case, they are not identical). If the space is missing, and the reads are interleaved rather than in 2 files, BBTools at least will not recognize them as paired (though obviously it would if they were in two files). Were they processed in some way (like removing all whitespace) before you got them?

    P.S. It would be helpful if you gave your Tophat command line or explained whether these reads are in two files or interleaved in one file.
    Last edited by Brian Bushnell; 04-24-2014, 05:20 PM.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      The sam specification requires read 1 and read 2 to have identical names, so the /1 and /2 will be dropped. BBMap has a flag that will retain them ("keepnames=t"), but bear in mind that the result will be an improper sam file, so some downstream programs may not be able to process it.

      I'm just guessing, but probably, Tophat was treating the reads you renamed as single-ended?

      FYI, those names look odd to me. I'm used to seeing something like this:

      Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /1:CGATGT
      Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /2:CGATGT

      ...well, that may not be exactly right, but anyway, there's normally a space before the "/1", and everything before the space is identical for both reads (whereas in your case, they are not identical). If the space is missing, and the reads are interleaved rather than in 2 files, BBTools at least will not recognize them as paired (though obviously it would if they were in two files). Were they processed in some way (like removing all whitespace) before you got them?

      P.S. It would be helpful if you gave your Tophat command line or explained whether these reads are in two files or interleaved in one file.
      Hi, Brian
      In my case, Read1 and Read2 are in two files, and this is my Tophat command line:
      Code:
      tophat -o tophat_out -G zv9.gtf bowtie_index/zv9 read1.fq.gz read2.fq.gz
      In my opinion, after I rename the read 2 in the alignment, the read 1 and read 2 will have the identical name, so Tophat will treat these reads as paired reads. If I don't rename the read 2, Tophat will fail to treat these reads as paired reads.

      If Tophat treat these reads as single-end reads, theoretically the FPKMs of most genes will increase, but in my case, over 90% genes with FPKM equal to 0.

      Comment


      • #4
        Actually, one of my fellow researchers just mentioned to me that he was having trouble with Illumina reads using /1 and /2 without a leading space. Perhaps it's a new Illumina software version. What can I say, other than shame on Illumina... but it's very strange that Tophat does not recognize 2 files as being paired - BBMap certainly will, regardless of the read names:

        bbmap.sh in1=read1.fq.gz in2=read2.fq.gz ref=zv9.fa out=mapped.sam maxindel=200000

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          Actually, one of my fellow researchers just mentioned to me that he was having trouble with Illumina reads using /1 and /2 without a leading space. Perhaps it's a new Illumina software version. What can I say, other than shame on Illumina...
          Illumina has not changed the format of its read names in years, since the release of v1.8 in May 2011. The standard format is:

          Code:
          Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 1:1:0:CGATGT
          Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 2:1:0:CGATGT
          The definition is:

          Code:
          @<instrument>:<run_number>:<flowcell_ID>:<lane>:<tile>:<x-pos>:<y-pos><space><read>:<is_filtered>:<control_number>:<index_sequence>
          Notice that definition lines have two parts separated by a <space>. The first part is the ID, the second is additional information. Only the ID part is required to be identical for TopHat (and other software) to recognize the reads as mates.

          Originally posted by wisense
          In my opinion, after I rename the read 2 in the alignment, the read 1 and read 2 will have the identical name, so Tophat will treat these reads as paired reads.
          No, after you renamed them they still did not have identical names:

          Code:
          Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]1[/COLOR]:1:0:CGATGT
          Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]2[/COLOR]:1:0:CGATGT
          What apparently has happened is the somewhere along the line someone replaced the required <space> with a ':' in the definition line. This means the entire line is now considered the ID, not just the first part. Most cases I have seen with software having trouble recognizing Illumina IDs for read pairs have been caused by someone has intentionally altering the ID's.

          Comment


          • #6
            Originally posted by kmcarr View Post
            Illumina has not changed the format of its read names in years, since the release of v1.8 in May 2011. The standard format is:

            Code:
            Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 1:1:0:CGATGT
            Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 2:1:0:CGATGT
            The definition is:

            Code:
            @<instrument>:<run_number>:<flowcell_ID>:<lane>:<tile>:<x-pos>:<y-pos><space><read>:<is_filtered>:<control_number>:<index_sequence>
            Notice that definition lines have two parts separated by a <space>. The first part is the ID, the second is additional information. Only the ID part is required to be identical for TopHat (and other software) to recognize the reads as mates.



            No, after you renamed them they still did not have identical names:

            Code:
            Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]1[/COLOR]:1:0:CGATGT
            Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]2[/COLOR]:1:0:CGATGT
            What apparently has happened is the somewhere along the line someone replaced the required <space> with a ':' in the definition line. This means the entire line is now considered the ID, not just the first part. Most cases I have seen with software having trouble recognizing Illumina IDs for read pairs have been caused by someone has intentionally altering the ID's.
            Hi, kmcarr
            Thanks for your reply.
            What I have done was like:
            Code:
            Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]1[/COLOR]:1:0:CGATGT
            Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]1[/COLOR]:1:0:CGATGT
            I have changed the "2" in the read name field of read 2 to "1", so read 1 and read 2 share the identical ID, sorry for the ambigous expression before.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            26 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X