Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cufflinks incorrectly specifies read type as single end

    Hi,

    I am using cufflinks to analyze a samfile produced using tophat. The data are 75bp paired-end reads and the samfile includes information about location of mate pairs etc. But cufflinks appears to be confused about the nature of the data. It prints that the read type is single-end, yet still estimates a fragment length distribution (see below). Anyone know what the problem is? Should I just ignore the fact that it says the reads are single-end?

    > Processed 259464 loci. [*************************] 100%
    > Map Properties:
    > Total Map Mass: 286777666.76
    > Read Type: 75bp single-end
    > Fragment Length Distribution: Gaussian (default)
    > Estimated Mean: 209.55
    > Estimated Std Dev: 70.54
    Lindy McBride - Rockefeller University

  • #2
    Hi Lindy,

    Can you post a few lines of the input SAM file you're feeding Cufflinks? You can retrieve it from a BAM file with:

    Code:
    samtools view accepted_hits.bam | tail -n +25 | head -n 10
    It may be that the read names in the alignment are not matching up, or are not being properly reported as paired.


    Andrew

    Comment


    • #3
      Hi Andrew,
      Thanks so much for your response. Here are the first 15 lines from one of the many samfiles that give me this problem. It might easier to read if you cut and paste into a text editor. It looks to me like the 5th and 7th reads are paired

      Thanks!
      Lindy
      @HD VN:1.0 SO:sorted
      @PG ID:TopHat VN:1.0.14 CL:/usr/local/genome/bin/tophat -p 8 -a 12 -m 1 -r 56 -o /mnt/analyses/LVPant/ AaegL1 LVPa2_R1_fastq_prefilter.txt,LVPa2_B_read_1_fastq_prefilter.txt,LVPa3_A_read_1_fastq_prefilter.txt,LVPa3_B_read_1_fastq_prefilter.txt,LVPa4_A_read_1_fastq_prefilter.txt,LVPa4_B_read_1_fastq_prefilter.txt,LVPa5_A_read_1_fastq_prefilter.txt,LVPa5_B_read_1_fastq_prefilter.txt LVPa2_R2_fastq_prefilter.txt,LVPa2_B_read_2_fastq_prefilter.txt,LVPa3_A_read_2_fastq_prefilter.txt,LVPa3_B_read_2_fastq_prefilter.txt,LVPa4_A_read_2_fastq_prefilter.txt,LVPa4_B_read_2_fastq_prefilter.txt,LVPa5_A_read_2_fastq_prefilter.txt,LVPa5_B_read_2_fastq_prefilter.txt
      ILLUMINA-300160:LVPa4_B_read_2:1:6:78:12045:15339#0 137 CH477186.1 702 1 76M * 0 0 CCAAAAAGAGATCAAGATTTATCAAAATGTTATTGTGCAATGGTACAGCCGTTTTGAGAAAGAATTGTTTGGCATC hhhhfffhhhhhhhchhhhhdhhffhhhghhchhehhhghhhfghgchhchhhhhhcfah_fW]chhhhehahhhg NM:i:2 NH:i:4 CC:Z:CH477219.1 CP:i:2212927
      HWUSI-EAS1600:LVPa2_R2:10_15_2010:1:8:21:9291:2197#0 137 CH477186.1 934 1 75M * 0 0 GGAACGTCAGCGTAGCAGCTCACTAAACAGCGGTGATCACTCCAGTGCTCAGTGTACCGGACAACATAGTGCCGG bdfchdaefffgeghhghcfhfhghhfhhhhghghhhhghghhhhhhhhhhhhhhhghhhhhhhhhghhhhhhhh NM:i:1 NH:i:3 CC:Z:CH477299.1 CP:i:2381918
      ILLUMINA-300160:LVPa3_A_read_2:1:2:114:7815:7323#0 137 CH477186.1 1102 3 76M * 0 0 GCACGACGATTGGAATCGACCTCAACAAAGTGAACAATTTCGGTCACAGAGTGTGGCCAAGATTCGGTGCCCGCCC d`^`R`feedd[a]c[bb]b]ccc]_fdadaffdfd_Q^a_fdbb]fdWfa[dchafffdaafahghh`_gfhefh NM:i:2 NH:i:2 CC:Z:CH477284.1 CP:i:705773
      ILLUMINA-300160:LVPa3_A_read_2:1:2:5:9651:17717#0 163 CH477186.1 2520 255 76M = 2641 0 TGCAACTGTAGTAGCTATCCCCAAGGCCAAAAAAGACGTAACGCTGCCTACAAACTACCGTCCCATCAGCCTGCTA gfhhhhhghfh_fhhhgahhhhhhhchghhgahhhhchhhhhhhhdhchhfgfgge_ce[bdfdcahgaga]aaaa NM:i:0 NH:i:1
      ILLUMINA-300160:LVPa5_B_read_1:1:8:116:1810:15456#0 99 CH477186.1 2567 255 76M = 2708 0 CTACAAACTACCGTCCCATCAGCCTGCTAAGTAGCCTGAGCAAAGTGTTTGAACGAATCATCTTAAACCTGTTTCA gggggggggggggggggaggdcfffgggggggggggggggfffaeaebefgggggcgggegggdgdfJeddadefg NM:i:1 NH:i:1
      ILLUMINA-300160:LVPa3_A_read_1:1:2:5:9651:17717#0 83 CH477186.1 2641 255 76M = 2520 0 CAAAGCCGGTCACTCCACCAACCACCAGTTAGCACGGATAACCAAAATCGTAAAGGACGGTTTCTCTGCAAGAAAA dhcgegdfddfdfcfce]hfehhghhdgghhhfhhhghhhhhhghhhfhhhhhhhhghhhhhhhhhhh_hhhhhhh NM:i:0 NH:i:1
      ILLUMINA-300160:LVPa5_B_read_2:1:8:116:1810:15456#0 147 CH477186.1 2708 255 76M = 2567 0 GCAAGAAAATCGACTGGTATGATTATGCTTGACGTCGAAAAGGCATATAATTCCGTCTGGCAGGACGCGATCATCT dbbghggdadghadghgggefdfdfc_dhhhdhhceghhhhhhhhfgchhhcfhhhhfhhhhhhhhhhhhghhhhh NM:i:0 NH:i:1
      ILLUMINA-300160:LVPa5_B_read_2:1:8:10:14657:2302#0 163 CH477186.1 2743 255 76M = 2877 0 CGAAAAGGCATATAATTCCGTCTGGCAGGACGCGATCATCTACAAACTACGTAAATGTAACTGTCCTTTCTACATT hhfhhhhhhhhhhhhhhhhhhhhhhhghhfheghhhhchhhhhhh_hghghhhdfhgfhggghdfahhgghhdcgh NM:i:0 NH:i:1
      ILLUMINA-300160:LVPa5_B_read_1:1:8:10:14657:2302#0 83 CH477186.1 2877 255 76M = 2743 0 CTGTATCTGGGAAATTCGACATCCCTTGTGGCGTCCCACAGCGATCAGTGTTGAGCCCAACACTCTACAATGTTTT ggdggdgghhhgggWhfhfhhggcgehhghghhffcaggdhhhchhfhhhghhghgghhhgghhhhhhhghhhhhh NM:i:2 NH:i:1
      HWUSI-EAS1600:LVPa2_R2:10_15_2010:1:8:3:14836:3337#0 163 CH477186.1 4492 0 75M = 4603 0 ATTCGCCCCATTGCGGGGTGAATAGAAACATACAACATTTTCAATAATATTTGTACGATATAATTTTTATTTTTT hhhhhhhhhhhhhhhhhcbh`ffdehhhhhfhhhhhhhhhhhehhhhhhhhhhgghhghhhhghhhhhghhghhe NM:i:2 NH:i:5 CC:Z:CH477198.1 CP:i:431872
      HWUSI-EAS1600:LVPa2_R2:10_15_2010:1:8:47:7635:10354#0 163 CH477186.1 4495 0 75M = 4606 0 TGCCCCATTGCGGGGTGAATAGAAACATACAACATTTTCATAAATATTTGTACGATATAATTTTTATTTTTTAAC hhhhhhhhhhhhhhh`f]dffffffhhhhfhhhhghhhhhhhhdhhhhhhghhhhhghhghhhhhghhhhhhhhh NM:i:1 NH:i:5 CC:Z:CH477198.1 CP:i:431869
      ILLUMINA-300160:LVPa2_B_read_2:0:1:53:19937:17604#0 137 CH477186.1 4588 0 76M * 0 0 AAAGGAACACGCGACCCATATTTAAGACTAATAAAATTGGATTTTATCCACACGGTTTAAGTTGTACCCATTCAAA hhfhhZ]GSWRXWXZhhagfhhhhcggfghhhhaeegd]e_afffhgghhahhchhhfedcacfa[VaZZ[[[[Q[ NM:i:2 NH:i:19 CC:Z:CH477198.1 CP:i:431775
      HWUSI-EAS1600:LVPa2_R2:10_15_2010:0:8:91:10481:4293#0 137 CH477186.1 4596 0 75M * 0 0 ACGCGACCCATATTTAAGACTAATAAAATTGGATTTTATCCACACGGTTTAAGTTGTACACATTCAAACATGTCG f_[]feae`d^Qb]]c^[caZZZZZZWMWY\aYY\Z_ZUOed]]`VUQTV``WX]d`^Wdb`d^_WUU[QbdZd` NM:i:0 NH:i:16 CC:Z:CH477198.1 CP:i:431768
      Lindy McBride - Rockefeller University

      Comment


      • #4
        From http://samtools.sourceforge.net/SAM1.pdf, the name of both mates in a pair must be the same. So while the two reads you mentioned are certainly paired, they aren't matched up in Cufflinks because the names differ by the number marked with a carrot below. If you preprocess the SAM files to eliminate the mate field given by that number (maybe turning it into an X or something?), the library should be recognized as paired-end.

        Code:
        ILLUMINA-300160:LVPa5_B_read_1:1:8:116:1810:15456#0 99 CH477186.1 2567 255 76M =
                                     ^                                                 
        ILLUMINA-300160:LVPa5_B_read_2:1:8:116:1810:15456#0 147 CH477186.1 2708 255 76M =
                                     ^
        Lemme know how it goes!


        Andrew
        Last edited by polyatail; 12-16-2010, 10:11 PM. Reason: misinfo re: sam specification

        Comment


        • #5
          I have only just now come back to this issue, and your suggestion worked - thanks! I changed both read_1 and read_2 in the read names in my samfile to read_X, and cufflinks now reports my library as paired end.

          I have a second issue now however - less significant than the first. When I run cufflinks with a reference annotation it correctly reports the mean fragment length as ~200 bp. But when I run it without the reference annotation it reports the fragment length as 0 bp (see two outputs below). I assume this is because my samfile was generated by a tophat run which used known junctions from the same reference annotation? It seems like the most likley answer, though I see no reason why cufflinks would not be able to parse the mapping coordinates of each read in a pair and report the correct fragment length distribution regardless of whether a reference annotation is used...

          > cufflinks -G aaegypti.BASEFEATURES_Liverpool-AaegL1.2.gff3 LVPa2_abbr.RX.bam
          [21:13:01] Inspecting reads and determining fragment length distribution.
          > Processed 16360 loci. [*************************] 100%
          > Map Properties:
          > Total Map Mass: 123784.26
          > Read Type: 75bp paired-end
          > Fragment Length Distribution: Empirical (learned)
          > Estimated Mean: 197.41
          > Estimated Std Dev: 13.71
          [21:13:04] Calculating estimated abundances.
          > Processed 16360 loci. [*************************] 100%


          > cufflinks LVPa2_abbr.RX.bam
          [21:05:59] Inspecting reads and determining fragment length distribution.
          > Processed 1389 loci. [*************************] 100%
          > Map Properties:
          > Total Map Mass: 150829.93
          > Read Type: 75bp paired-end
          > Fragment Length Distribution: Gaussian (default)
          > Estimated Mean: 0.02
          > Estimated Std Dev: 2.17
          [21:06:01] Assembling transcripts and estimating abundances.
          > Processed 1389 loci. [*************************] 100%
          Lindy McBride - Rockefeller University

          Comment


          • #6
            It looks like when you didn't provide an annotation, there was not enough available information to determine the fragment length distribution (see http://cufflinks.cbcb.umd.edu/howitworks.html#hdis). I'm not sure, however, why it defaulted to such low values. Try specifying the mean with -m and the std-dev with -s.

            -Adam

            Comment


            • #7
              Hi Adam - Thanks. I will try that. I guess I can run it with the reference to get the values and then rerun it without the reference specifying the values estimated in the first run. I had not seen the "how cufflinks works" page. That is helpful.
              Lindy
              Lindy McBride - Rockefeller University

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              66 views
              0 likes
              Last Post seqadmin  
              Working...
              X