Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat SOLID PE data not mapped in pairs

    Hello,

    this is a problem I have been dealing with now for a long time. I try to map SOLID PE reads using TopHat v 1.1.4 using this command:

    Code:
    tophat --color --quals -r 155 -p 4 -o /data/genetics/ -G /home/schaefer/tophat/Rattus_norvegicus.RGSC3.4.56.tophat.gtf rn4_c /data/genetics/1sort_1.csfasta,/data/genetics/1sort_2.csfasta /data/genetics/analyses/1sort_1.qual,/data/genetics/1sort_2.qual
    All 4 input files are sorted and the reads are named like this:

    Code:
    1sort_1.csfasta
    >1000_1000_1070/1
    T20010011233323011122110001120331100111211132111111
    >1000_1000_1179/1
    T31033102020100001322200230302001331230303321323200
    >1000_1000_1947/1
    T03313223130010310121222000210123010111002233323320
    1sort_2.csfasta
    >1000_1000_1070/2
    G1011001131233111011133111
    >1000_1000_1179/2
    G2222132322022032303223131
    >1000_1000_1947/2
    G0100011002000120101320013
    1sort_1.qual
    >1000_1000_1070/1
    32 24 32 29 22 6 11 29 10 25 14 20 31 20 14 33 32 26 6 5 6 6 32 22 8 33 10 28 21 14 8 8 23 25 14 22 19 26 10 8 14 17 23 23 32 4 32 24 8 32
    >1000_1000_1179/1
    31 33 32 31 33 27 33 33 31 33 33 27 32 32 33 33 32 33 31 32 32 31 33 28 32 33 27 30 31 32 33 22 31 32 29 26 31 22 29 24 32 33 20 16 33 25 33 12 31 29
    >1000_1000_1947/1
    33 33 33 30 33 33 28 33 31 33 26 32 26 24 33 29 33 31 20 8 33 33 30 33 29 31 32 33 29 31 29 32 24 14 17 33 30 28 27 27 33 31 28 33 33 30 31 8 24 12
    1sort_2.qual
    >1000_1000_1070/2
    32 25 33 33 8 31 11 10 6 32 33 33 6 8 8 32 28 22 30 20 32 8 12 10 14
    >1000_1000_1179/2
    30 29 30 33 25 16 29 33 33 26 31 30 23 29 30 25 32 31 25 10 27 30 24 31 29
    >1000_1000_1947/2
    33 33 32 27 32 33 33 23 24 32 33 33 19 16 30 31 27 22 14 31 19 17 20 6 29
    >1000_1000_254/2
    15 16 18 25 31 30 8 31 8 4 10 20 28 32 31 30 4 31 31 12 10 4 17 19 9
    >1000_1000_358/2
    5 32 12 11 7 8 18 9 7 8 4 27 4 6 4 4 4 4 5 6 4 21 4 4 4
    I have also tried using identical names for the reads (without /1 and /2).

    Everything seems to work ok:

    Code:
    [Mon Dec 13 11:52:28 2010] Beginning TopHat run (v1.1.4)
    -----------------------------------------------
    [Mon Dec 13 11:52:29 2010] Preparing output location /data/genetics//
    [Mon Dec 13 11:52:29 2010] Checking for Bowtie index files
    [Mon Dec 13 11:52:29 2010] Checking for reference FASTA file
    [Mon Dec 13 11:52:29 2010] Checking for Bowtie
    	Bowtie version:			 0.12.3.0
    [Mon Dec 13 11:52:29 2010] Checking for Samtools
    	Samtools version:		 0.1.8.0
    [Mon Dec 13 11:52:36 2010] Checking reads
    	min read length: 25bp, max read length: 50bp
    	format:		 fasta
    [Mon Dec 13 11:58:39 2010] Reading known junctions from GTF file
    [Mon Dec 13 12:13:59 2010] Mapping reads against rn4_c with Bowtie
    [Mon Dec 13 15:08:47 2010] Joining segment hits
    [Mon Dec 13 15:35:40 2010] Mapping reads against rn4_c with Bowtie(1/2)
    [Mon Dec 13 17:16:18 2010] Mapping reads against rn4_c with Bowtie(2/2)
    [Mon Dec 13 18:29:19 2010] Searching for junctions via segment mapping
    [Mon Dec 13 20:05:16 2010] Retrieving sequences for splices
    [Mon Dec 13 20:07:55 2010] Indexing splices
    Warning: Encountered reference sequence with only gaps
    Warning: Encountered reference sequence with only gaps
    [Mon Dec 13 20:23:43 2010] Mapping reads against segment_juncs with Bowtie
    [Mon Dec 13 21:20:19 2010] Mapping reads against segment_juncs with Bowtie
    [Mon Dec 13 22:10:15 2010] Joining segment hits
    [Mon Dec 13 22:19:17 2010] Reporting output tracks
    -----------------------------------------------
    Run complete [10:55:26 elapsed]
    The problem arises when I look at the output: The reads are not mapped in pairs, but in single reads of 50 and 25 bp each.

    samtools view reveals:

    Code:
    33_119_59	16	chr1	226	1	25M	*	0	0	GATTTGAGTGCAGAGAGCTGTGTGA	!!#+))))!!)))))))))****FF	NM:i:0	NH:i:3	CC:Z:=	CP:i:70416
    478_1485_267	16	chr1	3060	0	25M	*	0	0	AGACATGGAGAAAGAACACTCCTCC	//6.(BI:<;07A<51!.JNJ::::	NM:i:0	NH:i:5	CC:Z:=	CP:i:73804
    980_603_1244	16	chr1	3884	0	50M	*	0	0	TGTATATGTATATATATATGCATATATATATATATATATATACATACGTA	!!!+/2&!!)-/+++))*"&&!)))**),41**)!!))!!++))))))))	NM:i:0	NH:i:16	CC:Z:=	CP:i:61895914
    All flags are either 16 or 0, which cannot be true for a paired read. Do you have any idea what is going on?

    All the best,
    Seb

  • #2
    You are not saying to tophat that you want to use paired end data. You should remove the "," between the files to do so:

    Code:
    tophat --color --quals -r 155 -p 4 -o /data/genetics/ -G /home/schaefer/tophat/Rattus_norvegicus.RGSC3.4.56.tophat.gtf rn4_c /data/genetics/1sort_1.csfasta /data/genetics/1sort_2.csfasta /data/genetics/analyses/1sort_1.qual /data/genetics/1sort_2.qual

    Comment


    • #3
      great, I tried it and it works!

      thx!

      Comment


      • #4
        TopHat PE data not mapped in pairs - new version of TopHat

        Could you please tell me if this command is still the same using the new version of TopHat (v1.3.1)?

        I ask because we are trying to map SOLiD paired-end RNA-seq reads to the human genome. We are getting 36% - 40% reads mapping and are looking for ways to improve our level of mapping. We also suspect TopHat is not treating our datasets as paired, as in the SAM output column 8 (PNEXT) the value is always 0.

        An example command is:
        Code:
        tophat -p 8 --color --quals  -r 140 --mate-std-dev 30 --library-type fr-secondstrand -G hg19mRNA.gtf /Path/to/Bowtie-0.12.7/index/hg19_c Pair_F3.csfasta,Pair_F5.csfasta Pair_F3.qual,Pair_F5.qual
        Any advice would be much appreciated.

        Thanks
        Helen

        Comment


        • #5
          Just look at the previous post. You have to remove the commas between the file names.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X