Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • paired-end reads apparently unseparated

    Hi, this is my first post!

    I've downloaded some ChIP-Seq data from the SRA ( http://www.ncbi.nlm.nih.gov/sra/SRX000425?report=full# ). The originating paper ( http://www.ncbi.nlm.nih.gov/pubmed/18477713 ) says that the reads are SOLiD paired-end, with 25 bp from each end, but the reads themselves are 52 bases long! It seems that the two ends have been ligated and sequenced together.

    I am wondering where the extra two bases come from, but the larger problem is that all the alignment programs out there seem to expect that when you have paired-end reads that you'll have TWO lists of reads, one for each half of the pair. (There are two archives attached to this SRA account, but I'm fairly certain that they're not paired. When you add the number of reads they come to the number of reads reported in the paper, and the reported number of matches is more than half that number. besides, the read id's don't appear to match, and the reads are too long.)

    What tools can I use to map these reads to a reference genome?

    for reference, here are some sample reads:
    first file:
    Code:
    @SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
    T32322133300002330031001022230020232002203222030231
    +SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
    !21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'
    @SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
    T01212120333223322020022322232232232222022232033230
    +SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
    !,*+*()+*(%'+)%%%&%+&%%'%%%%%%%%%%%%%%%%%%%%'+%%%%%
    @SRR015241.3 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_369_F3 length=50
    T32023002222000323202022222323322200222200220003032
    +SRR015241.3 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_369_F3 length=50
    !(*)%%%+'%%%*%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%%%+%%%%%
    @SRR015241.4 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_487_F3 length=50
    T32021200310022332200020032222332303202203222030030
    +SRR015241.4 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_487_F3 length=50
    !9)'+*)4')%&&%)%%('&%%'%'%%%%%%%)%%%%%%%%%%%%+%%%%(
    second file
    Code:
    @SRR015242.1 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_30_F3 length=50
    T03231223000321133333031113002130221200322111211011
    +SRR015242.1 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_30_F3 length=50
    !:9<3:99<*8;8<)0<;<%-8;2%%3*5%*.8<,1;6;*%&..'%%-*,%
    @SRR015242.2 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_68_F3 length=50
    T01032120003210102101003202002003021300310100313323
    +SRR015242.2 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_68_F3 length=50
    !<*7-;3291:/*0306/';'6<8&/;13'/,6%5&,''*+3--/+*4&%&
    @SRR015242.3 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_217_F3 length=50
    T30000002320022232001023330002000220231323302003320
    +SRR015242.3 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_217_F3 length=50
    !,%&'''---+5%%%*-(-2%37''%-&%%+(3-&%*%%'%*&2''%3.%%
    @SRR015242.4 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_312_F3 length=50
    T01301202310020020101002322020221212212112020001111
    +SRR015242.4 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_312_F3 length=50
    !52/601)1,&3:%5691*-':74),'%%%&%&+(*)&%)'&&,'&)*)*%
    @SRR015242.5 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_482_F3 length=50
    T30202333031100120210330331030310222032111001231300
    +SRR015242.5 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_482_F3 length=50
    !;<<<;;<<<<<1<<1;<56/<<:9:1;31;/<;%%/89/'99<'08)<%0
    (all the reads start with T?)

  • #2
    The F3 is usually the first end, while the R3 is usually the second end (see the read name). Also, the read is in color space, hence the first T and the digits. Please review literature and this forum surrounding color space (what it is and what it looks like) and color space mapping (see BFAST/BWA/Mosaik/NovoalignCS/SHRiMP).

    Comment


    • #3
      i grep'd the fastq files and there are no 'R3' strings in there. Why would the reads be 52 bases long when the paper says the run was 25 bases x 2? It seems likely that I failed to extract the reads from the sra-lite files correctly.

      (it's color space, clearly. i'm just confused why the first nucleotide is always T -- inspecting the fastq files i haven't found any reads starting with any other letter. I've read that ABI ligates some sort of primer or something in some cases, which could explain this.)

      Comment


      • #4
        The only way to decode a color space read is using the adapter (did you read up on color space?). If there are 2x25 and each has an adapter added then coincidentally 2x(25+1) = 52, which is the number of bases you observed. I am wondering if they have done some concatenation magic here.

        Anyone else seen this problem when downloading from SRA? You could try to email their supports since the above data is confusing.

        Comment


        • #5
          am contacting SRA. thanks for your help.

          Comment


          • #6
            i'm pretty sure this paper is mate pair reads and not paired end.
            I know it says paried end, but they are from circularized libraries which means mate pair. Plus it is from 2008 and the reverse reads were not even around.

            also, i've never seen data in that format before.

            Comment


            • #7
              Originally posted by SeqAA View Post
              also, i've never seen data in that format before.
              You will not see it again after the SRA shuts down

              The second end on mate pair data is the R3 tag, while F5 is the paired end tag). The first end should be F3, so it looks like just the first end is present or you've downloaded fragements?

              Comment


              • #8
                i contacted the SRA, who told me that they only received forward reads of length 50. It looks like i'll have to contact the authors of the paper. I'll post here when I solve the problem.

                thanks for your help, ppl!

                Comment


                • #9
                  For each read, the first 26 characters are F3, the last 25 character prefixed with a 'G' are R3.

                  Comment


                  • #10
                    sinaian: is there any alignment program that can read this format? If there's no native tool, it seems that I would have to make a script that maps the reads into dna-space (from colorspace), then separate the forward and back reads.

                    That would be relatively easy, but it seems like a pretty bad idea to map into DNA space before mapping! can anyone think of a better way?

                    also, sinaian, is there any documentation for this format of reads?

                    Comment


                    • #11
                      BWA does colorspace fastq like this BUT (IMPORTANT) the target genome must have been indexed (created) using the using "-c" parameter: example "bwa index -c hg1xORwhatever.fasta". If bwa complains, use additional parameters : "-a bwtsw".

                      Then run bwa using the -c parameter to align. THIS "-c/colorspace" MUST BE DONE ON BOTH THE INDEXING AND ALIGNMENT COMMANDS!!!!!

                      Comment


                      • #12
                        For BWA, you'd have to strip the T, convert 0 to A, 1 to C, 2 to G, 3 to T and anything else to N. That's pseudo-nucleotide representation of color space, which was introduced as a workaround. It has nothing to do with the actual decoding into nucleotide space!

                        But BWA would map these reads as if they were 50 bp reads, not mate pairs.
                        You'll just have to write a script that separates the colorspace data and writes them to 2 files: one containing the F3
                        Code:
                        @SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3
                        T3232213330000233003100102
                        +SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3
                        21(()+%'+%40*.%%**)&%&*&%
                        ...
                        the other containing the R3
                        Code:
                        @SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_R3
                        G2230020232002203222030231
                        @SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_R3
                        %%&%%%%%%%%%%%%%%%(+%%%%'
                        ...
                        It's probably also necessary to get rid of whitespaces in the lines containing SRR015241.1
                        The T / G primer has no quality so that the quality line is 25 characters instead of 26.
                        You can then map these files separately with either BFAST or Novoalign (should work directly on this format) or BWA (after conversion into pseudo-nucleotides).

                        Comment


                        • #13
                          ALSO ... bwa provides solid2fastq.pl and solid2fastq2.pl programs for converting to a fastq suitable for input to "bwa aln"

                          Comment


                          • #14
                            Originally posted by glocke View Post
                            sinaian: is there any alignment program that can read this format? If there's no native tool, it seems that I would have to make a script that maps the reads into dna-space (from colorspace), then separate the forward and back reads.

                            That would be relatively easy, but it seems like a pretty bad idea to map into DNA space before mapping! can anyone think of a better way?

                            also, sinaian, is there any documentation for this format of reads?
                            Sorry for the late reply. You'll have to separate the two parts in color space before doing anything else to them. I am not aware of any tool or documentation, and my suggestion is only based on my limited trial and error. Honestly I am STUNNED by the way SRA handles SOLiD paired reads.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Essential Discoveries and Tools in Epitranscriptomics
                              by seqadmin




                              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                              04-22-2024, 07:01 AM
                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Today, 08:47 AM
                            0 responses
                            12 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            60 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            59 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            54 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X