Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Maq - sol2sanger problem - different sizes for the pair?

    Hi, All

    I just use "maq sol2sanger" to convert Illumina's _sequence.txt to .fastq format. I used paired-end design. I have the following two txt files

    s_1_1_sequence.txt ; size 4116883072
    s_1_2_sequence.txt ; size 4116883072

    After sol2sanger conversion, the fastq files don't have the same size:

    s_1_1_sequence.fastq; size 3644668984
    s_1_2_sequence.fastq; size 3644660878

    It is weird..They should have given out the same size, right? Besides, in all the other lanes, this conversion all output the same size for the pair.

    Can anyone help me answer this question?

    Thanks very much!

    -Cliff

  • #2
    That does look like something has gone wrong.

    Also, assuming you are using FASTQ files from Illumina pipeline 1.3+, then don't use sol2sanger, use ill2sanger (requires a patch to MAQ - search the forum).

    Or BioPerl, or EMBOSS, or an ad-hoc perl script or, ... lots of examples on the forum. My biased suggestion would be to use Biopython, http://news.open-bio.org/news/2009/0...vert-function/

    See also: http://en.wikipedia.org/wiki/FASTQ_format
    Last edited by maubp; 12-07-2009, 11:55 AM. Reason: Typo

    Comment


    • #3
      Originally posted by cliff View Post
      It is weird..They should have given out the same size, right? Besides, in all the other lanes, this conversion all output the same size for the pair.
      Have you checked the files? sol2sanger predicate doesn't print sequence headers twice, so

      @seqID
      CGATCGTAGCTAGC
      +seqID
      BBBBBBBBBBBB

      becomes

      @seqID
      CGATCGTAGCTAGC
      +
      ###########

      (the scores are completely random in this example ^__^)

      hence you may missing bytes

      Comment


      • #4
        I'd wondered about that too dawe, and while it does explain why the converted files are smaller than the originals, it does not explain why they are different sizes to each other.

        cliff - how about posting the first few records of each file?

        Comment


        • #5
          Originally posted by maubp View Post
          I'd wondered about that too dawe, and while it does explain why the converted files are smaller than the originals, it does not explain why they are different sizes to each other.

          cliff - how about posting the first few records of each file?
          You're right! On a second read I realize the issue here is not "the size differ before and after conversion" but "the paired reads differ in size after conversion"... Whoops!

          d

          Comment


          • #6
            Thanks for all your replies. Here the fastq files:

            1: $ more s_1_1_sequence.fastq

            @BILLIEHOLIDAY:1:1:3:1204#0/1
            GACCACACCCTGNAGCCCTTTCTGTCCAAACAGAAAGTAAGATATTCCTTGGGCTGGTTGGTCTGAGGACCTGAGGTTGTAGGTGGACACCCTCATGGAGG
            +
            BBBCBCCCCBB.&6;=-:>9>7?.>@=B7B>+?1.2=0;-90?<B<>;><@@3/6<*4*>47584.:597<723>9%%%%%%%%%%%%%%%%%%%%%%%%%
            @BILLIEHOLIDAY:1:1:3:277#0/1
            TTGAGACAAGAGNATCACTTGAACCCAGAAATTCGAGGCCAGCCTGGGCAACAGAGAGAGCCCTCATTTCTACAAAAAATAAAAATATTAGCCAGGCATGG
            +
            BABBA?BB8?<9&40C4BA@:?@BBB:B?A@>8B=@)7B><8@B6:>>=<4=38?8?9;739%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            2: $ more s_1_2_sequence.fastq
            @BILLIEHOLIDAY:1:1:3:1204#0/2
            TCATTACCTACTTTATTGCTCACACATAGCCTGTTTGGTGGTCTCTTCACACGGACGCGTGTGACATTTGGTGCCAAAACCCAGGACAGGAGGAGCNCTTT
            +
            BCBBCAACB@CCCBAC@?B@B?C<CB@CBBBAB=ABB?@A45BCCBBAACCB?BB@BBB6B@B@B@@A@AA@BA@?3?8@?@B<@-@@<6@A%%%%%%%%%
            @BILLIEHOLIDAY:1:1:3:277#0/2
            GATAGGGTTTAGATGTCGTTTAGGCTGGAGTGCAGTGGTACATCACGGCTCACTGCAGCCTCGACCTCCCAGGCTTAAGCAGCCCTCCCACCTCAGCCTCC
            +
            A=@9B?>6??B7ABA=BC>9B@6BC@B>@B4BBBB;BB1B<;BABBAABB<(3@=?@A>@=@A>6A>?>>>>??8=93?5>=):>=;A92;81?26>226>

            Comment


            • #7
              Or using the [ code ] tags, since otherwise the forum mangles them:

              1: $ more s_1_1_sequence.fastq

              Code:
              @BILLIEHOLIDAY:1:1:3:1204#0/1
              GACCACACCCTGNAGCCCTTTCTGTCCAAACAGAAAGTAAGATATTCCTTGGGCTGGTTGGTCTGAGGACCTGAGGTTGTAGGTGGACACCCTCATGGAGG
              +
              BBBCBCCCCBB.&6;=-:>9>7?.>@=B7B>+?1.2=0;-90?<B<>;><@@3/6<*4*>47584.:597<723>9%%%%%%%%%%%%%%%%%%%%%%%%%
              @BILLIEHOLIDAY:1:1:3:277#0/1
              TTGAGACAAGAGNATCACTTGAACCCAGAAATTCGAGGCCAGCCTGGGCAACAGAGAGAGCCCTCATTTCTACAAAAAATAAAAATATTAGCCAGGCATGG
              +
              BABBA?BB8?<9&40C4BA@:?@BBB:B?A@>8B=@)7B><8@B6:>>=<4=38?8?9;739%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
              2: $ more s_1_2_sequence.fastq
              Code:
              @BILLIEHOLIDAY:1:1:3:1204#0/2
              TCATTACCTACTTTATTGCTCACACATAGCCTGTTTGGTGGTCTCTTCACACGGACGCGTGTGACATTTGGTGCCAAAACCCAGGACAGGAGGAGCNCTTT
              +
              BCBBCAACB@CCCBAC@?B@B?C<CB@CBBBAB=ABB?@A45BCCBBAACCB?BB@BBB6B@B@B@@A@AA@BA@?3?8@?@B<@-@@<6@A%%%%%%%%%
              @BILLIEHOLIDAY:1:1:3:277#0/2
              GATAGGGTTTAGATGTCGTTTAGGCTGGAGTGCAGTGGTACATCACGGCTCACTGCAGCCTCGACCTCCCAGGCTTAAGCAGCCCTCCCACCTCAGCCTCC
              +
              A=@9B?>6??B7ABA=BC>9B@6BC@B>@B4BBBB;BB1B<;BABBAABB<(3@=?@A>@=@A>6A>?>>>>??8=93?5>=):>=;A92;81?26>226>
              At first glance, I see nothing amiss with the FASTQ representation. Interestingly the read quality of the forward reads trails off much more quickly than the reverse reads.

              Comment


              • #8
                Thanks, maubp!

                We use illumina pipeline 1.5. I am thinking of trying ill2sanger. Do I need use ill2sanger to convert all my _sequence.txt files to .fastq files? As I said, all the other fastq files are all have the same size between paired-reads. Can I just try ill2sanger on the paired reads which differ in .fastq size?

                Thank~

                Comment


                • #9
                  Originally posted by cliff View Post
                  Thanks, maubp!

                  We use illumina pipeline 1.5. I am thinking of trying ill2sanger. Do I need use ill2sanger to convert all my _sequence.txt files to .fastq files? As I said, all the other fastq files are all have the same size between paired-reads. Can I just try ill2sanger on the paired reads which differ in .fastq size?

                  Thank~
                  This probably won't make any difference to the file size oddity. The difference between sol2sanger and ill2sanger is how they map the quality scores.

                  If your data is from Illumina 1.3 or later, use ill2sanger.

                  If your data is from Solexa 1.0 up to Illumina 1.2, use sol2sanger.

                  Comment


                  • #10
                    maubp, thanks. I just downloaded ill2sanger from here http://sourceforge.net/tracker/?func...15&atid=938895

                    Do you know how to install and use this maq-ill2sanger.patch?

                    I am sorry I am not a cs background..
                    Last edited by cliff; 12-11-2009, 11:55 AM.

                    Comment


                    • #11
                      Originally posted by cliff View Post
                      maubp, thanks. I just downloaded ill2sanger from here http://sourceforge.net/tracker/?func...15&atid=938895

                      Do you know how to install and use this maq-ill2sanger.patch?

                      I am sorry I am not a cs background..
                      There was a discussion on this here:
                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                      Basically (and this isn't going to be detailed enough), grab the MAQ source code, use the patch command to make this change, compile MAQ, install MAQ. If you didn't install MAQ in the first place, this might be tricky.

                      --

                      Alternatively, there are non-MAQ options for converting the FASTQ files.

                      If you like Perl, there are plenty of scripts to do this in Perl (some using BioPerl) - search the forum.

                      You could also use the seqret tool from EMBOSS 6.1.0 patch 1 or later.

                      Other options include installing Biopython 1.52 or later, and using a tiny Python script like http://www.biopython.org/wiki/Reading_from_unix_pipes or like this:
                      Code:
                      from Bio import SeqIO
                      count = SeqIO.convert("s_1_1_sequence.txt", "fastq-illumina", "s_1_1_sequence.fastq", "fastq-sanger")
                      print "Converted %i forward reads" % count
                      count = SeqIO.convert("s_1_2_sequence.txt", "fastq-illumina", "s_1_2_sequence.fastq", "fastq-sanger")
                      print "Converted %i reverse reads" % count
                      Last edited by maubp; 12-09-2009, 07:13 AM. Reason: Clarity; adding link

                      Comment


                      • #12
                        I'm having a different issue with the ill2sanger patch in updating an existing install of MAQ. Downloaded the patch from sourceforge, ran the patch command (which modified fastq2bfq.c, main.c, and main.h), compiled MAQ with "make" and installed with "make install". Tried to run the ill2sanger command, which exited with a segmentation fault. Ran the command in gdb, which returned "Program received signal SIGSEGV, Segmentation fault. 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6". Backtrace returned the following:
                        "#0 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6
                        #1 0x000000340fc4faa8 in fprintf () from /lib64/libc.so.6
                        #2 0x0000000000405369 in ill2sanger (fpin=0x63a010, fpout=0x0) at fastq2bfq.c:105
                        #3 0x0000000000405424 in ma_ill2sanger (argc=<value optimized out>, argv=<value optimized out>)
                        at fastq2bfq.c:137
                        #4 0x000000340fc1ea2d in __libc_start_main () from /lib64/libc.so.6
                        #5 0x00000000004019b9 in _start ()"

                        Any suggestions in solving the problem(s) would be greatly appreciated.

                        Thanks,
                        Harold

                        Comment


                        • #13
                          Originally posted by HESmith View Post
                          I'm having a different issue with the ill2sanger patch in updating an existing install of MAQ. Downloaded the patch from sourceforge, ran the patch command (which modified fastq2bfq.c, main.c, and main.h), compiled MAQ with "make" and installed with "make install". Tried to run the ill2sanger command, which exited with a segmentation fault. Ran the command in gdb, which returned "Program received signal SIGSEGV, Segmentation fault. 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6". Backtrace returned the following:
                          "#0 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6
                          #1 0x000000340fc4faa8 in fprintf () from /lib64/libc.so.6
                          #2 0x0000000000405369 in ill2sanger (fpin=0x63a010, fpout=0x0) at fastq2bfq.c:105
                          #3 0x0000000000405424 in ma_ill2sanger (argc=<value optimized out>, argv=<value optimized out>)
                          at fastq2bfq.c:137
                          #4 0x000000340fc1ea2d in __libc_start_main () from /lib64/libc.so.6
                          #5 0x00000000004019b9 in _start ()"

                          Any suggestions in solving the problem(s) would be greatly appreciated.

                          Thanks,
                          Harold
                          Interesting... can you tell me your system configuration? (Hardware/software). Also, can you test if the sol2sanger works? ill2sanger is nothing but a different version of sol2sanger so, a segfault should be raised in that case too

                          Comment


                          • #14
                            As dawe suggested, retry sol2sanger on your newly compiled MAQ to see if that crashes.

                            It would also be worth re-downloading the FASTQ files (from your service provider, collaborator - where ever you got them from) just in case there was a corruption on transfer. That could could explain the file size oddity. Its a long shot though.

                            Comment


                            • #15
                              Hi, maubp

                              I have tried ill2sanger, but still got the same problem.

                              The orginal txt files from Read 1 and Read 2 of the same lane are in the same size as below:

                              4116883072 read1.txt
                              4116883072 read2.txt

                              But, after ill2sanger, the two reads have different sizes:

                              3644668984 read1.fastq
                              3644660878 read2.fastq

                              This problem is exactly the same as what I saw after sol2sanger. And all the other lanes are fine except this one.

                              Do you have thoughts on this?

                              Thanks

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X