Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dejavu2010
    Member
    • Jan 2012
    • 21

    program which can make a pair end to have equal number of sequence

    i have a PE100, the problem is some tiles of read 2 are corrupted and it caused un even number of reads in read 1 and read 2. Are there any programs which can match up sequences from both reads and just keep mathed ones for down stream analysis. Thanks

    mike
  • mgg
    Member
    • Nov 2011
    • 12

    #2
    re-pairing PE files

    I remember a nice contribution from kmcarr a while back which can probably help; search for thread 10392. (incidentally I can't recommend my own contribution to that thread - it is hideously slow)

    best

    m
    Last edited by mgg; 02-27-2012, 03:55 AM. Reason: typo correction

    Comment

    • kmcarr
      Senior Member
      • May 2008
      • 1181

      #3
      Originally posted by mgg View Post
      I remember a nice contribution from kmcarr a while back which can probably help; search for thread 10392. (incidentally I can't recommend my own contribution to that thread - it is hideously slow)

      best

      m
      m,

      Thanks for the acknowledgement. Here is a link to the thread. If you go there you'll see that I just posted an update. Due to a limitation in cdbfasta my method will not work for large input fastq files. The only work-around at the moment is to split the input up into smaller chunks.

      Comment

      • Hobbe
        Member
        • Apr 2010
        • 29

        #4
        The trimmer Trimmomatic outputs files with intact pairs as well as files with single reads. It should be able to split your files in the way you want, as well as do trimming at the same time if you wish.

        Comment

        • azneto
          Member
          • Dec 2009
          • 24

          #5
          I wrote a script exactly to tackle that issue. You'll find a copy attached.
          The script will output either an interleaved mate pair fastq or two fastq files. The unpaired reads will also be saved in a separate file. The script uses a regular expression to identify the ID, so let me know if you need help with that. It requires at least as much RAM as the size of the first file. Feel free to use it and let please let me know how can we improve it.
          Adhemar
          Attached Files

          Comment

          • sklages
            Senior Member
            • May 2008
            • 628

            #6
            Maybe of interest as well, PairedreadFinder:
            Usage: PairedreadFinder, Version 1.01. This tool takes two fasta/q files and looks for matching readnames in both files. [OPTION]...

            -h, --help displays this help message
            -v, --version return program version
            -s1, --source1 input file 1
            -s2, --source2 input file 2
            -f, --format input file format
            -t1, --target1 target file 1
            -t2, --target2 target file 2
            -n, --nr-threads nr of threads to use (default 1)
            -is, --suffix-ignore nr of characters to ignore from the END of the readname (in case paired reads are named like /1 /2 it should be set to 2) (default 0)
            -ip, --prefix-ignore nr of characters to ignore from the BEGINNING of the readname (in case paired reads are named like s_1.. s_2.. it should be set to 3) (default 0)
            from FAR, http://sourceforge.net/apps/mediawik...itle=Main_Page

            Sven

            Comment

            • dejavu2010
              Member
              • Jan 2012
              • 21

              #7
              hi azneto, how to setup regular expression like the following
              @HWI-ST829:138071VACXX:1:1101:1131:2048 1:N:0:ATCACG.

              Thanks.

              Comment

              • azneto
                Member
                • Dec 2009
                • 24

                #8
                Hi dejavu2010.

                @HWI-ST829:138071VACXX:1:1101:1131:2048 1:N:0:ATCACG.

                you can use: '^@(\S+)\s[1|2]\S+$'

                Assuming that 1 and 2 will appear right after the space char.
                '@' 'ID' 'space' '1or1' '...'

                I'll add this example to the script.

                Comment

                • dejavu2010
                  Member
                  • Jan 2012
                  • 21

                  #9
                  Hi

                  my process got killed

                  perl mergeShuffledFastqSeqs.pl -f1 2044-BH-1_1_sequence.txt -f2 2044-BH-1_2_sequence.txt -r '^@(\S+)\s[1|2]\S+$' -o 2044-BH-1 -t

                  Loading the first file...Killed

                  2044-BH-1_1_sequence.txt 18gb, the other one is 17gb. we have a server with 32 duel core cpus and 192gb mem. I wonder what could be the reason it got killed.

                  thx

                  Comment

                  • epistatic
                    Senior Member
                    • Mar 2009
                    • 129

                    #10
                    Picard has a FixMateInformation to "Ensure that all mate-pair information is in sync between each read and it's mate pair."


                    If you are in Galaxy this is implemented under Picard as: Paired Read Mate Fixer

                    Comment

                    • azneto
                      Member
                      • Dec 2009
                      • 24

                      #11
                      Hi,
                      It most probably is a memory issue.
                      The script loads only the first file into the memory and starts to match with the entries in the second file. You'll have to monitor the memory usage ('top' or 'free -m').
                      I just ran a test and perl uses 220Gb RAM for two 33Gb fastq file.
                      Soon I'll start to search for alternative ways to handle memory using perl in order to improve the script. I'll let you know.
                      -Adhemar

                      Comment

                      • dejavu2010
                        Member
                        • Jan 2012
                        • 21

                        #12
                        thx. everybody, i got it resolved

                        thx. everybody, i got it resolved

                        Comment

                        • dejavu2010
                          Member
                          • Jan 2012
                          • 21

                          #13
                          i feel that Trimmomatic index your reads based on input order, not lane_position_... combination, i tested one un matched dataset, they can not handle it.

                          Comment

                          • SES
                            Senior Member
                            • Mar 2010
                            • 275

                            #14
                            Originally posted by kmcarr View Post
                            m,

                            Thanks for the acknowledgement. Here is a link to the thread. If you go there you'll see that I just posted an update. Due to a limitation in cdbfasta my method will not work for large input fastq files. The only work-around at the moment is to split the input up into smaller chunks.
                            The other work-around is to stick with fasta files (smaller index files). If you are mapping to a reference and the quality scores are important then this probably won't help, but for assembly it doesn't matter.

                            Originally posted by sklages View Post
                            Maybe of interest as well, PairedreadFinder:
                            from FAR, http://sourceforge.net/apps/mediawik...itle=Main_Page

                            Sven
                            Has anyone actually used this program and found it to work correctly? I gave it a try but found several bugs. First, it inserted random blank lines in the individual "paired" output files. Second, the individual "paired" files actually differed by more than 800 records with my data, which means that trying to interleave the two files causes them to get all out of order. Of course, this could be due to the data or the user, but I've successfully used other methods with the same data and the usage of the program is quite simple, so I'm a bit skeptical on this one.

                            Comment

                            • kga1978
                              Senior Member
                              • Nov 2010
                              • 100

                              #15
                              Hi All,

                              I have exactly this problem as well, but with fasta files. Anybody know of a program that will work with Fasta or could modify 'mergeShuffledFastqSeqs.pl' so it would work on that format as well?

                              Much appreciated.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              34 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              37 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              24 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...