Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Advice on speeding up grep for ddRAD preprocessing

    Hi,
    With my rudimentary linux skills I wanted to create a script that pre-processed my ddRAD data by extracting only paired end sequences that had the correct restriction sites in the correct positions (as STACKS only looks for one).

    The script I came up with shown below does work on my test data set, however when I run it using my large fastq data files the process "grep -A 3 -E -f coordinates1.txt R2.fastq" peaks at using ~40GB RAM. I killed it after about 30 mins. There must be something inherently inefficient about the way grep uses an array in file as the search term? Bare in mind there are millions of search terms and lines to search.
    I wonder would there be a more efficient way of doing this using a loop in Perl?
    Any advice would be appreciated.

    Cheers

    ##################################
    ##Gives cluster corrdinates of sequences that have the CATGC restriction site (RS) directly after the first 5 bases (INDEX/BARCODE)

    grep @M00 -A 1 --no-group-separator R1.fastq|sed 's/^.....//'|grep '\<CATGC' -B 1|grep 14:3|cut -c 29-38 >coordinates1.txt


    ###Gives corrdinates of PE sequnces that contain CATGC in correct position of R1 and AATTC RS in the first 5 bases of R2

    grep -A 3 -E -f coordinates1.txt R2.fastq|grep '\<AATTC' -A 2 -B1|grep M00| cut -c 34-44 > coordinates2.txt

    ###De-multiplexes the raw fastq and pulls out R1 and R2 with correct RS
    grep -A 3 -E -f coordinates2.txt R1.fastq > R1_correct_RS.fastq
    grep -A 3 -E -f coordinates2.txt R2.fastq > R2_correct_RS.fastq


    echo "Number of RS R1:"
    grep -c '\<CATGC' R1_correct_RS.fastq
    echo "Number of RS R2:"
    grep -c '\<AATTC' R2_correct_RS.fastq
    echo "These numbers should match!"

    rm coordinates1.txt
    rm coordinates2.txt

  • #2
    Advice on speeding up grep for ddRAD preprocessing

    Hi Jackie,

    I don't know much about ddRAD, so I'm not sure how many different restriction sites you're going to be searching for.

    I would take advantage of the fact that Illumina paired-end files are usually both in the same order, and just read through each file (R1.fastq and R2.fastq) once, one line at a time using a Perl loop (obviously adding the necessary checks so that the script exits with an error message if the 2 files get out of sync)

    I would put the 4 lines of each fastq record into an array, one for each read of the pair, do the necessary comparisons using regular expressions, if the reads both match the restriction sites you want , print the 4 lines to output files, one for each read of the pair, empty the array (if they don't match, just empty the array), start over with the next record, until you've read through the entire files.

    I don't know whether this would run faster, but it shouldn't use up so much memory.

    maria

    Comment


    • #3
      Ok that sounds more like it...emptying the array after each search. I'll look into it. Thanks.

      Comment


      • #4
        grep shouldn't be creating an array for the data you're searching through. it looks at each line evaluates it and then discards it. so your fastq file can be any size and RAM shouldn't be consumed. I'm not sure of the top of my head but I beleive grep caches the search terms in memory to speed up the search process so excessively large search term txt files might flood the memory. causing the system to start churning. grep should typically be faster than an equivalent perl app however with perl you should be able to construct a search algorithm which doesn't cache the search terms. how fast this will be is another matter.
        In short rather than breaking up your fastq items, it might be worth looking at breaking up your search terms into seperate smaller files. Once you dip below the amount of ram in the system and stop hitting swap space you should see a dramatic speed increase. Mind you this is all assuming you have a HUGE number of search terms, and relatively little ram.

        Comment


        • #5
          Originally posted by mdobeson View Post
          grep shouldn't be creating an array for the data you're searching through. it looks at each line evaluates it and then discards it. so your fastq file can be any size and RAM shouldn't be consumed. I'm not sure of the top of my head but I beleive grep caches the search terms in memory to speed up the search process so excessively large search term txt files might flood the memory. causing the system to start churning. grep should typically be faster than an equivalent perl app however with perl you should be able to construct a search algorithm which doesn't cache the search terms. how fast this will be is another matter.
          In short rather than breaking up your fastq items, it might be worth looking at breaking up your search terms into seperate smaller files. Once you dip below the amount of ram in the system and stop hitting swap space you should see a dramatic speed increase. Mind you this is all assuming you have a HUGE number of search terms, and relatively little ram.

          OK thats for the input. Yup there are around 8 million search terms. I should just let it run for a few hours. I can live with it taking less than 24 hours.

          I also tried the splitting approach, but got discouraged due to the same reasons. Ill be patient and see what comes of it.

          Thanks again

          Comment


          • #6
            if its hitting swap space and filling all the ram it could take alot more than 24 hours...

            Comment


            • #7
              I'm not sure but you might try "fgrep" instead of "grep".

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              7 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              7 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              66 views
              0 likes
              Last Post seqadmin  
              Working...
              X