Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #76
    Originally posted by GenoMax View Post
    Tried the following.

    Code:
    $ fastq-dump -F --split-files ./SRR1561197.sra
    @safina: Not sure why you are changing the fastq headers

    Code:
    $ fastq_quality_filter -i SRR1561197_1.fastq -q 28 -p 100 -Q33 -o SRR1561197_1_filt.fastq
    @safina: Note the -Q33 option. This data is most certainly sanger fastq formatted so you need to add this option (it remains undocumented in fastx_toolkit). I used the latest fastx_toolkit.

    I chose to use repair.sh from BBMap and did

    Code:
    $ repair.sh in1=SRR1561197_1_filt.fastq in2=SRR1561197_2_filt.fastq out1=fixed1.fq out2=fixed2.fq outsingle=single.fq
    Here is where things fell apart.

    @Brian: I get the following error about a Gb into the filtered files.



    Multiple possibilities:

    1. Original sra file from SRA is corrupt
    2. fastx_toolkit is messing up the files in the filter process
    3. Not sure why repair.sh is asking to run itself

    Should try BBDuk to see if that works instead of fastq_filter.
    Originally posted by SES View Post
    I tried the whole process using the commands above and did not find any issues. Here is the script: seqanswers163784.sh (link to a gist, not a direct link). You can fetch that script and run it on your own machine. Here is the output:

    Code:
    ========= pairfq version : 0.14.1 (completion time: Wed Apr  8 12:14:41 EDT 2015)
    Total forward reads (SRR1561197_1_filt_info.fastq)                   :    8492638
    Total reverse reads (SRR1561197_2_filt_info.fastq)                   :   13525478
    Total forward paired reads (SRR1561197_1_filt_info_p.fastq)          :    7105003
    Total reverse paired reads (SRR1561197_2_filt_info_p.fastq)          :    7105003
    Total forward unpaired reads (SRR1561197_1_filt_info_s.fastq)        :    1387635
    Total reverse unpaired reads (SRR1561197_2_filt_info_s.fastq)        :    6420475
    
    Total paired reads                                                   :   14210006
    Total unpaired reads                                                 :    7808110
    
    real	21m14.372s
    user	9m54.612s
    sys	0m19.421s
    This used 5.5g of RAM on my machine, so you should be fine to use it without the --index option. For reference, the only issue was the missing pair information, which was one of my earlier suggestions in this thread, but it appears that modifying the headers and perhaps some other operations messed up the files for @safina. For the commands in the script, you can replace "pairfq" with

    Code:
    curl -sL git.io/pairfq_lite | perl -
    and you'll never need to download any package or update it.

    EDIT: Just my 2c, but I think fastx still has a place. It is stable, no need to update frequently, and is probably on most workstations. Also, it works very well in a Unix environment because of the single binaries that use one CPU, which allows you to use it on a cluster.
    The headers were not messing the file. still i have problem if i trim my fastq files with this command i get the empty files:

    after filtering from the command :

    ## quality filter
    fastq_quality_filter -i SRR1561197_1.fastq -q 28 -p 100 -Q33 -o SRR1561197_1_filt.fastq
    fastq_quality_filter -i SRR1561197_2.fastq -q 28 -p 100 -Q33 -o SRR1561197_2_filt.fastq


    then i did trimming:

    fastx_trimmer -i SRR1561197_1_filt.fastq -l 100 -f 14 -o SRR1561197_1_filt_trim.fastq
    fastx_trimmer -i SRR1561197_2_filt.fastq -l 100 -f 14 -o SRR1561197_2_filt_trim.fastq


    ## add pair info to reads and remove comment to reduce size
    pairfq addinfo -i SRR1561197_1_filt_trim.fastq -o SRR1561197_1_filt_trim_info.fastq -p 1
    pairfq addinfo -i SRR1561197_1_filt_trim.fastq -o SRR1561197_2_filt_trim_info.fastq -p 2

    ## pair the reads
    time pairfq makepairs -f SRR1561197_1_filt_trim_info.fastq \
    -r SRR1561197_2_filt_trim_info.fastq \
    -fp SRR1561197_1__p.fastq \
    -rp SRR1561197_2__p.fastq \
    -fs SRR1561197_1_s.fastq \
    -rs SRR1561197_2_s.fastq \
    --stats

    Still i get all reads in these two files:
    -fs SRR1561197_1_s.fastq \
    -rs SRR1561197_2_s.fastq \
    Last edited by safina; 04-10-2015, 11:21 PM.

    Comment


    • #77
      Originally posted by safina View Post
      Thanx for this. but i have a question...
      why you havent used the trim command.. as i need to trim SRR1561197 reads from start as well as from end. After trimming i get error in pairfq and it gives me empty files....

      i used fastx tool kit for triming as well:

      Code:
      fastx_trimmer -f 14 -l 100 -o SRR1561197_1_filt_trim.fastq
      And when i run pairfq after this i get empty files and all reads in unpaired file.
      Fastx_trimmer (as above) followed by repair.sh works fine.

      @SES will need to comment on pairfq question.

      Comment


      • #78
        Originally posted by GenoMax View Post
        Fastx_trimmer (as above) followed by repair.sh works fine.

        @SES will need to comment on pairfq question.
        That was my bad, one of the files was named incorrectly in the script. I fixed that typo and changed the command to use the curl method so we can rule out the installation being an issue. I ran the script and get the same result I posted above. This command will get the script:

        Code:
        curl -L https://gist.githubusercontent.com/sestaton/09781a5ac8849753d6ed/raw/af767ad46961c19438b9fe95e14ba87270337f6f/seqanswers163784.sh > seqanswers163784.sh
        Then edit the paths to the fastx trimmer and fastq-dump, if necessary, and run the script:

        Code:
        nohup bash seqanswers163784.sh 2>&1 > seqanswers163784.out &
        Or send it to the queuing system, it doesn't matter. That should work on anyone's machine that has those programs installed (pairfq need not be installed). The first few steps take quite awhile but the pairing step should take 10-12 min. depending on the machine.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 11:49 AM
        0 responses
        13 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-24-2024, 08:47 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        61 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Working...
        X