Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing short reads from paired-end fastqs

    Sometimes trimming adapters from two paired read files (with, say, cutadapt) results in unequal trimming for the members of any given pair. Therefore if you subsequently remove short inserts from both readfiles independently, it can throw the pairs out of sync as soon as it removes one but not the other member of a pair.

    The following script "nixshorts_PE" will remove a read pair from two paired-end read fastqs when at least one of the two members are below a certain length. The same method can be used for removing short reads from a single-end file, with some adjustments. Just thought some of you might find this handy.

    Please post improvements to this script if you think of them. Thanks!

    #!/bin/bash

    # This removes reads of a below a certain length from paired read files in fastq format (e.g., R1 and R2 from the same library)

    # Usage: $ bash nixshorts_PE [input fastqR1] [input fastqR2] [minimum read length to keep]

    # PROCESS:

    #1. Start with inputs
    R1fq=$1
    R2fq=$2
    minlen=$3

    #2. Find all entries with read length less than minimum length and print line numbers, for both R1 and R2
    awk -v min=$minlen '{if(NR%4==2) if(length($0)<min) print NR"\n"NR-1"\n"NR+1"\n"NR+2}' $R1fq > temp.lines1
    awk -v min=$minlen '{if(NR%4==2) if(length($0)<min) print NR"\n"NR-1"\n"NR+1"\n"NR+2}' $R2fq >> temp.lines1

    #3. Combine both line files into one, sort them numerically, and collapse redundant entries
    sort -n temp.lines1 | uniq > temp.lines
    rm temp.lines1

    #4. Remove the line numbers recorded in "lines" from both fastqs
    awk 'NR==FNR{l[$0];next;} !(FNR in l)' temp.lines $R1fq > $R1fq.$minlen
    awk 'NR==FNR{l[$0];next;} !(FNR in l)' temp.lines $R2fq > $R2fq.$minlen
    rm temp.lines

    #5. Conclude
    echo "Pairs shorter than $minlen bases removed from $R1fq and $R2fq"

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin


    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
    Yesterday, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
55 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
52 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
45 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
55 views
0 likes
Last Post seqadmin  
Working...
X