Hello, I was wondering if anyone could help me?
I've been trying to adapter trim and merge my dataset using Seqprep, but when I plot the read lengths after merging, I'm missing most of the reads between 40 and 50bp. I can't work out why, or whether I'm doing something wrong!
So: read length plots resemble this:
()
I'm running SeqPrep as follows:
SeqPrep -f L120_1.qual.fastq -r L120_2_.qual.fastq -1 L120-R1.qual.unmerged.fastq -2 L120-R2.qual.unmerged.fastq -3 L120_NeutCap_2-R1.qual.discarded.fastq -4 L120_NeutCap_2-R2.qual.discarded.fastq -L 30 -q 15 -A AGATCGGAAGAGCACACGTC -B GGAAGAGCGTCGTGTAGGGA -s L120_NeutCap_2.qual.merged.fastq -E L120_NeutCap_2.qual.readable_alignment.txt -o 10
You'll notice that while the first adapter is the standard illumina one, but the second is a modified one, missing the first 5 bp. You can see both adapters present in the file if you grep the sequences (indicated below in bold)…
Read1 quality trimmed, L120_2 above:
@HISEQ:268:C8TMGANXX:2:1101:1430:1965 1:N:0:NTCGTCGGNCGCAACG
CAGGCACTCCCTGGAAACTCTAAGGGGCAGTTCTACTCTAGATCGGAAGA
+
A@B0BGGGGGGGCFGGGGGGGGGGGEGGGGGGGGGGCGG@1E@FGD/CEF
@HISEQ:268:C8TMGANXX:2:1101:1457:1992 1:N:0:TTCGTCGGNCGCAACG
CTAGACCGCGAATACACACAAGATCGGAAGAGCACACGTCTGAACTCCAG
+
33<<BGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGBGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1684:1955 1:N:0:TTCGTCGGCCGCAACG
NTGATATGTCCGGAGTGCATCGTATGGCGCTTTCAATGAATTTGAGATCG
+
#3<<@EGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1619:1977 1:N:0:TTCGTCGGCCGCAACG
CGGTGCCATCGAGCCTGTTCTGTCTCATAGTGACCCTAGATCGGAAGAGC
+
33@>@GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1574:1983 1:N:0:TTCGTCGGCCGCAACG
CCATCCTAGTGGGGGGAAATAGATCGGAAGAGCACACGTCTGAACTCCAA
+
<330<E1EFFCGGGGGFGECDGEGGFGBDCDDGEGGGGCD0DDCDG=EBC
Read 2, quality trimmed, for L120_2 above.
@HISEQ:268:C8TMGANXX:2:1101:1430:1965 2:N:0:NTCGTCGGNCGCAACG
AGAGTAGAACTGCCCCNNNNAGTTTCCAGGGAGTGCCTGGGAAGAGCGTC
+
BB@BBGGDFGGGGGGG####==EFGDFFGGGGGGGGGGGGEGGGGGGGGF
@HISEQ:268:C8TMGANXX:2:1101:1457:1992 2:N:0:TTCGTCGGNCGCAACG
TGTGTGTATTCGCGGTCTATGGAAGAGCGTCGTGTAGGGAAAGAGTGTCG
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1684:1955 2:N:0:TTCGTCGGCCGCAACG
CAAATTCATTGAAAGNNNNNTACGATGCACTCCGGACATATCATGGAAGA
+
CCCCCGGGGGGGGGG#####@=EFGGGGGGGGGGGGGGGGGGGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1619:1977 2:N:0:TTCGTCGGCCGCAACG
AGGGTCACTATGAGACAGAACAGGCTCGATGGCACCTGGAAGAGCGTCGT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1574:1983 2:N:0:TTCGTCGGCCGCAACG
ATTTCCCCCCACTAGGATGTGGAAGAGCGTCGTGTAGGGAAAGAGTGTCG
+
BCCCCGGGGGDGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGFG
The only time I've seen such a dip is when I got the adapter sequences wrong in the SeqPrep command. When I corrected them it went away. But I think the adapter sequences are correct, so I can't explain why there's a dip in the read length frequency. Is this a quirk of SeqPrep? Can anyone offer any explanation?
I'd be very grateful of any help!
Many thanks.
I've been trying to adapter trim and merge my dataset using Seqprep, but when I plot the read lengths after merging, I'm missing most of the reads between 40 and 50bp. I can't work out why, or whether I'm doing something wrong!
So: read length plots resemble this:
()
I'm running SeqPrep as follows:
SeqPrep -f L120_1.qual.fastq -r L120_2_.qual.fastq -1 L120-R1.qual.unmerged.fastq -2 L120-R2.qual.unmerged.fastq -3 L120_NeutCap_2-R1.qual.discarded.fastq -4 L120_NeutCap_2-R2.qual.discarded.fastq -L 30 -q 15 -A AGATCGGAAGAGCACACGTC -B GGAAGAGCGTCGTGTAGGGA -s L120_NeutCap_2.qual.merged.fastq -E L120_NeutCap_2.qual.readable_alignment.txt -o 10
You'll notice that while the first adapter is the standard illumina one, but the second is a modified one, missing the first 5 bp. You can see both adapters present in the file if you grep the sequences (indicated below in bold)…
Read1 quality trimmed, L120_2 above:
@HISEQ:268:C8TMGANXX:2:1101:1430:1965 1:N:0:NTCGTCGGNCGCAACG
CAGGCACTCCCTGGAAACTCTAAGGGGCAGTTCTACTCTAGATCGGAAGA
+
A@B0BGGGGGGGCFGGGGGGGGGGGEGGGGGGGGGGCGG@1E@FGD/CEF
@HISEQ:268:C8TMGANXX:2:1101:1457:1992 1:N:0:TTCGTCGGNCGCAACG
CTAGACCGCGAATACACACAAGATCGGAAGAGCACACGTCTGAACTCCAG
+
33<<BGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGBGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1684:1955 1:N:0:TTCGTCGGCCGCAACG
NTGATATGTCCGGAGTGCATCGTATGGCGCTTTCAATGAATTTGAGATCG
+
#3<<@EGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1619:1977 1:N:0:TTCGTCGGCCGCAACG
CGGTGCCATCGAGCCTGTTCTGTCTCATAGTGACCCTAGATCGGAAGAGC
+
33@>@GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1574:1983 1:N:0:TTCGTCGGCCGCAACG
CCATCCTAGTGGGGGGAAATAGATCGGAAGAGCACACGTCTGAACTCCAA
+
<330<E1EFFCGGGGGFGECDGEGGFGBDCDDGEGGGGCD0DDCDG=EBC
Read 2, quality trimmed, for L120_2 above.
@HISEQ:268:C8TMGANXX:2:1101:1430:1965 2:N:0:NTCGTCGGNCGCAACG
AGAGTAGAACTGCCCCNNNNAGTTTCCAGGGAGTGCCTGGGAAGAGCGTC
+
BB@BBGGDFGGGGGGG####==EFGDFFGGGGGGGGGGGGEGGGGGGGGF
@HISEQ:268:C8TMGANXX:2:1101:1457:1992 2:N:0:TTCGTCGGNCGCAACG
TGTGTGTATTCGCGGTCTATGGAAGAGCGTCGTGTAGGGAAAGAGTGTCG
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1684:1955 2:N:0:TTCGTCGGCCGCAACG
CAAATTCATTGAAAGNNNNNTACGATGCACTCCGGACATATCATGGAAGA
+
CCCCCGGGGGGGGGG#####@=EFGGGGGGGGGGGGGGGGGGGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1619:1977 2:N:0:TTCGTCGGCCGCAACG
AGGGTCACTATGAGACAGAACAGGCTCGATGGCACCTGGAAGAGCGTCGT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@HISEQ:268:C8TMGANXX:2:1101:1574:1983 2:N:0:TTCGTCGGCCGCAACG
ATTTCCCCCCACTAGGATGTGGAAGAGCGTCGTGTAGGGAAAGAGTGTCG
+
BCCCCGGGGGDGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGFG
The only time I've seen such a dip is when I got the adapter sequences wrong in the SeqPrep command. When I corrected them it went away. But I think the adapter sequences are correct, so I can't explain why there's a dip in the read length frequency. Is this a quirk of SeqPrep? Can anyone offer any explanation?
I'd be very grateful of any help!
Many thanks.
Comment