Hi!
I am somewhat of a beginner and will soon receive data from a CRISPR-screen, and would like to get familiar with the workflow. For now, I am trying to map and normalize reads from public data.
The reads I am using are obtained from here:
The reference sgRNA sequences corresponding to this study (as far as I can tell) (GeCKOv1 library) were obtained here:
I have made a fasta file from the GeCKOv1.txt file and indexed it with bowtie 1.0.0 in the following way:
However, when attempting to map the reads from the sample SRR2071309.fastq to this reference (as follows) 0 reads are mapped:
I suspect that this may be because the reads need to be trimmed somehow, since sgRNAs are generally only about 20 bp and the reads are 51 bp. Therefore, I printed the first few lines for the fastq-file to identify any recurring sequences in the beginning or end:
I made a guess that "ACTGATTCTTGTGGAAAGGACGAAACACCG" would be the sequence that should be trimmed. Therefore I tried cutadapt:
However, the output file (around 98% reads trimmed according to the log) looked like this:
I am completely lost as to how I should proceed to be able to map these reads to the reference. Is anyone familiar with this?
I am somewhat of a beginner and will soon receive data from a CRISPR-screen, and would like to get familiar with the workflow. For now, I am trying to map and normalize reads from public data.
The reads I am using are obtained from here:
The reference sgRNA sequences corresponding to this study (as far as I can tell) (GeCKOv1 library) were obtained here:
I have made a fasta file from the GeCKOv1.txt file and indexed it with bowtie 1.0.0 in the following way:
Code:
awk '{print ">"$1"\n"$2}' GeCKOv1.txt > GeCKOv1.fasta bowtie-build GeCKOv1.fasta GeCKOv1
Code:
bowtie -t GeCKOv1 SRR2071309.fastq SRR2071309.GeCKOv1.map
Code:
@SRR2071309.1 DH1DQQN1:414:H9PTDADXX:1:1101:1913:2190 length=51 ACTGATTCTTGTGGAAAGGACGAAACACCGCGTCGGGATGCACCAGCTCCG +SRR2071309.1 DH1DQQN1:414:H9PTDADXX:1:1101:1913:2190 length=51 DDDDDDDIIIIEFFIEIII>EEIIIIIIDIIDIIDDDDDDDDDAA@@AAAA @SRR2071309.2 DH1DQQN1:414:H9PTDADXX:1:1101:1832:2196 length=51 ACTGTTCTTGTGGAAAGGACGAAACACCGTAGATACCCAGAACGCCTTCGT +SRR2071309.2 DH1DQQN1:414:H9PTDADXX:1:1101:1832:2196 length=51 FFHHCFHJJJIIJJJJJJJJJJJJJJJJJHIIIJJJJJJJJJJJJHHHFFF @SRR2071309.3 DH1DQQN1:414:H9PTDADXX:1:1101:2186:2239 length=51 ACTGATTCTTGTGGAAAGGACGAAACACCGCACACAGAAGAGCTCCTGGCG
I made a guess that "ACTGATTCTTGTGGAAAGGACGAAACACCG" would be the sequence that should be trimmed. Therefore I tried cutadapt:
Code:
cutadapt -a ACTGATTCTTGTGGAAAGGACGAAACACCG -o SRR2071309.trimmed.fastq SRR2071309.fastq
Code:
@SRR2071309.1 DH1DQQN1:414:H9PTDADXX:1:1101:1913:2190 length=51 +SRR2071309.1 DH1DQQN1:414:H9PTDADXX:1:1101:1913:2190 length=51 @SRR2071309.2 DH1DQQN1:414:H9PTDADXX:1:1101:1832:2196 length=51 +SRR2071309.2 DH1DQQN1:414:H9PTDADXX:1:1101:1832:2196 length=51 @SRR2071309.3 DH1DQQN1:414:H9PTDADXX:1:1101:2186:2239 length=51
Comment