Hello,
I am working a .fastq file. I would like to use TopHat to convert .fastq to BAM files, and then convert the BAM files to count data to find differentially expressed genes. Upon looking at the .fastq files, I have derived two questions:
1) An example from my .fastq file is below:
@SRR452349.7 solid0196_2009082_cho_WT_lib_A_bcSample1_1_21_724_F3 length=50
T32312021301103013102021230203203000020001100110.00
+SRR452349.7 solid0196_2009082_cho_WT_lib_A_bcSample1_1_21_724_F3 length=50
!)?1333=4;;*</&99>(+58@/(9:=64/:;4-55<6453>241&8!.7
This looks different than the .fastq file posted on Wiki:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
I am concerned about the second line. My second line has numbers instead of characters, and also contains dots. First, is it okay that it contains numbers instead of characters? Second, is it okay that it contains dots?
It has been recommended to me to replace each dot of this line with "N". However, I worry that doing so will also replace dots from other lines (such as the dot in the first line @SRR452349.7) . So, is it really necessary for me to do so, and if so, is there a safe script to use?
2) I have several of these .fastq files to compare, and each is about 222629380 lines (~15G). I am worried about time constraints, and wonder if it would be suitable for me to only use the same first X lines of the .fastq files? If so, what would be a means to determine an appropriate X value? That is, what would be too few lines to extract?
Also, as I am using TopHat2 to convert the .fastq files (and in so doing, I am aligning the .fastq files with the reference genome), it would not be a problem for me to only include some lines, right?
*If you see any problems with my general pipeline of converting .fastq to BAM files via TopHat2, and then converting BAM files to count tables, please let me know!
Thank you!
I am working a .fastq file. I would like to use TopHat to convert .fastq to BAM files, and then convert the BAM files to count data to find differentially expressed genes. Upon looking at the .fastq files, I have derived two questions:
1) An example from my .fastq file is below:
@SRR452349.7 solid0196_2009082_cho_WT_lib_A_bcSample1_1_21_724_F3 length=50
T32312021301103013102021230203203000020001100110.00
+SRR452349.7 solid0196_2009082_cho_WT_lib_A_bcSample1_1_21_724_F3 length=50
!)?1333=4;;*</&99>(+58@/(9:=64/:;4-55<6453>241&8!.7
This looks different than the .fastq file posted on Wiki:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
I am concerned about the second line. My second line has numbers instead of characters, and also contains dots. First, is it okay that it contains numbers instead of characters? Second, is it okay that it contains dots?
It has been recommended to me to replace each dot of this line with "N". However, I worry that doing so will also replace dots from other lines (such as the dot in the first line @SRR452349.7) . So, is it really necessary for me to do so, and if so, is there a safe script to use?
2) I have several of these .fastq files to compare, and each is about 222629380 lines (~15G). I am worried about time constraints, and wonder if it would be suitable for me to only use the same first X lines of the .fastq files? If so, what would be a means to determine an appropriate X value? That is, what would be too few lines to extract?
Also, as I am using TopHat2 to convert the .fastq files (and in so doing, I am aligning the .fastq files with the reference genome), it would not be a problem for me to only include some lines, right?
*If you see any problems with my general pipeline of converting .fastq to BAM files via TopHat2, and then converting BAM files to count tables, please let me know!
Thank you!
Comment