Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • hisat2 file reads don't match up

    Hello all. For a few of my samples, I am getting this error that the reads aren't matching up between my two mates:
    Code:
    Error, fewer reads in file specified with -1 than in file specified with -2
    libc++abi.dylib: terminating with uncaught exception of type int
    (ERR): hisat2-align died with signal 6 (ABRT)
    Here is the command that I used for reference:
    Code:
    hisat2 -q -p 8 -x /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/HISAT2_INDEXES/Xenopus_Laevis -1 /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L005_R1_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L006_R1_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L007_R1_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L008_R1_001.fastq_sub_sample_0.25 -2 /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L005_R2_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L006_R2_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L007_R2_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L008_R2_001.fastq_sub_sample_0.25 -S Sample_10_hisat_results.sam
    It should be noted that these samples have been subsampled using this script that utilizes htseq to subsample mate pairs:
    Code:
    tophat_folders = [os.path.join(root, name)
                 for root, dirs, files in os.walk(os.getcwd()) #does for the current directory in the shell!!!!     
                 for name in files
                 if name.endswith(".fastq")]
    
    #within the sample file, need to go into results folder to get the .fastq file 
    for files in tophat_folders:
    	print(files)
    
    fraction = float(.1)
    
    print("fastq files being ran through")
    for v, w in zip(tophat_folders[::2], tophat_folders[1::2]):
    	in1 = v 
    	in2 = w
    	print("Mate 1 is", v)
    	print("Mate 2 is", w)
    	iter1 = iter( HTSeq.FastqReader( in1 ) )
    	iter2 = iter( HTSeq.FastqReader( in2 ) )
    
    	output1 = in1 + "_sub_sample_" + str(fraction)
    	output2 = in2 + "_sub_sample_" + str(fraction)
    	out1 = open( output1, "w" )
    	out2 = open( output2, "w" )
    
    	for read1, read2 in itertools.izip( iter1, iter2 ):
    	   if random.random() < fraction:
    	      read1.write_to_fastq_file( out1 )
    	      read2.write_to_fastq_file( out2 )
    	      
    	out1.close()
    	out2.close()
    When I ran the above script it executed quite smoothly, but it seems to have had errors in a couple files. When I re-run the subsampling script through samples that get the above error when running through hisat2, it works fine. Any idea on this?

  • #2
    No clue why the script is giving you issues. In the future, though, you might use seqtk to subsample files. It's quite quick (it'll very likely be faster than python) and you can give it a seed, so the subsampling is reproducible.

    Comment


    • #3
      What do you mean by seed?

      Comment


      • #4
        Have a look at wikipedia.

        Comment


        • #5
          Thanks for bringing that to my attention. So currently each time a file is being subsampled it is presumably using a different seed.
          Code:
          random.seed(None)
          However, to fix this and establish a seed for better consistency using this line at the beginning of the script will establish the seed:
          Code:
          random.seed(100)
          My question when using a consistent seed is does it make a difference when subsampling multiple mate pairs of the same sample? Presumably the reads in the mate pairs are random when comparing the pairs against each other.

          Comment


          • #6
            Stated differently, "Suppose I manually specify a seed before going through each file, if the 'i'th read in file 1 is selected will the 'i'th read of other files also be selected?" The answer is yes. Having said that, if you use something pre-existing (e.g., seqtk), then you don't need to spend time debugging your own scripts.

            Comment


            • #7
              Yes, I understand the 'i'th read will be selected from both files if specified with the same seed. Although this is good for reproducibility, what are the other advantages? Why is it important to use the same seed when subsamplng paired-end mates?

              Would it be wise to use this across samples that one is subsampling, or does it not make a difference?

              Comment


              • #8
                If you don't use the same seed for mates then the results will be out of sync.

                It's often convenient to use the same seed for all samples.

                Comment


                • #9
                  If you were using BBMap you could set sampling parameters with the aligner itself and save yourself a step.

                  Code:
                  reads=-1                Set to a positive number N to only process the first N
                                          reads (or pairs), then quit.  -1 means use all reads.
                  samplerate=1            Set to a number from 0 to 1 to randomly select that
                                          fraction of reads for mapping. 1 uses all reads.
                  Any specific reason you are only using a sample of the data?

                  Comment


                  • #10
                    Originally posted by dpryan View Post
                    If you don't use the same seed for mates then the results will be out of sync.

                    It's often convenient to use the same seed for all samples.
                    I am not using mates but paired-end reads (but that shouldn't make too much of a difference). Though I am still unsure how it will be out of sync if we don't use the same seed. I say this because as far as I can tell, the order of reads in each file of the 2 paired-end reads are sorted in a random order to begin with - please correct me if I'm wrong.

                    So please explain why paired reads need to have the same 'random' reads subsampled. I only see this making sense if the paired-end reads were in some kind of order relative to each other.

                    Thanks

                    Comment


                    • #11
                      Originally posted by ronaldrcutler View Post
                      I say this because as far as I can tell, the order of reads in each file of the 2 paired-end reads are sorted in a random order to begin with - please correct me if I'm wrong.
                      That should not be the case, unless something has been done to the files to perturb the order of reads (e.g. using a trimming program that is not PE aware). Most aligners expect the reads to be in the same order in the R1/R2 files (and don't check if that is so). If the reads are out of order in the files then you are likely to get odd results (discordant alignments etc).

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        If you were using BBMap you could set sampling parameters with the aligner itself and save yourself a step.

                        Code:
                        reads=-1                Set to a positive number N to only process the first N
                                                reads (or pairs), then quit.  -1 means use all reads.
                        samplerate=1            Set to a number from 0 to 1 to randomly select that
                                                fraction of reads for mapping. 1 uses all reads.
                        Any specific reason you are only using a sample of the data?
                        I will go ahead and check out BBmap, do you know how it compares against hisat2 and tophat2 when mapping?

                        I am subsampling at different fractions of an original DE experiment to determine what is the lowest amount of reads we can use for 3 replicates while maintaining power to get a high percentage of the first 300 DE genes ordered by padj value using 100% of reads.

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          Most aligners expect the reads to be in the same order in the R1/R2 files (and don't check if that is so).
                          Ah, and this order should be the same as both forward and reverse reads are sequenced from the 5' end right? I'm also assuming that both of these reads contains the same number of reads.

                          Comment


                          • #14
                            BBMap is splice-aware and should do very well against any published aligner out there. Choice of aligners becomes a personal choice (since they offer so many options it is hard to master them all and switching between aligners becomes a hassle after a time). It is multi-threaded and should be one of the faster aligners (I am biased, so make your own judgement ). @Brian (author of BBMap) participates here regularly and can answer any questions (including advanced usage that sometimes only he knows about).

                            Here is a thread about BBMap aligner. I have tried to collect all things that BBMap suite of programs do here.

                            Comment


                            • #15
                              Originally posted by ronaldrcutler View Post
                              Ah, and this order should be the same as both forward and reverse reads are sequenced from the 5' end right? I'm also assuming that both of these reads contains the same number of reads.
                              Order with respect to the fragment to ensure that R1 and R2 from a specific fragment will be at the same relative location in both files. Sequencing is always in 5'->3' direction.

                              If a PE aware aligner is not used (or PE files are not trimmed together) then one of the files may end up with more reads than the other and the order of reads would no longer be maintained.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X