Difference between revisions of "Bioinformatics"

From Wiki
Jump to navigation Jump to search
Line 14: Line 14:
 
A note on disk space: Disk space on the UMASS [http://www.umassmed.edu/it/services/research-computing/storage-for-science/ s4s space (a broken link)] is $150/TB/yr, but is not backed up anywhere else and is good for storing data and results, but not for reading and writing directly from Dolphin (it is too slow and using it this way negatively impacts other users). GHPCC /project/umw_firstname_lastname/ directory space should be used for Dolphin. Historically users got many TBs of /project space for free from Umass, but I am not sure if any is still available. See also [http://www.umassmed.edu/it/services/research-computing/high-performance-computing/ Umass IT HPC info].  You can learn more about UMass s4s space and [https://aws.amazon.com/ Amazon Web Services] space (your data can be backed up to aws space by Dolphin, if you have set that up) see [http://www.umassmed.edu/it/services/data-sciences-technology/computing-services here].  You also get 50 GB (not much for RNA Seq analysis) of space in your GHPCC home directory (eg /home/ab27w/), this space is backed up by IS (/project and /s4s space is not backed up unless you are using Dolphin and set up the AWS option), but they are supposed to be (?) very reliable.  
 
A note on disk space: Disk space on the UMASS [http://www.umassmed.edu/it/services/research-computing/storage-for-science/ s4s space (a broken link)] is $150/TB/yr, but is not backed up anywhere else and is good for storing data and results, but not for reading and writing directly from Dolphin (it is too slow and using it this way negatively impacts other users). GHPCC /project/umw_firstname_lastname/ directory space should be used for Dolphin. Historically users got many TBs of /project space for free from Umass, but I am not sure if any is still available. See also [http://www.umassmed.edu/it/services/research-computing/high-performance-computing/ Umass IT HPC info].  You can learn more about UMass s4s space and [https://aws.amazon.com/ Amazon Web Services] space (your data can be backed up to aws space by Dolphin, if you have set that up) see [http://www.umassmed.edu/it/services/data-sciences-technology/computing-services here].  You also get 50 GB (not much for RNA Seq analysis) of space in your GHPCC home directory (eg /home/ab27w/), this space is backed up by IS (/project and /s4s space is not backed up unless you are using Dolphin and set up the AWS option), but they are supposed to be (?) very reliable.  
  
How much space does Dolphin need? it depends on your input data and what kind of analysis you do.  But let's do some sample numbers, suppose you run an Illumina NextSeq 500 run with 400M "reads". In this context I think a "read" is an entire fragment, so that might be 75 base-pairs (bp)long per fragment.  Illumina produces fastq files which will roughly take 3 bytes/bp (one for the base letter: A,G,C,T, one for the quality score on the letter, and one more for ancillary system info).  So the fastq input files will be 400M*75*3 = 90GB of data.  This data gets "imported" to Dolphin, which copies it
+
How much space does Dolphin need? it depends on your input data and what kind of analysis you do.  But let's do some sample numbers, suppose you run an Illumina NextSeq 500 run with 400M "reads". In this context I think a "read" is an entire fragment, so that might be 75 base-pairs (bp) long per fragment.  Illumina produces fastq files which will roughly take 3 bytes/bp (one for the base letter: A,G,C,T, one for the quality score on the letter, and one more for ancillary system info).  So the fastq input files will be 400M*75*3 = 90GB of data.  This data gets "imported" to Dolphin, which copies it
 
to the import directory, and also duplicates in under that directory in its "initial_run/" subdirectory (where it does checksum counts, etc).  So you have now used an additional 180 GB of space (you can manually delete the files in the initial_run/ directory outside of Dolphin if you want). Then when you run the pipeline analysis on the files, if you have told Dolphin to separately
 
to the import directory, and also duplicates in under that directory in its "initial_run/" subdirectory (where it does checksum counts, etc).  So you have now used an additional 180 GB of space (you can manually delete the files in the initial_run/ directory outside of Dolphin if you want). Then when you run the pipeline analysis on the files, if you have told Dolphin to separately
 
search for rRNA (or piRNA, etc), finds that RNA and then produces a new data file without that RNA in it.  But this means that if you tell it to run those steps but there is already little of that RNA in the data (eg, rRNA may already have been purified out), essentially after the rRNA step, you have a new fastq dataset which is almost the same size as the original!  This can eat up space quickly.  Do NOT specify these steps unless you have sizeable amounts of that type of RNA.  The RNA types (eg, rRNA) will still be correctly found without those extra steps being specified (although it may take longer). So if you specify 3 of these filtering steps your data will end up getting
 
search for rRNA (or piRNA, etc), finds that RNA and then produces a new data file without that RNA in it.  But this means that if you tell it to run those steps but there is already little of that RNA in the data (eg, rRNA may already have been purified out), essentially after the rRNA step, you have a new fastq dataset which is almost the same size as the original!  This can eat up space quickly.  Do NOT specify these steps unless you have sizeable amounts of that type of RNA.  The RNA types (eg, rRNA) will still be correctly found without those extra steps being specified (although it may take longer). So if you specify 3 of these filtering steps your data will end up getting
copied 3x.  The other datasets which are fairly large will be the output from the alignment step (the *.bam files).  I think those will be about the size of the input data set. Note: after running Dolphin you can copy so of these output directories into /s4s space and then tell Dolphin their
+
copied 3x.  The other datasets which are fairly large will be the output from the alignment step (the *.bam files).  I think those will be about the size of the input data set. Note: after running Dolphin you can copy some of these output directories into /s4s space and then tell Dolphin their
 
new locations (by doing ...).
 
new locations (by doing ...).
  

Revision as of 15:34, 6 June 2017


note: this page does not attempt to be comprehensive (e.g., if I say "Dolphin analyzes your RNASeq data", I do not mean to imply that it won't analyze any other type of data). There are many documents with much more detail. It is just my attempt to summarize various concepts or issues that I've found confusing.

Dolphin

Dolphin will take your RNASeq fastq files (typically ending in .fastq or .fast.gz), do some quality checking on them, align the fragments to a reference genome, calculate estimates of gene and transcript abundance, and even do some differential expression calculations (e.g., which genes are significantly up or down regulated) - although DEBrowser can do this in more detail.

First you need to have an account on Dolphin and an account (email hpcc-support@umassmed.edu) on the Green High Performance Compute Cluster (GHPCC) - because that is where Dolphin does all its analysis. See the Dolphin Quickstart Guide or the ghpcc wiki on how to do that. note: once you have a ghpcc account you can login directly to the ghpcc if you want to manage your directories and file structure (see connecting to the cluster).

A note on disk space: Disk space on the UMASS s4s space (a broken link) is $150/TB/yr, but is not backed up anywhere else and is good for storing data and results, but not for reading and writing directly from Dolphin (it is too slow and using it this way negatively impacts other users). GHPCC /project/umw_firstname_lastname/ directory space should be used for Dolphin. Historically users got many TBs of /project space for free from Umass, but I am not sure if any is still available. See also Umass IT HPC info. You can learn more about UMass s4s space and Amazon Web Services space (your data can be backed up to aws space by Dolphin, if you have set that up) see here. You also get 50 GB (not much for RNA Seq analysis) of space in your GHPCC home directory (eg /home/ab27w/), this space is backed up by IS (/project and /s4s space is not backed up unless you are using Dolphin and set up the AWS option), but they are supposed to be (?) very reliable.

How much space does Dolphin need? it depends on your input data and what kind of analysis you do. But let's do some sample numbers, suppose you run an Illumina NextSeq 500 run with 400M "reads". In this context I think a "read" is an entire fragment, so that might be 75 base-pairs (bp) long per fragment. Illumina produces fastq files which will roughly take 3 bytes/bp (one for the base letter: A,G,C,T, one for the quality score on the letter, and one more for ancillary system info). So the fastq input files will be 400M*75*3 = 90GB of data. This data gets "imported" to Dolphin, which copies it to the import directory, and also duplicates in under that directory in its "initial_run/" subdirectory (where it does checksum counts, etc). So you have now used an additional 180 GB of space (you can manually delete the files in the initial_run/ directory outside of Dolphin if you want). Then when you run the pipeline analysis on the files, if you have told Dolphin to separately search for rRNA (or piRNA, etc), finds that RNA and then produces a new data file without that RNA in it. But this means that if you tell it to run those steps but there is already little of that RNA in the data (eg, rRNA may already have been purified out), essentially after the rRNA step, you have a new fastq dataset which is almost the same size as the original! This can eat up space quickly. Do NOT specify these steps unless you have sizeable amounts of that type of RNA. The RNA types (eg, rRNA) will still be correctly found without those extra steps being specified (although it may take longer). So if you specify 3 of these filtering steps your data will end up getting copied 3x. The other datasets which are fairly large will be the output from the alignment step (the *.bam files). I think those will be about the size of the input data set. Note: after running Dolphin you can copy some of these output directories into /s4s space and then tell Dolphin their new locations (by doing ...).

You have several ways to get files to the GHPCC. You can just use your favorite file transfer program, eg: from whichever computer has the files initially: scp myfile.fastq ab27w@ghpcc06.umassrc.org:/project/umw_lawrence_lifshitz/data/myfile.fastq . See also ghpcc wiki on transferring data. Dolphin may also be able to import them from your PC via an excel spreadsheet...

Then you need to import the fastq files to Dolphin. When Dolphin does this it will copy them to an output ("Process") directory and do some simple checking of their format. Start up Dolphin (http://dolphin.umassmed.edu). There are two ways to import files. Along the left click on either NGS tracking -> Excel Import (this is a bit tricky but can provide some advantages) , or NGS tracking -> fastlane. Dolphin considers this import a type of "run".

Now you want to align and count your data. Go to NGS tracking -> NGS Browser and pick the files ("samples") that you want to analyze. Then click Send to Pipeline. In addition to checking Yes for FastQC, click on Add Additional Pipeline, and add RSEM (and click on RSEM QC with that option to get some quality control information out about the run). Then click on Submit Pipeline - nothing may happen for about a minute (so don't keep clicking) and then you will be told you job is submitted. See the NGS Status Guide for how to check on a job's (a "run") status. It may take hours to days for your job to run, depending upon its size.

Finally (well, not really), once the job has successfully run, go back to NGS tracking -> NGS Browser (or NGS tracking -> Report Status) to find the files you want to do a differential expression analysis on. Each file ("sample") keeps track of what "runs" have been performed on it (i.e., whether RSEM has been run so that a gene counts file is available). Select the files and either the expected_genes.tsv or expected_transcripts.tsv output files to analyze (see [] for how to do this). Then go to Generate Tables, generate at table with all this data and "Save" it within Dolphin (for future reference). Also save it to a local file on your PC (for future import into DEBrowser). Note, you CAN also send this table directly to DEBrowser, which is fine as long as you don't need to specify batch effects.

DEBrowser

Type http://debrowser.umassmed.edu (or this will pop up automatically if you told Dolphin to send the table directly to DEBrowser). Now see DEBrowser help for how to use it.

Normalization methods/issues.

DEBrowser's Quality Control Inter-quartile Range (IQR) plots use normalized data (which normalization method?). The heat map is the log_2 of normalized data (using the technique EdgeR uses, ie, Trimmed Mean of M-values, TMM). Then all the data in a column is scaled and centered - so that it's mean is 0 and its standard deviation is 1.

TMM FPKM TPM estimated_counts from RSEM

resources

http://wiki.umassrc.org/wiki/index.php/Main_Page

https://github.com/Bioconductor-mirror/debrowser

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

RSEM

http://dolphin.readthedocs.io/en/master/dolphin-ui/Dolphin-UI.html

http://bioinfo.umassmed.edu/

https://galaxy.umassmed.edu/

Extensive discussion of how to use the DESeq2 package DEBrowser uses this. It has lots of info. MA plots, dispersion plots, etc. 2.2.3 Principal component plot. p. 29 Fig. 7 legend: "this type of plot is useful for visualizing the overall effect of experimental covariates and batch effects" (of course, it can be used other ways too). DESeq2 research article.

Bioconductor RNA workflows - a bunch of useful stuff here, the regularized-logarithm transformation, heatmaps, PCA, etc.

The RNA-seqlopedia provides an overview of RNA-seq and of the choices necessary to carry out a successful RNA-seq experiment. This site hits a nice sweet-spot as a how-to manual to actually do stuff.

RNA Seq bioinformtatics tools

biocore@umassmed.edu