Difference between revisions of "Bioinformatics"

From Wiki
Jump to navigation Jump to search
 
(7 intermediate revisions by the same user not shown)
Line 12: Line 12:
 
See [http://dolphin.readthedocs.io/en/master/dolphin-ui/quickstart.html the Dolphin Quickstart Guide] or [http://wiki.umassrc.org/wiki/index.php/Requesting_Access the ghpcc wiki] on how to do that. note: once you have a ghpcc account you can login directly to the ghpcc if you want to manage your directories and file structure (see [http://wiki.umassrc.org/wiki/index.php/Connecting_to_the_Cluster connecting to the cluster]).   
 
See [http://dolphin.readthedocs.io/en/master/dolphin-ui/quickstart.html the Dolphin Quickstart Guide] or [http://wiki.umassrc.org/wiki/index.php/Requesting_Access the ghpcc wiki] on how to do that. note: once you have a ghpcc account you can login directly to the ghpcc if you want to manage your directories and file structure (see [http://wiki.umassrc.org/wiki/index.php/Connecting_to_the_Cluster connecting to the cluster]).   
  
A note on disk space: Disk space on the UMASS [http://www.umassmed.edu/it/services/research-computing/storage-for-science/ s4s space (a broken link)] is $150/TB/yr, but is not backed up anywhere else and is good for storing data and results, but not for reading and writing directly from Dolphin (it is too slow and using it this way negatively impacts other users). GHPCC /project/umw_firstname_lastname/ directory space should be used for Dolphin. Historically users got many TBs of /project space for free from Umass, but I am not sure if any is still available. See also [http://www.umassmed.edu/it/services/research-computing/high-performance-computing/ Umass IT HPC info].  You can learn more about UMass s4s space and [https://aws.amazon.com/ Amazon Web Services] space (your data can be backed up to aws space by Dolphin, if you have set that up) see [http://www.umassmed.edu/it/services/data-sciences-technology/computing-services here].  You also get 50 GB (not much for RNA Seq analysis) of space in your GHPCC home directory (eg /home/ab27w/), this space is backed up by IS (/project and /s4s space is not backed up unless you are using Dolphin and set up the AWS option), but they are supposed to be (?) very reliable.
+
A note on disk space: Disk space on the UMASS [https://www.umassmed.edu/it/services/research-computing/storage/ nl space] (note: /nl space used to be different hardware and was called /s4s) is $240/TB/yr, but is not backed up anywhere else and is good for storing data and results, but not for reading and writing directly from Dolphin (it is too slow and using it this way negatively impacts other users). GHPCC /project/umw_firstname_lastname/ directory space should be used for Dolphin. /project space is billed at $0.60/GB/yr, but I am not sure if any is still available. See also [http://www.umassmed.edu/it/services/research-computing/high-performance-computing/ Umass IT HPC info].  You can learn more about UMass s4s space and [https://aws.amazon.com/ Amazon Web Services] space (your data can be backed up to aws space by Dolphin, if you have set that up) see [http://www.umassmed.edu/it/services/data-sciences-technology/computing-services here].  You also get 50 GB (not much for RNA Seq analysis) of space in your GHPCC home directory (eg /home/ab27w/), this space is backed up by IS (/project and /s4s space is not backed up unless you are using Dolphin and set up the AWS option), but they are supposed to be (?) very reliable.
 +
 
 +
How much space does Dolphin need? it depends on your input data and what kind of analysis you do.  But let's do some sample numbers, suppose you run an Illumina NextSeq 500 run with 400M "reads". In this context I think a "read" is an entire fragment, so that might be 75 base-pairs (bp) long per fragment.  Illumina produces fastq files which will roughly take 3 bytes/bp (one for the base letter: A,G,C,T, one for the quality score on the letter, and one more for ancillary system info).  So the fastq input files will be 400M*75*3 = 90GB of data.  This data gets "imported" to Dolphin, which copies it (still compressed)
 +
to the import directory, and also duplicates (and uncompresses it) it under that directory in its "initial_run/" subdirectory (where it does checksum counts, etc).  So you have now used an additional 180 GB of space (really more, since it has been uncompressed. you can manually delete the files in the initial_run/ directory outside of Dolphin if you want). Then when you run the pipeline analysis on the files, if you have told Dolphin to separately
 +
search for rRNA (or piRNA, etc), finds that RNA and then produces a new data file without that RNA in it.  But this means that if you tell it to run those steps but there is already little of that RNA in the data (eg, rRNA may already have been purified out), essentially after the rRNA step, you have a new fastq dataset which is almost the same size as the original!  This can eat up space quickly.  Do NOT specify these steps unless you have sizeable amounts of that type of RNA.  The RNA types (eg, rRNA) will still be correctly found without those extra steps being specified (although it may take longer). So if you specify 3 of these filtering steps your data will end up getting
 +
copied 3x.  The other datasets which are fairly large will be the output from the alignment step (the *.bam files).  I think those will be about the size of the input data set. Note: after running Dolphin you can copy some of these output directories into /nl space and then tell Dolphin their
 +
new locations (by going to NGS tracking>NGS Browser, Samples Pane, click on a sample name, then on the new Sample pane click on the Directory Info tab, and change the "Processed File(s) Directory:" info to be /nl/xxx). 
 +
 
 +
You can also delete many of the files Dolphin produces once it is done since
 +
they are no longer needed (see my DolphinSpace.txt file).  In particular, you can always delete the import/initial_run directory.
 +
Many of the other files (eg in the output "process" directory) can be deleted too, since Dolphin saves many of the results
 +
elsewhere. Alper says: "So, I would keep only imported fastq.gz, fastq.gz.count and fastq.gz.md5sum files under your import directory folder only and delete the rest. You can always use these files to rerun if you need any intermediate files. Rerunning is less expensive than keeping all these intermediate files. I [Dolphin] keep TPM tables and fastQC results in different directories [that the user doesn't
 +
see in their /project space] so you don’t need to keep them in your folders. When you need those TPM, expected count files you can download them from dolphin [lml: although that might be a bit more cumbersome at times than just using scp?]".  Also: "You don’t need process/ too unless you need bam files for visualization. Otherwise, you can remove all process directory. They don’t effect dolphin". Alper again: "I don't keep bam alignment files [ie, a copy of bam files is not saved in Dolphin's hidden file space, unlike TPM tables and fastQC results]. So, if you need them, please keep them as well. You can delete the rest. If you wish, you can keep small files too ... like tsv files [eg, process/count/*.tsv]".  
  
 
You have several ways to get files to the GHPCC. You can just use your favorite file transfer program, eg:
 
You have several ways to get files to the GHPCC. You can just use your favorite file transfer program, eg:
Line 53: Line 65:
  
 
http://bioinfo.umassmed.edu/
 
http://bioinfo.umassmed.edu/
 +
 +
[https://www.umassmed.edu/it/services/research-computing/storage/ UMASS Storage] - as of 11/13/17 still needs to be updated to reflect the replacement of /s4s with /nl space.
  
 
https://galaxy.umassmed.edu/
 
https://galaxy.umassmed.edu/
  
[https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf DESeq2.pdf] DEBrowser uses this.
+
[https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf Extensive discussion of how to use the DESeq2 package] DEBrowser uses this.
It has lots of info. MA plots, dispersion plots, etc.  2.2.3 Principal component plot. p. 29 Fig. 7 legend: "this type of plot is useful for visualizing the overall effect of experimental covariates and batch effects" (of course, it can be used other ways too).
+
It has lots of info. MA plots, dispersion plots, etc.  2.2.3 Principal component plot. p. 29 Fig. 7 legend: "this type of plot is useful for visualizing the overall effect of experimental covariates and batch effects" (of course, it can be used other ways too). [http://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8 DESeq2 research article].
  
 
[http://www.bioconductor.org/help/workflows/rnaseqGene/ Bioconductor RNA workflows] - a bunch of useful stuff here, the regularized-logarithm transformation, heatmaps, PCA, etc.
 
[http://www.bioconductor.org/help/workflows/rnaseqGene/ Bioconductor RNA workflows] - a bunch of useful stuff here, the regularized-logarithm transformation, heatmaps, PCA, etc.
  
 +
[http://rnaseq.uoregon.edu/ The RNA-seqlopedia] provides an overview of RNA-seq and of the choices necessary to carry out a successful RNA-seq experiment.  This site hits a nice sweet-spot as a how-to manual to actually do stuff.
 +
 +
[https://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools RNA Seq bioinformtatics tools]
  
 
biocore@umassmed.edu
 
biocore@umassmed.edu

Latest revision as of 16:23, 11 July 2018


note: this page does not attempt to be comprehensive (e.g., if I say "Dolphin analyzes your RNASeq data", I do not mean to imply that it won't analyze any other type of data). There are many documents with much more detail. It is just my attempt to summarize various concepts or issues that I've found confusing.

Dolphin

Dolphin will take your RNASeq fastq files (typically ending in .fastq or .fast.gz), do some quality checking on them, align the fragments to a reference genome, calculate estimates of gene and transcript abundance, and even do some differential expression calculations (e.g., which genes are significantly up or down regulated) - although DEBrowser can do this in more detail.

First you need to have an account on Dolphin and an account (email hpcc-support@umassmed.edu) on the Green High Performance Compute Cluster (GHPCC) - because that is where Dolphin does all its analysis. See the Dolphin Quickstart Guide or the ghpcc wiki on how to do that. note: once you have a ghpcc account you can login directly to the ghpcc if you want to manage your directories and file structure (see connecting to the cluster).

A note on disk space: Disk space on the UMASS nl space (note: /nl space used to be different hardware and was called /s4s) is $240/TB/yr, but is not backed up anywhere else and is good for storing data and results, but not for reading and writing directly from Dolphin (it is too slow and using it this way negatively impacts other users). GHPCC /project/umw_firstname_lastname/ directory space should be used for Dolphin. /project space is billed at $0.60/GB/yr, but I am not sure if any is still available. See also Umass IT HPC info. You can learn more about UMass s4s space and Amazon Web Services space (your data can be backed up to aws space by Dolphin, if you have set that up) see here. You also get 50 GB (not much for RNA Seq analysis) of space in your GHPCC home directory (eg /home/ab27w/), this space is backed up by IS (/project and /s4s space is not backed up unless you are using Dolphin and set up the AWS option), but they are supposed to be (?) very reliable.

How much space does Dolphin need? it depends on your input data and what kind of analysis you do. But let's do some sample numbers, suppose you run an Illumina NextSeq 500 run with 400M "reads". In this context I think a "read" is an entire fragment, so that might be 75 base-pairs (bp) long per fragment. Illumina produces fastq files which will roughly take 3 bytes/bp (one for the base letter: A,G,C,T, one for the quality score on the letter, and one more for ancillary system info). So the fastq input files will be 400M*75*3 = 90GB of data. This data gets "imported" to Dolphin, which copies it (still compressed) to the import directory, and also duplicates (and uncompresses it) it under that directory in its "initial_run/" subdirectory (where it does checksum counts, etc). So you have now used an additional 180 GB of space (really more, since it has been uncompressed. you can manually delete the files in the initial_run/ directory outside of Dolphin if you want). Then when you run the pipeline analysis on the files, if you have told Dolphin to separately search for rRNA (or piRNA, etc), finds that RNA and then produces a new data file without that RNA in it. But this means that if you tell it to run those steps but there is already little of that RNA in the data (eg, rRNA may already have been purified out), essentially after the rRNA step, you have a new fastq dataset which is almost the same size as the original! This can eat up space quickly. Do NOT specify these steps unless you have sizeable amounts of that type of RNA. The RNA types (eg, rRNA) will still be correctly found without those extra steps being specified (although it may take longer). So if you specify 3 of these filtering steps your data will end up getting copied 3x. The other datasets which are fairly large will be the output from the alignment step (the *.bam files). I think those will be about the size of the input data set. Note: after running Dolphin you can copy some of these output directories into /nl space and then tell Dolphin their new locations (by going to NGS tracking>NGS Browser, Samples Pane, click on a sample name, then on the new Sample pane click on the Directory Info tab, and change the "Processed File(s) Directory:" info to be /nl/xxx).

You can also delete many of the files Dolphin produces once it is done since they are no longer needed (see my DolphinSpace.txt file). In particular, you can always delete the import/initial_run directory. Many of the other files (eg in the output "process" directory) can be deleted too, since Dolphin saves many of the results elsewhere. Alper says: "So, I would keep only imported fastq.gz, fastq.gz.count and fastq.gz.md5sum files under your import directory folder only and delete the rest. You can always use these files to rerun if you need any intermediate files. Rerunning is less expensive than keeping all these intermediate files. I [Dolphin] keep TPM tables and fastQC results in different directories [that the user doesn't see in their /project space] so you don’t need to keep them in your folders. When you need those TPM, expected count files you can download them from dolphin [lml: although that might be a bit more cumbersome at times than just using scp?]". Also: "You don’t need process/ too unless you need bam files for visualization. Otherwise, you can remove all process directory. They don’t effect dolphin". Alper again: "I don't keep bam alignment files [ie, a copy of bam files is not saved in Dolphin's hidden file space, unlike TPM tables and fastQC results]. So, if you need them, please keep them as well. You can delete the rest. If you wish, you can keep small files too ... like tsv files [eg, process/count/*.tsv]".

You have several ways to get files to the GHPCC. You can just use your favorite file transfer program, eg: from whichever computer has the files initially: scp myfile.fastq ab27w@ghpcc06.umassrc.org:/project/umw_lawrence_lifshitz/data/myfile.fastq . See also ghpcc wiki on transferring data. Dolphin may also be able to import them from your PC via an excel spreadsheet...

Then you need to import the fastq files to Dolphin. When Dolphin does this it will copy them to an output ("Process") directory and do some simple checking of their format. Start up Dolphin (http://dolphin.umassmed.edu). There are two ways to import files. Along the left click on either NGS tracking -> Excel Import (this is a bit tricky but can provide some advantages) , or NGS tracking -> fastlane. Dolphin considers this import a type of "run".

Now you want to align and count your data. Go to NGS tracking -> NGS Browser and pick the files ("samples") that you want to analyze. Then click Send to Pipeline. In addition to checking Yes for FastQC, click on Add Additional Pipeline, and add RSEM (and click on RSEM QC with that option to get some quality control information out about the run). Then click on Submit Pipeline - nothing may happen for about a minute (so don't keep clicking) and then you will be told you job is submitted. See the NGS Status Guide for how to check on a job's (a "run") status. It may take hours to days for your job to run, depending upon its size.

Finally (well, not really), once the job has successfully run, go back to NGS tracking -> NGS Browser (or NGS tracking -> Report Status) to find the files you want to do a differential expression analysis on. Each file ("sample") keeps track of what "runs" have been performed on it (i.e., whether RSEM has been run so that a gene counts file is available). Select the files and either the expected_genes.tsv or expected_transcripts.tsv output files to analyze (see [] for how to do this). Then go to Generate Tables, generate at table with all this data and "Save" it within Dolphin (for future reference). Also save it to a local file on your PC (for future import into DEBrowser). Note, you CAN also send this table directly to DEBrowser, which is fine as long as you don't need to specify batch effects.

DEBrowser

Type http://debrowser.umassmed.edu (or this will pop up automatically if you told Dolphin to send the table directly to DEBrowser). Now see DEBrowser help for how to use it.

Normalization methods/issues.

DEBrowser's Quality Control Inter-quartile Range (IQR) plots use normalized data (which normalization method?). The heat map is the log_2 of normalized data (using the technique EdgeR uses, ie, Trimmed Mean of M-values, TMM). Then all the data in a column is scaled and centered - so that it's mean is 0 and its standard deviation is 1.

TMM FPKM TPM estimated_counts from RSEM

resources

http://wiki.umassrc.org/wiki/index.php/Main_Page

https://github.com/Bioconductor-mirror/debrowser

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

RSEM

http://dolphin.readthedocs.io/en/master/dolphin-ui/Dolphin-UI.html

http://bioinfo.umassmed.edu/

UMASS Storage - as of 11/13/17 still needs to be updated to reflect the replacement of /s4s with /nl space.

https://galaxy.umassmed.edu/

Extensive discussion of how to use the DESeq2 package DEBrowser uses this. It has lots of info. MA plots, dispersion plots, etc. 2.2.3 Principal component plot. p. 29 Fig. 7 legend: "this type of plot is useful for visualizing the overall effect of experimental covariates and batch effects" (of course, it can be used other ways too). DESeq2 research article.

Bioconductor RNA workflows - a bunch of useful stuff here, the regularized-logarithm transformation, heatmaps, PCA, etc.

The RNA-seqlopedia provides an overview of RNA-seq and of the choices necessary to carry out a successful RNA-seq experiment. This site hits a nice sweet-spot as a how-to manual to actually do stuff.

RNA Seq bioinformtatics tools

biocore@umassmed.edu