Difference between revisions of "Bioinformatics"

From Wiki
Jump to navigation Jump to search
Line 56: Line 56:
 
https://galaxy.umassmed.edu/
 
https://galaxy.umassmed.edu/
  
[https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf DESeq2.pdf] DEBrowser uses this.
+
[https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf Extensive discussion of how to use the DESeq2 package] DEBrowser uses this.
It has lots of info. MA plots, dispersion plots, etc.  2.2.3 Principal component plot. p. 29 Fig. 7 legend: "this type of plot is useful for visualizing the overall effect of experimental covariates and batch effects" (of course, it can be used other ways too).
+
It has lots of info. MA plots, dispersion plots, etc.  2.2.3 Principal component plot. p. 29 Fig. 7 legend: "this type of plot is useful for visualizing the overall effect of experimental covariates and batch effects" (of course, it can be used other ways too). [http://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8 DESeq2 research article].
  
 
[http://www.bioconductor.org/help/workflows/rnaseqGene/ Bioconductor RNA workflows] - a bunch of useful stuff here, the regularized-logarithm transformation, heatmaps, PCA, etc.
 
[http://www.bioconductor.org/help/workflows/rnaseqGene/ Bioconductor RNA workflows] - a bunch of useful stuff here, the regularized-logarithm transformation, heatmaps, PCA, etc.

Revision as of 19:20, 15 May 2017


note: this page does not attempt to be comprehensive (e.g., if I say "Dolphin analyzes your RNASeq data", I do not mean to imply that it won't analyze any other type of data). There are many documents with much more detail. It is just my attempt to summarize various concepts or issues that I've found confusing.

Dolphin

Dolphin will take your RNASeq fastq files (typically ending in .fastq or .fast.gz), do some quality checking on them, align the fragments to a reference genome, calculate estimates of gene and transcript abundance, and even do some differential expression calculations (e.g., which genes are significantly up or down regulated) - although DEBrowser can do this in more detail.

First you need to have an account on Dolphin and an account (email hpcc-support@umassmed.edu) on the Green High Performance Compute Cluster (GHPCC) - because that is where Dolphin does all its analysis. See the Dolphin Quickstart Guide or the ghpcc wiki on how to do that. note: once you have a ghpcc account you can login directly to the ghpcc if you want to manage your directories and file structure (see connecting to the cluster).

A note on disk space: Disk space on the UMASS s4s space (a broken link) is $150/TB/yr, but is not backed up anywhere else and is good for storing data and results, but not for reading and writing directly from Dolphin (it is too slow and using it this way negatively impacts other users). GHPCC /project/umw_firstname_lastname/ directory space should be used for Dolphin. Historically users got many TBs of /project space for free from Umass, but I am not sure if any is still available. See also Umass IT HPC info. You can learn more about UMass s4s space and Amazon Web Services space (your data can be backed up to aws space by Dolphin, if you have set that up) see here. You also get 50 GB (not much for RNA Seq analysis) of space in your GHPCC home directory (eg /home/ab27w/), this space is backed up by IS (/project and /s4s space is not backed up unless you are using Dolphin and set up the AWS option), but they are supposed to be (?) very reliable.

You have several ways to get files to the GHPCC. You can just use your favorite file transfer program, eg: from whichever computer has the files initially: scp myfile.fastq ab27w@ghpcc06.umassrc.org:/project/umw_lawrence_lifshitz/data/myfile.fastq . See also ghpcc wiki on transferring data. Dolphin may also be able to import them from your PC via an excel spreadsheet...

Then you need to import the fastq files to Dolphin. When Dolphin does this it will copy them to an output ("Process") directory and do some simple checking of their format. Start up Dolphin (http://dolphin.umassmed.edu). There are two ways to import files. Along the left click on either NGS tracking -> Excel Import (this is a bit tricky but can provide some advantages) , or NGS tracking -> fastlane. Dolphin considers this import a type of "run".

Now you want to align and count your data. Go to NGS tracking -> NGS Browser and pick the files ("samples") that you want to analyze. Then click Send to Pipeline. In addition to checking Yes for FastQC, click on Add Additional Pipeline, and add RSEM (and click on RSEM QC with that option to get some quality control information out about the run). Then click on Submit Pipeline - nothing may happen for about a minute (so don't keep clicking) and then you will be told you job is submitted. See the NGS Status Guide for how to check on a job's (a "run") status. It may take hours to days for your job to run, depending upon its size.

Finally (well, not really), once the job has successfully run, go back to NGS tracking -> NGS Browser (or NGS tracking -> Report Status) to find the files you want to do a differential expression analysis on. Each file ("sample") keeps track of what "runs" have been performed on it (i.e., whether RSEM has been run so that a gene counts file is available). Select the files and either the expected_genes.tsv or expected_transcripts.tsv output files to analyze (see [] for how to do this). Then go to Generate Tables, generate at table with all this data and "Save" it within Dolphin (for future reference). Also save it to a local file on your PC (for future import into DEBrowser). Note, you CAN also send this table directly to DEBrowser, which is fine as long as you don't need to specify batch effects.

DEBrowser

Type http://debrowser.umassmed.edu (or this will pop up automatically if you told Dolphin to send the table directly to DEBrowser). Now see DEBrowser help for how to use it.

Normalization methods/issues.

DEBrowser's Quality Control Inter-quartile Range (IQR) plots use normalized data (which normalization method?). The heat map is the log_2 of normalized data (using the technique EdgeR uses, ie, Trimmed Mean of M-values, TMM). Then all the data in a column is scaled and centered - so that it's mean is 0 and its standard deviation is 1.

TMM FPKM TPM estimated_counts from RSEM

resources

http://wiki.umassrc.org/wiki/index.php/Main_Page

https://github.com/Bioconductor-mirror/debrowser

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

RSEM

http://dolphin.readthedocs.io/en/master/dolphin-ui/Dolphin-UI.html

http://bioinfo.umassmed.edu/

https://galaxy.umassmed.edu/

Extensive discussion of how to use the DESeq2 package DEBrowser uses this. It has lots of info. MA plots, dispersion plots, etc. 2.2.3 Principal component plot. p. 29 Fig. 7 legend: "this type of plot is useful for visualizing the overall effect of experimental covariates and batch effects" (of course, it can be used other ways too). DESeq2 research article.

Bioconductor RNA workflows - a bunch of useful stuff here, the regularized-logarithm transformation, heatmaps, PCA, etc.


biocore@umassmed.edu