3) Optimised packages and function libraries
Optimised packages using parallelism
caret- Classification And REgression Training
The caret package is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:
- data splitting
- pre-processing
- feature selection
- model tuning using resampling
- variable importance estimation
as well as other functionality.
http://topepo.github.io/caret/index.html
maanova - tools for analyzing Micro Array experiments
Analysis of N-dye Micro Array experiment using mixed model effect. Containing analysis of variance, permutation and bootstrap, cluster and consensus tree.
http://www.bioconductor.org/packages/release/bioc/html/maanova.html
pvclust - hierarchical clustering with p-values
pvclust is an R package for assessing the uncertainty in hierarchical cluster analysis. For each cluster in hierarchical clustering, quantities called p-values are calculated via multiscale bootstrap resampling. P-value of a cluster is a value between 0 and 1, which indicates how strong the cluster is supported by data.
http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/
tm - Text Mining infrastructure in R
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
http://tm.r-forge.r-project.org/
varSelRF - variable Selection using Random Forests
Variable selection from random forests using both backwards variable elimination (for the selection of small sets of non-redundant variables) and selection based on the importance spectrum (somewhat similar to scree plots; for the selection of large, potentially highly-correlated variables). Main applications in high-dimensional data (e.g., microarray data, and other genomics and proteomics applications).
http://ligarto.org/rdiaz/Software/Software.html
bcp - Bayesian Analysis of Change Point Problems
This package provides an implementation of an approximation to the Barry and Hartigan (1993) product partition model for the normal errors change point problem using Markov Chain Monte Carlo. It also extends the methodology to independent multivariate series with an assumed common change point structure.
http://cran.r-project.org/web/packages/bcp/index.html
multtest - Resampling-based multiple hypothesis testing
Non-parametric bootstrap and permutation resampling-based multiple testing procedures (including empirical Bayes methods) for controlling the family-wise error rate (FWER), generalized family-wise error rate (gFWER), tail probability of the proportion of false positives (TPPFP), and false discovery rate (FDR). Several choices of bootstrap-based null distribution are implemented (centered, centered and scaled, quantile-transformed). Single-step and step-wise methods are available. Tests based on a variety of t- and F-statistics (including t-statistics based on regression parameters from linear and survival models as well as those based on correlation parameters) are included. When probing hypotheses with t-statistics, users may also select a potentially faster null distribution which is multivariate normal with mean zero and variance covariance matrix derived from the vector influence function. Results are reported in terms of adjusted p-values, confidence regions and test statistic cutoffs. The procedures are directly applicable to identifying differentially expressed genes in DNA microarray experiments.
http://www.bioconductor.org/packages/release/bioc/html/multtest.html
GAMBoost - Generalized linear and additive models by likelihood based boosting
This package provides routines for fitting generalized linear and and generalized additive models by likelihood based boosting, using penalized B-splines.
http://cran.r-project.org/web/packages/GAMBoost/index.html
partDSA - Partitioning using deletion, substitution, and addition moves
partDSA is a novel tool for generating a piecewise constant estimation list of increasingly complex predictors based on an intensive and comprehensive search over the entire covariate space.
http://cran.r-project.org/web/packages/partDSA/index.html
dclone - Data Cloning and MCMC Tools for Maximum Likelihood Methods
Low level functions for implementing maximum likelihood estimating procedures for complex models using data cloning and Bayesian Markov chain Monte Carlo methods. Sequential and parallel MCMC support for JAGS, WinBUGS and OpenBUGS.
http://dcr.r-forge.r-project.org/
pmclust - Parallel Model-Based Clustering
The pmclust aims to utilize model-based clustering (unsupervised) for high dimensional and ultra large data, especially in a distributed manner. The package employs Rmpi to perform a expectation-gathering-maximization (EGM) algorithm for finite mixture Gaussian models. The unstructured dispersion matrices are assumed in the Gaussian models. The implementation is default in the single program multiple data (SPMD) programming model. The code can be executed through Rmpi and independent to most MPI applications. See the High Performance Statistical Computing (HPSC) website for more information, documents and examples.
harvestr: A Parallel Simulation Framework
Functions for easy simulation and reproducability.
SPRINT
R and Intel Maths Kernel Libraries (MKL)
The Intel Maths Kernel Libraries provide a set of optimised routines for linear algebra. These are the BLASand MKL libraries. These can provide a significant improvement in execution speed. A version of R has been built to use these libraries and can be accessed via the "module load R/???" command.
To be completed when I've done the work
Under construction