# 3) Optimised packages and function libraries

# Optimised packages using parallelism

## caret- *C*lassification *A*nd *RE*gression *T*raining

The **caret** package is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:

- data splitting
- pre-processing
- feature selection
- model tuning using resampling
- variable importance estimation

as well as other functionality.

http://topepo.github.io/caret/index.html

## maanova - tools for analyzing Micro Array experiments

Analysis of N-dye Micro Array experiment using mixed model effect. Containing analysis of variance, permutation and bootstrap, cluster and consensus tree.

http://www.bioconductor.org/packages/release/bioc/html/maanova.html

## pvclust - hierarchical clustering with *p*-values

**pvclust** is an **R** package for assessing the uncertainty in hierarchical cluster analysis. For each cluster in hierarchical clustering, quantities called *p*-values are calculated via multiscale bootstrap resampling. *P*-value of a cluster is a value between 0 and 1, which indicates how strong the cluster is supported by data.

http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/

## tm - *T*ext *M*ining infrastructure in R

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

http://tm.r-forge.r-project.org/

## varSelRF - *var*iable *Sel*ection using *R*andom *F*orests

Variable selection from random forests using both backwards variable elimination (for the selection of small sets of non-redundant variables) and selection based on the importance spectrum (somewhat similar to scree plots; for the selection of large, potentially highly-correlated variables). Main applications in high-dimensional data (e.g., microarray data, and other genomics and proteomics applications).

http://ligarto.org/rdiaz/Software/Software.html

## bcp - Bayesian Analysis of Change Point Problems

This package provides an implementation of an approximation to the Barry and Hartigan (1993) product partition model for the normal errors change point problem using Markov Chain Monte Carlo. It also extends the methodology to independent multivariate series with an assumed common change point structure.

http://cran.r-project.org/web/packages/bcp/index.html

## multtest - Resampling-based multiple hypothesis testing

Non-parametric bootstrap and permutation resampling-based multiple testing procedures (including empirical Bayes methods) for controlling the family-wise error rate (FWER), generalized family-wise error rate (gFWER), tail probability of the proportion of false positives (TPPFP), and false discovery rate (FDR). Several choices of bootstrap-based null distribution are implemented (centered, centered and scaled, quantile-transformed). Single-step and step-wise methods are available. Tests based on a variety of t- and F-statistics (including t-statistics based on regression parameters from linear and survival models as well as those based on correlation parameters) are included. When probing hypotheses with t-statistics, users may also select a potentially faster null distribution which is multivariate normal with mean zero and variance covariance matrix derived from the vector influence function. Results are reported in terms of adjusted p-values, confidence regions and test statistic cutoffs. The procedures are directly applicable to identifying differentially expressed genes in DNA microarray experiments.

http://www.bioconductor.org/packages/release/bioc/html/multtest.html

## GAMBoost - Generalized linear and additive models by likelihood based boosting

This package provides routines for fitting generalized linear and and generalized additive models by likelihood based boosting, using penalized B-splines.

http://cran.r-project.org/web/packages/GAMBoost/index.html

## partDSA - Partitioning using deletion, substitution, and addition moves

partDSA is a novel tool for generating a piecewise constant estimation list of increasingly complex predictors based on an intensive and comprehensive search over the entire covariate space.

http://cran.r-project.org/web/packages/partDSA/index.html

## dclone - Data Cloning and MCMC Tools for Maximum Likelihood Methods

Low level functions for implementing maximum likelihood estimating procedures for complex models using data cloning and Bayesian Markov chain Monte Carlo methods. Sequential and parallel MCMC support for JAGS, WinBUGS and OpenBUGS.

http://dcr.r-forge.r-project.org/

## pmclust - Parallel Model-Based Clustering

The pmclust aims to utilize model-based clustering (unsupervised) for high dimensional and ultra large data, especially in a distributed manner. The package employs Rmpi to perform a expectation-gathering-maximization (EGM) algorithm for finite mixture Gaussian models. The unstructured dispersion matrices are assumed in the Gaussian models. The implementation is default in the single program multiple data (SPMD) programming model. The code can be executed through Rmpi and independent to most MPI applications. See the High Performance Statistical Computing (HPSC) website for more information, documents and examples.

## harvestr: A Parallel Simulation Framework

Functions for easy simulation and reproducability.

## SPRINT

# R and Intel Maths Kernel Libraries (MKL)

The Intel Maths Kernel Libraries provide a set of optimised routines for linear algebra. These are the BLASand MKL libraries. These can provide a significant improvement in execution speed. A version of R has been built to use these libraries and can be accessed via the "module load R/???" command.

To be completed when I've done the work

Under construction