HPSC -- High Performance Statistical Computing for Data Intensive Research

All Websites: pbdR | HPSC | Phyloclustering | R_note | About me |
About HPSC
Overview
Rscript

Master/Worker
SPMD
Example

Cookbook
Reference

NGSA
pbdR




Section: Cookbook

Note that parallelized codes provided in this page are designed for single program multiple data (SPMD) and pbdMPI as default. The parallel codes may not be optimized and may only perform efficiently in some circumstances, so the reports of performance are all skipped. The essential purpose is to show how to utilize existed serial codes and rewrite/rethink them in parallel. Although the examples given here are all extremely simplified for illustration, these ideas are able to extend to more complex cases for real data and in real situation. We aim to explain parallel ideas from the view of statistics, and the better sense of statistics can help to think in applications and redesign algorithms. Some recipes for analyzing ultra-large/unlimited datasets are available in the following. Click titles to show pages in the bottom. For running codes see Rscript in details.

  1. Binning -- Table Cutting and Binning, simple nonparametric method.
  2. Basic -- Sample Mean and Sample Variance.
  3. Quantile -- Quantile or Percentile.
  4. OLS -- Ordinary Least Squares for Linear Models.
  5. MVN -- Log Likelihood of Multivariate Normal Distribution.
  6. PCA -- Principal Component Analysis.
  7. Model-Based Clustering -- Finite Mixture Model and EGM Algorithm, and its older brother
    K-means -- Distance-Based Clustering.


Binning -- Table Cutting, Binning, and Nonparamatric Statistics.

Counting is a fundamental method of Statistics including computing frequence, proportion, and probability, ... etc. It is also an essential tool for categorial data analysis. A fast implementation for binning data given categories/breaks is done in R efficiently for small data. Based on the same idea demonstrated here, a lot of statistical concepts can be parallelized in the same way for large datasets.

Serial code: (ex_binning_serial.r)
    
# File name: ex_binning_serial.r
# Run: Rscript --vanilla ex_binning_serial.r

### A famous example from help("cut") in R.
set.seed(1234)
N <- 100
y <- rnorm(N)

### Based on breaks to count data.
table(cut(y, breaks = pi / 3 * (-3:3)))

Parallel (SPMD) code: (ex_binning_spmd.r for ultra-large/unlimited N )
    
# File name: ex_binning_spmd.r
# Run: mpiexec -np 2 Rscript --vanilla ex_binning_spmd.r

### Load pbdMPI and initial the communicator.
library(pbdMPI, quiet = TRUE)
init()

### Main codes start from here.
set.seed(1234)
N <- 100
y <- rnorm(N)

### Load data partially by processors if N is ultra-large.
id.get <- get.jid(N)
y.spmd <- y[id.get]

### Based on breaks to count data.
bin.spmd <- table(cut(y.spmd, breaks = pi / 3 * (-3:3)))
bin <- as.array(allreduce(bin.spmd, op = "sum"))
dimnames(bin) <- dimnames(bin.spmd)
class(bin) <- class(bin.spmd)

### Output from RANK 0 since reduce(...) will dump only to 0 by default.
comm.print(bin)
finalize()

Exercise:

  1. Try the R function tabulate() to replace table().

[ Go to top ]

Created: Oct 19 2011
Last Revised: Feb 13 2013, 12:20 (CDT Ames, IA, USA)
Maintained: Wei-Chen Chen
E-Mail: wccsnow @ gmail.com
free counters Best Resolution
Firefox 3.5
1024x768
small font