About HPSC
Overview
Rscript
Master/Worker
SPMD
Example
Cookbook
Reference
NGSA
pbdR
|
Section: Cookbook
Note that parallelized codes provided in this page are designed for
single program multiple data (SPMD) and
pbdMPI
as default.
The parallel codes may not be optimized and may only perform efficiently in some
circumstances, so the reports of performance are all skipped.
The essential purpose is to show how to utilize existed
serial codes and rewrite/rethink them in parallel.
Although the examples given here are all extremely simplified
for illustration, these ideas are able to
extend to more complex cases for real data and in real situation.
We aim to explain parallel ideas from the view of statistics, and
the better sense of statistics can help to think in applications
and redesign algorithms.
Some recipes for analyzing ultra-large/unlimited datasets are
available in the following. Click titles to show pages in the bottom.
For running codes see
Rscript in details.
-
Binning
-- Table Cutting and Binning, simple nonparametric method.
-
Basic
-- Sample Mean and Sample Variance.
-
Quantile
-- Quantile or Percentile.
-
OLS
-- Ordinary Least Squares for Linear Models.
-
MVN
-- Log Likelihood of Multivariate Normal Distribution.
-
PCA
-- Principal Component Analysis.
-
Model-Based Clustering
-- Finite Mixture Model and EGM Algorithm,
and its older brother
K-means
-- Distance-Based Clustering.
Binning -- Table Cutting, Binning, and Nonparamatric Statistics.
Counting is a fundamental method of Statistics including computing
frequence, proportion, and probability, ... etc. It is also an essential tool
for categorial data analysis. A fast implementation for binning data
given categories/breaks is done in R efficiently for small data.
Based on the same idea demonstrated here, a lot of statistical concepts can
be parallelized in the same way for large datasets.
Serial code:
(ex_binning_serial.r)
|
|
# File name: ex_binning_serial.r
# Run: Rscript --vanilla ex_binning_serial.r
### A famous example from help("cut") in R.
set.seed(1234)
N <- 100
y <- rnorm(N)
### Based on breaks to count data.
table(cut(y, breaks = pi / 3 * (-3:3)))
|
Parallel (SPMD) code:
(ex_binning_spmd.r
for ultra-large/unlimited )
|
|
# File name: ex_binning_spmd.r
# Run: mpiexec -np 2 Rscript --vanilla ex_binning_spmd.r
### Load pbdMPI and initial the communicator.
library(pbdMPI, quiet = TRUE)
init()
### Main codes start from here.
set.seed(1234)
N <- 100
y <- rnorm(N)
### Load data partially by processors if N is ultra-large.
id.get <- get.jid(N)
y.spmd <- y[id.get]
### Based on breaks to count data.
bin.spmd <- table(cut(y.spmd, breaks = pi / 3 * (-3:3)))
bin <- as.array(allreduce(bin.spmd, op = "sum"))
dimnames(bin) <- dimnames(bin.spmd)
class(bin) <- class(bin.spmd)
### Output from RANK 0 since reduce(...) will dump only to 0 by default.
comm.print(bin)
finalize()
|
Exercise:
- Try the R function
tabulate() to replace table().
|