Running open-source R on EMR with RHadoop packages and RStudio

More documentation you can find at the AWS Big Data Blog post.
Please copy all files to a S3 bucket.
With the following command you can start an EMR cluster with R, RHadoop and RStudio installed.
Please replace <YOUR-X> with your data:

aws emr create-cluster --name emR-example \
--ami-version 3.2.1 \
--region $region \
--ec2-attributes KeyName=$keypair \
--no-auto-terminate \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \
--bootstrap-actions \
Args=[--rstudio,--rhdfs,--plyrmr,--rexamples] \
--steps \

File documentation

  • - installs RStudio and RHadoop packages depending on the provided arguments on all EMR instances
    • --rstudio - installs rstudio-server, default false
    • --rexamples - adds R examples to the user home directory, default false
    • --rhdfs - installs rhdfs package, default false
    • --plyrmr - installs plyrmr package, default false
    • --updater - installs latest R version, default false
    • --user - sets user for rstudio, default "rstudio"
    • --user-pw - sets user-pw for user USER, default "rstudio"
    • --rstudio-port - sets rstudio port, default 80
  • - fixes /tmp permission in hdfs to provide temporary storage for R streaming jobs
  • rmr2_example.R - simple example for mapReduce jobs with R
  • biganalyses_example.R - bigger script with some bigger analyses using plyrmr package
  • change_pw.R - simple script to change Unix (rstudio user) password from R session

