Skip to content

Run in a cluster

The following steps can be performed on any machine with a Linux distribution (inside or outside Amazon EC2).

1. Set your AWS Credentials-

Assuming that you already have an AWS account set up, one option is to export the following environment variables:

export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

2. Install Flintrock-

Flintrock is a command-line tool for launching Apache Spark clusters. Flintrock requires Python 3.4 or newer, unless you use one of the standalone packages. Find more info here.

Recommended Way

Install Python3 pip (if required)

sudo apt update && sudo apt install python3-pip

To get the latest release of Flintrock, simply run pip:

sudo pip3 install flintrock
flintrock --version

3. Configure Flintrock-

Flintrock lets you persist your desired configuration to a YAML file so that you don't have to keep typing out the same options over and over at the command line.

To setup and edit the default config file, run the following command:

flintrock configure

Sample config.yaml

services:
  hdfs:
    version: 2.8.5
  spark:
    version: 2.3.3

provider: ec2

providers:
  ec2:
    key-name: key_name # change accordingly
    identity-file: /path/to/key.pem # change accordingly
    instance-type: f1.2xlarge
    region: us-east-1
    ami: ami-0d1172c23618d21d5 # InAccel's AMI
    user: ec2-user
    min-root-ebs-size-gb: 35 # feel free to change
    ebs-optimized: yes
    instance-initiated-shutdown-behavior: stop

launch:
  num-slaves: 4 # feel free to change
  spark-executor-instances: 8 # 2xlarge
  install-hdfs: True
  install-spark: True

debug: false

4. Create a new cluster-

With a config file like that, you can now launch a cluster with just this:

flintrock launch inaccel-demo-cluster

Since AWS performance is highly variable, the exact launch time can not be predicted. A typical launch of a medium size cluster takes around 10 minutes. After it finishes, login to the Master node.

flintrock login inaccel-demo-cluster

5. Install InAccel-

Clone InAccel repository on the Master node.

git clone https://bitbucket.org/inaccel/release.git inaccel && source inaccel/setup.sh

6. Submitting applications-

  • InAccel OFF:

$ spark-submit [ arguments ]

  • InAccel ON:

$ spark-submit --inaccel [ arguments ]

7. Destroy the cluster-

Once you're done using the cluster, don't forget to destroy it with:

flintrock destroy inaccel-demo-cluster