Skip to content

Run in a cluster

Estimated reading time: 6 minutes

The following steps can be performed on any machine with a Linux distribution (inside or outside Amazon EC2).

1. Set your AWS Credentials-

Assuming that you already have an AWS account set up, one option is to export the following environment variables:

export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

2. Install Flintrock-

Flintrock is a command-line tool for launching Apache Spark clusters. Flintrock requires Python 3.4 or newer, unless you use one of the standalone packages. Find more info here.

Recommended Way

To get the latest release of Flintrock, simply run pip:

pip3 install flintrock
flintrock --version

3. Configure Flintrock-

Flintrock lets you persist your desired configuration to a YAML file so that you don't have to keep typing out the same options over and over at the command line.

To setup and edit the default config file, run the following command:

flintrock configure

Sample config.yaml:

services:
  hdfs:
    version: 2.8.5
  spark:
    version: 2.4.3

provider: ec2

providers:
  ec2:
    key-name: key_name # change accordingly
    identity-file: /path/to/key.pem # change accordingly
    instance-type: f1.2xlarge
    region: us-east-1
    ami: ami-03d52aaea4f50b9fa # InAccel's AMI
    user: centos
    min-root-ebs-size-gb: 32 # feel free to change
    ebs-optimized: yes
    instance-initiated-shutdown-behavior: stop
    user-data: /path/to/user-data/script.sh # change accordingly

launch:
  num-slaves: 4 # feel free to change
  spark-executor-instances: 8 # 2xlarge
  install-hdfs: True
  install-spark: True

debug: false

Use user-data field to configure the InAccel Coral license. Click here to automatically generate a new free license key!

Sample script.sh:

#!/bin/bash

inaccel config license <key>

4. Create a new cluster-

With a config file like that, you can now launch a cluster with just this:

flintrock launch inaccel-demo-cluster

Since AWS performance is highly variable, the exact launch time can not be predicted. A typical launch of a medium size cluster takes around 10 minutes. After it finishes, login to the Master node.

flintrock login inaccel-demo-cluster

5. Install InAccel release-

Clone InAccel repository on the Master node.

git clone https://bitbucket.org/inaccel/release.git inaccel && source inaccel/setup.sh

6. Submitting applications-

  • InAccel OFF:

$ spark-submit [ arguments ]

  • InAccel ON:

$ spark-submit --inaccel [ arguments ]

7. Destroy the cluster-

Once you're done using the cluster, don't forget to destroy it with:

flintrock destroy inaccel-demo-cluster