Machine Learning Examples
The following application examples demonstrate how to accelerate your Spark ML pipelines, seamlessly.
Resources:
Datasets-
The datasets are stored in the popular LibSVM format. Our datasets are vectors of pixels representing images of handwritten letters/digits.
Load datasets into target file system
The datasets will be read in from
inaccel-demo
s3 bucket.
${INACCEL_HOME}/bin/load-demo-data-file nist/letters_libsvm_train.dat
${INACCEL_HOME}/bin/load-demo-data-file nist/letters_libsvm_test.dat
${INACCEL_HOME}/bin/load-demo-data-hdfs nist/letters_libsvm_train.dat
${INACCEL_HOME}/bin/load-demo-data-hdfs nist/letters_libsvm_test.dat
${INACCEL_HOME}/bin/load-demo-data-file nist/mnist8m_libsvm.dat
${INACCEL_HOME}/bin/load-demo-data-hdfs nist/mnist8m_libsvm.dat
Logistic Regression-
Usage: LogisticRegressionExample [options] trainSet testSet
trainSet Input path to train dataset (in libsvm format).
testSet Input path to test dataset (in libsvm format).
--numFeatures <value> Number of features.
--elasticNetParam <value>
ElasticNet parameter. (Default: 0.5)
--maxIter <value> Maximum number of iterations. (Default: 100)
--tol <value> The convergence tolerance of iterations. Smaller value will lead to better accuracy with the cost of more iterations. (Default: 1.0E-6)
You can read more about Logistic Regression from the classification and regression section of MLlib Programming Guide.
Handwritten Letters classification Example (letters 380MB)-
Run LogisticRegression Application [src]
For a non-accelerated execution omit --inaccel
.
${SPARK_HOME}/bin/run-example \
--inaccel \
--master local[*] \
--driver-memory 80g \
--executor-memory 10g \
--jars ${INACCEL_SPARK_EXAMPLES} \
inaccel.ml.LogisticRegressionExample \
file://${INACCEL_HOME}/data/nist/letters_libsvm_train.dat \
file://${INACCEL_HOME}/data/nist/letters_libsvm_test.dat \
--numFeatures 784
${SPARK_HOME}/bin/run-example \
--inaccel \
--master spark://$(hostname):7077 \
--driver-memory 80g \
--executor-memory 10g \
--jars ${INACCEL_SPARK_EXAMPLES} \
inaccel.ml.LogisticRegressionExample \
hdfs:///inaccel/data/nist/letters_libsvm_train.dat \
hdfs:///inaccel/data/nist/letters_libsvm_test.dat \
--numFeatures 784
Handwritten Digits classification Example (mnist8m 24GB)-
Run LogisticRegression Application [src]
For a non-accelerated execution omit --inaccel
.
${SPARK_HOME}/bin/run-example \
--inaccel \
--master local[*] \
--driver-memory 80g \
--executor-memory 10g \
--jars ${INACCEL_SPARK_EXAMPLES} \
inaccel.ml.LogisticRegressionExample \
file://${INACCEL_HOME}/data/nist/mnist8m_libsvm.dat \
file://${INACCEL_HOME}/data/nist/mnist8m_libsvm.dat \
--numFeatures 784
${SPARK_HOME}/bin/run-example \
--inaccel \
--master spark://$(hostname):7077 \
--driver-memory 80g \
--executor-memory 10g \
--jars ${INACCEL_SPARK_EXAMPLES} \
inaccel.ml.LogisticRegressionExample \
hdfs:///inaccel/data/nist/mnist8m_libsvm.dat \
hdfs:///inaccel/data/nist/mnist8m_libsvm.dat \
--numFeatures 784
K-Means-
Usage: KMeansExample [options] trainSet testSet
trainSet Input path to train dataset (in libsvm format).
testSet Input path to test dataset (in libsvm format).
--numFeatures <value> Number of features.
--K <value> K parameter. (Default: 64)
--maxIter <value> Maximum number of iterations. (Default: 100)
--tol <value> The convergence tolerance of iterations. Smaller value will lead to better distance with the cost of more iterations. (Default: 1.0E-6)
You can read more about K-Means from the clustering section of MLlib Programming Guide.
Handwritten Letters clustering Example (letters 380MB)-
Run KMeans Application [src]
For a non-accelerated execution omit --inaccel
.
${SPARK_HOME}/bin/run-example \
--inaccel \
--master local[*] \
--driver-memory 80g \
--executor-memory 10g \
--jars ${INACCEL_SPARK_EXAMPLES} \
inaccel.ml.KMeansExample \
file://${INACCEL_HOME}/data/nist/letters_libsvm_train.dat \
file://${INACCEL_HOME}/data/nist/letters_libsvm_test.dat \
--numFeatures 784
${SPARK_HOME}/bin/run-example \
--inaccel \
--master spark://$(hostname):7077 \
--driver-memory 80g \
--executor-memory 10g \
--jars ${INACCEL_SPARK_EXAMPLES} \
inaccel.ml.KMeansExample \
hdfs:///inaccel/data/nist/letters_libsvm_train.dat \
hdfs:///inaccel/data/nist/letters_libsvm_test.dat \
--numFeatures 784
Handwritten Digits clustering Example (mnist8m 24GB)-
Run KMeans Application [src]
For a non-accelerated execution omit --inaccel
.
${SPARK_HOME}/bin/run-example \
--inaccel \
--master local[*] \
--driver-memory 80g \
--executor-memory 10g \
--jars ${INACCEL_SPARK_EXAMPLES} \
inaccel.ml.KMeansExample \
file://${INACCEL_HOME}/data/nist/mnist8m_libsvm.dat \
file://${INACCEL_HOME}/data/nist/mnist8m_libsvm.dat \
--numFeatures 784
${SPARK_HOME}/bin/run-example \
--inaccel \
--master spark://$(hostname):7077 \
--driver-memory 80g \
--executor-memory 10g \
--jars ${INACCEL_SPARK_EXAMPLES} \
inaccel.ml.KMeansExample \
hdfs:///inaccel/data/nist/mnist8m_libsvm.dat \
hdfs:///inaccel/data/nist/mnist8m_libsvm.dat \
--numFeatures 784