Deep Learning on Qubole Using BigDL for Apache Spark – Part 1

BigDL runs natively on Apache Spark, and because Qubole offers a greatly enhanced and optimized Spark as a service, it makes for a perfect deployment platform.

In this Part 1 of a two-part series, you will learn how to get started with distributed Deep Learning library BigDL on Qubole. By the end, you will have BigDL installed on a Spark cluster with a distributed Deep Learning library readily available for you to use in your Deep Learning applications running on Qubole.

In Part 2, you will learn how to write a Deep Learning application on Qubole that uses BigDL to identify handwritten digits (0 to 9) using a LeNet-5 (Convolutional Neural Networks) model that you will train and validate using MNIST database.

Before we get started, here’s some introduction and background on the technologies involved.

What is Deep Learning?

Deep learning is a form of Machine Learning that uses a model of computing very much inspired by structure of the brain. It is a kind of Machine Learning that allows computers to improve with data and achieve great flexibility by learning to represent the world as a nested hierarchy of concepts.

In early talks on Deep Learning, Andrew Ng described it in the context of traditional artificial neural networks. In his talk titled “Deep Learning, Self-Taught Learning and Unsupervised Feature Learning”, he described the idea of Deep Learning as:

Using brain simulations, hope to: – Make learning algorithms much better and easier to use. – Make revolutionary advances in machine learning and AI. I believe this is our best shot at progress towards real AI.

So, What is BigDL?

BigDL is a distributed deep learning library created and open sourced by Intel. It was designed from the ground up to run natively on Apache Spark, and therefore enables data engineers and scientists to write deep learning applications as standard Spark programs–without having to explicitly manage distributed computations.

  • Rich Deep Learning Support

    • Modeled after Torch, BigDL provides comprehensive support for deep learning including numeric computing via Tensor and high level neural networks.
    • In addition, users can load pre-trained Caffe or Torch models into Spark applications using BigDL.
  • Extremely High Performance

    • BigDL uses Intel MKL (Math Kernel Library) and multi-threaded programming within each Spark task. Consequently, it is orders of magnitude faster than out-of-box open source Caffe, Torch or TensorFlow on a single-node Xeon — which is comparable to mainstream GPU instance.
  • Efficient Scaling

    • BigDL can efficiently scale out to perform data analytics at “Big Data scale” by leveraging Apache Spark, efficient implementations of synchronous SGD as well as all-reduce communications on Spark.

For more details on BigDL, click here.

And, Why BigDL on Qubole?

BigDL runs natively on Apache Spark, and because Qubole offers a greatly enhanced and optimized Spark as a service, it makes for a perfect deployment platform.

Highlights of Apache Spark as a service offered on Qubole

  • Auto-scaling Spark Clusters

    • In the open source version of auto-scaling in Apache Spark, the required number of executors for completing a task are added in multiples of two. In Qubole, we’ve enhanced the auto-scaling feature to add required number of executors based on configurable SLA
    • With Qubole’s auto-scaling, cluster utilization is matched precisely to the workloads, so there are no wasted compute resources and it also leads to lowered TCO. Based on our benchmark on performance and cost savings, we estimate that auto-scaling saves a Qubole’s customer over $300K per year for just one cluster.
  • Heterogeneous Spark Clusters on AWS

    • Qubole supports heterogeneous Spark clusters for both On-Demand and Spot instances on AWS. This means that the slave nodes in Spark clusters may be of any instance type.
    • For On-Demand nodes, this is beneficial in scenarios when the requested number of primary instance type nodes are not granted by AWS at the time of request. For Spot nodes, it’s advantageous when either the Spot price of primary slave type is higher than the Spot price specified in the cluster configuration or the requested number of Spot nodes are not granted by AWS at the time of request.
  • Optimized Split Computation for Spark SQL

    • We’ve implemented optimization with regards to AWS S3 listings which enables split computations to run significantly faster on Spark SQL queries. As a result, we’ve recorded up to 6X and 81X improvements on query execution and AWS S3 listings respectively.

To learn more about Qubole, click here.

Getting started with BigDL on Qubole

Prerequisites

  • Qubole account — for a free trial, click here
  • Build BigDL jar — https://bigdl-project.github.io/master/#UserGuide/install-build-src/ — Once you’ve successfully built BigDL jar, it will be in format bigdl-[VERSION]-SNAPSHOT-jar-with-dependencies.jar
  • Download MNIST database of handwritten digits — http://yann.lecun.com/exdb/mnist/
  • Download test images — Zero, Two, Three, Four, Seven, Nine
  • Imp: Copy/upload BigDL jar, MNIST data files (train-images-idx3-ubyte, train-labels-idx1-ubyte, t10k-images-idx3-ubyte and t10k-labels-idx1-ubyte) and test images to S3 bucket that can be accessed from a remote shell script. (These files will need to be downloaded on the cluster via bootstrap script.)

    Steps

    #1 If you don’t have a Spark cluster configured for this application, click here for instructions on how to configure one.
    #2 Then, on
    Clusters page, select/scroll down to the Spark cluster of your choice and click on Edit.
    #3 On
    Edit Cluster Settings page, click on 4. Advanced Configuration tab.
    #4 Scroll down to
    SPARK SETTINGS section and copy-and-paste the following in **Override Spark Configuration****.

    spark-defaults.conf:
    spark.executorEnv.DL_ENGINE_TYPE mklblas
    spark.executorEnv.MKL_DISABLE_FAST_MM 1
    spark.executorEnv.KMP_BLOCKTIME 0
    spark.executorEnv.OMP_WAIT_POLICY passive
    spark.executorEnv.OMP_NUM_THREADS 1
    spark.yarn.appMasterEnv.DL_ENGINE_TYPE mklblas
    spark.yarn.appMasterEnv.MKL_DISABLE_FAST_MM 1
    spark.yarn.appMasterEnv.KMP_BLOCKTIME 0
    spark.yarn.appMasterEnv.OMP_WAIT_POLICY passive
    spark.yarn.appMasterEnv.OMP_NUM_THREADS 1
    spark.shuffle.reduceLocality.enabled false
    spark.shuffle.blockTransferService nio
    spark.scheduler.minRegisteredResourcesRatio 1.0
    spark.executor.instances 4
    spark.qubole.max.executors 4
    
    Note: These parameters are required by BigDL and setting them here will make them available to Spark driver and executors across existing nodes as well as any new nodes that are added during auto-scaling in Qubole.


    #5 Save the cluster settings and configuration by clicking on
    Update. At this point, you should be back on the main Cluster page.
    #6 Click on the dotted (…) menu all the way to the right and select
    **Edit Node Bootstrap****.
    #7 Copy-and-paste the following script:

    #!/bin/bash
    source /usr/lib/hustler/bin/qubole-bash-lib.sh
    make-python2.7-system-default
    
    mkdir -p /media/ephemeral0/bigdl
    mkdir -p /media/ephemeral0/bigdl/mnist
    mkdir -p /media/ephemeral0/bigdl/mnist/data
    mkdir -p /media/ephemeral0/bigdl/mnist/model
    
    is_master=`nodeinfo is_master`
    if [[ "$is_master" == "1" ]]; then
       echo "Setting BigDL env variables in usr/lib/zeppelin/conf/zeppelin-env.sh"
       echo "export DL_ENGINE_TYPE=mklblas" >> usr/lib/zeppelin/conf/zeppelin-env.sh
       echo "export KMP_BLOCKTIME=0" >> usr/lib/zeppelin/conf/zeppelin-env.sh
       echo "export MKL_DISABLE_FAST_MM=1" >> usr/lib/zeppelin/conf/zeppelin-env.sh
       echo "export OMP_NUM_THREADS=1" >> usr/lib/zeppelin/conf/zeppelin-env.sh
       echo "export OMP_WAIT_POLICY=passive" >> usr/lib/zeppelin/conf/zeppelin-env.sh
    
       echo "Restarting Zeppelin daemon"
       /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart
    
       echo "Downloading test digit image files from s3://YOUR_S3_BUCKET"
       hadoop dfs -get s3://YOUR_S3_BUCKET/two_28x28.png /media/ephemeral0/bigdl
       hadoop dfs -get s3://YOUR_S3_BUCKET/three.png /media/ephemeral0/bigdl
       hadoop dfs -get s3://YOUR_S3_BUCKET/four_28x28.png /media/ephemeral0/bigdl
       hadoop dfs -get s3://YOUR_S3_BUCKET/seven.png /media/ephemeral0/bigdl
       hadoop dfs -get s3://YOUR_S3_BUCKET/nine_28x28.png /media/ephemeral0/bigdl
       hadoop dfs -get s3://YOUR_S3_BUCKET/zero_28x28.png /media/ephemeral0/bigdl
    fi
    
    echo "Downloading BigDL jar from s3://YOUR_S3_BUCKET"
    sudo hadoop dfs -get s3://YOUR_S3_BUCKET/bigdl-[VERSION]-SNAPSHOT-jar-with-dependencies.jar /usr/lib/spark/lib
    
    echo "Downloading mnist data files from s3://YOUR_S3_BUCKET"
    hadoop dfs -get s3://YOUR_S3_BUCKET/t10k-images-idx3-ubyte /media/ephemeral0/bigdl/mnist/data
    hadoop dfs -get s3://YOUR_S3_BUCKET/t10k-labels-idx1-ubyte /media/ephemeral0/bigdl/mnist/data
    hadoop dfs -get s3://YOUR_S3_BUCKET/train-images-idx3-ubyte /media/ephemeral0/bigdl/mnist/data
    hadoop dfs -get s3://YOUR_S3_BUCKET/train-labels-idx1-ubyte /media/ephemeral0/bigdl/mnist/data
    
    Imp: Replace YOUR_S3_BUCKET with S3 bucket in your AWS account where you uploaded BigDL jar and MNIST data files and also replace bigdl-[VERSION]-SNAPSHOT-jar-with-dependencies.jar with your BigDL jar. Here is what’s happening in the above bootstrap script:
    • Set Python 2.7 as system default
    • Create temp directories that are accessed by our application
    • Recall setting BigDL environment variables for Spark in previous step. Similarly, we need to make those available to Zeppelin driver running on the master node
    • Download test images we will use in our Spark application
    • Download BigDL jar so it’s available for us to import in our Spark application
    • Download MNIST dataset that we will use to train model in our application

    #8 Click *Save ***to save the bootstrap script.
    #9 Click on
    Start** to bring up the cluster.

    That’s it!

    Once the Spark cluster comes up, BigDL deep learning library will be readily available for you to use in your Spark application running on Qubole.

    See you in Part 2!