[Recipes] Installing and Setting Up Apache Spark in Local (Windows Machine)

Submitted by heartin on Sat, 06/10/2017 - 21:42

Problem:

Install and set Up Apache Spark in Local (Windows Machine)

Solution Summary:

We will install and setup Spark in local (Windows machine) and run few commands from Spark’s own Quick Start guide to get started.

Prerequisites:

Install Java and Scala, add java install location to JAVA_HOME. Make sure there are no spaces within path.
Download winutils (https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe) and add folder path to HADOOP_HOME. Winutils.exe should be inside a bin folder within the HADOOP_HOME location.
Run: C:\Dev\Tools\WINUTILS\bin\winutils.exe chmod 777 \tmp\hive

Solution Steps:

Download Spark from http://spark.apache.org/downloads.html
1. Select latest version, pre-built for latest version of Hadoop.
Follow the quick start guide to get familiar with Spark: http://spark.apache.org/docs/2.1.0/quick-start.html
1. Start spark shell: bin\spark-shell --conf spark.sql.warehouse.dir=file:///C:/tmp/spark-warehouse
2. Once started, you should see info similar to:
  - Spark context Web UI available at http://10.0.75.1:4040
  - Spark context available as 'sc' (master = local[*], app id = local-1496816307399).
  - Spark session available as 'spark'.
3. Run within spark shell:
  1. Check type of the context object: sc
  2. val textFile = sc.textFile("README.md")
  3. textFile.count()
  4. textFile.first()
  5. val linesWithSpark = textFile.filter(line => line.contains("Spark"))
  6. textFile.filter(line => line.contains("Spark")).count()
  7. Exit: sys.exit
4. Run with Python:
  1. bin\pyspark
  2. Check type of the context object: sc
  3. textFile = sc.textFile("README.md")
  4. textFile.count()
  5. textFile.first()
  6. linesWithSpark = textFile.filter(lambda line: "Spark" in line)
  7. textFile.filter(lambda line: "Spark" in line).count()
  8. Use exit() or Ctrl-Z plus Return to exit

Note:

Every Spark application has a driver program. Driver program launches various parallel operations on the cluster. Driver program contains the main function and defines distributed datasets on the cluster, and applies operations to them.
1. In the above example, driver program was the spark shell itself.
Driver program access Spark through the SparkContext object. SparkContext represents a connection to a computing cluster.
1. In the spark shell, a SparkConext was automatically created with the name sc (see example above).
We express our computations through operations on RDDs. RDDs are distributed collections that are automatically parallelized across the cluster, and is Spark’s abstraction for distributed data and computation. We can build RDDs from a SparkContect.
1. In the above example, the variable called textFile is an RDD, which is created from a text file on our local machine.
Various parallel operations can be run on the RDD such as counting the number of elements in the dataset, printing the first one etc. For running operations, driver programs usually maintains a set of nodes called executors.
1. Here, since we ran the spark shell in local, all work was done on a single machine; we can connect the shell to a cluster to analyze data in parallel.
We pass functions to its operators (to run them on the cluster) using lambda or => syntax with python/scala. Java 8 also support a similar syntax using its new lambda syntax.
1. We can also define functions separately and pass it to functions, without using the lambda style syntax.