Be A Sparky: 2020

Before getting into any Spark implementation or testing you would need a correct Spark environment/set up. In this post, I am going to tell you how to set up the spark in your Windows environment.

The steps are very simple, as the title says our objective is to setup PySpark on windows, there is no specific prerequisite is required. So to avoid all misconceptions we just need to follow the below steps to get this set-up ready. Here I have assumed you already have Java installed.

Steps 1-

Goto Spark official site and get the Spark distribution download from http://spark.apache.org/downloads.html

You can download any latest stable release, as I have highlighted "spark-2.4.5-bin-hadoop2.7.tgz". As it is a pre-built of Hadoop, you will also get the Spark with Scala 2.11.
Extract this .tgz file in your C:\ directory, in my case I have WinRAR installed by which I can easily extract this .tgz file

Step 2-

Download winutils.exe from Hadoop binaries repository https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe
Save this downloaded file into your Spark bin directory-

C:\spark-2.4.0-bin-hadoop2.7\bin

Step 3-

Now setup the environment paths for Spark.
Go to "Advanced System Settings" and set below paths
JAVA_HOME="C:\Program Files\Java\jdk1.8.0_181"
HADOOP_HOME="C:\spark-2.4.0-bin-hadoop2.7"
SPARK_HOME="C:\spark-2.4.0-bin-hadoop2.7"
Also add their bin path into PATH system variable, as shown below

Step 4-

At this point, we have done with the setup but I found something very important from this blog which is optional but to avoid some errors when you work with SPark with the hive.

Optional: Some tweaks to avoid future errors -

Create folder C:\tmp\hive
Open your Command Prompt CMD as an administrator

Give the full right to this temp hive directory using below command

winutils.exe chmod -R 777 D:\tmp\hive

Check the given permission

winutils.exe ls -F D:\tmp\hive

Step 5- Check the installation

Open your cmd and run command "spark-shell"

Congratulations! all set, you can now start your coding.

Be A Sparky

Sunday, May 3, 2020

How to Setup PySpark on Windows