Sunday, May 3, 2020

How to Setup PySpark on Windows

Before getting into any Spark implementation or testing you would need a correct Spark environment/set up. In this post, I am going to tell you how to set up the spark in your Windows environment.

The steps are very simple, as the title says our objective is to setup PySpark on windows, there is no specific prerequisite is required. So to avoid all misconceptions we just need to follow the below steps to get this set-up ready. Here I have assumed you already have Java installed.

Steps 1-
          

  • You can download any latest stable release, as I have highlighted "spark-2.4.5-bin-hadoop2.7.tgz". As it is a pre-built of Hadoop, you will also get the Spark with Scala 2.11. 
  • Extract this .tgz file in your C:\ directory, in my case I have WinRAR installed by which I can easily extract this .tgz file 
         

 Step 2-
          C:\spark-2.4.0-bin-hadoop2.7\bin

         

Step 3-
  • Now setup the environment paths for Spark.
  • Go to "Advanced System Settings" and set below paths
  • JAVA_HOME="C:\Program Files\Java\jdk1.8.0_181"
  • HADOOP_HOME="C:\spark-2.4.0-bin-hadoop2.7"
  • SPARK_HOME="C:\spark-2.4.0-bin-hadoop2.7"
  • Also add their bin path into PATH system variable, as shown below
        


Step 4-

At this point, we have done with the setup but I found something very important from this blog which is optional but to avoid some errors when you work with SPark with the hive.

Optional: Some tweaks to avoid future errors

  • Create folder C:\tmp\hive
  • Open your Command Prompt CMD as an administrator 
         
    
  • Give the full right to this temp hive directory using below command
           winutils.exe chmod -R 777 D:\tmp\hive
  • Check the given permission
          winutils.exe ls -F D:\tmp\hive

         


Step 5- Check the installation

  • Open your cmd and run command "spark-shell"
         
Congratulations! all set, you can now start your coding.