How to configure Eclipse in order to develop with Spark and Python

Philippe ROSSIGNOL : 2015/06/12
How to configure Eclipse in order
to develop with Spark and Python



This article is focusing on an older version of Spark that is V1.3, so it's recommended to visit the link below if you want to play with a more recent version of Spark:
https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/

Introduction

Introduction

This document shows a way to configure Eclipse IDE in order to develop with Spark 1.3.1 and Python via the plugin PyDev.

PyDev is a plugin that enables Eclipse to be used as a Python IDE.

First we will install Eclipse, then Spark 1.3.1 and PyDev, then we will configure PyDev.

Finally we will finish by developing and by testing a well-known example code named “Word Counts” written in Python and running on Spark.

Under the cover of PySpark


The Spark Python API (PySpark) exposes the Spark programming model to Python.
By default, PySpark requires python (2.6 or higher) to be available on the system PATH and use it to run programs.

Let’s note that PySpark applications are executed by using a standard CPython interpreter (in order to support Python modules that use C extensions).
But an alternate Python executable may be specified by setting the PYSPARK_PYTHON environment variable.

All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported.

In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.
Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.
RDD transformations in Python are mapped to transformations on PythonRDD objects in Java.

For more details please visit the pages below :

Requirements


Let’s note that Spark V1.3.1 runs on Java 6+ and Python 2.6+, so you will need on your computer :
  • A JVM 6 or higher (JVM 7 may be a good compromising)
  • A python 2.6 or higher
The following installation has been carried out with a JVM 7 and a Python interpreter 2.7.

A brief note about Scala

Keep in mind that a great idea would consist to use the following Eclipse IDE for Spark in order to develop later both in Python and Scala.

To allow this, it’s important to know that Spark 1.3.1 needs to use a Scala API that is compatible with Scala version 2.10.x.

That’s the reason why the following installation uses Eclipse 4.3 (Kepler) because of its compatibility with Scala 2.10.

1°) Install Eclipse


Go to the Eclipse website then download and uncompress Eclipse 4.3 (Kepler) on your computer : http://www.eclipse.org/downloads/packages/release/Kepler/SR2

Finally launch Eclipse and create your workspace as usually.

2°) Install Spark


Go to the Spark website then download and uncompress Spark 1.3.1 (e.g: “Pre-built for Hadoop 2.6 and later”) on your computer : https://spark.apache.org/downloads.html

3°) Install PyDev


From Eclipse IDE :
Go to the menu Help > Install New Software...

From the “Install“ window :
Click on the button [Add…]

From the “Add Repository” dialog box :
Fill the field Name: PyDev
Fill the field Location: http://pydev.sf.net/updates
Validate with the button [OK]

From the “Install“ window :
Check the name PyDev and click twice on the button [Next >]
Accept the terms of the license agreement and click on the button [Finish]

If a “Security Warning” window appears :
If the following warning message appears : “Warning: you are installing software that contains unsigned content…” :
Click on the button [OK]

From the “Sofware Updates” window :
Click the button [Yes] to restart Eclipse and for the changes to take effect.

Now PyDev (e.g: 4.1.0) is installed in your Eclipse.
But you can’t develop in Python, because PyDev isn’t configured yet.

4°) Configure PyDev with a Python interpreter

As PySpark, PyDev requires a Python interpreter installed on your computer.

Remember that with PySpark, Py4J is not a Python interpreter.
Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.

The following installation has been carried out with a Python interpreter 2.7.

From Eclipse IDE :
Open the PyDev perspective (on top right of the Eclipse IDE)
Go to the menu Eclipse > Preferences… (on Mac), or Window > Preferences... (on Linux and Windows)

From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter

Click on the button [Advanced Auto-Config]
Eclipse with introspect all the Python installations on your computer.

Choose a Python version 2.7 (e.g: /usr/bin/python2.7 on mac) then validate with the button [OK]

From the “Selection needed” window :
Click on the button [OK] to accept the folders to be added to the system PYTHONPATH

From the “Preferences” window :
Validate with the button [OK]

Now PyDev is configured in your Eclipse.
You are able to develop in Python but not with Spark yet.

5°) Configure PyDev with the Spark Python sources

Now we are going to configure PyDev with the Spark Python sources.

From Eclipse IDE :
Check that you are on the PyDev perspective
Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)

From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter

Click on the button [New Folder]
Choose the python folder just under your Spark install directory and validate :
e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python
Note : This path must be absolute (don’t use the Spark home environment variable)

Click on the button [New Egg/Zip(s)]
From the File Explorer, select [*.zip] rather [*.egg]
Choose the file py4j-0.8.2.1-src.zip just under your Spark python folder and validate :
e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python/py4j-0.8.2.1-src.zip
Note : This path must be absolute (don’t use the Spark home environment variable)

Validate with the button [OK]

Now PyDev is configured with Spark Python sources.
But we can’t execute Spark while the Environment variables aren’t configured.

6°) Configure PyDev with the Spark Environment variables

It’s necessary to configure PyDev with the Spark Environment variables in order to execute codes on Spark.

From Eclipse IDE :
Check that you are on the PyDev perspective
Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)

From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter

Click on the central button [Environment]
Click on the button [New...] (close to the button [Select...]) to add a new Environment variable.
Add the environment variable SPARK_HOME and validate :
e.g 1 : Name: SPARK_HOME, Value: /home/foo/Spark_1.3.1-Hadoop_2.6
e.g 2 : Name: SPARK_HOME, Value: ${eclipse_home}../Spark_1.3.1-Hadoop_2.6
Note : Don’t use the system environment variables such as Spark home

It’s recommended to handle your own "log4j.properties" file in each of your project.
To allow that, adds the environment variable SPARK_CONF_DIR as previously and validates :
e.g : Name: SPARK_CONF_DIR, Value: ${project_loc}/conf
If you experience some problems with the variable ${project_loc} (e.g: with Linux OS), specify an absolute path instead.
Or if you want to keep ${project_loc}, right-click on every Python source and: Runs As > Run Configurations…,
then create your SPARK_CONF_DIR variable in the Environment tab as described previously

Occasionally, you can add other environment variables such as TERM, SPARK_LOCAL_IP and so on :
e.g 1 : Name: TERM, Value on Mac: xterm-256color, Value on Linux: xterm
e.g 2 : Name: SPARK_LOCAL_IP, Value: 127.0.0.1 (it’s recommended to specify your real local IP address)

Validate with the button [OK]

Now PyDev is full ready to develop with Spark in Python.

7°) Create the Spark-Python project “CountWords”

Now we can develop any kind of Spark project written in Python, we will now create the code example named “CountWords”.

This example will count the frequency of each word present in the “README.md” file belonging to the Spark installation.
To allow a such counting, the well-known MapReduce paradigm will be operated in memory by using the two Spark transformations named “flatMap” and “reduceByKey”.

Create the new project :
Check that you are on the PyDev perspective
Go to the Eclipse menu File > New > PyDev project
Name your new project “PythonSpark”, then click on the button [Finish]

Create a source folder :
To add a source folder (in order to create soon your Python source), right-click on the project icone and New > Folder
Name the new folder “src”, then click on the button [Finish]

To add the new Python source, right-click on the source folder icon and New > PyDev Module
Name the new Python source “WordCounts”, then click on the button [Finish], then click on the button [OK]

Copy-paste the following Python code into your PyDev module WordCounts.py :

# Imports
# Take care about unused imports (and also unused variables),
# please comment them all, otherwise you will get any errors at the execution.
# Note that neither the directives "@PydevCodeAnalysisIgnore" nor "@UnusedImport"
# will be able to solve that issue.
#from pyspark.mllib.clustering import KMeans
from pyspark import SparkConf, SparkContext
import os

# Configure the Spark environment
sparkConf = SparkConf().setAppName("WordCounts").setMaster("local")
sc = SparkContext(conf = sparkConf)

# The WordCounts Spark program
textFile = sc.textFile(os.environ["SPARK_HOME"] + "/README.md")
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
for wc in wordCounts.collect(): print wc

In PyDev take care about unused imports and also unused variables.
Please comment them all, otherwise you will get any errors at the execution.
Note that neither the directives @PydevCodeAnalysisIgnore nor @UnusedImport will be able to solve that issue.

Create a config folder :
To add a config folder (useful for log4j), right-click on the project icon and New > Folder
Name the new folder “conf”, then click on the button [Finish]

To add your new config file (the “log4j.properties” file) right-click on the config folder icon and New > File
Name the new config file “log4j.properties”, then click on the button [Finish], then click on the button [OK]

Copy-paste the content of the file “log4j.properties.template” (under $SPARK_HOME/conf) to your new config file ”log4j.properties

Edit your own config file ”log4j.properties” to change as much as you want the level of logs (e.g : INFO to WARN, or INFO to ERROR...)

8°) Run the Spark-Python project “CountWords"


To execute your code, right-click on the Python module “WordCounts.py”, then choose Run As > 1 Python Run

PyDev.png
Have fun :-)

9 commentaires:

  1. I am getting the error as mentioned in the link :
    http://stackoverflow.com/questions/35959638/spark-sample-python-in-eclipse

    Can you please help ?

    RépondreSupprimer
  2. Hi, this roadmap is deprecated, please refer to the following link:
    https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/
    Kind Regards

    RépondreSupprimer
  3. This roadmap has been made especially for Spark V1.3 only, that's probably the reason why you've got that issue. Also, if you want to use Spark V1.6 please refer to the following link:
    https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/

    RépondreSupprimer
  4. Hi,
    I followed all the steps you mentioned.

    But when I try to run the wordcount.py program it show:

    Error from python worker:
    File "/Applications/anaconda/lib/python3.5/site.py", line 176
    file=sys.stderr)

    RépondreSupprimer
  5. Same error for me
    16/11/21 11:16:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    File "C:\Program Files\Anaconda3\lib\site.py", line 176
    file=sys.stderr)
    ^

    RépondreSupprimer
  6. Hi Philippe, I need your help. When I configure PyDev with the Spark Python sources, eclipse doesn't allow selected /usr/local/spark/python directory since he expected a file .exe. what can I do for finished configure ?

    RépondreSupprimer
  7. awesome man it works like a charm

    but a bit slow ..i am figuring our to make it run fast

    RépondreSupprimer
  8. Just like Mavan as packging for scala,wht we can use to manage dependecies while using pyspark,

    RépondreSupprimer
  9. Hello,
    I was wondering why are you using eclipse? I would like to understand what the difference with other IDE because I can't see many people using pyspark with eclipse..
    Thanks

    RépondreSupprimer