Philippe ROSSIGNOL : 2015/06/12
How to configure Eclipse in order
to develop with Spark and Python
|
This article is focusing on an older version of Spark that is V1.3, so it's recommended to visit the link below if you want to play with a more recent version of Spark:
https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/
Introduction
Introduction
This document shows a way to configure Eclipse IDE in order to develop with Spark 1.3.1 and Python via the plugin PyDev.
PyDev is a plugin that enables Eclipse to be used as a Python IDE.
First we will install Eclipse, then Spark 1.3.1 and PyDev, then we will configure PyDev.
Finally we will finish by developing and by testing a well-known example code named “Word Counts” written in Python and running on Spark.
Under the cover of PySpark
The Spark Python API (PySpark) exposes the Spark programming model to Python.
By default, PySpark requires python (2.6 or higher) to be available on the system PATH and use it to run programs.
Let’s note that PySpark applications are executed by using a standard CPython interpreter (in order to support Python modules that use C extensions).
But an alternate Python executable may be specified by setting the PYSPARK_PYTHON environment variable.
All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported.
In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.
Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.
RDD transformations in Python are mapped to transformations on PythonRDD objects in Java.
For more details please visit the pages below :
Installing and Configuring PySpark : https://spark.apache.org/docs/0.9.2/python-programming-guide.html
Requirements
Let’s note that Spark V1.3.1 runs on Java 6+ and Python 2.6+, so you will need on your computer :
- A JVM 6 or higher (JVM 7 may be a good compromising)
- A python 2.6 or higher
A brief note about Scala
Keep in mind that a great idea would consist to use the following Eclipse IDE for Spark in order to develop later both in Python and Scala.
To allow this, it’s important to know that Spark 1.3.1 needs to use a Scala API that is compatible with Scala version 2.10.x.
That’s the reason why the following installation uses Eclipse 4.3 (Kepler) because of its compatibility with Scala 2.10.
1°) Install Eclipse
Go to the Eclipse website then download and uncompress Eclipse 4.3 (Kepler) on your computer : http://www.eclipse.org/downloads/packages/release/Kepler/SR2
Finally launch Eclipse and create your workspace as usually.
2°) Install Spark
Go to the Spark website then download and uncompress Spark 1.3.1 (e.g: “Pre-built for Hadoop 2.6 and later”) on your computer : https://spark.apache.org/downloads.html
3°) Install PyDev
From Eclipse IDE :
Go to the menu Help > Install New Software...
From the “Install“ window :
Click on the button [Add…]
From the “Add Repository” dialog box :
Fill the field Name: PyDev
Validate with the button [OK]
From the “Install“ window :
Check the name PyDev and click twice on the button [Next >]
Accept the terms of the license agreement and click on the button [Finish]
If a “Security Warning” window appears :
If the following warning message appears : “Warning: you are installing software that contains unsigned content…” :
Click on the button [OK]
From the “Sofware Updates” window :
Click the button [Yes] to restart Eclipse and for the changes to take effect.
Now PyDev (e.g: 4.1.0) is installed in your Eclipse.
But you can’t develop in Python, because PyDev isn’t configured yet.
|
4°) Configure PyDev with a Python interpreter
As PySpark, PyDev requires a Python interpreter installed on your computer.
Remember that with PySpark, Py4J is not a Python interpreter.
Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.
The following installation has been carried out with a Python interpreter 2.7.
From Eclipse IDE :
Open the PyDev perspective (on top right of the Eclipse IDE)
Go to the menu Eclipse > Preferences… (on Mac), or Window > Preferences... (on Linux and Windows)
From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter
Click on the button [Advanced Auto-Config]
Eclipse with introspect all the Python installations on your computer.
Choose a Python version 2.7 (e.g: /usr/bin/python2.7 on mac) then validate with the button [OK]
From the “Selection needed” window :
Click on the button [OK] to accept the folders to be added to the system PYTHONPATH
From the “Preferences” window :
Validate with the button [OK]
Now PyDev is configured in your Eclipse.
You are able to develop in Python but not with Spark yet.
|
5°) Configure PyDev with the Spark Python sources
Now we are going to configure PyDev with the Spark Python sources.
From Eclipse IDE :
Check that you are on the PyDev perspective
Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)
From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter
Click on the button [New Folder]
Choose the python folder just under your Spark install directory and validate :
e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python
Note : This path must be absolute (don’t use the Spark home environment variable)
Click on the button [New Egg/Zip(s)]
From the File Explorer, select [*.zip] rather [*.egg]
Choose the file py4j-0.8.2.1-src.zip just under your Spark python folder and validate :
e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python/py4j-0.8.2.1-src.zip
e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python/py4j-0.8.2.1-src.zip
Note : This path must be absolute (don’t use the Spark home environment variable)
Validate with the button [OK]
Now PyDev is configured with Spark Python sources.
But we can’t execute Spark while the Environment variables aren’t configured.
|
6°) Configure PyDev with the Spark Environment variables
It’s necessary to configure PyDev with the Spark Environment variables in order to execute codes on Spark.
From Eclipse IDE :
Check that you are on the PyDev perspective
Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)
From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter
Click on the central button [Environment]
Click on the button [New...] (close to the button [Select...]) to add a new Environment variable.
Add the environment variable SPARK_HOME and validate :
e.g 1 : Name: SPARK_HOME, Value: /home/foo/Spark_1.3.1-Hadoop_2.6
e.g 2 : Name: SPARK_HOME, Value: ${eclipse_home}../Spark_1.3.1-Hadoop_2.6
Note : Don’t use the system environment variables such as Spark home
It’s recommended to handle your own "log4j.properties" file in each of your project.
To allow that, adds the environment variable SPARK_CONF_DIR as previously and validates :
e.g : Name: SPARK_CONF_DIR, Value: ${project_loc}/conf
If you experience some problems with the variable ${project_loc} (e.g: with Linux OS), specify an absolute path instead.
Or if you want to keep ${project_loc}, right-click on every Python source and: Runs As > Run Configurations…,
then create your SPARK_CONF_DIR variable in the Environment tab as described previously
Occasionally, you can add other environment variables such as TERM, SPARK_LOCAL_IP and so on :
e.g 1 : Name: TERM, Value on Mac: xterm-256color, Value on Linux: xterm
e.g 2 : Name: SPARK_LOCAL_IP, Value: 127.0.0.1 (it’s recommended to specify your real local IP address)
Validate with the button [OK]
Now PyDev is full ready to develop with Spark in Python.
|
7°) Create the Spark-Python project “CountWords”
Now we can develop any kind of Spark project written in Python, we will now create the code example named “CountWords”.
This example will count the frequency of each word present in the “README.md” file belonging to the Spark installation.
To allow a such counting, the well-known MapReduce paradigm will be operated in memory by using the two Spark transformations named “flatMap” and “reduceByKey”.
Create the new project :
Create the new project :
Check that you are on the PyDev perspective
Go to the Eclipse menu File > New > PyDev project
Name your new project “PythonSpark”, then click on the button [Finish]
Create a source folder :
Create a source folder :
To add a source folder (in order to create soon your Python source), right-click on the project icone and New > Folder
Name the new folder “src”, then click on the button [Finish]
To add the new Python source, right-click on the source folder icon and New > PyDev Module
To add the new Python source, right-click on the source folder icon and New > PyDev Module
Name the new Python source “WordCounts”, then click on the button [Finish], then click on the button [OK]
Copy-paste the following Python code into your PyDev module WordCounts.py :
Copy-paste the following Python code into your PyDev module WordCounts.py :
# Imports
# Take care about unused imports (and also unused variables),
# please comment them all, otherwise you will get any errors at the execution.
# Note that neither the directives "@PydevCodeAnalysisIgnore" nor "@UnusedImport"
# will be able to solve that issue.#from pyspark.mllib.clustering import KMeans
from pyspark import SparkConf, SparkContext
import os
# Configure the Spark environment
sparkConf = SparkConf().setAppName("WordCounts").setMaster("local")
sc = SparkContext(conf = sparkConf)
# The WordCounts Spark program
textFile = sc.textFile(os.environ["SPARK_HOME"] + "/README.md")
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
for wc in wordCounts.collect(): print wc
|
In PyDev take care about unused imports and also unused variables.
Please comment them all, otherwise you will get any errors at the execution.
Note that neither the directives @PydevCodeAnalysisIgnore nor @UnusedImport will be able to solve that issue.
Create a config folder :
To add a config folder (useful for log4j), right-click on the project icon and New > Folder
Name the new folder “conf”, then click on the button [Finish]
To add your new config file (the “log4j.properties” file) right-click on the config folder icon and New > File
To add your new config file (the “log4j.properties” file) right-click on the config folder icon and New > File
Name the new config file “log4j.properties”, then click on the button [Finish], then click on the button [OK]
Copy-paste the content of the file “log4j.properties.template” (under $SPARK_HOME/conf) to your new config file ”log4j.properties”
Copy-paste the content of the file “log4j.properties.template” (under $SPARK_HOME/conf) to your new config file ”log4j.properties”
Edit your own config file ”log4j.properties” to change as much as you want the level of logs (e.g : INFO to WARN, or INFO to ERROR...)
8°) Run the Spark-Python project “CountWords"
To execute your code, right-click on the Python module “WordCounts.py”, then choose Run As > 1 Python Run
I am getting the error as mentioned in the link :
RépondreSupprimerhttp://stackoverflow.com/questions/35959638/spark-sample-python-in-eclipse
Can you please help ?
Hi, this roadmap is deprecated, please refer to the following link:
RépondreSupprimerhttps://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/
Kind Regards
This roadmap has been made especially for Spark V1.3 only, that's probably the reason why you've got that issue. Also, if you want to use Spark V1.6 please refer to the following link:
RépondreSupprimerhttps://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/
Hi,
RépondreSupprimerI followed all the steps you mentioned.
But when I try to run the wordcount.py program it show:
Error from python worker:
File "/Applications/anaconda/lib/python3.5/site.py", line 176
file=sys.stderr)
Same error for me
RépondreSupprimer16/11/21 11:16:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
File "C:\Program Files\Anaconda3\lib\site.py", line 176
file=sys.stderr)
^
Hi Philippe, I need your help. When I configure PyDev with the Spark Python sources, eclipse doesn't allow selected /usr/local/spark/python directory since he expected a file .exe. what can I do for finished configure ?
RépondreSupprimerawesome man it works like a charm
RépondreSupprimerbut a bit slow ..i am figuring our to make it run fast
Just like Mavan as packging for scala,wht we can use to manage dependecies while using pyspark,
RépondreSupprimerHello,
RépondreSupprimerI was wondering why are you using eclipse? I would like to understand what the difference with other IDE because I can't see many people using pyspark with eclipse..
Thanks