A tiny foray into Apache Spark & Python

I’ve spent an afternoon playing with Apache Spark (1.0.1) to start to form an opinion on where it might be useful. Here’s a couple of notes. We’re discussing this at PyDataLondon tonight.

UPDATE I cover PySpark 1.2, ElasticSearch and PyPy in 2015.

You can run Spark out of the box on Linux (I’m using 13.10) without having Hadoop or HDFS installed, this makes quick experimentation easy. Having downloaded spark-1.0.1-bin-hadoop2.tgz I followed the README’s advice of running

./bin/pyspark
>>> sc.parallelize(range(1000)).count()

and indeed the PySpark command line interface popped up and the parallel job ran, on my local machine. I changed 1000 to 10,000,000 and it looked as though it was using multiple CPUs (though I won’t bet money on it).

The online doc is out of date for running the example programs (the Scala/Java demos are well documented), for the example Pi estimator I used:

.bin/spark-submit examples/src/main/python/pi.py

and it produced a similar result to the equivalent Scala program.

I did try to run PySpark with Python 3.4 but it is written for Python 2.7, the first stumbling block was the SocketServer module (now called socket_server in Python 3+) and I gave up there. I also tried using PyPy, it seems that PySpark starts with PyPy:

$ PYSPARK_PYTHON=~/Downloads/pypy-c-jit-69206-84efb3ba05f1-linux64/bin/pypy bin/pyspark

...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.0.1
      /_/

Using Python version 2.7.3 (84efb3ba05f1, Feb 18 2014)
SparkContext available as sc.
And now for something completely different: ``samuele says that we lost arazor. so we can't shave yaks''

If I try to run the Pi estimator with PyPy then it has a deeper error with “PicklingError: Can’t pickle builtin <type ‘method’>” so I gave up there.

From what I can see Python 2.7 and numpy (>=1.4) are supported, their dev guide does note that Py 2.6+2.7 are the only two supported. I’ve not tried using HDFS (it looks like it’ll take <1/2 day to setup on a single machine but I didn’t need it for this first foray). It looks like IPython was (and maybe should) be supported in 0.8.1 but I couldn’t get it to run with pyspark 1.0.1 (possible solution that I’ve not tested yet).

MLlib seems to support scipy (e.g. Sparse arrays) and Numpy which goes down to netlib and jblas (java versions of BLAS/lapack/etc wrappers).

Update – At the PyDataLondon event it was noted that Python is currently a second-class citizen (with Scala as first-class) and that Python incurs a 2* memory overhead (I believe that numpy data gets copied into Spark’s system) – if anyone has better knowledge and could leave a comment, that’d be ace.

I also see that the k-means approach (k-means-||) is parallelised but for the other ML algorithms it isn’t clear if they are parallelised. If they’re not, what’s the point of building a distributed set of classifiers? I fear I’m missing something here – if you have an opinion, I’d love to see it.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.

1 Comment