Skip to content

Apache Spark for Python developers #
Find similar titles

Structured data

Big Data

Introduction #

Apache Spark is an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's multi-stage in-memory primitives provides performance up to 100 faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well-suited to machine learning algorithms.

In most cases using Python Pandas library not enough as it can be run only on single machine. And, Pandas developers don't have yet plan to make it distributed. But, Python developers can easily use Apache Spark for big data computational tasks.

Software Prerequisites #

IPython #

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers introspection, rich media, shell syntax, tab completion, and history.

PySpark #

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Spark’s Python API #

Key Differences in the Python API #

There are a few key differences between the Python and Scala APIs:

  • Python is dynamically typed, so RDDs can hold objects of multiple types.
  • PySpark does not yet support a few API calls, such as lookup and non-text input files, though these will be added in future releases.

IPython Configuration #

ipython profile create pyspark

Starting IPython Notebook with PySpark #

export SPARK_HOME='/opt/spark'

export PYTHONPATH=/opt/spark/python:$PYTHONPATH

export PYTHONPATH=/opt/spark/python/lib/$PYTHONPATH

Example #


Suggested Pages #