Skip to content

Apache Spark for Python developers #
Find similar titles

You are seeing an old version of the page. Go to latest version

Structured data

Big Data

Introduction #

Software Prerequisites #

IPython #

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers introspection, rich media, shell syntax, tab completion, and history.

PySpark #

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Spark’s Python API #

Key Differences in the Python API #

There are a few key differences between the Python and Scala APIs:

  • Python is dynamically typed, so RDDs can hold objects of multiple types.
  • PySpark does not yet support a few API calls, such as lookup and non-text input files, though these will be added in future releases.

IPython Configuration #

ipython profile create pyspark

Starting IPython Notebook with PySpark #

export SPARK_HOME='/opt/spark'

export PYTHONPATH=/opt/spark/python:$PYTHONPATH

export PYTHONPATH=/opt/spark/python/lib/$PYTHONPATH

Example #