Skip to content

Spark Cluster on CentOS (bare metal) #
Find similar titles

Structured data

Category
Management

Apache Spark is fast, real time computing system which can execute jobs in distributed environment. It means, more we add nodes into our cluster we can get computational jobs more faster. To get Spark Cluster installed on our servers we must follow these steps.

Update base repository #

We need to keep our package repository up-to-date first. And, we need to install wget package to download necessary packages.

$ sudo yum update -y
$ sudo yum install wget -y

Install Java #

Also, we need to install Java. This time we get Java 1.7.

$ sudo yum install java-1.7.0-openjdk -y`
$ java -version
java version "1.7.0_85"
OpenJDK Runtime Environment` `(rhel-2.6.1.2.el7_1-x86_64 u85-b01)
OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)

Install Spark #

Then, we download Apache Spark with Hadoop 2.6 bundled.

$ wget http://mirror.apache-kr.org/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz
$ tar xzvf spark-1.5.1-bin-hadoop2.6.tgz

To get everything worked with our installed Spark, we need to add environment variables.

$ echo "export SPARK_HOME=`pwd`/spark/spark-1.5.1-bin-hadoop2.6" >>/etc/profile
$ echo "export SPARK_LOCAL_IP="127.0.0.1" >>/etc/profile
$ echo "export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH"
$ echo "export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

To confirm setup, we can run example. For Scala and Java, use run-example.

$ ./bin/run-example SparkPi

For Python examples, use spark-submit directly:

$ ./bin/spark-submit examples/src/main/python/pi.py

Setup cluster #

0.0.1_20210630_7_v33