Spark Cluster on CentOS (bare metal)
#
Find similar titles
- (rev. 3)
- Duskan
Structured data
- Category
- Management
Apache Spark is fast, real time computing system which can execute jobs in distributed environment. It means, more we add nodes into our cluster we can get computational jobs more faster. To get Spark Cluster installed on our servers we must follow these steps.
Update base repository #
We need to keep our package repository up-to-date first. And, we need to install wget
package to download necessary packages.
$ sudo yum update -y
$ sudo yum install wget -y
Install Java #
Also, we need to install Java. This time we get Java 1.7.
$ sudo yum install java-1.7.0-openjdk -y`
$ java -version
java version "1.7.0_85"
OpenJDK Runtime Environment` `(rhel-2.6.1.2.el7_1-x86_64 u85-b01)
OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)
Install Spark #
Then, we download Apache Spark
with Hadoop 2.6 bundled.
$ wget http://mirror.apache-kr.org/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz
$ tar xzvf spark-1.5.1-bin-hadoop2.6.tgz
To get everything worked with our installed Spark, we need to add environment variables.
$ echo "export SPARK_HOME=`pwd`/spark/spark-1.5.1-bin-hadoop2.6" >>/etc/profile
$ echo "export SPARK_LOCAL_IP="127.0.0.1" >>/etc/profile
$ echo "export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH"
$ echo "export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"
To confirm setup, we can run example. For Scala
and Java
, use run-example
.
$ ./bin/run-example SparkPi
For Python
examples, use spark-submit
directly:
$ ./bin/spark-submit examples/src/main/python/pi.py