Apache Spark: Setting up a Cluster on CentOS 8

4 min readMar 26, 2023

Apache Spark is an open-source big data processing framework that allows developers to write parallelized data processing applications. Spark provides a distributed computing environment, enabling data processing on a large scale. In this article, we’ll cover what Apache Spark is, how it works, how to set up a cluster, and how to run SQL and Python examples.

What is Apache Spark? Apache Spark is an open-source big data processing framework that allows developers to write parallelized data processing applications. Spark was developed by the Apache Software Foundation and is written in Scala. It is designed to be faster and more flexible than Hadoop MapReduce, another big data processing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Architecture of Apache Spark The architecture of Apache Spark is based on a distributed computing model, where data is processed in parallel on a cluster of nodes. Spark has three main components:

Driver Program: The driver program is the entry point of a Spark application. It creates the SparkContext, which is the main entry point for Spark functionality.
Cluster Manager: The cluster manager is responsible for managing the resources of the cluster. It allocates resources to the different Spark applications running on the cluster.
Executors: Executors are the worker nodes that perform the actual computation. They receive tasks from the driver program and process the data in parallel.

Setting up a Spark Cluster on CentOS 8 Here are the steps to set up a Spark cluster on CentOS 8:

Install Java: Spark requires Java 8 or later to run. Install Java on each node in the cluster using the following command:

sudo yum install java-1.8.0-openjdk-devel

Download and extract Spark: Download the latest version of Spark from the official website using the following command:

wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz

Extract it using the following command:

tar -xzf spark-3.3.2-bin-hadoop3.tgz

Configure the environment: Set the following environment variables on each node in the cluster:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export SPARK_HOME=/path/to/spark
export PATH=$PATH:$SPARK_HOME/bin

Start the master node: Start the master node by running the following command:

./sbin/start-master.sh

Start the worker nodes: Start the worker nodes by running the following command on each worker node:

./sbin/start-worker.sh spark://master-ip:7077

Verify the cluster: Verify that the cluster is up and running by visiting the Spark web UI at http://master-ip:8080. You should see the list of worker nodes connected to the master node.

Organizing a Spark Cluster A Spark cluster is organized into a master node and multiple worker nodes. The master node manages the distribution of tasks to the worker nodes. The worker nodes are responsible for processing the data in parallel.

Running SQL and Python Examples on Spark Now that you have set up a Spark cluster, you can run SQL and Python examples on it. Here are some examples:

1.- Running SQL queries: You can run SQL queries on Spark using the Spark SQL module. Here’s an example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("John", 25), ("Jane", 30), ("Bob", 40)]

df = spark.createDataFrame(data, ["Name", "Age"])

df.createOrReplaceTempView("people")

result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25")

result.show()

In this example, we first create a Spark session and a DataFrame with some data. We then create a temporary view of the DataFrame, which enables us to run SQL queries on it using the spark.sql function. We use this function to select the name and age of people whose age is greater than 25. Finally, we use the show function to display the results of the query on the console.

+----+---+
|Name|Age|
+----+---+
|Jane| 30|
+----+---+

2.- Running Python scripts: You can also run Python scripts on Spark using the PySpark module. Here’s an example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("John", 25), ("Jane", 30), ("Bob", 40)]

df = spark.createDataFrame(data, ["Name", "Age"])

def process_row(row):
    return row.Name + " is " + str(row.Age) + " years old."

processed_data = df.rdd.map(process_row)

processed_data.collect()

In this example, we first create a Spark session and a DataFrame with some data. We then define a Python function process_row that takes a row of the DataFrame as input and returns a string that describes the person's name and age. We then use the map function on the DataFrame's RDD to apply the process_row function to each row of the DataFrame. Finally, we use the collect function to retrieve the results of the computation and display them on the console.

Sending Code to Spark To send code to Spark, you can use the spark-submit command. Here's an example of how to submit a Python script:

./bin/spark-submit --master spark://master-ip:7077 my_script.py

Conclusion

We covered what Apache Spark is, how it works, how to set up a Spark cluster on CentOS 8, and how to run SQL and Python examples. With the ability to handle large data sets and process them quickly, Spark is an essential tool for big data processing tasks. By understanding how to set up and use Spark, developers can take advantage of its capabilities to process and analyze large amounts of data.

Apache Spark: Setting up a Cluster on CentOS 8

Conclusion

Written by Matías Salinas

No responses yet