What is Spark, anyway?

Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.

However, with greater computing power comes greater complexity. Therefore, consider questions like:

Is my data too big to work with on a single machine?
Can my calculations be easily parallelized?

Untitled

The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker. The master sends the workers data and calculations to run, and they send their results back to the master.

Creating the connection is as simple as creating an instance of the SparkContext class. SparkContext is the connection to the cluster and the SparkSession is interface with that connection.

An object holding all the attributes can be created with the SparkConf() constructor.

# Verify SparkContext
print(sc) # <SparkContext master=local[*] appName=pyspark-shell>

# Print the version of SparkContext
print(sc.version) # 3.2.0

# Print the Python version of SparkContext
print(sc.pythonVer) # 3.9

# Print the master of SparkContext
print(sc.master) # local[*]

sc : SparkContext

spark : SparkSession

Pyspark SQL