It took me a while to really understand how Spark (specifically PySpark) works under the hood. Specifically, I’m referring to what happens when I, as the user, submit a PySpark data processing job to a cluster. Even after all my digging around I’m still uncertain about how a few things work. I’m kinda of surprised at how convoluted the documentation is for a library/framework that’s so popular in industry. Nevertheless, this is my attempt at putting together a very high-level overview of how Spark works, from the perspective of the PySpark interface. I’ve broken things down into a key concept, some Spark terminology, and finally how Spark executes an application.
Spark processes all data in memory. This makes it efficient since you’re not wasting time reading/writing data from disk. (That’s an inefficiency you face with MapReduce. The output of the mapping process is written to disk before it’s used in the reduce process.)
Now for some terminology.
- Transformation Functions: Spark functions that transform data but do not return a result (
.groupBy(), etc). Transformations are lazily executed; this means that they will only be executed when an action function is called.
- Action Functions: Spark functions that require something to be returned back to the user (
.count(), etc). They require the Spark Application to be deployed and data processing to take place so that a result can be returned.
- Spark Application: A set of instructions that describe a connection to the Spark cluster along with a sequence of transformation and action functions to process the data (i.e. your PySpark code).
- Spark Context: The connection you establish between your Spark application and the Spark cluster. (In PySpark you create this connection with something like
spark = SparkSession.builder.appName('foo').getOrCreate().)
- Driver Process: This is the process that is responsible for sending your Spark Application to the cluster and receiving the results after the completion of the application.
- Cluster Manager: This is responsible for allocating resources to the executor processes and monitoring them in order to complete the tasks described in the Spark Application.
- Executor Process: These processes are responsible for actually reading/writing data and computing the transformation/action functions.
- Main Node: Where the cluster manager runs.
- Secondary Node: Where the executor processes run.
- Driver Node: Where the driver process runs.
Spark Application Execution
The process of running a simple query like
df.filter(...).show() in your application goes some thing this:
- The user writes their PySpark application which consists of resource requirements and transformations/actions on the data.
- This application is deployed on the driver which creates a Spark Session which is the connection to the Spark cluster. The driver could be on the user’s side (client/local mode) or on the Spark cluster (cluster mode).
- The driver takes your Spark application’s transformation and action functions and creates a DAG (directed acyclic graph) to determine the optimal computation procedure.
- The driver submits your Spark application via an API like PySpark.
- The cluster manager on the main node allocates the resources (to the executors on the secondary nodes) required to complete your job.
- The executors run the tasks required to complete your job by reading/writing data, shuffling information, and doing the transformations/action requested by the user. When they complete the result is sent back to the driver
I’m big on visuals. So here’s my attempt at illustrating the workflow.