How to use Broadcast variable in UDF in pyspark

Why UDF and Broadcast Variable

Diptiman Chakrabarti
4 min readApr 28, 2020

Apache spark is now used as ETL on big data hadoop platform or even on cloud with different essence of it. Complex joins across multiple files/tables and transformation are now part and parcel of any Apache spark script. With wake of complex implementations, performance tuning on spark has also become need of hour.

Complex joins with multiple tables in spark causes lots of shuffling. And higher shuffling impacts performance and sometime causes memory issues too. Issues with hash memory, serializable memories are very common.

User defined function(UDF) with Broadcast Variable on apache spark can help to reduce joins and thus improves performances.

What is Broadcast Variable and UDF

Broadcast variable is a global variable which is broadcasted across all clustered and when ever required can be referred by the transformation and actions in apache spark. Broadcast variables are created out of RDD.

User Defined Function(UDF) in Apache Spark facilitates special functionalities which are not possible with Spark built-in functions, and are specific to work requirement.

In Apache Spark if any Spark serialized data access is required by User defined function(UDF), that can only be done either with Broadcast variable or by Accumulator. Broadcast variable can take key-value pair which accumulator can’t. So Broadcast variable keys can be used as filter column in UDF and required value from broadcast variable can be returned via UDF. Filter is a transformation and does not involve shuffling.

In this topic I am trying to describe how a broadcast variable with multiple keys can be used in user defined function(UDF) in pyspark.

Data Set

I have used a dataset which has Organization name, City, State, Amount, and amount date.

The file is uploaded in github. The link is shared at end of the post.

Following is a screenshot of the main data that has been used (pic 1)

Ref Pic 1 Main Data Set

Following is the screenshot of lookup file that is used as broadcast variable.(pic 2).

Ref Pic 2 Lookup file

Use Case

In real life implementation, lookup files or reference data files can be used as broadcast variable. Using look up file as Broadcast variable can avoid joins if used with UDF. Broadcast variable can be a key-value pair. Keys are required for lookup.

In this use case, the main file contains organization name, state, cities, salary and salary date. ( Ref: Pic 1)

The lookup file contains state, city, % of salary adjustment and the adjustment Date. The date will be used to replace the salary date in final output. (Ref: Pic 2)

In following code first read lookup file in RDD

Ref: Code#1

lookup” in above code refers to the lookup file path. The file has header, the above code removes header.

The code has taken (city and sate) columns as composite key and (% of salary adjustment and adjustment date) columns as value. It creates key value pair in RDD.

The following code converts the RDD to broadcast variable.

Ref: Code#2

Once the broadcast variable is created, the same can be referred within UDF or even directly by the transformations. Even broadcast variables directly can be used as part of join. It is not required to pass broadcast variable as parameter in UDF. Instead it can be directly referred.

Code#3 is user defined function to update salary amount. It takes city, state and salary amount as input parameter. If city and state is found in broadcast variable then returns the updated salary else returns the original salary available in main data file.

Ref: Code#3

Following user defined function updates salary date. It looks up city and state in broadcast variable. If city and state is available then returns the date from broadcast variable, if not then returns original data file date.

Ref: Code#4

The main code which uses the user defined function as following: It first creates data frame from data file. then converts the dataframe to temporary view. On temporary view runs Spark Sql Queries.

Ref: Code#5

And finally save it into specified location. “finalfile” in following code snippet is the target path.

Ref: Code#6

The above code shows how composite keys in broadcast variable can be used. How the broadcast variable can be referred within UDF.

Git link for code:

https://github.com/diptimanchakrabarti/broadcastvariable.git

--

--