Friday, July 07, 2017

How Mutable DataFrames Improve Join Performance in Spark SQL

DZone Database Zone
How Mutable DataFrames Improve Join Performance in Spark SQL

Recently, a user wrote into the Spark Mailing List asking about how to refresh data in a Spark DataFrame without reloading the application. The user stated:

“We have a Structured Streaming application that gets [credit card] accounts from Kafka into a streaming data frame. We have a blacklist of accounts stored in S3 and we want to filter out all the accounts that are blacklisted. So, we are loading the blacklisted accounts into a batch data frame and joining it with the streaming data frame to filter out the bad accounts. ... We wanted to cache the blacklist data frame to prevent going out to S3 every time. Since the blacklist might change, we want to be able to refresh the cache at a cadence, without restarting the whole app.”

This application makes perfect sense. A credit card issuer is liable for charges made on cards that are stolen, misplaced, or otherwise misused. In 2012, unauthorized/fraudulent credit card transactions cost banks $6.1 billion. It is in the credit card issuer’s interest to ensure that transactions involving a blacklisted card are caught right after the card has been flagged.

No comments:

Fun With SQL: Functions in Postgres

DZone Database Zone Fun With SQL: Functions in Postgres In our previous  Fun with SQL  post on the  Citus Data  blog, we covered w...