Creating Spark Dataframes without a SparkSession for tests

Creating Spark Dataframes without a SparkSession for tests

Tuesday 26 March 2019

Back to the scheduled .NET content after this brief diversion into… Java.

I’m currently helping a team put some tests around a Spark application, and one of the big bugbears is testing raw data transformations and functions that’ll run inside the spark cluster, on the outside of it. It turns out the most of the core Spark types all hang off a SparkSession and can’t really be manually constructed - something a quick StackOverflow query appears to confirm. People just can’t seem to create Spark Dataframes outside of a spark session.

Except you can.

All a Spark Dataframe really is, is a schema and a collection of Rows - so with a little bit of digging, you realise that if you can only create a row and a schema, everything’ll be alright. So you did, and you discover no public constructors and no obvious ways to create the test data you need.

Unless you apply a little bit of reflection magic, and then you can create a schema with some data rows trivially

Copy pasta until your hearts content. Test that Spark code, it’s not going to test itself.