Welcome to Unpacking Data. A blog about data engineering, PySpark, testing strategies, and big data insights.

Get started with our Featured Posts

November 18, 2022⋅5 min read

Building data quality checks in your pySpark data pipelines

Data quality is a rather critical part of any production data pipeline. In order to provide accurate SLA metrics and to ensure that the data is correct, it is important to have a way to validate the data and report the metrics for further analysis.

Dan

Cover Image for Improve your PySpark ETL's performance by providing explicit schema

pyspark databricks json schema tinsel

July 31, 2022⋅5 min read

Improve your PySpark ETL's performance by providing explicit schema

Have you ever stumbled upon a Spark ETL and you were left wondering how a simple loading of a dataset can take hours, even though the filtered dataset you are specifying is relatively small?

Dan

Cover Image for Integration Testing for your Databricks CI/CD Data Pipelines with Microsoft Nutter

nutter e2e testing integration testing pyspark databricks hypothesis

July 19, 2022⋅5 min read

Integration Testing for your Databricks CI/CD Data Pipelines with Microsoft Nutter

In this blogpost we will continue our journey of testing our Data Pipelines. If you haven't checked out the first post, make sure you do.

Dan

See what we’ve written lately

Cover Image for Automate all your PySpark Unit Test with Hypothesis!

pyspark databricks hypothesis property-based testing

July 15, 2022⋅5 min read

Automate all your PySpark Unit Test with Hypothesis!

Unit testing is often regarded as a main pillar of testing your software applications, and it usually involves testing a single/unit component and ensuring that it covers all the edge cases the software developer can think of.

Dan