This is an alert component

schema1

Content about schema definition, validation and management for data processing.

July 31, 2022⋅5 min read

Improve your PySpark ETL's performance by providing explicit schema

Have you ever stumbled upon a Spark ETL and you were left wondering how a simple loading of a dataset can take hours, even though the filtered dataset you are specifying is relatively small?

Dan

schema1

Improve your PySpark ETL's performance by providing explicit schema

Recommended

Building data quality checks in your pySpark data pipelines

Integration Testing for your Databricks CI/CD Data Pipelines with Microsoft Nutter

Automate all your PySpark Unit Test with Hypothesis!

Subscribe to new posts