Spark is one of the most in-demand Big Data processing frameworks right now.

This course will take you through the core concepts of PySpark. We will work to enable you to do most of the things you’d do in SQL or Python Pandas library, that is:

  • Getting hold of data
  • Handling missing data and cleaning data up
  • Aggregating your data
  • Filtering it
  • Pivoting it
  • And Writing it back

All of these things will enable you to leverage Spark on large datasets and start getting value from your data.

You can download the course dataset here.

THE COLAB DEVELOPMENT ENVIRONMENT

DATAFRAME & DATASET INTRODUCTION FOR SCENARIO

SPARK CONFIGURATION

INGESTING & CLEANING OUR SCENARIO DATA

ANSWERING OUR SCENARIO QUESTIONS

CORE CONCEPTS: BRINGING DATA INTO DATAFRAMES

CORE CONCEPTS: INSPECTING DATAFRAMES

HANDLING NULL & DUPLICATE VALUES

CORE CONCEPTS: SELECTING & FILTERING DATA

CORE CONCEPTS: APPLYING MULTIPLE FILTERS

CORE CONCEPTS: RUNNING SQL ON DATAFRAMES

CORE CONCEPTS: ADDING CALCULATED COLUMNS

CORE CONCEPTS: GROUP BY & AGGREGATION

CORE CONCEPTS: WRITING DATAFRAMES TO FILES