PySpark on Google Collab
SaveAWattHour Data science Data correlation and clustering in PySpark

Data correlation and clustering in PySpark

PySpark on Google Collab

Most of the computations today are performed on cloud infrastructures. Many of these rely on Hadoop and Apache Spark. In this tutorial, I will show how you can set up a spark cluster (including a web UI using NGROK) in Google Colab by using pypi and linking it to your Google Drive data. Then I will show how to perform simple transformations and actions to get basic statistics such as data correlation. Then I will show you how to use PySpark’s ML libraries to perform unsupervised clustering on time series data and pick the correct number of clusters by combining the Silhouette and Elbow methods. The Google Colab code is heavily documented so I urge you to read it carefully as it explains each step in detail.

You may wonder why I have chosen to show both simple data correlation and time series ML clustering in one post. Well, I wanted you to see the different aspects of Spark from simple to more complex, and to explain some particularities along the way.

In fact, data correlation and clustering are linked as they both try to determine relations between data. When correlating data (features) we are looking for dependencies whereas when clustering we are looking for features with similar properties. When dealing with consumption time series both are of importance. Let’s see some use cases based on some of the work I have done in the past:


  • Determining causal patterns: in many cases when dealing with microgrids for instance we may want to see how one consumer impacts another. An example I have met was that of a campus microgrid where say consumption in dorms was affected by student timetables and attendance in classes.
  • Forecast: linked to causality is the problem of forecasting consumption. Sometimes, you can use causal patterns to forecast a customer’s consumption. The research group I was part of at USC published papers on the topic of Influence-Driven Model for Time Series Prediction from Partial Observations.


  • Forecast: as the number of customers can get very high pushing the scalability of the ML to the limits, it is sometimes possible to forecast the consumption of the group instead of individual customers. The group forecast will closely match the individual forecast as all customers are similar. In a previous research paper part of an R&D tech transfer project for Elster (now Honeywell), we investigated the matter of using cluster information to predict individual customer consumption.
  • Data analysis: grouping customers together can allow for better targeting by the utility. This can be in the form of Demand Response policies or personalized offers. One of our past research papers has looked into the issue of grouping customers based on novel features such as shapelets.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post