MIZZOU-CIS-WAYLON

Citation Author(s):: Wenlong Wu (University of Missouri, Columbia)
Submitted by:: Wenlong Wu
Last updated:: Sat, 11/14/2020 - 23:45

Abstract

This submission is the final submission for the IEEE-CIS second technical challenge on energy prediction from smart meter data. My name is Wenlong Wu. I am a PhD student majoring in ECE at University of Missouri supervised by Dr. James Keller. This competition was done all by myself without any teammate. I would like to thank IEEE-CIS committee and Dr. Isaac Triguero for their efforts on this well-organized competition.

This competition is a time-series regression problem. Usually there are two directions to solve this problem: statistic modeling and machine learning modeling. We are provided with historical half-hourly energy readings for the 3248 smart meters. However, different smart-meters have different monthly availability ranging from only last month (Dec.) to the entire last year (Jan. to Dec.). It is difficult to predict the next whole year's information if only provided with Dec. data. Thus statistic modeling on each smart -meters has limited ability to solve this problem. So I went the direction of machine learning modeling in this competition.

My solution has a few steps: 1) data preprocessing, 2) feature engineering, 3) modeling using Light Gradient Boosting Tree (LightGBM) algorithm, 4) post processing, 5) ensemble. I will briefly introduce above steps in this abstract and will provide more details if I got selected in the final shortlist.

1) data preprocessing: We are provided with half-hourly meter reading in 2017 for the 3248 smart meters but need to make predictions on month-level so the hour information is not that important in my understanding. I decided to sum up each day's meter reading to do the day-level prediction. There are many missing data in the training data so I discard most of the 0-reading and NaN-reading by removing them in the training if their 3-days window size of meter readings are zero. This is because the meter was not deployed or broken for some reasons and it is common in the real world. I also do some manual data cleaning for some corner cases since no one approach is perfect.

2) feature engineering: Besides the historical meter readings in 2017, we are also provided with the weather data in 2017 (training data). However, the weather data in 2018 (testing data) is not provided so it is impossible to leverage this information using machine learning modeling. In this competition, I only used time-related features including "day of week", "day of month" and "month". I use cyclical features encoding by taking the sine and cosine value of the month value so that Jan. encoding and Dec. encoding are similar.

3) modeling: I ran t-SNE visualization on the month-level training data and found 12 clusters using fuzzy c-means (FCM) algorithm. So I do the modeling on both the whole-level meters and cluster-level meters. The regression model I used in this competition is Light Gradient Boosting Tree (LightGBM) that was developed by Microsoft in 2016.

4) post processing: There might be some data missing in the testing data as well so I made some zero predictions if the last three days of meter readings in 2017 are all zero, which means these meters are likely to break for some reasons like battery ran-out.

5) ensemble: I ensembled whole-level prediction, cluster-level prediction and Nov-Dec mean prediction.

Since I did modeling on the day-level so I can predict the day meter reading in 2018 which can be more precise. The Computational Intelligence techniques I used in this competition is fuzzy c-means (FCM) which help me find 12 clusters among 3248 buildings.

Above is the brief introduction of my approach and I will add more discussions and figures to illustrate the ideas in the short paper if I got selected in the shortlist. I would like to thank IEEE-CIS and Dr. Isaac Triguero again to provide such a meaningful competition to practice my machine learning skills.

DATA FILES

final_ensemble.csv

Analysis

MIZZOU-CIS-WAYLON

Abstract

DATA FILES

QUESTIONS?