Wednesday, September 21, 2022
HomeData ScienceClustering-based knowledge preprocessing for operational wind generators | by Abiodun Olaoye |...

Clustering-based knowledge preprocessing for operational wind generators | by Abiodun Olaoye | Sep, 2022


An environment friendly knowledge science strategy to creating asset teams

Left: Turbine clusters (picture by creator). Proper: Photograph by Philip Could, CC BY-SA 3.0, through Wikimedia

Introduction

Working wind generators generate streams of information whereas producing clear and renewable electrical energy for our each day use. The info is a time sequence of environmental, mechanical, and manufacturing variables and are obtained utilizing the Supervisory Management and Knowledge Acquisition (SCADA) system.

Wind power analyses usually require preprocessing of the SCADA knowledge together with figuring out which generators could also be thought of “neighbors”. The place the idea of being neighbors is dependent upon the variable of curiosity akin to turbine location, wind velocity, wind route, and energy output.

For example, geographically, two or extra generators could also be thought of neighbors if their latitude and longitude are nearer to one another in contrast with the remaining generators. Extra technically, two generators may additionally be grouped as neighbors based mostly on wind velocity in the event that they expertise related wind speeds throughout the interval below investigation.

Purposes of Wind Turbine Clustering

Grouping of generators in a wind farm is a helpful knowledge preprocessing step that must be carried out comparatively ceaselessly and for non-geographic variables, the connection between the generators could change over time. Some helpful purposes of wind turbine grouping embody:

  • Dealing with lacking and spurious knowledge: Figuring out a gaggle of generators that’s consultant of a given turbine supplies an environment friendly solution to backfill lacking or spurious knowledge with the typical of neighbors’ sensor knowledge. That is particularly helpful for variables like wind velocity as a result of anemometers are likely to have comparatively low reliability.
  • Facet-by-side evaluation: On this evaluation, management generators are chosen as a consultant of a take a look at turbine often based mostly on produced energy. On the finish of a turbine improve, it’s essential to measure the efficiency enchancment of the take a look at turbine by evaluating its manufacturing with that of the neighbor(s). Utilizing the clustering-based strategy, on this case, must be explored additional and in contrast with current strategies.
  • Group energy forecasting: To scale back the computational value, energy forecast fashions could also be constructed for teams of generators within the wind farm quite than for particular person generators. As well as, this strategy is predicted to present extra correct outcomes than constructing a single mannequin for the entire farm.
  • Group yaw management: An enchancment in wind farm power manufacturing could also be achieved by implementing an optimum yaw management technique for a gaggle of generators which might be neighbors within the sense of their place relative to the wind route quite than individually.

Though clustering methods have been utilized in completely different areas of wind power evaluation akin to wind farm energy forecasting and yaw management, this text proposes an extension of this strategy to different purposes akin to side-by-side evaluation and dealing with of lacking or spurious SCADA knowledge.

Clustering-based SCADA Knowledge Evaluation

One methodology of turbine grouping includes calculating the sum of squared variations (SSD) between the measurement from a turbine and different generators on the farm. The place generators with the least SSD are chosen as neighbors. This methodology might be computationally costly particularly if scripted by non-programming consultants.

One other methodology of grouping generators in a wind farm employs the correlation coefficient between completely different generators based mostly on the variable of curiosity. This methodology is straightforward and never computationally costly however will not be helpful for purposes akin to dealing with lacking values.

The clustering-based strategy employs current state-of-the-art knowledge science methods and instruments which might be publicly obtainable. The most well-liked of such strategies is Ok-Means clustering. This methodology ensures that generators in a gaggle not solely have minimal Inside-Cluster Sum of Squares but additionally maximal Between-Cluster Sum of Squares with generators in different teams.

Whereas Ok-Means clustering is just like the SSD strategy, it additionally optimizes for between-cluster dissimilarity which is predicted to enhance its robustness. As well as, the tactic applies to broader wind turbine analyses than the correlation coefficient strategy and might elegantly deal with a number of variables. Therefore, this strategy is extra environment friendly for SCADA knowledge evaluation.

Moreover, clustering methods are well-researched in knowledge science and might be simply applied in a number of traces of code. Different clustering strategies which may be explored embody hierarchical, spectral, and density-based clustering strategies.

Case Research:

Figuring out turbine teams and dealing with lacking wind velocity knowledge utilizing cluster statistics.

On this instance, we present an environment friendly but easy knowledge science strategy to creating turbine teams in a wind farm utilizing the Ok-Means clustering approach from the Sklearn library. We additionally examine two strategies of predicting lacking wind velocity knowledge utilizing obtained clusters.

First, let’s import the related libraries

# Import related libraries
import os
os.environ['OMP_NUM_THREADS'] = "1"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from scada_data_analysis.modules.power_curve_preprocessing import PowerCurveFiltering

The Knowledge

The dataset is publicly obtainable on Kaggle and used with the required quotation. It consists of 134 operational generators over 245 days and at 10 minutes decision. Location knowledge was additionally supplied for every unit. The info was loaded as follows:

# Load knowledge
df_scada = pd.read_csv('wtbdata_245days.csv.zip')
df_loc = pd.read_csv('sdwpf_baidukddcup2022_turb_location.csv')

Knowledge Exploration

Let’s guarantee the info was correctly learn by inspecting the pinnacle of the datasets.

Picture by creator
Picture by creator

The uncooked SCADA knowledge consists of 4.7 million rows and 13 columns.

Knowledge Cleansing

Let’s extract the related columns particularly turbine distinctive identification (TurbID), day (Day), timestamp (Tmstamp), wind velocity (Wspd), wind route (Wdir), nacelle route (Ndir), and energy output (Patv).

# Extract desired options
dff_scada = df_scada[['TurbID', 'Day', 'Tmstamp', 'Wspd', 'Wdir', 'Ndir', 'Patv']]

Now, let’s examine the lacking values and take away the affected rows. Earlier than eradicating the lacking values, the info high quality is visualized under:

Picture by creator

Just one.05% of the overall rows have been eliminated as a result of lacking values and the brand new knowledge high quality is displayed under:

Picture by creator

Subsequent, we create a novel Day-Time string for making a time sequence of the variables as wanted.

dff_scada['date_time'] = dff_scada[['Day', 'Tmstamp']].apply(lambda x: str(x[0]) + 'T' +str(x[1]), axis=1)

Knowledge Filtering

Uncooked SCADA knowledge might be fairly messy and requires filtering to extract the everyday operational conduct of every turbine. We’ll make use of the open-source scada-data-analysis library which has a energy curve filtering software for this step. The GitHub repository for the library might be discovered right here.

Picture by creator

The code used for knowledge filtering and ensuing cleaned scada knowledge are proven under. First, we take away spurious knowledge based mostly on area data.

# set parameters
cut_in_Speed = 2.85
# Preliminary filtering of clearly spurious knowledge
ab_ind = dff_scada[(dff_scada['Wspd'] < cut_in_speed) & (dff_scada['Patv'] > 100)].index
norm_ind = listing(set(dff_scada.index).distinction(set(ab_ind)))assert len(dff_scada) == len(norm_ind) + len(ab_ind)scada_data = dff_scada.loc[norm_ind, :]

Then, we use the ability curve filter to take away irregular working knowledge.

# Instantiate the Energy Curve Filter
pc_filter = PowerCurveFiltering(turbine_label='TurbID',
windspeed_label='Wspd', power_label='Patv', knowledge=scada_data, cut_in_speed=cut_in_speed, bin_interval=0.5, z_coeff=1.5, filter_cycle=15, return_fig=False)
# Run the info filtering module
cleaned_scada_df, _ = pc_filter.course of()
Picture by creator

The cleaned knowledge has 1.9 million rows and is extra consultant of the anticipated relationship between the windspeed and energy output for operational wind generators.

Now, we create take a look at knowledge for evaluating the efficiency of the clustering strategy when predicting lacking values. The take a look at knowledge is randomly sampled from the cleaned knowledge and has the same prevalence (1.05 %) as within the authentic dataset.

Picture by creator

Knowledge Transformation

On this step, we remodel the filtered knowledge for every of the specified variables which makes it prepared for clustering evaluation. The remodeled knowledge for wind velocity is proven under:

Picture by creator

Cluster Modeling

We want to cluster the generators based mostly on wind velocity, wind route, nacelle route, and energy output. The thought is to establish which group of generators might be thought of neighbors and used for the statistical illustration of a given turbine.

We use the KMeans algorithm on this instance. Collection of the optimum variety of clusters is important for establishing the mannequin and the favored Elbow methodology is employed for this function. The elbow plots for all instances are proven under:

Picture by creator

The optimum variety of clusters chosen based mostly on particular person and mixed options is 3 though utilizing 4 or 5 clusters additionally gave affordable outcomes.

We used the usual scaler software within the Sklearn preprocessing module to scale the enter knowledge within the case of all options since they’ve completely different orders of magnitude.

Outcomes

We created a clustering mannequin based mostly on every variable and all variables mixed and recognized the turbine teams for these instances. The outcomes are proven under:

Picture by creator

Within the outcomes above, the wind velocity and wind route turbine clusters exhibit the same sample the place generators in group one are on the sting of the park which is sensible because of the similar diminished stage of obstructions at each places particularly if the predominant wind route is alongside the X axis. As well as, teams 2 and three are discovered in the midst of the park throughout the columns (alongside the X axis).

Within the energy output cluster outcome, group 1 generators are discovered on the proper facet of the X axis with a clearly outlined boundary. Group 2 generators are in the midst of the park and alongside the X axis whereas group 3 generators are the biggest group and occupy largely the perimeters of the park.

The nacelle route is completely different from the remainder of the variables as a result of it relies upon strongly on the yaw logic utilized to the turbine and will differ throughout the location. Therefore, the clusters wouldn’t have a transparent boundary. Nonetheless, this evaluation could also be helpful to troubleshoot underperformance associated to yaw misalignment when mixed with manufacturing knowledge.

The clustering outcome utilizing all options mixed is just like the wind velocity cluster and is proven within the article header image.

Cluster-based strategy for lacking worth imputation

Right here, we’ll discover two cluster-based approaches for dealing with lacking values based mostly on the wind velocity knowledge particularly naive clustering (NC) and column-sensitive clustering (CSC). Each strategies have been intuitively named by the creator.

Naive clustering

On this strategy, the lacking worth for a given turbine is changed by the imply (or median) worth for the cluster through which the turbine belongs on the desired timestamp. We’ll use the imply cluster worth for this evaluation.

Column-sensitive clustering

This methodology extends the naive clustering strategy by taking the imply worth for less than generators in the identical cluster and column. This considers the impact of the geographical location of the generators and could also be particularly extra correct for variables like wind velocity.

Cluster Mannequin Analysis

The take a look at knowledge consists of 19,940 rows and comprises the bottom reality of the lacking wind velocity knowledge to be predicted utilizing the cluster approaches.

The naive clustering strategy can fill 99.7% of the lacking values based mostly on obtainable knowledge within the coaching cluster knowledge whereas the column-sensitive methodology can solely fill 93.7% as a result of lesser coaching knowledge when the clusters are additional binned into columns.

Each strategies are evaluated utilizing the imply absolute error (MAE) metrics based mostly on the lacking values they will predict. As well as, the imply absolute share error (MAPE) metric is used to judge the non-zero predictions. For each metrics, the smaller the higher by way of mannequin efficiency.

The SCS strategy gave a 2% and eight% enchancment over the NC strategy based mostly on the MAE and MAPE respectively. Nonetheless, it fills fewer lacking values. Therefore, complementary use of each approaches will provide larger advantages.

To have a good visible comparability of the outcomes we plot 100 randomly sampled factors from the take a look at knowledge as proven under:

Picture by creator

Subsequent, we visualize the lacking worth imputation errors for each approaches. The imputation error is the distinction between the bottom reality wind velocity worth and the expected worth.

Picture by creator

Each approaches have well-behaved imputation errors that are symmetric in regards to the zero error positions.

Conclusions

On this article, we carried out clustering-based SCADA knowledge preprocessing utilizing completely different particular person and mixed variables. As well as, we predicted lacking wind velocity knowledge utilizing naive and column-sensitive clustering approaches. Lastly, we inferred from the evaluation that each strategies might be complementary for larger advantages.

I hope you loved studying this text, till subsequent time. Cheers!

Don’t neglect to test different tales on making use of state-of-the-art knowledge science rules within the renewable power house.

References

Zhou, J., Lu, X., Xiao, Y., Su, J., Lyu, J., Ma, Y., & Dou, D. (2022). SDWPF: A Dataset for Spatial Dynamic Wind Energy Forecasting Problem at KDD Cup 2022. arXiv. https://doi.org/10.48550/arXiv.2208.04360

Wind power analytics toolbox: Iterative energy curve filter

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments