PV system performance evaluation by clustering production data to normal and non-normal operation.

The purpose of this study is the development of an algorithmic tool to cluster the points of a scatterplot of "System Yield vs. Reference Yield" of photovoltaic systems, which tends to follow a linear regression. From such clustering one can distinguish data points that comply with the linearity (inliers) and the data points which do not (outliers). The tool then is applied to the scatterplot for two main purposes: a) to detect and exclude any data input anomalies or b) to detect and separate measurements where the PV system is functioning properly from the measurements that show that the photovoltaic (PV) system is malfunctioning.


Introduction
With the continued increase in photovoltaic (PV) installations throughout the world, their proper functioning is becoming more and more important. Clearly, at high solar irradiation the generated amount of energy is high, while this depends on the actual condition of the PV system including proper system design and operational issues leading to energy loss. Any operational problems must be detected fast to limit the associated energy loss. Consequently, monitoring of PV systems is an important topic, both for scientists and owners/investors of residential and medium to large-scale size systems since it can give insights in the operation of systems and their performance, while it also allows detecting any malfunctions that may occur. The most common performance assessment of a studied PV system is the comparison of its energy production with a reference source. In this paper two reference sources are studied: Global Tilted Irradiance (GTI) and energy production of a neighboring PV system.

1.
The inliers, the measurements that fit the linear regression model and which will be used for the real performance evaluation of a PV system. 2.
The outliers, the measurements that do not fit the linear regression model and after further study could be used for the detection of any occurred malfunctions.

Performance Ratio
The Performance Ratio (PR) [19] is a broadly used indicator for the performance characterization of PV system. It has been used in studies regarding the performance analysis of PV systems [12][13][14]20], comparison of different type of PV systems [12] but also in studies regarding malfunction detection [6,21]. It is a dimensionless indicator, the ratio of the system's yield (Y f ) to the system's reference Yield (Y R ). where Y R = GTI 1000 in which E AC is the generated amount of energy and P peak the installed capacity. A system with PR higher than 70% is considered to performing well and above 80% as excellent [2,12].
In the scatterplot Y f vs. Y R , the indicators tend to follow a linear relationship (LR) and the slope of this linear regression is the Performance Ratio. In previous studies, changes in the slope of the LR is an indicator for the existence of malfunction [7]. Then again, the relationship between Y f and Y R is not strictly linear due to the fact that the efficiency of solar panels decreases as its temperature increases which affects the linear relationship. With this in mind, in this paper the typical linear regression function (Y = a + bX) is not calculated and in fact never used. Having said that, the term linear relationship is used for better understanding, in order to describe the near-linear relationship of Y f and Y R .

Comparison of Neighboring PV Systems
In the case of data from a neighboring PV system, the process is more straightforward. If the systems have the same capacity (for instance, same parts of solar parks, commercial PV systems on same rooftops or PV system with micro-inverters/power optimizer) the power outputs of the systems can be directly compared. If the compared systems have different capacities, their system yields (Y f ) are compared, since Y f is actually the normalized production. In both cases (same or different capacity) their relationship is expected to be linear with a linear regression line with a constant slope which defines the characteristic relationship of the compared PV systems. If the slope is almost equal to 1 then they have the same performance. If it is different than 1 then one of the systems is performing better. Any unexpected change in the slope denotes that the performance of the one of the PV systems is reducing.

Research Purpose and Paper Organization
The purpose of this paper is to introduce an algorithm, which will study a malfunctioning PV system (referred as studied PV from now on) by comparing its production with reference data (referred as reference data), from the sources mentioned above and calculate the expected energy loss due to these measurements. This paper introduces a cluster analysis algorithm, applicable to any scatter plot where the data to be analyzed show a near-linear relation. The aim of the algorithm is to automatically detect and distinguish measurements that are following the linear relationship from the ones which are not. As an outcome, the proposed algorithm will cluster the measurements in two groups: inliers and outliers, as mentioned above. The introduced algorithm is applied on PV system power data, in particular to compare "System Power with Reference Power" and it is used for two different purposes:

1.
To detect and exclude any data input anomalies during the monitoring process, especially in case of residential PV systems where GTI is obtained by satellite observations and solar models.

2.
To detect and separate measurements where the PV system is functioning properly from the measurements that show that the PV system is malfunctioning or shaded. Measurements showing proper functioning can then be used for the performance analysis while the rest can be further studied for malfunction characterization.
The aim of this algorithm is to be used in larger researches regarding monitoring of large numbers of residential PV systems. Hence only power output is used, which is the most common data provided by residential PV systems. Other PV system measurements (voltage or current) can be used for malfunction characterization, especially on outliers of purpose 2.
This paper is organized as follows. Section 2 discusses the reasons why measurements could not follow the linear relationship, for every case of reference data, and distinct them to data input anomalies and PV system failures. Section 3 provides the description of the proposed algorithm. In Section 4, the algorithm is applied in four different cases and the results are discussed. Finally, Section 5 summarizes the main findings of the paper.

Data Outliers in Performance Evaluation
For a variety of reasons measurements of system and reference yield do not follow a linear relationship and as a consequence the calculation of the performance evaluation may be erroneous. In this section the reasons of the existence of outliers for each type of reference data are described; they are presented in Table 1. Table 1. The reasons of outliers per reference data (ranked by the most possible). The second and the third reference data are used for the monitoring of residential PV systems on rooftops, in case of the Netherlands, where different objects might create shadow. To that end, for these reference data, faults due to surroundings are assumed to be the most common reason of outliers rather than malfunctions of the system. In contrast, pyranometers are used at large scale installations where surrounding objects are quite rare.

Irradiance Data from Pyranometers
In this case, the GTI irradiance measurements are accurate within 2.5% [2], as the pyranometers are installed at the same tilt angle as the module. The majority of erroneous performance evaluations then is due to an existence of a malfunction in the system. Provided that the malfunction causes a constant energy loss in the system (e.g., detached string, broken panel/inverter) the production is reduced, usually significantly, which is clear from the determined PR value.
Then again, in the event that a malfunction causes a changing energy loss, the PV system could operate either normally or not, depending usually on the level of the irradiance. Such malfunctions are partial shading [7,21,22] (which could lead to the creation of hot-spots [23]), losses due to maximum power point tracking (MPPT) errors in the inverter [21], grounding fault [24] and overheated modules [6]. In these cases, the majority of the measurements will follow a linear regression, while some measurements will not. However, the reduction of the PR value will not be significant and malfunctions can even remain undetected if the "Y f vs. Y R " plot of the systems is not studied at high time resolution (minutes).

Irradiance Data from Other Sources
As mentioned above, other sources of GHI data could be from local meteorological stations or satellites and subsequent processing to GTI data using solar models. In this case, the data would have more "noise", mainly due to the distance between the studied PV system and the measuring device as well as due to the uncertainty of the used solar models.
An increased uncertainty in GHI can be due to the fact that most of the meteorological stations are located outside of the cities, and could be easily some km away from the studied PV system. Thus the larger the distance between the station and the PV system, the higher the possibility that a cloud that effects the PV system can remain undetected by the station and vice versa. In case of satellite data, satellites such as MeteoSat 10 [25], are providing solar irradiance data of spatial grids (3 × 3 km in case of MeteoSat) that are to coarse to capture the effects of small clouds that reduce irradiance locally. Moreover, the irradiance of each grid is not measured constantly but once every 15 min. Thus any major changes within these 15 min, for instance one moving cloud on a sunny day, can remain undetected thus yielding incorrect reference data.
An increased uncertainty in GTI can be caused through its calculation procedures in the used solar model(s). For GTI calculations, the measured GHI firstly is separated to its components, DHI and to DNI (direct normal irradiance). Secondly, the impact of each component on the tilted surface is calculated and the sum of the impacts is the GTI. Comparisons of solar models at different locations prove the accuracy of these models, however they note the possibility of faults, while it is also clear that the calculation accuracy of GTI is strongly based on the accuracy of the inputs, DHI and DNI [26][27][28]. However, as DHI measurements are not common other models have to be used for the separation of GHI, the GHI separation models. The most used separation models are DIRINT [29], DISC [30] and Erbs [31]. Recently a more modern approach was introduced [32], where it was also stated that such models are empirical and local and yield an extra possibility of error in the calculation of GTI, and thus in erroneous assessment of performances. In Figure 1 the system yield of a commercial rooftop PV system is compared with the reference yield, obtained by satellite observations and the use of the HDKR Model [9,10]. It is obvious that the strong majority of the measurements are following a linear relationship. However, a large number of measurements are clearly outside the linear trend giving the impression to an observer that the system produces high values of electricity under very low radiation and vice versa. more "noise", mainly due to the distance between the studied PV system and the measuring device as well as due to the uncertainty of the used solar models. An increased uncertainty in GHI can be due to the fact that most of the meteorological stations are located outside of the cities, and could be easily some km away from the studied PV system. Thus the larger the distance between the station and the PV system, the higher the possibility that a cloud that effects the PV system can remain undetected by the station and vice versa. In case of satellite data, satellites such as MeteoSat 10 [25], are providing solar irradiance data of spatial grids (3 × 3 km in case of MeteoSat) that are to coarse to capture the effects of small clouds that reduce irradiance locally. Moreover, the irradiance of each grid is not measured constantly but once every 15 min. Thus any major changes within these 15 min, for instance one moving cloud on a sunny day, can remain undetected thus yielding incorrect reference data.
An increased uncertainty in GTI can be caused through its calculation procedures in the used solar model(s). For GTI calculations, the measured GHI firstly is separated to its components, DHI and to DNI (direct normal irradiance). Secondly, the impact of each component on the tilted surface is calculated and the sum of the impacts is the GTI. Comparisons of solar models at different locations prove the accuracy of these models, however they note the possibility of faults, while it is also clear that the calculation accuracy of GTI is strongly based on the accuracy of the inputs, DHI and DNI [26][27][28]. However, as DHI measurements are not common other models have to be used for the separation of GHI, the GHI separation models. The most used separation models are DIRINT [29], DISC [30] and Erbs [31]. Recently a more modern approach was introduced [32], where it was also stated that such models are empirical and local and yield an extra possibility of error in the calculation of GTI, and thus in erroneous assessment of performances. In Figure 1 the system yield of a commercial rooftop PV system is compared with the reference yield, obtained by satellite observations and the use of the HDKR Model [9,10]. It is obvious that the strong majority of the measurements are following a linear relationship. However, a large number of measurements are clearly outside the linear trend giving the impression to an observer that the system produces high values of electricity under very low radiation and vice versa. The reference yield is calculated from GHI, DHI and the HDKR solar model [9,10]. The oval red shape emphasizes the fact that a strong majority of the measurements follow a linear trend, hence the higher density of measurement points inside that area. The reference yield is calculated from GHI, DHI and the HDKR solar model [9,10]. The oval red shape emphasizes the fact that a strong majority of the measurements follow a linear trend, hence the higher density of measurement points inside that area. In case of neighboring systems, both systems, the reference one and the studied one, can be affected by partial shading, which will create more noise in the scatterplot. Moreover, if the distance of the systems is large, any local, small clouds can affect the linearity of the measurement. However, this effect will be smaller in case of nearby systems and it is absent in case of systems on the same rooftop.

Difference in Energy Production
Furthermore, due to different reasons (wiring, panel and inverter brand, age) one of the compared systems may produce less energy, in a non-linear way. Such an example is presented in Figure 2, where two different panels (same micro inverter, same capacity, different ages and wired panels) are compared. Except for the shadow, between 14:00 and 15:30, where panel one (blue line) is generating much less energy, the production curve of panel one is lower than the curve of panel two (green curve) and this difference is getting larger at higher energies. In this case, clustering of the inliers and the outliers can give two different values of energy loss (with respect to the panel which is used as reference) to the owner: (a) due to shade and (b) due to the system. In that case the owner can decide if it is economically profitable to replace the panel or remove the shade, or both. Furthermore, in case of no replacement, it will be very useful to know the production relationship of the monitored PV systems, in order to detect future malfunctions. In case of neighboring systems, both systems, the reference one and the studied one, can be affected by partial shading, which will create more noise in the scatterplot. Moreover, if the distance of the systems is large, any local, small clouds can affect the linearity of the measurement. However, this effect will be smaller in case of nearby systems and it is absent in case of systems on the same rooftop.

Difference in Energy Production
Furthermore, due to different reasons (wiring, panel and inverter brand, age) one of the compared systems may produce less energy, in a non-linear way. Such an example is presented in Figure 2, where two different panels (same micro inverter, same capacity, different ages and wired panels) are compared. Except for the shadow, between 14:00 and 15:30, where panel one (blue line) is generating much less energy, the production curve of panel one is lower than the curve of panel two (green curve) and this difference is getting larger at higher energies. In this case, clustering of the inliers and the outliers can give two different values of energy loss (with respect to the panel which is used as reference) to the owner: (a) due to shade and (b) due to the system. In that case the owner can decide if it is economically profitable to replace the panel or remove the shade, or both. Furthermore, in case of no replacement, it will be very useful to know the production relationship of the monitored PV systems, in order to detect future malfunctions. Comparison of Pac output of two panels with the same micro inverter, same capacity, but different brand. Except for the shadow, the production curve of panel one is lower than the curve of panel two (green curve) and this difference is getting bigger for higher power outputs.

Data Preparation
The proposed method is operating with two different sets of data. One is the power output (either DC or AC) of the studied PV system (referred to as studied PV from now on) and the other the reference data. The reference data could be the tilted irradiance measurements or the power output (either DC or AC) of a neighboring PV system, with same tilt and orientation (referred to as reference data).

Figure 2.
Comparison of Pac output of two panels with the same micro inverter, same capacity, but different brand. Except for the shadow, the production curve of panel one is lower than the curve of panel two (green curve) and this difference is getting bigger for higher power outputs.

Data Preparation
The proposed method is operating with two different sets of data. One is the power output (either DC or AC) of the studied PV system (referred to as studied PV from now on) and the other the reference data. The reference data could be the tilted irradiance measurements or the power output (either DC or AC) of a neighboring PV system, with same tilt and orientation (referred to as reference data).
Both studied PV and reference data should be preprocessed into a specific form before the algorithm can be applied, in order to create a linear relationship in the scatterplot "studied PV vs. reference data". The process depends on the type of reference data and it is presented in Table 2.  (2) and (3). Depending on the available reference data, the data are processed and used as shown in the table.

Same Capacity Different Capacity
Studied PV For all the cases of data, the difference (error ε) between the studied PV and the reference is calculated, since its role is pivotal for the later steps of the process.

Scope of the Algorithm
The scope of the algorithm is to define the maximum and lower thresholds of the error, where the measurements within these thresholds will be characterized as inliers, thus the measurements that fit to the linear regression curve. These are measurements where the studied PV system is operating normally. The measurements outside these thresholds will be characterized as outliers, thus the moments where the PV system is malfunctioning.
As we assume that a high density of measurement points in the "Studied PV vs. Reference data" plot must be around the linear regression line, the thresholds must be set such that the density of measurements points is decreasing.

First
Step-Inliers Determination Using Ran.Sa.C.
In the first step, the iterative method Ran.Sa.C. (Random Sample Consensus) [33] is applied on the scatterplot "Studied PV vs. Reference data", which clusters the measurements in two groups, the inliers (the ones that following a linearity, i.e., normal operation) and outliers.
If the data are limited, such as data for only a few days with hourly resolution, no further processing is needed. The inliers from Ran.Sa.C. are used directly for the calculation of the PR and the outliers are further studied in order to determine the cause of energy loss and its total impact on the energy production of the system.
However, if the sample is large, such as data for long periods at minutely time resolution, extra processing may be needed, since Ran.Sa.C. could be misloaded and measurements could have been denoted as inliers while they do not exactly fit in the linear relationship. In this case, Ran.Sa.C. is used as a first cleaning of measurements to identify clear outliers.

Second Step-Data Clustering and Polynomial Regression
From this step the focus of the analysis is on the calculated error between the studied PV and the reference (calculated with Equation (4)). Furthermore, the analysis involves only the inliers as calculated in step 1. The inliers from step 1 are clustered into groups, based on the actual value of the reference data. Typically fifteen groups are used to cover the range of the reference data. For each group, a histogram of the errors is calculated. Subsequently, a polynomial linear regression is performed: where f is the frequency and ε the value of the error.
In the plot of Figure 3 the histogram of one group is presented, as an example, together with the estimated polynomial linear regression between the errors and their frequency (red points).
Energies 2018, 11, x FOR PEER REVIEW 8 of 18 The inliers from step 1 are clustered into groups, based on the actual value of the reference data. Typically fifteen groups are used to cover the range of the reference data. For each group, a histogram of the errors is calculated. Subsequently, a polynomial linear regression is performed: where is the frequency and ε the value of the error. In the plot of Figure 3 the histogram of one group is presented, as an example, together with the estimated polynomial linear regression between the errors and their frequency (red points).

Third Step-Determine the Thresholds of Each Group
Based on the assumption described in Section 3.2, the algorithm is focusing on the maximum (depicted by a red arrow in Figure 3) of poly(ε), the εmax. As thresholds the two local minimums on the left and right (εleft and εright) of the global maximum (blue arrows on Figure 3) are used. If poly(εl/r) < 0 then as threshold is set the ε where poly(ε) = 0, with ε in (εleft, εmax) or ε in (εmax, εright).
The logic behind this is that as the polynomial curve is moving away (left or right) from the maximum, the frequency of the errors will rapidly reduce, thus poly(ε) will also rapidly reduce and will form a local minimum. If the reduction is high, the poly(ε) will take values lower than zero. That means that the frequency of the errors dramatically changes, as well the density of the measurements in the initial "Studied PV vs. Reference data" scatterplot.

Fourth Step-Normalization and Connection of All Limits
During this step the relationship of each defined threshold (εleft, εright) and the global maximum (εmax) of all power groups versus the reference power is studied, see Figure 4.
The clustering of the sample in groups during step 2, based on reference power, is random and it depends on the sample size. Furthermore, since one value of ε will represent a group, with reference power from Pn to Pn+1 then the plots of εmax, εleft and εright versus reference will be stair plots, as it is clear from Figure 4a. In order to use these data in practice, polynomial fits are made, to obtain continuous functions for each error. An example is presented in Figure 4b.
The polynomial fits of the εleft and εright versus reference data ( ( ) and ℎ ( )) is the most important output of the algorithm since they define the relationship between the studied and the reference data, for any value of the reference data. That is to say, for any measurement of a studied sample, the application of its reference data using ( ) and ℎ ( ) defines the maximum and the lower allowed value of its error, in order to be characterized as inlier or outlier. Furthermore, polynomials from historical data, can be applied to new measurements and characterize them as inliers and outliers.

Third
Step-Determine the Thresholds of Each Group Based on the assumption described in Section 3.2, the algorithm is focusing on the maximum (depicted by a red arrow in Figure 3) of poly(ε), the ε max . As thresholds the two local minimums on the left and right (ε left and ε right ) of the global maximum (blue arrows on Figure 3) are used. If poly(εl/r) < 0 then as threshold is set the ε where poly(ε) = 0, with ε in (ε left , ε max ) or ε in (ε max , ε right ).
The logic behind this is that as the polynomial curve is moving away (left or right) from the maximum, the frequency of the errors will rapidly reduce, thus poly(ε) will also rapidly reduce and will form a local minimum. If the reduction is high, the poly(ε) will take values lower than zero. That means that the frequency of the errors dramatically changes, as well the density of the measurements in the initial "Studied PV vs. Reference data" scatterplot.

Fourth Step-Normalization and Connection of All Limits
During this step the relationship of each defined threshold (ε left , ε right ) and the global maximum (ε max ) of all power groups versus the reference power is studied, see Figure 4.
The clustering of the sample in groups during step 2, based on reference power, is random and it depends on the sample size. Furthermore, since one value of ε will represent a group, with reference power from P n to P n+1 then the plots of ε max , ε left and ε right versus reference will be stair plots, as it is clear from Figure 4a. In order to use these data in practice, polynomial fits are made, to obtain continuous functions for each error. An example is presented in Figure 4b.
The polynomial fits of the ε left and ε right versus reference data (ε Poly le f t (re f ) and ε Poly right (re f )) is the most important output of the algorithm since they define the relationship between the studied and the reference data, for any value of the reference data. That is to say, for any measurement of a studied sample, the application of its reference data using ε Poly le f t (re f ) and ε Poly right (re f ) defines the maximum and the lower allowed value of its error, in order to be characterized as inlier or outlier. Furthermore, Energies 2018, 11, 977 9 of 19 polynomials from historical data, can be applied to new measurements and characterize them as inliers and outliers.
The polynomial fit of ε max versus the reference (ε Poly max (re f )) provides the information about the most frequent value of the error, for the respective value of the reference data.
Provided that the relationship of the error ε and Y f , Y R is for any single measurement defined as (4)), for any single measurement the ε max , ε left and ε right are calculated by the application of the reference data on the respective polynomial fits, ε Poly max (re f ), ε Poly le f t (re f ) and ε Poly right (re f ). Thus: Rewriting Equation (6) with respect to Y f : are the polynomial functions that return the thresholds for which Y f is inlier, for any respective value of Y R . In case that the Y f is constantly lower than the Y R (case of PV system vs. solar radiation data) then the right threshold represents the maximum value of Y f and the left the minimum in order to be characterized as inlier. On the other hand, Y f , max (Y R ) is the polynomial which returns the most frequent value of Y f for any respective value of Y R . In Figure 4c, the Equation (7) is plotted. Any measurement of the scatterplot between green and blue lines is characterized as inlier, while outside as outlier, with the majority of the measurements to be on the red line. The polynomial fit of εmax versus the reference ( ( )) provides the information about the most frequent value of the error, for the respective value of the reference data.
Provided that the relationship of the error ε and Yf, YR is for any single measurement defined as = − (Equation (4)), for any single measurement the εmax, εleft and εright are calculated by the application of the reference data on the respective polynomial fits, ( ) , ( ) and ℎ ( ). Thus: Rewriting Equation (6) with respect to Yf: in which , ( ), , ℎ ( ) are the polynomial functions that return the thresholds for which Yf is inlier, for any respective value of YR. In case that the Yf is constantly lower than the YR (case of PV system vs. solar radiation data) then the right threshold represents the maximum value of Yf and the left the minimum in order to be characterized as inlier. On the other hand, , ( ) is the polynomial which returns the most frequent value of Yf for any respective value of YR. In Figure 4c, the Equation (7) is plotted. Any measurement of the scatterplot between green and blue lines is characterized as inlier, while outside as outlier, with the majority of the measurements to be on the red line.

Fifth
Step-Pplication of the Limits to the Data Finally, the polynomial regressions of the thresholds, calculated during step four are applied to the data. Thus, for every measurement, the thresholds are calculated, based on its reference value and if the error (Y f can be used as well) is within these thresholds the measurement is defined as inlier, otherwise as outlier.
For better understanding an example is presented in Figure 4b, where the blue and red lines are the thresholds of Y f , based on the Equations (6) and (7). Every measurement inside these lines is characterized as inlier (gray), while outside as outlier (black).

Calculation of Energy Loss
After the clustering in inliers and outliers, the amount of lost energy due to the outliers can be calculated. In this paper, energy loss is defined as the energy that each outlier would have produced if it would follow the normal relationship between the studied data and the reference (which is defined by the polynomial fits calculated in Section 3.3.4). The calculation is applied only if it is known to the users that the outliers are due to a malfunction on the system, thus if the proposed algorithm is used for the second purpose, as described in Section 1.3.
If E hyp Studied is the hypothetical energy production of the studied PV, for a single outlier, in case that it was inlier and E studied PV the produced energy, then for a single outlier: And for all outliers: where i the number of outliers. In this paper, for the hypothetical energy production three scenarios are chosen:

1.
Error is equal to the higher frequent error (global maximums), thus the most probable value 2.
Error of outliers is slightly higher (105%) than the smaller threshold (ε left )

3.
Error is slightly lower (95%) than the higher threshold (ε right ) From the first scenario the most probable energy loss is calculated, since it is assumed that all the outliers would have the same error ε with the most frequent one. From the second and third scenarios the lower and the higher possible energy losses are calculated respectively, under the assumption that all outliers would have the same lower or higher Y f values in order to be characterized as inliers.

Application of the Method-Examples
The proposed method is applied in different examples and their scatterplots are presented below. The first three examples are from data from the experimental facility of SEAC (Solar Energy Application Center, Eindhoven, The Netherlands). The facility contains three PV systems, with identical panel structure (six panels in two rows, one front, one back, same tilt and orientation) and different inverter technology [22,34]. The system can be seen in Figure 5. The system on the right-hand side consist of 6 micro inverters (265 W each), the system in the middle consist of a series connection of the panels and a standard string inverter of 1.5 kW. Finally, the system on the left consist of 6 power optimizers connected in parallel (boost DC/DC) and a central inverter of 1.5 kW especially made for the power optimizer system. In front of each system a pole is placed (same dimension for every system) in order to create an artificial shadow on the front rows of each system during the day which is equal for all systems. Furthermore, two pyranometers are available for the measurement of the tilted irradiance. The initial purpose of the system was to study the performance of different inverter technologies (string inverter, power optimizer and micro inverter) under shading conditions [22,34]. For these cases it is known that the strong majority of the outliers in the data is caused by shadow, since the reference data are measured through the most accurate methods (pyranometer and neighboring PV system) as defined in Sections 2.1 and 2.3. Thus the algorithm will applied for purpose 2 as defined in Section 1.3. Moreover, the energy loss can be calculated since it is known that the outliers are due to a malfunction (shadow).
The percentage of lost energy compared to the actual production is presented, according to: where E loss is the energy loss as calculated from Equation (9). For each of the first two examples, the studied PV system is compared to the same reference data, two cases are used and two different plots are presented: the shaded system/panel versus (a) GTI obtained by an in-plane pyranometer, and (b) the average production of the 3 panels with power optimizers, in the back row of the system, which are shaded partially during late afternoon, different hours than the shaded panels. In the third example, two different panels with same micro-inverters are compared and monitored each other.
In the fourth example, a PV system from a house in The Netherlands is compared with the tilted irradiance, obtained by the application of the HDKR solar model [9,10,29] on satellite data from MeteoSat. The system consists of a 2.5 kWp inverter and has panel capacity of 2.45 kWp. This example is different than the other three, since the reference data will contain a large amount of input anomalies and any presence of shadow is unknown. Thus the algorithm will be used for purpose 1 of Section 1.3 and energy loss cannot be calculated, since it is not known if the outliers are due to data input anomalies (thus any energy loss calculation is pointless since there are false measurements) or for any other reason, for instance shadow. neighboring PV system) as defined in Sections 2.1 and 2.3. Thus the algorithm will applied for purpose 2 as defined in Section 1.3. Moreover, the energy loss can be calculated since it is known that the outliers are due to a malfunction (shadow). The percentage of lost energy compared to the actual production is presented, according to: where Eloss is the energy loss as calculated from Equation (9). For each of the first two examples, the studied PV system is compared to the same reference data, two cases are used and two different plots are presented: the shaded system/panel versus (a) GTI obtained by an in-plane pyranometer, and (b) the average production of the 3 panels with power optimizers, in the back row of the system, which are shaded partially during late afternoon, different hours than the shaded panels. In the third example, two different panels with same micro-inverters are compared and monitored each other.
In the fourth example, a PV system from a house in the Netherlands is compared with the tilted irradiance, obtained by the application of the HDKR solar model [9,10,29] on satellite data from MeteoSat. The system consists of a 2.5 kWp inverter and has panel capacity of 2.45 kWp. This example is different than the other three, since the reference data will contain a large amount of input anomalies and any presence of shadow is unknown. Thus the algorithm will be used for purpose 1 of Section 1.3 and energy loss cannot be calculated, since it is not known if the outliers are due to data input anomalies (thus any energy loss calculation is pointless since there are false measurements) or for any other reason, for instance shadow.

Shaded Panel with Power Optimizer
In the first example the algorithm is applied to the DC power output of a panel with power optimizer. Figure 5a,c shows the scatterplots "Yf vs. YR" and "PDC_shaded vs. PDC_reference", respectively, while at the right plots, Figure 5b,d show a zoom (red box in Figure 5a,c) for more details. The time resolution is 5 min and the studied period is 116 days, from beginning of July 2015 until end of October. The panel is shaded for a part of the day (approximately from 10:30 am to 12:20 pm).
In the scatterplots b and d of the Figure 6, it is obvious how the algorithm is detecting as inliers the areas with the higher density of measurements. Due to the large number of the measurements, this is not obvious in the scatterplots of the whole sample, where the inliers and outliers seem to have the same density. However, in the zoomed areas, in plots b,d differences in density are clearer.

Shaded Panel with Power Optimizer
In the first example the algorithm is applied to the DC power output of a panel with power optimizer. Figure 5a,c shows the scatterplots "Y f vs. Y R " and "P DC_shaded vs. P DC_reference ", respectively, while at the right plots, Figure 5b,d show a zoom (red box in Figure 5a,c) for more details. The time resolution is 5 min and the studied period is 116 days, from beginning of July 2015 until end of October. The panel is shaded for a part of the day (approximately from 10:30 am to 12:20 pm).
In the scatterplots b and d of the Figure 6, it is obvious how the algorithm is detecting as inliers the areas with the higher density of measurements. Due to the large number of the measurements, this is not obvious in the scatterplots of the whole sample, where the inliers and outliers seem to have the same density. However, in the zoomed areas, in plots b,d differences in density are clearer. Differences in the classification of inliers and outliers between the two cases are observed since the reference data are different. In case of the pyranometer as reference data (Figure 6a,b), the threshold of the inliers is broader for higher values of irradiance, due to the accuracy of the pyranometer and the reduction of the efficiency of the panels in higher temperatures. However, these differences are very small, since for the studied period, 95% of the measurements have the same classification (both inliers or outliers) and only 3% different in the two cases of the reference data.
As mentioned at the end of Section 1.2.1, the plot of "Yf vs. YR" is not strongly linear, especially at higher values of energy due to the reduction of the efficiency of the panels at higher temperatures. Thus the use of the fitted linear regression equation (Yf = a +bYR) is not so accurate, as the use of the polynomial fits (Section 3.3.4), where the two polynomials (left and right in Figure 4) will define the upper and lower threshold for the inliers and the third (global maximum in Figure 4) the most possible position of the inliers. By contrast, in the "PDC_shaded vs. PDC_reference" the relationship is very strongly linear, because both panels have the same efficiency reduction at higher temperatures. In this case the polynomial fits are almost a straight line and the fitted linear regression equation could be used as well, in order to describe the performance relationship between the compared panels.
The performance ratio (PRDC in this case) is calculated, according to Section 1.1 and Equation (1). As reference data the pyranometer is used (Figure 6c). The PRDC of the panel, with the use of all data is 81.7% while using only the inliers (green markers in Figure 6c) the PRDC is 86.8%. That means that the real performance of the panel is 86.8%, however due to presence of the shadow (external reason, not malfunction inside the system) it drops to 81.7%.
The energy loss due to the shadow is calculated using Equations (9) and (10) presented in Table  3, for each of the scenarios and for both cases of the pyranometer and neighboring panels as reference data. Clearly, the panel produces less energy due to the shadow (outliers) and depending the Figure 6. Example of the application of the algorithm for monitoring a panel with power optimizer (Shaded P DC ). Two different types of data are used: GTI measured by an in-plane pyranometer (a,b) and the average production of the 3 panels with power optimizers (c,d). In plots (b,d) the marked red square of plots (a,c) is presented.
Differences in the classification of inliers and outliers between the two cases are observed since the reference data are different. In case of the pyranometer as reference data (Figure 6a,b), the threshold of the inliers is broader for higher values of irradiance, due to the accuracy of the pyranometer and the reduction of the efficiency of the panels in higher temperatures. However, these differences are very small, since for the studied period, 95% of the measurements have the same classification (both inliers or outliers) and only 3% different in the two cases of the reference data.
As mentioned at the end of Section 1.2.1, the plot of "Y f vs. Y R " is not strongly linear, especially at higher values of energy due to the reduction of the efficiency of the panels at higher temperatures. Thus the use of the fitted linear regression equation (Y f = a +bY R ) is not so accurate, as the use of the polynomial fits (Section 3.3.4), where the two polynomials (left and right in Figure 4) will define the upper and lower threshold for the inliers and the third (global maximum in Figure 4) the most possible position of the inliers. By contrast, in the "P DC_shaded vs. P DC_reference " the relationship is very strongly linear, because both panels have the same efficiency reduction at higher temperatures. In this case the polynomial fits are almost a straight line and the fitted linear regression equation could be used as well, in order to describe the performance relationship between the compared panels.
The performance ratio (PR DC in this case) is calculated, according to Section 1.1 and Equation (1). As reference data the pyranometer is used (Figure 6c). The PR DC of the panel, with the use of all data is 81.7% while using only the inliers (green markers in Figure 6c) the PR DC is 86.8%. That means that the real performance of the panel is 86.8%, however due to presence of the shadow (external reason, not malfunction inside the system) it drops to 81.7%. The energy loss due to the shadow is calculated using Equations (9) and (10) presented in Table 3, for each of the scenarios and for both cases of the pyranometer and neighboring panels as reference data. Clearly, the panel produces less energy due to the shadow (outliers) and depending the reference data, the calculated energy loss is slightly different. As mentioned above, these differences are due to the accuracy of the pyranometer, and the fact that in case of the pyranometer as reference data the threshold of the inliers is broader for higher values of irradiance, explains the larger difference between the scenarios. Table 3. The energy loss due to the shadow, for each reference data and for each scenario. According to the scenario, that the error of each outlier should be equal to the most frequent error, the panel is producing 6.7% or 6.1% (depending on the reference data taken for comparison) less energy.

PV System with String Inverter
In this case the algorithm is applied to the PV system with string inverter, where three panels in the front row are heavily shaded during most of the day. The data resolution is 5 min and the studied period is 133 days, from beginning of July until end of October. In Figure 7, the results of applying the algorithm are presented. Figure 7a,c show the complete data sample, while Figure 6b,d show a smaller portion for better understanding.
The results from the study of this system show that the duration of the shadow is clearly different. While the panel in Section 4.1 was shaded only a part of the day, this system is shaded during most of the day. Furthermore, the shaded panels are shaded one at a time and only a few moments two panels are shaded at the same time. Due to this fact, the values of the errors are much smaller and the measurements with high error are rare.
In Figure 7a the clustering threshold of the algorithm is almost impossible to be discerned. In the more detailed observation (Figure 7b) the threshold is clearer, since the density of the green marks is reduced (where the yellow arrows are pointing), and the marks after the reduction are clustered as outliers (black colored). A thin white line between the arrow heads can be seen, reflecting absent data points, which in fact shows where the algorithm separates the inliers from the outliers.
However, in Figure 7c,d, where the average power of the neighboring panels is used as reference data, the results are more clear and detailed. Furthermore, three distinct different lines of outliers can be observed (highlighted by the yellow circle). This demonstrates that the use of neighboring panels for performance evaluations can show much more detail and can be more accurate than the use of a pyranometer as reference data.
These differences can be explained by the accuracy of the pyranometer and the reduction of the panels efficiency at higher temperatures. Similar to example 1, for higher irradiances, where the temperature is higher as well, the efficiency of the panels is reducing. Thus the scatterplot of the energy production versus the solar irradiance shows more scatter for higher irradiances and the relationship is not strongly linear. Thus a polynomial fit is more accurate to describe the relationship of the Y f versus the Y R . By contrast, the relationship with the neighboring panel is considerably stronger linear, since its efficiency is reducing at higher temperatures as well.

PV System with String Inverter
In this case the algorithm is applied to the PV system with string inverter, where three panels in the front row are heavily shaded during most of the day. The data resolution is 5 min and the studied period is 133 days, from beginning of July until end of October. In Figure 7, the results of applying the algorithm are presented. Figure 7a,c show the complete data sample, while Figure 6b,d show a smaller portion for better understanding.

Systems with Same Capacity, Different Production and Different Shadows
In this example, two different panels with same micro inverters are monitored. Panel with micro inverter 3 (Micro 3 from now on) is shaded by the placed pole, while the other panel (Micro 6 from now on) is shaded by the corner of the rooftop in the late afternoon. Each panel is used as reference for the other, thus the algorithm runs two times, one for each panel as studied and one as reference. The results are presented in Figure 8. The results from the study of this system show that the duration of the shadow is clearly different. While the panel in Section 4.1 was shaded only a part of the day, this system is shaded during most of the day. Furthermore, the shaded panels are shaded one at a time and only a few moments two panels are shaded at the same time. Due to this fact, the values of the errors are much smaller and the measurements with high error are rare.
In Figure 7a the clustering threshold of the algorithm is almost impossible to be discerned. In the more detailed observation ( Figure 7b) the threshold is clearer, since the density of the green marks is reduced (where the yellow arrows are pointing), and the marks after the reduction are clustered as outliers (black colored). A thin white line between the arrow heads can be seen, reflecting absent data points, which in fact shows where the algorithm separates the inliers from the outliers.
However, in Figure 7c,d, where the average power of the neighboring panels is used as reference data, the results are more clear and detailed. Furthermore, three distinct different lines of outliers can be observed (highlighted by the yellow circle). This demonstrates that the use of neighboring panels for performance evaluations can show much more detail and can be more accurate than the use of a pyranometer as reference data.
These differences can be explained by the accuracy of the pyranometer and the reduction of the panels efficiency at higher temperatures. Similar to example 1, for higher irradiances, where the temperature is higher as well, the efficiency of the panels is reducing. Thus the scatterplot of the energy production versus the solar irradiance shows more scatter for higher irradiances and the relationship is not strongly linear. Thus a polynomial fit is more accurate to describe the relationship of the Yf versus the YR. By contrast, the relationship with the neighboring panel is considerably stronger linear, since its efficiency is reducing at higher temperatures as well.

Systems with Same Capacity, Different Production and Different Shadows
In this example, two different panels with same micro inverters are monitored. Panel with micro inverter 3 (Micro 3 from now on) is shaded by the placed pole, while the other panel (Micro 6 from now on) is shaded by the corner of the rooftop in the late afternoon. Each panel is used as reference for the other, thus the algorithm runs two times, one for each panel as studied and one as reference. The results are presented in Figure 8. The differences between the figures-cases are negligible since 96% of the measurements have the same classification in both cases. The major difference is in the angle of the linear regression lines of the two plots, where in plot (a) is lower than (b), thus system 6 is functioning better than system 3. It Figure 8. Two different panels with micro inverters are compared to each other. The one is heavily shaded and less productive (micro 3) than the other, which is less shaded and newer, thus more efficient (micro 6). The red line corresponds to identical performance of the two systems (45 • ) and, is shown for better comparison of the performances. (a) P AC of system micro 3 versus micro 6; (b) P AC of system micro 6 versus micro 3.
The differences between the figures-cases are negligible since 96% of the measurements have the same classification in both cases. The major difference is in the angle of the linear regression lines of the two plots, where in plot (a) is lower than (b), thus system 6 is functioning better than system 3. It is clear in the plots that the system 3 is shaded much more than system 6. Thus in this example 3 values for energy loss can be calculated:

1.
Energy loss of system 3 due to shadow 2. Energy loss of system 6 due to shadow 3. Energy loss of system 3 due to the older panel Values 1 and 2 are calculated through Equation (5), similarly to examples 1 and 2. System 6 is producing due to the shadow (percentwise according to Equation (6)) from 0.2 to 0.4% less energy and system 3 from 2.5 to 3.5%. As mentioned above, system 3 is affected much more by shade than system 6. Energy loss of system 6 is negligible since it is affected by shadow by the very end of the day.
Furthermore, system 3 has an older and less efficient panel. In order to calculate this energy loss, only the inliers are used as the panel is not disturbed by other reasons (shade in this case). Thus, system 3 is producing 95.6% less energy than system 6, due to age difference between the panels, since both micro inverters and the wirings are the same and no other differences have been observed in DC to AC conversion during the operation of the system [22,34].

System of a Regular House in The Netherlands, Monitored with MeteoSat Data
In the last example a PV system with system capacity 2.45 kW and inverter capacity 2.5 kW is compared with solar radiation, determined from images taken with the satellite MeteoSat 10 [25]. The measurements of MeteoSat 10 consist of 15 min time resolution data of GHI and DHI and these are converted to GTI with the use of the HDKR solar model [9,10] see Figure 9a. The results of applying the proposed method are presented in Figure 9b. is clear in the plots that the system 3 is shaded much more than system 6. Thus in this example 3 values for energy loss can be calculated: 1. Energy loss of system 3 due to shadow 2. Energy loss of system 6 due to shadow 3. Energy loss of system 3 due to the older panel Values 1 and 2 are calculated through Equation (5), similarly to examples 1 and 2. System 6 is producing due to the shadow (percentwise according to Equation (6)) from 0.2 to 0.4% less energy and system 3 from 2.5 to 3.5%. As mentioned above, system 3 is affected much more by shade than system 6. Energy loss of system 6 is negligible since it is affected by shadow by the very end of the day.
Furthermore, system 3 has an older and less efficient panel. In order to calculate this energy loss, only the inliers are used as the panel is not disturbed by other reasons (shade in this case). Thus, system 3 is producing 95.6% less energy than system 6, due to age difference between the panels, since both micro inverters and the wirings are the same and no other differences have been observed in DC to AC conversion during the operation of the system [22,34].

System of a Regular House in The Netherlands, Monitored with MeteoSat Data
In the last example a PV system with system capacity 2.45 kW and inverter capacity 2.5 kW is compared with solar radiation, determined from images taken with the satellite MeteoSat 10 [25]. The measurements of MeteoSat 10 consist of 15 min time resolution data of GHI and DHI and these are converted to GTI with the use of the HDKR solar model [9,10] see Figure 9a. The results of applying the proposed method are presented in Figure 9b. In this example, 75% of all measurements are classified as inliers (green markers in Figure 9b) and 25% as outliers. About 1/3 of the outliers are above the linear regression line. These are mostly due to spatial measuring faults. About 2/3 of the outliers are below the line, and these are due to the presence of a local shadow in addition to spatial measuring faults.
The Performance Ratio of this system from the total sample of measurement is determined to be 71.3%. By taking into account only the inliers the real PR is 79.4%.
Clearly in this example the proposed algorithm is used for purpose 1, as explained in Section 1.3, thus in order to filter irradiance data with large input anomalies and make them suitable for accurate performance analysis for residential systems where the only available data is the power output. Figure 9. (a) System Yield of a residential PV system versus the reference yield obtained by satellite data and solar models; (b) The plot after the application of the algorithm, where it is clustered to inliers (green) and outliers (black).
In this example, 75% of all measurements are classified as inliers (green markers in Figure 9b) and 25% as outliers. About 1/3 of the outliers are above the linear regression line. These are mostly due to spatial measuring faults. About 2/3 of the outliers are below the line, and these are due to the presence of a local shadow in addition to spatial measuring faults.
The Performance Ratio of this system from the total sample of measurement is determined to be 71.3%. By taking into account only the inliers the real PR is 79.4%. Clearly in this example the proposed algorithm is used for purpose 1, as explained in Section 1.3, thus in order to filter irradiance data with large input anomalies and make them suitable for accurate performance analysis for residential systems where the only available data is the power output.
The two-thirds of the outliers which are below the line could be better studied and malfunction detection techniques could be applied in order to determine if there are malfunctions or irradiance data input anomalies. For instance, if DC power is available as well, a DC to AC conversion study on the outliers could detect a malfunction on the inverter. A study on voltage (if it is available) can detect the presence of shadow [22]. If further data on the DC side are available, fault detection and classification methods that have been proposed by Akram and Lotfifard [35] or in [24], a study which also describes detection of ground faults.

Conclusions
In conclusion, this paper describes the development of a new method of cluster analysis and its application for the analysis of PV performance. The method itself can be applied to any pair of data sets which tend to fit to a linear relationship, and allows to distinguish the data between those which are following the linear relationship and those which are not.
The application of the developed algorithm for monitoring and performance evaluation of PV systems is based on the fact that the produced power of a PV system, with the application of the correct formulas) tends to be linear with irradiance in the plane of array (GTI). As the fit is not strongly linear the proposed method aims to replace this linear fit with three polynomial fits. Two are providing the lower and higher value of production (as system yield, Y f ) in order for the measurement to be characterized as inlier or outlier, for any value of the irradiance (as reference yield, Y R ). The third polynomial is providing the position of the majority of the inliers in the scatterplot of the compared data.
The proposed algorithm is used for two different purposes in the field of PV monitoring. For each monitored PV system, the purpose depends strongly on the available reference data and any additional information about the system (known shadow or malfunction). Presuming that the monitored PV system is a residential one, on a rooftop, and the only available reference data is tilted irradiance obtained by satellite or local weather station data after application of solar models, then the method is applied in order to detect and exclude data input anomalies (purpose 1). If the reference data is obtained by a tilted pyranometer or from a neighboring PV system then the method is applied in order to detect any malfunction that causes changing energy loss (shadow in the given examples) and will calculate the energy loss due to the malfunction as well (purpose 2). Furthermore, in that case the performance of the PV system, the "real PR", for the moments that the system is not malfunctioning is calculated. Thus, if the PR is between 0.8 and 0.7 the system is performing well but it can be improved and if the PR is lower than 70% then it is suffering from a constant energy loss malfunction. On the other hand, if the number of the outliers is increasing then a new shadow or another changing energy loss malfunction may be affecting the system.
Unfortunately the case of the neighboring PV systems is rare, except for systems with micro inverters and power optimizers. Its linear relationship with the studied PV is much stronger and detailed compared to pyranometer measurements as reference data, as the pyranometer accuracy is limited to~2.5% and calibration is needed, ideally every two years according to surveys [36] and pyranometer manufacturers [37,38].
As a suggestion, the outliers could be further studied for the detection and classification of any malfunctions. For instance, the hourly occurrence of outliers should be studied and used for detection of any shadow created by any obstacle around the studied PV system. Other possible studies are depending on the case and the available data.
As a final suggestion, the proposed method can be used for validation of any model that simulates PV performance. The simulated and measured data should follow a strong linear relationship (similar to the comparison of two PV systems, see Figures 6c, 7c and 8) and the algorithm can detect from the measurements if the system is malfunctioning or if the model may have any weaknesses at specific moments. Especially in case of building integrated PV (BIPV), where the calculation of the tilted irradiance is more complicated due to the existence of different surfaces around the building, the combination of the proposed algorithm with energy performance models, such as proposed in [39] could lead to monitoring and identification of outliers.