Missing Data? You Can Still Get Great Analytics!

It is a truth universally acknowledged that an analyst with a data set must be in want of a protocol to deal with missing data (my apologies to Jane Austen).  In our work analyzing data related to buildings and facilities, NIKA’s analysts often find that the facility managers are missing information that we would need to help them better predict equipment failure or develop maintenance schedules or manage inventory.  If this sounds familiar, don’t fret – there are several ways to deal with missing data, each with its own advantages and disadvantages.

Option 1: Only Analyze Complete Data Sets
While this would be analytically pure, it is extremely limiting.  Perfect data sets are only slightly less rare than unicorn sightings, and only accepting complete data sets may mean that there is questionable data that should be excluded that isn’t, such as when there is an expenditure or cost that’s out of range with the rest of the data.

Option 2: Ignore Incomplete Records
This is a valid option, provided the data set is large enough and the number of incomplete records is proportionally small; statistical literature normally limits this to around 5% of the records. There may be cases, however, where this is a poor fit, such as a building’s energy consumption reported on a monthly basis. For example, if only the hottest months are missing, then your analysis is not going to take into account all the air conditioning that’s used in those months.

Option 3: Use Estimates
This option has the advantage of making the data set complete.  However, there are many different ways to make an estimate and a particular philosophy or reasoning must be chosen. In general, you want to treat each instance of missing data the same way.  When you choose a method for using an estimate, you will have to defend it.  An experienced data analytics expert can help you use proven estimation methodologies and calculations, like regression analysis, to fill in the blanks and get a more accurate picture of your overall facility health.

If you’ve thought about turning to data analytics to help improve facility operations, but were worried that your data isn’t accurate or complete enough, there are options available to you. Data analytics is a complex field requiring diverse expertise, including business, mathematics, and technology. These disciplines all come together to create a holistic view of your organization’s facilities. The Enterprise Technologies team at NIKA has extensive experience turning all kind of data into actionable analytics. Contact us to learn more about how we can help you increase the useful life of buildings and equipment in your facility portfolio.



Math that Makes Sense: Regression Analysis for Facility Management

Predictive analytics are becoming increasingly important in the realm of facility management. Being able to effectively forecast outcomes – whether related to equipment performance or construction scheduling – is a key element of the way NIKA’s Enterprise Technologies experts help organizations better manage facility projects across the globe.

For example, given data on completed construction projects, you can estimate the length of time a new project will actually take, using the cost and original duration that the construction company gives you.  If you have data on energy consumption in a building and the outside temperature, you can estimate how much electricity or natural gas the building will consume on a given day. A great tool for making these kinds of predictions is regression analysis.

Understanding Regression Analysis

With any data set, there are three main groups of analysis that can be performed: reporting on what the data is, correlation analysis (or how some of the data relates to other parts of the data), and predictive analysis (or using the data to predict what will happen in the future).

Regression analysis falls into the latter category. It takes data from observations and finds the function that best fits them.  A regression model predicts a variable by combining independent variables.

Statistical software packages calculate the coefficients of the variables and the statistical values of R-squared and p-value.  The R-squared¹ can be described as the fit of the model, and a higher value is better (a perfect model has an R-squared of 1).  The p-value² is the statistical significance of the model or variable, and a lower value is better.

What is Needed for Regression Analysis?

First and foremost, you need a data set with plenty of observations.  You can use a smaller data set provided the data reflect a group with more commonalities.  For example, if you are looking at a data set containing information on chukwar partridges, it could be smaller than one which contains multiple kinds of birds.

Next, you should decide how to deal with outliers³.  If you have a large data set, the effect of outliers is minimal.  If you don’t have a large data set, then you might want to consider removing the outliers so they don’t exert too much influence on the model.

It is always best to have a complete data set with many observations, all of which have no missing data.  This is the ideal, but we can also work with data sets are not ideal.  If there is missing data, it is necessary to consider whether the missing data is randomly distributed or if there is a pattern.  Depending on the number of records, either incomplete records can be removed or use estimates/imputed data to fill the gaps of data that is missing at random.

If you are familiar with the data or have beliefs that data is connected (whether from experience or literature), then you can jump right into the regression analysis.  If you are not, then you will want to see if there are relationships between the data.  You can do this by running correlation analysis or taking a look at scatterplots.  This may inform you that you should transform the data to turn it into a linear relationship between the data.

What Does Regression Analysis Tell Me?

Once you run your regression analysis, you have a model for your dependent variable (what you are trying to predict).  It can be for energy usage, construction costs, or political outcomes.  Take a look at the R-squared and p-values for the coefficients and model to make sure that they are sufficient.  If they are, then it’s great!  You can plug in values to see what your dependent variable would be in that case.  If they are not, then you might want to check your premise or the data you have.

Even if you have a model that isn’t great at predicting your dependent variable, it can still be useful in discussing the effect of the independent variables on it.  If a variable has a positive coefficient in the model, then an increase in that variable will increase the dependent variable; a negative coefficient means the opposite.

Regression Analysis in Action

Here at NIKA, we have used regression analysis in many different ways. Case in point: We were hired by an organization whose construction contractor was assuring them that their project was running on schedule. Our client had a gut feeling that that wasn’t the case. NIKA’s Enterprise Technologies team performed regression analysis on similar completed projects, and produced a model that projected the construction project would be completed two years after the anticipated completion date, confirming the client’s suspicion. Based upon the objective analysis, our client was able to prepare a more realistic scenario for the project’s completion and wait to assign the initial outfitting money that would have been spent on an unfinished project.

For more information on how regression analysis and predictive analytics can be put to work for your facilities, contact NIKA’s Enterprise Technologies team.

¹ More formally, the R-squared is the amount of variance explained by the model.  Valid R-squared values vary according to discipline, some hold that anything over 0.5 is good, while there have       been published economics articles with R-squared values in the neighborhood of 0.4.

² Likewise more formally, the p-value is the probability that the null hypothesis (that is, the coefficient of the variable is 0) holds.  The generally accepted level of statistical significance is to have a     p-value of 0.05 or less.

³ In the statistical sense, not the Malcolm Gladwell book.