If you ask any experienced analytics or data science professional, what differentiates a good model from a bad model – chances are that you will hear a uniform answer. Whether you call it “characteristics generation” or “variable generation” (as it was known traditionally) or “feature engineering” – the importance of this step is unanimously agreed in the data science / analytics world.
This step involves creating a large and diverse set of derived variables from the base data. The richer the set of variables that are generated, the better will be your models. Most of our time and coding efforts are usually spent in the area of feature engineering. Therefore, understanding feature engineering for specific data sources is a key success factors for us. Unfortunately, most analytics courses and text books do not cover this aspect in great detail. This article is a humble effort in that direction.
Table of Contents
- The IoT Revolution
- Nature of IoT or sensor data
- Aggregation of data for feature engineering
- Usage at atomic level
- Usage at aggregated level
- Selecting Optimal time window for aggregating sensor data
- Types of Agreegation
- Missing value treatment
- Feature generation
- Basic Features
- Features based on relationships
- Features based on higher order statistics
- Features based on Outlier detection
- Features based on Sereies transformation
- Further Readings
1. The IoT (Internet of Things) revolution
I am sure you know this already – none of us are untouched by the impact of IoT. Look at the forecast for mobile data traffic from Cisco (as done in 2015) below:
In the last few years, continuous streaming data from sensors has emerged as one of the most promising data sources, fuelling a number of interesting developments in “Internet of Things”. Some of the common areas which are being revolutionised by the emergence of IOT include vehicle telematics, predictive maintenance of equipment, manufacturing quality management, connected consumer devices, health care and many more.
Given the fast pace of change to connected devices and our perspective of data science, we think that data science professionals need to understand and explore feature engineering of IOT or sensor data. And that is what we will attempt in this article. Just a caveat, though – like any other area, this is a vast and emerging field and hence, this is not a comprehensive guide. But, it is good enough for you to get started.
2. Nature of IoT or sensor data
IoT or sensor data consists of a continuous stream of data, the time interval between successive updates of the data is very small; usually minutes, seconds or even mili-seconds.
The data produced, usually pertains to information about the physical state of a system or a human being. Examples being temperature, pressure, voltage, current, flow rate, velocity, acceleration, heart rate, blood pressure etc. In case of many sensors, the data stream also provides continuous information about the state of the system. For instance, in case of vehicle telematics data, along with the information from a particular sensor (e.g. temperature, flow of lubricant etc.), the status of the car (e.g. running, idling, cranking) is also captured as a continuous stream.
Many IoT applications also consider other less dynamic data along with the continuous streaming data mentioned above. Such less dynamic data usually pertains to the overall state or working condition of the system. Examples being type of input materials being processed by a machine, the treatment regime that is being administrated to a patient etc.
3. Aggregation of data for feature engineering
Prior to creation of features from IOT or sensor data, it is important to consider the level of aggregation (across time) of the continuous streaming data. In many cases, the continuous streaming data is not aggregated but is used at the most granular or atomic level. If one assumes that a particular sensor data is available every second, then the signal at each second interval is used for generating features. In most cases where the prediction time windows are very short, this is the most useful level of aggregation. On the other hand, in case where the prediction time window is longer, it may be most optimal to aggregate the signal data over specific time windows.
To sum up, while working at the atomic level, the data is not aggregated over time and hence, is used at the most granular level. But for certain problems the data is aggregated over specific time windows. Let me discuss this with an example of each type.
Usage at an atomic level (without aggregation) – an example
As an example of atomic aggregation of sensor data, one may consider a motor that is used as prime mover for a grinding machine. Based on the input particle size and operating parameters of the grinder (e.g. temperature, flow rate etc.) the rotations per minute (RMP) of the motor is controlled dynamically. In this case, the data on input particle size, temperature within various parts of the grinder, and input flow rate are available every second. Based on these inputs, the RPM of the motor needs to be change dynamically. If the motor rotates faster than the optimal speed then it will cause overheating of the grinder, leading to possible damage of certain areas of the grinder. On the other hand, if the RPM is too low then it will cause a decrease in production rate.
Data on temperature, particle size and flow rate are taken as inputs and a model is used to determine the right RPM for the motor in such a way that there is very low probability of damage to the grinder. It is apparent that the time window for predicting the right RPM is within a few seconds or maximum a minute. In this case, one will most likely use the input data at every second level to derive features that will be used as possible candidate variables in the model. Usually, applications, involving process control use very short prediction window which necessitates feature generation at the atomic level. Figuire-1 below, illustrates the grinder assembly described above.
Usually, one assumes that the process underlying the generation of the sensor data is memory less, which implies that, the value of the sensor data stream within a particular window is not dependent on the values in the previous window.
Usage of aggregated sensor data – an example
Let us look at the other example, where we aggregate the data. Consider a car which has being driven for a few years. If one is interested in predicting the life time of a specific component within the car or the equipment, then one is faced with a situation where the prediction horizons are longer-term in nature (days, weeks or months). In this case, the sensors inside the car will produce data at frequent time intervals similar to the earlier case. But the data needs to be aggregated over time to understand meaningful trends and changes that will signify an impending failure over a longer time horizon. In these cases, both atomic level and aggregated level are used for generating the features, but in most cases, the aggregated level features prove more productive. This also holds true for a lot of other manufacturing equipments.
Selecting the optimal time window for aggregating sensor data
The next obvious question is what should be the appropriate time window for aggregating the sensor data before feature derivation. Selecting the time window is often an important consideration that drives the success of the feature engineering exercise in IoT. Usually the following types of windows are used for aggregation.
- Fixed time window
- This is the simplest window calculation. The aggregation is performed over a specific, uniform time interval (e.g. 15 minutes across the entire span of data availability).
- Variable time window
- Various measures can be used to generate variable time windows. In most cases, a specific number of occurrences of events are used to determine the size of the window. For example, in vehicle telematics data, one may use a time window that contains at least 4 instances of switching-on and switching-off of the engine. Within the span of the available data, the 4 occurrences may happen within any period of time. All aggregations are performed within this time window that contains 4 instances of switching-on and switching-off.
- Exponentially expanding or exponentially contracting time windows
- In many IOT applications, it may be relevant to aggregate the near-term data across more granular time windows, while data that is further off, may be aggregated across a wider window. This approach is referred to as exponentially expanding windows. The opposite of the same, is referred to as exponentially contracting window
- Overlapping window
- A pertinent question that one may raise at this point, is “how to determine the right window?” – Similar to many other answers in data mining, the answer to this is “let’s try different types and lengths of windows, and then see what the data picks up as most predictive”. This gives rise to the usage of overlapping windows.
- One may use various definitions of creating the time windows and then naturally these windows will overlap. It may so happen that features across various windows are finally used for building the model
Types of aggregation
Once the window for aggregation has been arrived at, the next step involves aggregating the sensor data over these time windows to create a set of new variables / features from the atomic ones. The aggregations performed are typically driven by the context and the data types. Some of the features are generated by using data across multiple windows; while others are generated by creating features within a single window. The section on feature generation provides a list of the features that are usually generated.
Treatment for missing values
Aggregating sensor data across various time windows also helps in treating missing values. In case of sensor data, missing values may arise due to failure of the sensing, transmitting or recording systems. For instance, due to issues in satellite connectivity, a telematics sensor inside a car engine may not be transmitting any signal for a specific period of time. The use of a variable time window allows the data to be aggregated over sufficient lengths of time which allows for certain threshold amount of data. For instance one may define a variable time window as a time window that captures at least 1000 pings from the sensor.
4. Feature generation
Post the aggregation process, one creates a continuous stream of values across time dimensions (if the data is not aggregated and hence, used at an atomic level) or across various windows. The following diagram illustrates a visualization of the data, pertaining to temperature sensor inside a machine. As expected the variability of the data is lower post aggregation; this suggests the reason of using atomic level data for shorter window predictions.
Figuire-2: Impact of aggregation window
The following features can be generated post the creation of the series (either aggregated or atomic).
- Basic features: Change, rate of change and standardization
- Features based on relationships: Relationship between different types of sensor data
- Features based on higher order statistics: Moments and cumulants
- Features based on outlier detection: Kalman filters and other outlier detection methods
- Features based on series transformation: Fourier transformation and other transformation
The simplest set of features involve change and rate of change (essentially the first and second differences) and percentage changes (growth or decay). If the series of value is represented by X1…..Xn and time is represented as t1…….tn then some of basic the features are calculated as follows. It should be noted that these are only a sample set of features possible and is by no means comprehensive. Feature engineering is a combination of science, creativity and domain knowledge.
- Simple features involving one type of signal
- Change over time: Cm+1 = (Xm+1 – Xm)/(tm+1 – tm)
- Rate of change over time: RTm= (Cm+1 – Cm)/(tm+1 – tm)
- Growth or decay: Gm+1 = (Xm+1 – Xm)/Xm
- Rate of growth or decay: RGm= (Gm+1 – Gm)/(tm+1 – tm)
- Count of values above or below a threshold value
- Moving average = Average of (Xm-p to Xm)
- Moving standard deviation = Standard deviation of (Xm-p to Xm)
- Relative average = Moving average / Global average
- Relatives standard deviation = Moving standard deviation / Global standard deviation
- Ratio of changes, growth rate etc. with standard deviation
- Features involving trend of values across various aggregation windows: change and rate of change in average, standard deviation etc. across windows
- In most cases, one will have multiple series of values (Xa1…..Xan, Xb1…..Xbn and so on). Using these multiple series of values one can create large number of combined features. These include the following.
- Simple ratio between the two series
- Ratio of changes, rate of change and growth between the two series
- Ratio of moving average and moving standard deviation
- Ratio of relative averages and relative standard deviation
- Relative first difference: RV1m+1 = (XAm+1 – XAm)/(XBm+1 – XBm)
- Relative second difference: RV2m+1 = (GAm+1 – GAm)/(GBm+1 – GBm)
- Count of cases where Growth of both series is positive or negative
- Count of cases where Growth of both series is in opposite direction
- Count of cases where the first series is above a threshold and the second below a threshold and vice-a-versa
- Similar to the case involving a single type of signal, one can also generate features across various aggregation windows
Features based on relationships
In many cases, one will leverage data coming out of multiple sensors, each pertaining to similar or dis-similar quantities that are measured. In most cases, there may be a natural relationship between the multiple signals emanating from the various types of sensors. However, in certain cases, the signal from a particular sensor may demonstrate a value outside the natural relationship. This may be driven by changes in operating conditions, breakdown etc. As a data scientist, it may be a critical objective to identify such events and convert them into features.
To identify such “un-natural” events, one may first build a simple model; usually, a regression model by taking one of the Xs as a dependent variable and the others as independent variables. In most cases these are simple ordinary least square regression models (but might use squares and cubes of some of the Xs as input variables). The error of that regression, provides a measure of the “un-natural” behaviour. If in some cases, one observes an outlier in the error term, then the same may be used asa feature. Further, the series of errors can be used as a new X variable and subsequent features may be generated using this new series.
It should be noted that while building the regression model to create the error series, care should be taken to ensure that the outliers are excluded and the error term obey the usual assumptions. This will help ensure that the “un-natural” behaviours are captured appropriately once the outlier data points are included.
An example has been provided to illustrate the above proposition, if one refers to the grinder example illustrated earlier, one may have the following continuous streaming sensor data that will be available:
- RPM of the motor
- Flow rate of the incoming material
- Flow rate of the outgoing material
- Particle size of the incoming material
- Particle size of the outgoing material
- Temperature coming from a number of thermocouples attached at various parts of the grinder; for instance TC1 to TC16 (which represents 16 continuous streaming sensor data on temperature from 16 different areas of the grinder)
As per the usual pattern of expected data, it may be expected that the temperate recorded by each thermocouple is a function of RPM, flow rates (both input and output) and particle size (both input and output). Once the RPM increases there is higher friction and hence, temperatures should naturally increase. Similarly, if the ratio between the input and output particle sizes are higher; it represents higher grinding ratio which may cause heating up. Therefore, one may be interested in knowing if the temperature recorded by any of the thermocouple is above or below this “normal” behaviour. For instance, if a there is some harder impurity mixed with the input material then it may cause above normal heating. To address this requirement, one may create specific features to capture such “un-natural” behaviour. For instance, one may build a simple regression model with temperature as the Y and RPM, particle size and flow rate as the Xs. Once this regression is applied on the data series, one should be able to calculate an error series. Under normal operating circumstances, the value of the errors should be within a limit; however, during “un-natural” working conditions. The error may shoot up or the standard deviation of the error may shoot up. Taking this error series e1…….en; one may create a set of features similar to the ones described in the last section.
Features based on higher order statistics
Usually moments (mean, variance, skewness and kurtosis etc.) are calculated within the aggregation window. However, in cases where the series is non-Gaussian, one uses cumulants rather than the moments for obtaining information about the nature of the distribution of the series within the window. The cumulants of a series X are defined using the cumulant-generating function, which is the natural logarithm of the moment-generating function. Till the 3rd cumulant, the value of the moment and the cumulant are the same. However, from the fourth and higher-order cumulants are not equal to moments. Deriving the relationship between moments and cumulants is beyond the scope of the current article; but interested readers should explore the derivation of the same.
Features based on outlier detection
In many uses cases, it may be required to detect outliers and use presence of outliers within an aggregation window as features. The most popular method of outlier detection involves use of techniques like Kalman filter. Other methods involve creating a principal component analysis using a set of already created features and then identifying the distance of each point within the aggregation window from the origin of this multi-dimensional space. Any point which may be 3 or more standard deviations away from the origin may be considered as an outlier. In a regression type approach described in the last section, Cook distance or similar measures can also be used for outlier detection.
Features based on series transformation
Usually a series of continuous streaming data can be considered as a signal in the time domain. Established mathematical transformations can be used to decompose this signal into a set of simpler functions, the forms of those simpler functions can then be analysed across multiple aggregation windows to identify any significant change in the pattern of the signal. Usually the following transformations are attempted. In this article, a brief description of the Fast Fourier Transformation has been provided as is it possibly the most commonly used transformation.
- Fast Fourier Transform
- Hilbert Huang Transform
- Wigner Ville Distribution
- Wavelet Transformation
Fast Fourier Transform
The Fourier transform of any signal is used to transform the signal from a time domain to a frequency domain. The transform helps in decomposing the original signal into a series of sinusoidal function of the form of ACos(ωt) + BjSin(ωt). Which is essentially a combination of the Cosine and Sine functions in the real and imaginary space. The most amazing aspect of the Fourier transform is that it can transform any function (irrespective of its shape) to a combination of a large number of sinusoidal functions. The Fast Fourier Transform (FFT) is an algorithm designed by Cooley and Tukey in the 1960’s. The FFT is used to calculate the amplitude, phase and frequencies of the sinusoids that should be combined to recreate an original signal. This can be intuitively thought as a process of breaking a piece of music into the component notes that can be combined across various scales to recreate the original piece of music.
To use this fascinating mathematical transformation for feature engineering, one needs to arrive at the FFT of the signal for each aggregation window. This will produce the frequencies and amplitudes of each component sinusoid across the aggregation windows. Then the series of amplitudes for a given frequency, can be used for further feature generation (similar to the ones described earlier). Alternately, if a new frequency is detected within any aggregation window, then the same may be an indicator of an anomalous behaviour.
Taking the music metaphor further, one may consider this process as “listening” to the music produced by the signal. Wherein one tries to “identify” any abnormal or discordant note in the music being produced in each aggregation window.
6. Further readings
The paper titled “A Cumulant-Based Method for Gait Identification Using Accelerometer Data with Principal Component Analysis and Support Vector Machine” by SEBASTIJAN SPRAGER, DAMJAN ZAZULA; provides a suitable application of cumulants for feature engineering.
Readers may refer to the paper titled “A Kalman Filter for Robust Outlier Detection” by Jo-Anne Ting, Evangelos Theodorou, and Stefan Schaal. For understanding the use of Kalman Filters for feature engineering.
For understanding the application of Fourier Transformations, readers may refer to the paper titled “Classification of epileptiform EEG using a hybrid system based on decision tree classifier and fast Fourier transform” by Kemal Polat, and Salih Güneş as a good example. The paper titled “Eeg alpha spindle measures as indicators of driver fatigue under real traffic conditions” by M. Simon, E. A. Schmidt, W. E. Kincses et al. is also a good example
About the Author
Sandhya Kuruganti and Hindol Basu are authors of a book on business analytics titled “Business Analytics: Applications to Consumer Marketing”, recently published by McGraw Hill and is available on Flipkart and Amazon India. They are seasoned analytics professionals with a collective industry experience of more than 30 years.