- Consumer Goods
- Business insights
- Supply chain
- Sales forecasting
- TIBCO Spotfire
- Predictive Analytics
- Mathematical Modelling
Why do we use data cleansing?
Data cleansing, or in other word, data scrubbing is used simply to detect, to correct and to remove inaccurate records from our database.
What is data enrichment for?
Data enrichment is to increase the value of the information pool, adding extra values to our data set to enhance the quality and richness of the data.
The result of these processes:
Increased reliability, better quality data for decision making and planning.
Data scientists often face the challenge of creating proper analytics on datasets that include noisy, incomplete or irrelevant data. If a dataset is unclear it cannot serve as basis of any analysis, clustering or data visualizations as they result in misleading insights for business decision makers.
What are the root causes of incomplete or missing data?
- The most common problem is that we have missing records for given dates. For example, when there was a particular week, a specific day (or in manufacturing it can be as small as a given millisecond), when we have no data as by accident no record was taken of an event.
- It is typical that for some products data is gathered weekly, while for others it is available only bi-weekly or monthly.
- The third case is when data quality itself is poor, there are some outliers and our challenge is to fit the data into a range to ensure quality for decision making.
In the above cases, we can use linear interpolation to make the dataset ready for analysis.
However, in some cases, the data is missing on purpose, because it must be missing. In real life, it can simply happen that production ceased at a given plant, because it is closed. Please note that we cannot use interpolation in these cases.
The capabilities of KNIME are used to process data in a Big Data environment. For mathematical modelling and statistical computing we are using ‘R’ as it has the ability to reach state-of-the-art technologies rapidly and easily. Finally, TIBCO Spotfire visualizes the results of the data cleansing and enrichment process.
The following mathematical models and algorithms are used in the KNIME workflow:
- Data Cleansing: Linear interpolation
- Data Enrichment: Iterated Function Systems
- Noise Reduction: Kalman Filter
TIBCO Spotfire is used to visualize the result of the above algorithms calculated via the KNIME R-Snippet nodes.
1. Data Cleansing: Linear interpolation
As discussed earlier linear interpolation is used for
- automatically detecting and imputing missing dates even if they not in data source,
- automatically interpolating missing values, if date is present.
2. Data Enrichment: Iterated Function Systems
To enhance, refine and improve our raw data we use Iterated Function Systems. With the IFS interpolation in KNIME we can
- automatically detect and impute missing dates even if they not in data source,
- automatically interpolate missing values if date present,
- automatically augment data by fractals.
In the example below IFS generated weekly data from monthly source with date imputations.
3. Reducing Noise: The Kalman filter
The Kalman filter, also known as the linear quadratic estimation, is an algorithm reducing noise and other inaccuracies from data. Using all data captured over time, the Kalman filter shows us the statistically optimal estimate of the underlying dataset.
The algorithm works in a two-step process, the prediction of the current state variables and the estimation updates with weighted average. Because of its recursive nature, the Kalman filter can work in real time, using only the present input measurement.
If the noise has Gaussian distribution, no specific settings in KNIME are needed.
See the effect of Kalman filter algorithm on data below:
After data cleansing and enrichment the original data is transformed into sets that are consistent with other data sets in the BI ecosystem.
- Eliminated cost of false conclusions and misdirected investments due incorrect and inconsistent data
- Minimized manual database maintenance work
- High quality data ready for
- validation, harmonization and standardization
- data visualization to get actionable insights
- enterprise decision-making and planning