Industry:

Cross-industry

Relevant for:

  • Consumer Goods
  • Manufacturing
  • Pharmaceuticals
  • Energy
  • Retail

InfomatiX Platform:

IX Automation & Optimization


Business topics:
  • Business insights
  • Supply chain
  • Sales forecasting

Analytical Tools
  • KNIME
  • R
  • TIBCO Spotfire

Technology:
  • Integration
  • Predictive Analytics
  • Mathematical Modelling


Why do we use data cleansing?
Data cleansing, or in other word, data scrubbing is used simply to detect, to correct and to remove inaccurate records from our database.

What is data enrichment for?
Data enrichment is to increase the value of the information pool, adding extra values to our data set to enhance the quality and richness of the data.

The result of these processes:
Increased reliability, better quality data for decision making and planning.


Get IX KNIME-Spotfire Integration FREE Community Edition

THE CHALLENGE

Data scientists often face the challenge of creating proper analytics on datasets that include noisy, incomplete or irrelevant data. If a dataset is unclear it cannot serve as basis of any analysis, clustering or data visualizations as they result in misleading insights for business decision makers.

What are the root causes of incomplete or missing data?

  • The most common problem is that we have missing records for given dates. For example, when there was a particular week, a specific day (or in manufacturing it can be as small as a given millisecond), when we have no data as by accident no record was taken of an event.
  • It is typical that for some products data is gathered weekly, while for others it is available only bi-weekly or monthly.
  • The third case is when data quality itself is poor, there are some outliers and our challenge is to fit the data into a range to ensure quality for decision making.

In the above cases, we can use linear interpolation to make the dataset ready for analysis.

However, in some cases, the data is missing on purpose, because it must be missing. In real life, it can simply happen that production ceased at a given plant, because it is closed. Please note that we cannot use interpolation in these cases.

THE TECHNOLOGY

The capabilities of KNIME are used to process data in a Big Data environment. For mathematical modelling and statistical computing we are using ‘R’ as it has the ability to reach state-of-the-art technologies rapidly and easily. Finally, TIBCO Spotfire visualizes the results of the data cleansing and enrichment process.

THE SOLUTION

The following mathematical models and algorithms are used in the KNIME workflow:

  1. Data Cleansing: Linear interpolation
  2. Data Enrichment: Iterated Function Systems
  3. Noise Reduction: Kalman Filter

KNIME workflow for data enrichment

TIBCO Spotfire is used to visualize the result of the above algorithms calculated via the KNIME R-Snippet nodes.

 

1. Data Cleansing: Linear interpolation

As discussed earlier linear interpolation is used for

  • automatically detecting and imputing missing dates even if they not in data source,
  • automatically interpolating missing values, if date is present.

Interpolation

 

2. Data Enrichment: Iterated Function Systems

To enhance, refine and improve our raw data we use Iterated Function Systems. With the IFS interpolation in KNIME we can

  • automatically detect and impute missing dates even if they not in data source,
  • automatically interpolate missing values if date present,
  • automatically augment data by fractals.

In the example below IFS generated weekly data from monthly source with date imputations.IFS

 

3. Reducing Noise: The Kalman filter

The Kalman filter, also known as the linear quadratic estimation, is an algorithm reducing noise and other inaccuracies from data. Using all data captured over time, the Kalman filter shows us the statistically optimal estimate of the underlying dataset.

The algorithm works in a two-step process, the prediction of the current state variables and the estimation updates with weighted average.   Because of its recursive nature, the Kalman filter can work in real time, using only the present input measurement.

If the noise has Gaussian distribution, no specific settings in KNIME are needed.

Kalman-filter-with-Gaussian-noise

See the effect of Kalman filter algorithm on data below:

Result of Kalman Filter

THE RESULT

After data cleansing and enrichment the original data is transformed into sets that are consistent with other data sets in the BI ecosystem.

  • Eliminated cost of false conclusions and misdirected investments due incorrect and inconsistent data
  • Minimized manual database maintenance work
  • High quality data ready for
    • validation, harmonization and standardization
    • data visualization to get actionable insights
    • enterprise decision-making and planning

NEXT STEPS

  • Check more KNIME news, usage and development cases on knime.org/blog.
  • IX KNIME-Spotfire Integration simplifies data research by allowing users to do iterations quickly and safely and eliminates manual work errors of loading data from KNIME to TIBCO Spotfire. A great tool to accelerate Spotfire and KNIME adaptivity in analytics-driven organizations.
  • For more information contact IX Automation and Optimization Consultants at spotfire@infomatix.net