Why Your Data Scientist Can’t Spin Your IoT Data Into Gold

Data Science, as described at the first international workshop on Data Science for Internet of Things, part of the IEEE International Conference on Mobile Ad hoc and Sensor Systems (MASS) in October 2016, is “an interdisciplinary field that involves techniques to acquire, store, analyze, manage, and publish data. Data can be analyzed using machine learning, data analysis, and statistics – optimizing processes and maximizing their power in larger scenarios.” A key finding of the workshop’s organizers was the importance of well-planned IoT data management and how it impacts the success of any IoT project depending on its capacity for allowing researchers to reproduce scenarios, and optimize the acquisition, analysis, and visualization of the data acquired by IoT devices.

Good Data Management Enables Better Data Science

Moving forward, more data cleaning, pre-processing, and exploratory data analysis will be automated. With this in mind, the future value of insights produced by data scientists and their tools will increasingly depend on the IoT data management capabilities and architecture of the IoT system itself.

“The monetary value of an IoT system increases at a rate directly proportional to the system’s ability to enable data scientists to learn from incoming data and then rapidly operationalize those learnings.”

In many organizations, there is a one way street from IoT system to data science team – here’s a mountain of raw data for you to clean and learn something from. See you again tomorrow. Same place, same time, same messy data flow.

In a proper IoT data management architecture, methods of cleaning are iteratively automated and incorporated into the body of the system, enabling data scientists to dive into deriving actionable insights from each new batch rather than manual pre-processing. Critically, the insights themselves are operationalized as well, bringing the next round of challenges to the forefront automatically.

It’s Not Going to Get Easier

Adding to the challenge, the data scientist, AI designer, and author Ajit Jaokar calls out the importance of iterational design not just in the cloud, but at the edge as well. The combination of volume, variety, and velocity of IoT data pouring in from sensors and the challenges of latency and connectivity encourage a design where trained models (rules, recommendations, scores, etc) are created in one location and deployed at multiple points.

In their paper Data Science and Machine Learning in the Internet of Things and Predictive Maintenance, the Data Science Group at SAP makes it clear that data science faces “many new challenges in the domain of IoT while at the same time the traditional challenges have not gone away.” They also note the major activity in the data science process is “identifying, accessing, and preparing data for analysis.” So what separates an IoT system that serves as a flywheel for business innovation and increasing revenues, from a Rumplestiltskin-esque nightmare of failed promises to spin data into gold? It is the ability to ingest the learnings from each day into the system in a manner that enables the team to seek the next higher order of value rather than repeat the process ad infinitum.

How can this be accomplished?

The 3 Keys

As mentioned, the ability to operationalize cleaning algorithms and learned insights is paramount to building a system that becomes more valuable over time as a factor of the volume of data collected. This applies to both inputs (the raw sensor data and information from 3rd party sources like tide levels and fuel prices) and outputs of the current generation of analytics and machine learning tools. A continuous loop means iterative improvement.

Secondly, storing the incoming events as immutable data points (including metadata) in a single source of truth not only ensures a complete audit and debugging trail, but the data is also modeled to enable analysis from many angles across different axes. Normalization and transformation become dramatically easier, with these clean, “synthetic” values being provided to downstream systems while the original raw inputs are maintained for deeper analysis and troubleshooting down the road.

Lastly (for the purposes of this article), the notion of trust must be foundational to the IoT system architecture design for both security/provenance (is the data from an authorized source) and reliability (is the data accurate). The ability to flag data (at a system level) as trusted or untrusted as a way to protect downstream analytics and other enterprise systems is critical to driving actions and deriving insights that benefit the business. Furthermore, the state of the trust flag for each bit of data must be editable by the system for any particular point in time – if you learn an air temperature sensor became detached and immersed in water for 7 weeks and then repaired, you need to be able to remove data from that sensor in historical reports for only that 7 week period while maintaining all other periods of time and for all users and systems who query this data. Without the ability to clean data retroactively, much of the hard work of data scientists never makes it back into the main body of the system.

Putting it all together

At a high level, such a system will look like this:

iot data flow

To learn more about designing industrial IoT systems that promote effective data science and become more valuable over time, get in touch with our team and let us know how we can help.

Related posts