We began this series of posts with insights into the many stakeholders involved in IoT initiatives, and considerations for designing the data processing infrastructure required to support those outcomes.Then we introduced the concept of an IoT data pipeline platform, followed by an iterative approach for discovering value from IoT data. Next, we provided an experience report of the latest ML tools from Microsoft Azure and Amazon Web Services (AWS), and in this article we shall bring it all together with a deeper technical dive into the data pipeline itself.
As shared earlier in the series, you’ll need to manage the complexity that arises from balancing tactical with strategic requirements during the design and implementation. Technology choices are an important part of the decision making process. Making smart choices now will make satisfying future requirements possible, while also reducing risk with the current implementation. You’ve started to look at the technologies offerings from cloud providers like AWS and Azure. The number of choices offered is dizzying and some services seem to overlap. How do you know what will work best, be most efficient, and ensure a successful initiative?
Where to start
You need a good starting point for choosing the technology that can help your project the most in the short and long term. Which applications will be used to analyze, report, or incorporate the IoT data? For example, reporting applications might want views of operational data requiring access to databases. Or, data scientists working in Python might need access to historical information. Is a specific data format most convenient for these tools? Also, if data is to be extracted from ERP systems, then additional tools or components might be needed. Will software automation help pre- or post-process IoT data? Identifying the tools and applications consuming the data as well as the most accommodating data formats are first steps to finding the best mix of technologies to support your users.
To begin, IoT data must be ingested into the cloud. Capabilities of devices and sensors to stream IoT data over the internet vary greatly. A variety of patterns exist for enabling modern and legacy IoT devices to transmit data to the cloud. However, how IoT data is transmitted to the cloud is out of scope of this blog post. Let’s assume that messages, formatted as say XML or JSON, can be transmitted to the cloud providers over an internet connection. Azure IoT Hub or AWS IoT Core offer secure services once the IoT data can be exposed to them, like routing streams of ingested data, sending messages to devices as well as secure, durable endpoints.
For legacy devices incapable of supporting the requirements of Azure IoT Hub or AWS IoT Core, Azure Functions or AWS Lambdas, or other web servers can be exposed as public HTTP(S) endpoints for IoT data capture. Although a less automated process than IoT Hub or IoT Core, Functions and Lambdas offer flexibility for truly custom data processing exposed through configuration and code.
IoT data may need to be transformed during ingestion and before being persisted. Certain types of transformation make perfect sense to automate while ingesting data. Unit conversions, like converting values to metric units, or, standardizing on a common data format, like JSON, are good examples. Other types of processing should be kept out of the ingestion process, like augmenting IoT data with enterprise data. Generally, the transformations should be quick and benefit the data whichever application or tool consumes the data. If ingestion was implemented with Azure Functions or AWS Lambdas, then this code is a great place to add such logic. Alternatively, Amazon Kinesis or Azure Stream Analytics enable data transformation on IoT data streams and integrate with standard ingestion components.
A variety of options exist for persisting IoT data. Influencing factors for choosing the best option include the structure (or not) of data, volume of data, throughput requirements, organizational support, and query performance. Highly structured data is a good fit for RDBMS. NoSQL databases, like Azure Cosmos DB or Amazon DynamoDB are a good fit for semi-structured data or instances where fields within ingested messages will change over time or message by message. Blob storage, like AWS S3 or Azure ADLS is a third option when data is totally unstructured, a less expensive storage option is required, or looking for a cold storage solution.
A time series DB should be seriously considered for telemetry data, but is naturally more involved to implement and support. A number of good options exist for Time Series DB, including but not limited to InfluxDB, Timescale, and OpenTSDB.
The level of support varies for the plethora of persistence options offered by cloud providers. It’s best to compare options by maintainability, cost, performance, flexibility and ease of integration. Straightforward integrations exist for cloud-native solutions, like AWS S3 and Azure ADLS Gen2, AWS DynamoDB or Azure Cosmos DB, or SQL database (Azure-managed).
In contrast, a Time Series DB or custom RDBMS must be run on AWS EC2 or Azure VM instances. For inserting data into the datastores, Lambdas or Functions provide a serverless option, while a micro-service or app server approach may be a better fit for some architectures. Each solution will offer different storage formats. Consuming applications may consume certain data formats more efficiently than other data formats. Consider that a hybrid solution involving more than one persistence solution could be the best fit. Find the right tool for the right job.
Applications consuming persisted IoT data will require API access. Like we discussed in Persisting Data, many options exist for providing API access. The choice is dependent on factors such as maintainability, application support, and the types of data stores implemented. When connecting to RDBMS and executing SQL, obvious choices are JDBC, ODBC and other DB connectors. This works well for reporting applications, like Microsoft PowerBI.
Reports and models performing analysis on the data may require additional data from ERP systems in order to give context to IoT data. Typically, this information is best kept in the ERP system or data warehouse until required by the application using it. Exposing endpoints – e.g. REST – will allow applications to consume it on demand. Federating data sources can be taken one step further with Presto. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. For example, Presto can even join Athena views of S3 bucket data with views in RDBMS.
Events and notifications
Data can be dirty when it’s ingested. Missing values or nonsensical values from sensors or device components are possible. Also, as is the case with any messaging architecture, the QoS of the message delivery mechanism can vary. Suppose IoT devices are communicating over HTTP via a WiFi connection directly to AWS or Azure. Intermittent drops in coverage or the internet connection may cause duplicate messages to be delivered. As mentioned in Gathering Data, if ingested messages are processed by Azure Functions or AWS Lambdas, the code can filter for dirty data or duplicate messages. Similarly, this filtering and duplicate message logic can be written with AWS Kinesis or Azure Stream Analytics.
Notifications of certain events occurring in IoT data streams is a common requirement. Integration with email, SMS, Slack and other popular messaging systems is well supported with programmatic APIs and cloud services. Whether customer-focused or internal notifications, straightforward methods are available to templatize, construct and send notifications.
Consideration should be given to where and how to capture the events. Triggers can be set up against databases or blob storage (AWS S3 or Azure ADLS) which allow code to determine if a notification is appropriate. Likewise, if using Azure Stream Analytics or AWS Kinesis, events can be gleaned off their data streams, then code executed. Custom applications, using our SpringBoard Cloud platform accelerator for example, have built-in event capture and integration for notifications. Lastly, topics or queues, like Azure Service Bus or AWS SNS/SQS, can be very useful when creating microservices dealing with event capture and notifications.
Different requirements may require orchestration of data flowing among different components with conditional logic. Custom calculations or complex algorithms may operate against data streams. Machine learning algorithms may require access to recently ingested IoT data in order to make predictions. Alternatively, microservices may pre-calculate values for reporting applications to improve performance. A more involved chain of components may execute to route, transform and operate on data. In AWS, Data Pipelines and Step Functions allow workflows to be created. Similarly in Azure, Data Factory and Logic Apps are applicable workflow tools. Loosely coupling services with Azure Service Bus or AWS SNS/SQS ensures good architectural practices.
Building a data pipeline platform is complicated. Picking tools to match needs and capabilities is critical for both short and long term success. Spend time to research a variety of tools and vendors before making big technology decisions. Many choices exist and this only compounds the confusion about what software is a best fit or not. Your goal is to connect the dots between an IoT device and applications consuming the IoT data with a coherent, long-living and adaptable architecture.
For more best practices or hands-on guidance for building an IoT data pipeline platform, contact us today and we’ll be happy to talk through options as well as ways to accelerate digital strategy, AWS and Azure IoT architecture designs, and implementation of your overall edge to cloud solution.