In our previous post, we showed how applying the principles of Zero Waste Engineering™ increases the impact of IoT data science initiatives. This article provides a survey of some of the IoT data science tools from AWS and Microsoft Azure that are most commonly used in the context of industrial enterprise IoT solutions.
Both AWS and Azure offer managed machine learning services that promise easy ML development and deployment. Having developed and deployed machine learning models with previous generations of cloud-based ML services, we decided to take a few of the newer offerings for a test drive using a recent IoT prediction problem.
We started by streaming device data into both Azure IoT Hub and AWS IoT Core which ultimately landed in a data lake – Azure Data Lake Storage Gen2 and AWS S3, respectively. This is also where we housed data sourced from other systems required to contextualize the device data. You can read about constructing an IoT data pipeline in a previous post.
Be aware, for this exercise we focused on the integration of data science into IoT workflows, rather than cutting edge data science itself. As experienced industrial IoT software engineers, we’ve spent a lot of time on this problem, and know how critical it is for creating value from connected systems.
- Explore the IoT data
- Train a model
- Deploy the model as an individual prediction service endpoint
- Deploy the model as part of a batch scoring workflow
We’re sharing our findings and lessons learned from a “first timer” perspective below. Spoiler alert – both Azure Machine Learning and AWS SageMaker delivered a solid experience and we were quite pleased with the results.
Using local tools
Many data explorations start with local tools and a subset of data, so it’s worth mentioning that both AWS and Azure make it very easy to download data from their respective data lakes for local processing. You can also access the cloud data lakes directly with local tools via their SDKs.
First, I downloaded the data to my local machine directly from the S3 console and Azure Data Lake Gen2 portal, then read it into Anaconda Jupyter notebooks (the individual version is free) for local data explorations and visualizations. However, any tool you’re comfortable with that can handle your data is perfectly fine provided you can get it to tell you what you want to know; sometimes a spreadsheet is all you need to get started. Simplify, simplify.
Microsoft Azure data science offerings
Azure Databricks is a fully managed Apache Spark–based analytics, data engineering, and data science platform. It allows you to query, explore, and analyze very large files and data sets, and join disparate data sources in data lakes. The integrated notebook interface supports python, R, SQL, .NET, Java, and Scala and allows you to develop and train models against huge data sets. It also provides the ability to process live streams of data, manage compute clusters, models, and schedule jobs.
Our activity in Databricks focused on data exploration, ad hoc analysis, and non-ML data processing. We accessed our data lake using dbutils, then used a combination of PySpark and SQL (spark.read.json, create temp view, spark.sql) to query and join the IoT and other data sources, creating data frames with the result sets for further manipulation and visualization.
We did the same kind of data querying and processing via notebooks on an Azure HDInsights cluster as well with no issues.
Azure Machine Learning
Azure Machine Learning (AzureML), like Databricks, is a fully managed service to support the full machine learning development lifecycle, providing the ability to to train, deploy, automate, manage, and track ML models, experiments, endpoints, and pipelines.
AzureML offers deep learning, hyperparameter tuning, and automated machine learning where you provide a dataset, identify the task and the metric you want to optimize, and the service then experiments with several algorithms to determine the best model. You can set boundaries around the process; the service tracks all of the experiments, and the models are explainable. This provides a head start, especially if you’re not a data scientist, saving hours of initial exploration and manual experimentation. However, the resulting models still need to be validated, tuned, and optimized – it’s no silver bullet.
You can interact with the service either through the AzureML Studio webportal, the CLI, or the SDK. AzureML supports Jupyter notebooks/JupyterLab with Python or R. The enterprise version also offers a no-code/low-code interface for doing data science called the Designer (in preview as of this writing) geared toward those just getting their feet wet in data science. Instead of coding in notebooks, you drag and drop modules that perform different tasks onto your canvas to author your ML activity (screenshot below). The enterprise version also provides a graphical interface for auto ML.
The Studio is an IDE to manage everything in your AzureML world. Workspace access is managed via typical Azure IAM access control, making it easy to collaborate. It provides the means to manage data sources and data sets, notebooks, compute resources, track experiments (training jobs, parameter optimization, etc.), models, endpoints, and pipelines. The interface is fairly intuitive and easy to use once you’ve oriented yourself with the docs.
I worked with the Studio and Jupyter notebooks (python) in my interactions with the service, and found it easy to access with the data lake for data exploration and feature creation. Data access is done through datastores. I set one up for our Gen2 data lake, then created a dataset from it using the Dataset.Tabular.from_delimted_files. I then loaded the dataset to a pandas dataframe and was on my way.
Ours was a supervised classification problem. I performed manual feature creation with pandas, and used SKLearn’s logistic regression to train a model. Parameter tuning experiments can be set up and results tracked in the console. Once I had a model I was happy with, I serialized it with pickle and registered the resulting .pkl file as a model in the workspace.
Be forewarned, deployment can be a little tricker. A deployment consists of a registered model, runtime environment, execution environment, and an execution script. While typically easy to deploy, it’s difficult to troubleshoot the deployment when things don’t work right out of the gate. Error messages returned from invoking the endpoint are often confusing. For instance, I received a 502 (bad gateway) which I initially thought meant some sort of networking issue, but it was caused by a runtime error in the scoring script. Although this is documented in a troubleshooting guide, there can still be a lot of trial and error; I found myself wishing for better debug tools at this stage.
I also set up an ML pipeline to utilize the model for a batch scoring process. The pipeline is passed a source input data file name, has steps for data prep and feature creation, batch scoring, and writes an output file for further processing. Constructing and debugging the ML pipeline was difficult working from the docs alone, but I found a set of tutorials on the MSLearn site that had good explanations. I then incorporated the ML pipeline into an Azure DataFactory Pipeline triggered when a file is added to the data lake.
AWS machine learning offerings
Amazon SageMaker is a fully managed machine learning service that provides the ability to build, train and deploy machine learning models. SageMaker offers traditional ML, deep learning, supervised, and unsupervised learning, auto ML, and hyperparameter tuning. In addition, they also offer a data labelling service utilizing mechanical turk, multi-model deployments, a model compiler (Neo) to create very portable deployments, and human augmented AI.
You can interact with SageMaker either through their website console, the CLI, or the SDK. In their console, you can optionally set up the Studio, which they call an IDE for managing your ML activity. The AWS Studio interface reminds me of Azure’s Studio. It’s almost like a wrapper around the rest of the service, but not quite. However, it’s not easy to find (see image below).
In general, I found SageMaker more complicated and less intuitive to work with than AzureML service, although such differences are the kinds of things that both Microsoft and AWS are continuously iterating on. The current versions of docs and training materials also seemed a bit sparse, though you can count on this too improving over time.
We targeted the same functionality in SageMaker that we did in AzureML – developing a model from IoT device data in a data lake and deploying it to an endpoint to serve single predictions, and also using the model for batch predictions.
It was very easy to get a notebook instance (EC2) up and running and accessing our data lake via Jupyter Lab notebooks. I used a python notebook and the boto3 library to iterate through the targeted S3 bucket subdirectories and files to create one large dataset which I then manipulated with pandas.
Our device failure is a binary classification problem, and I used SageMaker’s built-in xgboost algorithm to develop my model. I did nothing beyond minimal data cleanup, and quickly had a pretty accurate model trained that was easy to deploy and test. This is probably due in no small part to my choice to use XGBoost (no scoring script to write and debug). Training job logs were accessible via CloudWatch.
I deployed my model as an endpoint in order to make a single prediction, which I then tested via a notebook and it worked fine. You can also use postman to construct the signed header – these are not intended to be public endpoints – although we’ve had authorization problems here. If you’d like to go down this route, here are some tips to help you along.
The next task was to run batch predictions using the same model. It was pretty simple to set up the batch transform job: I had to create an input and output directory in an S3 bucket and upload the input file. It took only a few lines of code in a notebook to create my batch transform run with expected results written back out to S3. I had a couple of initial failures due to input file issues, and was able to access all run logs via CloudWatch.
Finally, I incorporated the batch predictions into a workflow triggered by the addition of a file in the data lake using a lambda function with an S3 trigger. I wrote the function in python, using a boto3 sagemaker client to create the transform job. It was very straightforward.
Just for fun, I also kicked off an autoML job using the Studio. AutoPilot cranked through my data, ran for a couple of hours, invoking an astounding number of jobs, then offered up it’s best suggestion for my classifier model. AutoPilot shows it’s work: it created two notebooks in the process so you can see exactly what was done. Cool.
Both Azure Machine Learning and AWS SageMaker live up to their promises of easy development and deployment of ML models; we were able to develop and deploy our IoT data models on both services without much difficulty. Therefore, our recommendation regarding Azure vs AWS is to use the cloud platform your enterprise already depends on. Alternately, if that’s still an open decision, start with whichever your team is most comfortable with to kick off your IoT data analysis efforts.
For more best practices or hands-on guidance for gaining insights from your IoT data, contact us today and we’ll be happy to talk through options as well as ways to accelerate digital strategy, AWS and Azure IoT architecture designs, and implementation of your overall edge to cloud solution.