Build and Scale Data-Driven Operational ML Pipelines with Pure Storage and Iguazio

jboothomas
6 min readOct 13, 2021

--

In this blog I will cover some recent work done with the great folks at Iguazio

Pure Storage and Iguazio work together to automate MLOps and cut the time to market for data science initiatives by enabling consistent, reliable performance and simplicity at scale. Please refer to the “How to Bring Breakthrough Performance and Productivity to AI/ML Projects” for more background on machine learning at scale, from experimentation to production, and why data science projects become a challenge to implement as they are deployed into real business environments and scaled across the organisation. Learn how Pure Storage and Iguazio can help you overcome these challenges by adopting a “Production-First” mindset and establishing the foundation to succeed with operational AI, with a solution that scales up simply, along with your requirements.

By combining MLOps (machine learning operations) automation with the benefits of disaggregated high-speed all-flash storage that scales and evolves with your data science requirements, you can free yourself from the management burden of your full ML stack and focus on the outcomes: bringing AI-driven insights to your users.

Benefits of deploying Pure Storage and Iguazio together:

  • Robust managed MLOps environment at enterprise scale
  • Data and ML pipeline automation for streamlined operations
  • Disaggregated architecture that optimizes compute and storage efficiency
  • Consolidated infrastructure that reduces data center (or cloud) footprint and environmental impact
  • Storage that scales with your datasets and ML training
  • Improved data science and analytics workflows with shared scalable storage
  • Online and offline feature store to simplify ML feature engineering

From an ML workflow perspective, users can seamlessly transition from exploration with any size of datasets, to ML feature engineering, training, deployment in live environments, and monitoring at scale (see figure 1).

Figure 1

Leading ML with a “Production First” Approach

Often, enterprises begin with model development, only to discover that their toolset is insufficient to get their models to production, when working in live environments with production data at scale.

Savvy enterprises today are starting to think about a production-first approach. They then build the foundation — both in terms of workflows and tooling — to support their needs as they evolve and scale. The Iguazio MLOps platform combined with Pure Storage provides the foundation needed to succeed with AI in production, abstracting away many of the complexities of MLOps and making it easy to fully operationalize your ML operational pipeline, freeing your team to focus on the business logic and not the underlying infrastructure.

Introduction to the Iguazio MLOps Platform

Before we get started on how we enable all these benefits, let’s understand the Iguazio platform. The Iguazio MLOps platform is a fully enterprise grade, integrated and secure data science & MLOps platform as a service (PaaS), which simplifies development, accelerates performance, facilitates collaboration, and automates the deployment and management of ML applications in production at scale (see figure 2).

Figure 2

The platform incorporates a large number of components, which include:

  • A data science workbench including Jupyter, analytics engines, Python packages, ML packages such as scikit-learn, Pyplot, NumPy, PyTorch, and TensorFlow.
  • MLOps orchestration framework using MLRun (open-source) for ML model management with experiments tracking and pipeline automation
  • An integrated feature store with the ability to track & monitor ML model
  • Integrated multi-model data layer: An extremely fast and secure data layer that supports SQL, NoSQL, time-series DB, files (simple objects), and streaming. The underlying NVMe storage is Pure Storage FlashArray//X.
  • Integrated data analytics and processing tools, such as Apache Spark, Presto, Pandas, and Dask. All integrated with the Iguazio data layer for small datasets, and Pure Storage FlashBlade for large datasets.
  • A real-time serverless functions framework for model serving (Nuclio)
  • Managed data and machine-learning (ML) services over scalable Kubernetes with Portworx software defined storage capabilities.

A high-level view of the Pure Storage and Iguazio integration points we will now cover is shown in Figure 2. We implement a disaggregated architecture from the Iguazio data layer with FlashArray, the Kubernetes ML nodes with Portworx and with FlashBlade for the datasets. A disaggregated architecture simplifies operations, reduces infrastructure footprint (cooling, power, rackspace), and improves agility by being able to scale compute or storage in answer to changing conditions (see figure 3).

Figure 3

Pure Storage and Iguazio Data Layer

With Pure Storage FlashArray //X, storage can be provisioned directly to the Iguazio data nodes and benefit from all-flash performance, simplicity, availability, durability, non-disruptive upgrades, in addition to data reduction capabilities. This reduces the overall architecture footprint when compared to scaling out across multiple servers that are typically packed with local drives; considerations also extend to rack space, heat, power but also network ports, cabling, etc.

The data nodes used for the Iguazio database can also be used for features and datasets. As your machine learning journey progresses, datasets will grow and the distributed compute nature of ML will require a scale-out approach to accommodate this growth. Pure Storage recommends a fast, scalable NFS or S3 storage solution, leveraging the FlashBlade platform for these large datasets (more on this later).

The FlashArray-backed shared storage layer on the Iguazio data nodes should be used for exploration across smaller subsets of your data. Our testing showed a data reduction ratio of 13 to 1 for the New York City taxi dataset (2009 to 2021 — initial size 235 GB, 144 csv files) which is great for reducing the storage footprint of your data scientists’ playground.

This disaggregated architecture allows for scaling the compute nodes according to the actual machine learning workloads while a smaller footprint across the data nodes is maintained. Storage array level snapshots and replication can also be leveraged for added protection on the platform.

Pure Storage and Iguazio Application Layer

As with many modern workloads, models may eventually be containerized with persistent storage being used for its various outputs. The Iguazio application nodes are in essence a Kubernetes cluster, and as such, storage can be provisioned into containers using Persistent Volume Claims (PVC). Persistent volumes are a simple way to add data persistence to stateful applications.

Portworx by Pure Storage is the leading storage platform for any Kubernetes environment (cloud, on-prem, and hybrid) and can be installed onto the Iguazio application nodes to enable simple, performant, persistent storage at scale for all your applications. For our testing, Flasharray LUNs were used as the backing block devices over which Portworx created the software defined storage layer.

With Portworx in place, enterprise features such as volume replication, encryption, and resizing can provide production-quality benefits to the overall solution.

We recognize that an AI/ML workflow has certain IO requirements that require a storage layer delivering multidimensional performance, and FlashBlade was created with this in mind (see figure 3). FlashBlade provides shared access to fast object (S3) or file (NFS/SMB) storage so you can work with your datasets in Iguazio or from within other applications.

Pure Storage and ML Workloads

ML workloads can leverage either the local data layer, Portworx storage or S3/NFS scalable storage from a FlashBlade. As previously indicated, the local data layer or even a persistent volume can be used when operating at a small scale, but when you need to distribute your training across multiple compute nodes then parallelization at the storage layer becomes a necessity.

Pure Storage FlashBlade was created with this in mind, to deliver a high-performance distributed scale-out storage platform for all modern workloads. Workloads where we want to feed our GPUs, TPUs or CPUs with data as fast as they can process it, and not be bottlenecked by the dual-controller storage architectures of old.

Leveraging FlashBlade from within an MLRun job is simple, using either a shared RWX mapping into the container(s) running the job or by passing the required S3 parameters for access to our S3 bucket(s). The multidimensional performance of FlashBlade is well suited for ingestion, clean & transform and training; its shared nature simplifies the exploration stage of an AI/ML workflow.

To configure an MLrun workload with FlashBlade, see the blog post here.

Conclusion

Together, Pure Storage and Iguazio implement a performant, scalable full MLOps stack that’s ready for exploration to production workloads.

--

--

jboothomas

Infrastructure engineering for modern data applications