What is Python data acquisition

Data collection options for Azure Machine Learning workflows

  • 2 minutes to read

In this article, you will learn about the pros and cons of data ingestion options available with Azure Machine Learning.

The following options are available:

Data ingestion is the process of extracting unstructured data from one or more sources and then preparing it for training machine learning models. It is also time consuming, especially if it is done manually and you have large amounts of data from multiple sources. Automating this effort frees up resources and ensures that your models are using the most current and appropriate data.

Azure Data Factory

Azure Data Factory provides native support for data source monitoring and triggers for data collection pipelines.

The following table summarizes the pros and cons of using Azure Data Factory for your data collection workflows.

advantagesdisadvantage
Specifically designed for extracting, loading, and transforming data.Currently offers a limited number of pipeline tasks for Azure Data Factory.
Allows you to create data-driven workflows to orchestrate data moves and transformations on a large scale.It is expensive to build and maintain. For more information, see the Azure Data Factory pricing page.
Integrated with various Azure tools such as Azure Databricks and Azure FunctionsDoes not run native scripts; instead, it relies on a separate compute to run scripts
Natively supports data acquisition triggered by the data source
The processes for data preparation and model training are separate from each other.
Embedded data lineage capabilities for Azure Data Factory dataflows
Provides a low-level code experience user interface for non-scripting approaches

These steps and the following diagram illustrate Azure Data Factory's data collection workflow.

  1. Pulling the data from the sources

  2. Transform and store the data in an output blob container that acts as a data store for Azure Machine Learning

  3. With prepared data stored, the Azure Data Factory pipeline calls a machine learning pipeline that receives the prepared data for model training

Learn how to build a data collection pipeline for machine learning with Azure Data Factory.

Python SDK for Azure Machine Learning

You can use the Python SDK to integrate data collection tasks into an Azure Machine Learning pipeline.

The following table summarizes the advantages and disadvantages of using the SDK and an ML pipeline step for data collection tasks.

advantagesdisadvantage
Configure your own Python scriptsDoes not natively support triggering data source changes. Requires Logic App or Azure Functions implementations
Data preparation as part of every model training runRequires development skills to script data collection
Supports data prep scripts for various compute targets, including Azure Machine Learning computeDoes not provide a user interface for creating the capture mechanism

In the following figure, the Azure Machine Learning pipeline consists of two steps: data collection and model training. The data collection step includes tasks that can be performed using Python libraries and the Python SDK, such as: B. extracting data from local / web-based sources and data transformations such as imputation of missing values. The training step then uses the prepared data as input to your training script to train your machine learning model.

Next Steps

Follow these instructions: