Azure Data Factory | Overview

Hiya there!

This is my first article here at AzureTar Blog and I am really excited to contribute with Data stuffs on Microsoft Azure! But let’s stop stalling and go straight to the point?

If you are familiar with SSIS (SQL Server Integration Services) or do not even know what that is but know at least do a Select query on SQL Server Management Studio (in short, SSMS), this article hopefully will not be hard to understand.

For completing its relational data service engine, Microsoft offers not only the well-known SQL Server, but also several services that help to manage different aspects of the way we manage, integrate, report and analyse data.

SSMS (SQL Server Management Server):
This service is dedicated to managing and configuring any SQL infrastructure by running SQL queries, stored procedures and views on the databases.

SSIS (SQL Server Integration Services):
This service is used to execute a wide range of migration tasks like data extraction, transformation (cleaning, aggregating, joining…) and loading. It basically moves relational data in different formats like Excel files, Oracle, DB2, and .csv into a SQL Server Database.

SSRS (SQL Server Reporting Services):
This service allows us to create various interactive reports using graphs, maps, tabular reports, drill-down reports, click-through reports, and many more.

SSAS (SQL Server Analysis Services):
It provides us with a great view of what is happening on our database using online analytical processing and data mining.  In simple terms, we can use SSAS to create cubes using data from a data warehouse for deeper and faster data analysis.

To be fair, I have only used one of the four services and there was SSMS. I would load .txt files into a SQL Server table using BULK INSERT or an Excel file using OPENROWSET so then, after having the database clean and ready to be used, I would use either Excel or Tableau to create my reports and dashboards.

As we can see, Microsoft offers us softwares and solutions for each step in our data process. Not too different from on-premise, MS also helps us with a bunch of cloud-based alternatives and this platform is known as Azure.

The Azure cloud platform is more than 200 products and cloud services designed to help you bring new solutions to life – to solve today’s challenges and create the future. Build, run and manage applications across multiple clouds, on-premises and at the edge, with the tools and frameworks of your choice.

Microsoft Azure

Azure Data Factory

In this article, we are going to talk about the Data Factory Platform which is the cloud-based alternative for SSIS.

Cloud-based data integration services orchestrate and automate data movement and data transformation.

If you did not understand what that means, no worries. I am still not getting it as well. If it helps I prefer the definition:
Data Integration Service: Gathers data from multiple sources and provides a consolidated outcome of them.

Basically, for those who are coming from SSIS or have played with SQL Queries before, Azure Data Factory is all about extraction, manipulation and saving the data. It is the first step to be taken when we initiate our journey in data workload. Unless we already have everything set into our Database, which is unlikely to happen in real life. The terms that define better what ADF performs are more known as Extraction-Transformation-Load (the famous ETL).


What is ETL?

In data engineering, the most data integration pattern is ETL (there is a pattern known as ELT which basically is in a different order of ETL process).

EXTRACTION

Where the datasets are coming from

It is called Data Source the resource that we link the servers where our data is coming from. Here we use something similar to connection strings. In other words, we will connect to the sources to be able to collect them.

TRANSFORMATION

What we want to change

Performs on the data transforming and enriching it. Here we manipulate and change the data using processes such as aggregations, cleansing, filtering, joining, summarization and so on.

LOAD

Where we want to store

Once we get the data modified, we can store it into a destination called Sink such as Azure Blob Storage or Azure SQL Database. We need to have at least one sink transformation for every data flow due to the fact that ADF does not store any data.

Putting it in an example, let’s imagine we work for a video streaming company and they want us to inspect some marketing analysis: customer behaviors, customer demographics and their preferences in terms of movies. We have petabytes of video logs data in the cloud, customer reviews on Hadoop and their latest watched movies on MongoDB.

We can just export the files from their sources and upload them into SQL Server Database and do the job there. However, it would take us forever to export, load and transform the data. Big data needs a scalable service that can ORCHESTRATE the data movement and AUTOMATE the data transformation. In our case, we have raw and unorganized data relational and non-relational coming from different storage systems. So that is where the ADF comes in to show us its power!

At a high level of collecting from sources, processing to transform, publishing into a warehouse for analysts, and monitoring and managing daily schedules, Azure is a set of interconnected systems and gives us an end-to-end platform for our data engineering.


What Else About ADF?

Okay! That ADF helps us with Data Integration at cloud scale, we already know, but what we do not know yet is that on ADF we also can

  • DRAG AND DROP: ADF is code-free transformation;
  • CAN CODE (if you prefer): ADF runs code on any Azure Compute;
  • USE SSIS PACKAGES: Many SSIS Packages run on Azure;
  • DATA OPS on GITHUB ou AZURE DEVOPS: Source control automated deploy for DataOps;
  • SECURE our DATA: Secure data integration by connecting to private endpoints that are supported by various Azure data stores.

Key Concepts

ADF has 6 main components to do its hybrid data integration at enterprise scale easy and they are:

  • Pipelines: Group of all activities needed to perform one unit of work.
  • Activities: Each step performed on the data (copy, move, transform, enrich, etc.).
  • Integration Runtimes: Computing Infrastructure of the ADF (Azure Integration Runtime and Self-Hosted Runtime).
  • Triggers: When everything starts it is like a power switcher.
  • Datasets: Files or SQL tables.
  • Linked Servers: Offers more than 90 connectors.

If you forget about any concept above. That’s okay. We will cover each one along with this series either publishing more on AzureTar Blog and/or through recorded hands-on videos. This was just a short introduction to ADF. What it offers and how it does.

Do not miss out, follow us and stay tuned!