Incremental Data Load with SSIS and Change Data Capture (CDC)

We can easily perform incremental data loads in SSIS using Change Data Capture (CDC).
CDC is a SQL Server tracking feature introduced in SQL Server 2008 that provides historical change activity (insert, update and delete) made to user tables.

In the following example, we will see how to develop a SSIS incremental data load package using Change Data Caputure (CDC).

* I would recommend that you have some basic knowledge in SSIS to better understand this article.

Creating OLTP, Staging and OLAP databases and tables

USE master;
GO
 
CREATE DATABASE OLTP;
GO
 
USE OLTP;
 
CREATE TABLE dbo.Customer
(
CustomerId int identity(1,1) primary key,
CustomerName varchar(50)
);
 
INSERT INTO dbo.Customer (CustomerName)
VALUES
('Adam'),
('Bob'),
('Joseph'),
('Roger'),
('Samuel');
 
CREATE DATABASE Staging;
GO
 
USE Staging;
 
CREATE TABLE dbo.Customer_insert
(
CustomerId int,
CustomerName varchar(50)
);
CREATE TABLE dbo.Customer_update
(
CustomerId int,
CustomerName varchar(50)
);
CREATE TABLE dbo.Customer_delete
(
CustomerId int,
CustomerName varchar(50)
);
 
CREATE DATABASE OLAP;
GO
 
USE OLAP;
 
CREATE TABLE dbo.Customer_DW
(
CustomerId int primary key,
CustomerName varchar(50)
);

 

Enabling Change Data Capture (CDC) for the OLTP database and Customer table

  • First we need to enable CDC for the database and then enable it for each table we want to track.
  • SQL Server Agent must be running.
  • The user must have sysadmin privileges to enable CDC for databases.
  • The user must be db_owner to enable CDC for tables.
USE OLTP;
GO
 
EXEC sp_cdc_enable_db;
GO
 
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name = N'Customer',
@role_name = NULL,
@supports_net_changes = 1;
GO



SSIS – Initial Data Load package

Considering that there are no rows in the OLAP database, we need to perform a one-time load to populate it.

The initial load package has three tasks:

  1. A CDC Control Task to mark the initial load start.
  2. A Data Flow Task to transfer the data from the OLTP database to the OLAP database.
  3. A CDC Control Task to mark the initial load end.

The following figures show how to configure the Control Flow and each task

Control Flow:

Control Flow

Control Flow with three tasks

1. CDC Control Task (mark initial load start) properties:

CDC Control Task (mark initial load) properties

CDC Control Task (mark initial load start) properties

2. Data Flow Task (load customer):

Data Flow

The Data Flow tab is composed of two objects: ADO NET Source and ADO NET Destination.

3. CDC Control Task (mark initial load end) properties:

ssis_cdc_control_task_mark_initial_load_end.jpg

CDC Control Task (mark initial load end) properties

Executing the initial load package 

Now the initial load package is ready to be executed.

Let´s check the source and destination tables before executing it:

USE OLTP;
GO
 
SELECT * FROM Customer;
GO
 
USE OLAP;
GO
 
SELECT * FROM Customer_DW;
GO
result_set

Result sets berofe execution

  • Now we need to execute the SSIS initial load package to load data from the Customer table (OLTP) to the Customer_DW (OLAP).
  • Press F5 to execute the package.
Check the results:
result_il_control_flow

Control Flow tab

 Let´s check the tables after the execution:
USE OLTP;
GO
 
SELECT * FROM Customer;
GO
 
USE OLAP;
GO
 
SELECT * FROM Customer_DW;
GO
result_set2

Result sets after execution

The initial data load worked fine! The next step is to perform the incremental data load.

SSIS – Incremetal Data Load package

This package will perform incremental loads of the change data to the OLAP database.

The incremental data load package has five tasks:

  1. An Execute SQL Task to truncate the staging tables.
  2. A CDC Control Task to get the processing range of LSNs (Log Sequence Numbers) containing the changes to be loaded.
  3. A Data Flow Task to load the staging tables.
  4. An Execute SQL Task to load change data from the Staging to the OLAP database.
  5. A CDC Control Task to mark the processed range and define where the next processing range will begin.

The following figures show how to configure the Control Flow and each task:

Control Flow:

incremental_load_control_flow

Control Flow with five tasks

1. Execute SQL Task (clean staging tables) properties:

Task1

Execute SQL Task (clean staging tables) properties

SQL Statement:

TRUNCATE TABLE Customer_delete;
TRUNCATE TABLE Customer_insert;
TRUNCATE TABLE Customer_update;

2. CDC Control Task (get processing range) properties:

Task2.jpg

CDC Control Task (get processing range) properties

3. Data Flow Task (load staging tables):

Task3

Data Flow Task (load staging tables)

4. Execute SQL Task (load customer_dw) properties:

Task4.jpg

Execute SQL Task (load customer_dw) properties

SQL Statement:

INSERT INTO Customer_DW SELECT * FROM Staging..Customer_insert;
DELETE FROM Customer_DW WHERE CustomerId IN (SELECT CustomerId FROM Staging..Customer_delete);
UPDATE c1 SET c1.CustomerName = c2.CustomerName FROM Customer_DW c1 INNER JOIN Staging..Customer_update c2 ON c1.CustomerId = c2.CustomerId;

5. CDC Control Task (mark processed range) properties:

Task5

CDC Control Task (mark processed range) properties

Executing the incremental data load package

Before executing this package, let´s make some data changes in the OLTP database:

USE OLTP;
GO
 
DELETE FROM Customer WHERE CustomerId = 4;
 
INSERT INTO Customer (CustomerName) VALUES ('Martin');
 
UPDATE Customer SET CustomerName = 'Jason' WHERE CustomerId = 1;

 

Now we can execute the following scripts to check that the OLTP and OLAP databases are different:

USE OLTP;
GO
 
SELECT * FROM Customer;
GO
 
USE OLAP;
GO
 
SELECT * FROM Customer_DW;
GO
resultset2.jpg

Result sets before executing the incremental data load. There are some differences between the databases.

  • Now we can execute the SSIS incremental data load package to load data from the Customer table (OLTP) to the Customer_DW (OLAP).
  • Right click the package in the solution explorer and then click “Set as StartUp Object”.
  • Press F5 to execute the package.
Check the results:
result_control_flow

Control Flow tab

result_data_flow

Data Flow tab

Let´s check the tables after the execution:
USE OLTP;
GO
 
SELECT * FROM Customer;
GO
 
USE OLAP;
GO
 
SELECT * FROM Customer_DW;
GO
result_set_after_execution

Result sets after the incremental data load execution

We can see that the tables are synchronized now!

I hope you enjoyed this article.

Please feel free to leave any comments, share and like it!

Thank you very much.

Rafael Juca.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s