3 Methods for Implementing Change Data Capture

Janaka Last Updated : 30 Jul, 2021

6 min read

This article was published as a part of the Data Science Blogathon

MsSQL | Oracle | Triggers and Short procedures

Photo by Edward Howell on Unsplash

If you are familiar with Change Data Capture AKA Change Data Tracking, you can just skip the introduction below and get straight to the implementation section. But, in case you’re not, let me introduce the concept first.

American Writer Mark Twain once said: “Data is like garbage”. Today some people say: data is the new oil. But I would say: data is a blooming flower.

Like a flower, data goes through different stages, and takes time to develop, or burst into bloom. Just as the time taken may differ from flower to flower, so different data sets develop at different rates. A single flower is less striking, but when it is part of a bouquet, it becomes really eye-catching. And so it is with data — that’s how companies like Google or Facebook are able to make a profit, even without charging for their services -they simply collect data ‘flowers’ and turn them into data ‘bouquets’.

However, this article is not about flowers 😊 so, let’s move right on to our topic. As I explained previously, the state of data is continuously changing over time. Change Data Capture (CDC) is a set of technologies that enable you to identify and capture the previous states of the data so that later, you have a snapshot of past data that you can refer to when taking necessary action. See the example below:

Illustration

With the increasing demand for big-data technologies, Microsoft introduced CDC with MsSQL Server 2008. Today, CDC is available in almost all popular database servers, including MsSQL, Oracle, CockroachDB, MongoDB, etc. But, manual implementation is just one way to get the job done. We can also automate the CDC using triggers and short procedures. In this article, I will discuss how to implement CDC both ways.

Why Change Data Capture is important

Yes, you guessed right, this technology is mostly used in the big-data domain to keep timely snapshots of streaming data in data warehouses. Here, we go through a process called ELT/ETL to insert our data into the data warehouse. This process is efficient with historic data, but when it comes to real-time data it causes too much latency to run complex queries. The easiest way to deal with real-time data is CDC, which enables us to keep our data warehouse up to date and make business decisions faster.

Let’s Get our Hands Dirty with Some Practical Examples

1. Implementing CDC with MsSQL

First, we need to create a new database.

CREATE DATABASE cdcDB

Next, I select the created database from the available databases dropdown, and create a few simple tables – just to demonstrate CDC implementation.

CREATE TABLE Student

StudentID int NOT NULL PRIMARY KEY,

FirstName varchar(255) NOT NULL,

LastName varchar(255) NOT NULL,

Age int,

ContactNo char(10) NOT NULL

After creating tables, I enable CDC.

EXEC sys.sp_cdc_enable_db

After CDC is enabled for the database, I then enable it on each table.

EXEC sys.sp_cdc_enable_table

@source_schema = 'dbo',

@source_name = 'Student',

@role_name = NULL,

@supports_net_changes = 1

Finally, I insert some values to my table by the following query.

INSERT INTO Student(StudentID,FirstName,LastName,Age,ContactNo)

VALUES

(10638389, 'Indrajith', 'Ekanayake', 21, 0713101658),

(10637382, 'Kamal', 'Suriyaarachchi', 22, 0765432210),

(10622388, 'Kasun', 'Chamara', 28, 0708998123),

(10638812, 'Chamara', 'Hettiarachchi', 20, 0772134446);

To confirm whether CDC has been properly implemented, we can just check either at our hierarchy panel or Jobs tab, under the SQL Server Agent.

To understand how and what data we are storing in the CDC tables, we can just update a few rows.

UPDATE Student

SET FirstName = 'Janaka', Age= '23'

WHERE StudentID = 10638389;

So, we are now looking at the CDC CT(Change Table) and we can identify that there are additional records, as seen below.

CDC CT(Change Table)

2. Implementing CDC with Oracle

First, I need to mention that I’m running Oracle 11g Enterprise Edition using docker image and I’m writing queries using Oracle SQL Developer’s latest version on the host machine. The server and Oracle SQL developer are connected via port: 1521.

To implement CDC, I begin by creating a tablespace called “ts_cdcindrajith” in my cdcAssignment folder, inside my F drive.

create tablespace ts_cdcindrajith datafile 'F:cdcAssignment' size 300m;

Then, I create a new user called “cdcindrajith”, and grant all permissions to this user.

CREATE USER cdcindrajith IDENTIFIED by cdcindrajith DEFAULT TABLESPACE ts_cdcindrajith QUOTA UNLIMITED ON SYSTEM QUOTA UNLIMITED ON SYSAUX;

GRANT ALL PRIVILEGES TO cdcindrajith;

Next, I create a table called employees.

CREATE TABLE cdcindrajith.employees

EmpID int NOT NULL PRIMARY KEY,

EmpName varchar(255) NOT NULL,

Age int NOT NULL,

ContactNo char(10) NOT NULL

Then, I implement the CDC feature.

BEGIN

DBMS_CAPTURE_ADM.PREPARE_TABLE_INSTANTIATION(TABLE_NAME => 'cdcindrajith.employees');

END;

BEGIN

DBMS_CDC_PUBLISH.CREATE_CHANGE_SET(

change_set_name    => 'employees_set',

description        => 'Change set for employees change info',

change_source_name => 'SYNC_SOURCE');

END;

BEGIN

DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE(

owner             => 'cdcindrajith',

change_table_name => 'employees_ct',

change_set_name   => 'employees_set',

source_schema     => 'cdcindrajith',

source_table      => 'employees',

column_type_list  => 'EmpID int,

EmpName varchar(255) ,

Age int,

ContactNo char(10)',

capture_values    => 'both',

rs_id             => 'y',

row_id            => 'n',

user_id           => 'n',

timestamp         => 'n',

object_id         => 'n',

source_colmap     => 'y',

DDL_MARKERS =>       'n',

target_colmap     => 'y',

options_string    => 'TABLESPACE ts_cdcindrajith');

END;

I create a subscription called “employees_sub”.

BEGIN

DBMS_CDC_SUBSCRIBE.CREATE_SUBSCRIPTION(

change_set_name   => 'employees_set',

description       => 'Change data for employees',

subscription_name => 'employees_sub');

END;

After that, I create a View called “employees_view” and then activated the above-created subscription.

BEGIN

DBMS_CDC_SUBSCRIBE.SUBSCRIBE(

subscription_name => 'employees_sub',

source_schema     => 'cdcindrajith',

source_table      => 'employees',

column_list       => 'EmpID, EmpName, Age, ContactNo',

subscriber_view   => 'employees_view');

END;

BEGIN

DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION(

subscription_name => 'employees_sub');

END;

BEGIN

DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW(

subscription_name => 'employees_sub');

END;

Finally, I insert data and also update some data, to check that the CDC has been correctly implemented.

INSERT INTO cdcindrajith.employees(EmpID,EmpName,Age,ContactNo)

VALUES

(10638389, 'Indrajith', 21, 0713101658);

INSERT INTO cdcindrajith.employees(EmpID,EmpName,Age,ContactNo)

VALUES

(10638390, 'Kumara', 27, 0711226661);

UPDATE cdcindrajith.employees

SET EmpName = 'Janaka'

WHERE EmpID = 10638389;

Towards the end, I check the results of my CDC table “employees_ct” to make sure that the CDC is correctly implemented.

SELECT EmpID, EmpName FROM cdcindrajith.employees_ct;

employees_ct table

3. Implementing CDC using Trigger (MsSQL)

First, we need to create a new database.

CREATE DATABASE empDB

Then, I create two tables: one is to maintain employee salary records, the other is to update the salary log, which does the change data capturing.

CREATE TABLE Salary

ID int identity(1,1) primary key NOT NULL,

SalDate datetime default GETDATE() NOT NULL,

Task BIGINT NULL,

PaidPerTask BIGINT NULL,

EmpName NCHAR(100) NOT NULL

CREATE TABLE SalaryLogs

ID INT PRIMARY KEY IDENTITY(1,1) NOT NULL,

SalDate DATETIME DEFAULT GETDATE() NOT NULL,

Query NCHAR(6) NOT NULL,

OldTask BIGINT NULL,

NewTask BIGINT NULL,

OldPaidPerTask BIGINT NULL,

NewPaidPerTask BIGINT NULL,

EmpName NCHAR(100) NOT NULL

I implement a Trigger called “salary_change” to save the salary logs.

GO

CREATE TRIGGER salary_change

ON Salary

AFTER INSERT, UPDATE, DELETE

AS

BEGIN

DECLARE @operation CHAR(6)

SET @operation = CASE

WHEN EXISTS(SELECT * FROM inserted) AND EXISTS(SELECT * FROM deleted)

THEN 'Update'

WHEN EXISTS(SELECT * FROM inserted)

THEN 'Insert'

WHEN EXISTS(SELECT * FROM deleted)

THEN 'Delete'

ELSE NULL

END

IF @operation = 'Delete'

INSERT INTO SalaryLogs (Query, SalDate, OldTask, OldPaidPerTask, EmpName)

SELECT @operation, GETDATE(), d.Task, d.PaidPerTask, USER_Name()

FROM deleted d

IF @operation = 'Insert'

INSERT INTO SalaryLogs (Query, SalDate, NewTask, NewPaidPerTask, EmpName)

SELECT @operation, GETDATE(), i.Task, i.PaidPerTask, USER_Name()

FROM inserted i
IF @operation = 'Update'

INSERT INTO SalaryLogs (Query, SalDate, NewTask, OldTask, NewPaidPerTask,OldPaidPerTask, EmpName)

SELECT @operation, GETDATE(), d.Task, i.Task, d.PaidPerTask, i.PaidPerTask, USER_Name()

FROM deleted d, inserted i

END

GO

Finally, I insert some data into the salary table and then try to update and delete queries as well.

INSERT INTO Salary(SalDate,Task,PaidPerTask,EmpName)

VALUES

(2020-11-20, 4, 4000 , 'Indrajith'),

(2020-10-20, 11, 10000 , 'Kusum');

UPDATE Salary

SET EmpName = 'Janaka'

WHERE EmpName = 'Indrajith';

DELETE FROM Salary WHERE EmpName = 'Kusum';

Towards the end, let’s check the salary logs table to understand whether the log files are updated or not.

SalaryLogs table

We see here that the log files are up to date.

Summary

In short, CDC identifies and captures data that has changed in tables of a source database as a result of CRUD operations. This is useful to people who need to export their data into a data warehouse or a business intelligence application. Changed data is maintained in CDC tables in the source database.

In this article, we briefly discussed what is Change Data Capture, why CDC is useful, and most importantly illustrated three methods for implementing CDC. I believe that the rapid evolution of big-data will have many more CDC implications in the future.

Thanks for reading!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Janaka

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

3 Methods for Implementing Change Data Capture

MsSQL | Oracle | Triggers and Short procedures

Why Change Data Capture is important

Let’s Get our Hands Dirty with Some Practical Examples

1. Implementing CDC with MsSQL

2. Implementing CDC with Oracle

3. Implementing CDC using Trigger (MsSQL)

Summary

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

3 Methods for Implementing Change Data Capture

MsSQL | Oracle | Triggers and Short procedures

Why Change Data Capture is important

Let’s Get our Hands Dirty with Some Practical Examples

1. Implementing CDC with MsSQL

2. Implementing CDC with Oracle

3. Implementing CDC using Trigger (MsSQL)

Summary

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques