Change Data Capture – Informatica Mapping Logic for CDC Implementation
October 12, 2014
So, Finally here I go with an article on CDC (Change Data Capture) implementation through an informatica which had been a long waiting from my side to be posted.
Requirement – To track on the data change in the source system with the earlier data pulled into the warehouse and to track on Updates, new Inserts and Delete on the data that is no more in the source. I would continue on this CDC implementation with an another article on loading of data forward into a table to save the change history for the data user to keep a track on the change history.
Mapping Design:
The mapping I have come up with for the CDC logic would be as below,
(Click on the image for a better view)
The mapping would involve the Staging/source table which would be a daily truncate load from the source system as is with the current status on the transactions from a source. And the data are then to be sorted with respect to the primary key columns, sample table used would be as below. The sample implies on the target CDC table to have the same composite primary key columns.
Source Table:
Target Table:
The mapping would involve a simple logic to have a joiner (Full Outer Join) with the source table and the Target CDC table and to check for the matching and non matching records on the Primary key columns as is illustrated below,
This being a Full Outer Join, enables the loading of all matching and non matching transactions to flow through and then the following three expressions should be defined to check on the Change type,
Flag_DELETE:
ISNULL(TransactionID) AND ISNULL(ProductID) AND ISNULL(ReferenceOrderID) and ISNULL(ReferenceOrderLineID)
Flag_INSERT:
ISNULL(TransactionID_old) AND ISNULL(ProductID_old) AND ISNULL(ReferenceOrderID_old) AND ISNULL(ReferenceOrderLineID_old)
Flag_UPDATE:
TransactionID=TransactionID_old AND
ProductID=ProductID_old AND
ReferenceOrderID=ReferenceOrderID_old AND
ReferenceOrderLineID=ReferenceOrderLineID_old AND
(TransactionDate != TransactionDate_old OR
TransactionType != TransactionType_old OR
Quantity != Quantity_old OR
ABS(ActualCost – ActualCost_old)>1)
Flag => IIF(v_Flag_Update=1,-1,IIF(v_Flag_Insert=1,-2,IIF(v_Flag_Delete=1,-3,0)))
where ‘0’ is for no change records, and then to use a filter transformation to filter out all the no change transactions with Flag=0.
Then a Router to define the grouping as per the result of the above expression on checking if the Transaction is an INSERT, UPDATE or DELETE.
(Click on the image for a better view)
And the Update strategy transformation for each to perform and Update or Insert for each, and to note even for DELETE flags I will be going for a DD_UPDATE, which would update the CDC flag for the transactions not available in the source system to ‘D’ in the target table.
Also to note, that the Delete flow, would involve pulling the ports of CDC source table from the Router to the target and Update on the key column matching records with CDC_Flag=’D’.
And finally to make the session level changes for each target to define ‘Update as Update’ or ‘Update as Insert’ for the targets respectively. Upon running the mapping for the first time all transactions would be tagged as INSERT and then the proceeding runs will then depend on the source system changes to be captured through the CDC logic of our mapping.
Note – The DELETE transactions are the one which are no more in the source table, and hence these once after being tagged as ‘D’ for the CDC_Flag can be deleted from the target CDC table as well after making use of the change on to the History tracking table which will be covered more in detail with the next article shortly.
I will shortly be posting another phase of this topic on CDC and on tracking of history on a transaction change in your warehouse with the CDC. Hope you enjoyed reading this and helps you. Please post your comments and feedback for a better work and clear ideas. Thanks! !
October 17, 2014 at 5:42 pm
Reblogged this on Integration On My Mind and commented:
Can I use Informatica for CDC Implementation? Sure. Here’s how to do it!
April 2, 2015 at 12:27 pm
Nice Explanation, & a very informative article it is !!!
Looking forward for further articles 🙂 !
Keep Rocking !!!
Thanks,
Venkat.
April 3, 2015 at 4:42 pm
Thanks Venkat.. This gives a lot of confidence and motivation to come with many more.
April 13, 2015 at 10:01 am
Hi, if the source is not having any column like undated record, version or flag then how to implement the SCD type 2 . how we can capture the CDC .
April 21, 2015 at 3:47 pm
Hi Susant,
As far i understand your question, you wish to track a change (CDC/SCD) even if there are no Date column or flag column in the source??
If this is what you are asking, i would say we just should be clear on what will be our Primary Key columns (one or more set of columns) and keeping them we may have to just compare the other column if they have any change against the snap shot of our target loaded the previous day. As i have explained the Joiner and the Expression used after that would help you understand this better.
Let me know if this does not answers your query..!!
April 22, 2015 at 7:30 am
Hi mahaveer yes just need more clarification ,
say in Emp table data is
100 ABC MUM
101 XYZ BLR
no in the 2nd data is
100 ABC MUM
101 XYZ HYD — Change
102 DEF CHE — New
My question is lets say in source is having 10 milion records
i dont want to pull all the records i just want to pull the change
and new records from the source table
can it be achive with ur method which you described ..
i know if we have date field then we can pull d data from source using last date modified column.
April 23, 2015 at 4:46 pm
Hi Susant,
Yes this is what the purpose of having a CDC, not to depend on any Date Column to track on changes.
As i have explained here in the article, for any new insert in the source which would have a key column as in your example 100,101… defining the transaction number which can be the column to track on the new inserts.
Something as i have mentioned – ISNULL(EmployeeID_target), where you can also have more than one column to be a key column and you see up on joining a Employee_ID is available in Source and is not there in the target table and you would get that as NULL when you join and hence this record from source shall be tagged as INSERT.
Whereas to track an update you can see, the key column like Employee_ID will be present both in Source and the Target snap shot but there will be a change in any one of the other non key columns.
So we may have expression for your example like – EmployeeID_source=EmployeeID_target AND (City_Source City_target OR Department_source Department_target)
Hence you get these under a flag for UPDATE.
Hope this clarifies and answers your question..!!
April 25, 2015 at 4:31 pm
Hi Can you send me the mapping xml file for this CDC logic ?
Also i have doubt on that mapping why there is filter and router used in the pipeline where both act almost same .
:::::::::thanks ::::::::::::: regards…….susant
April 14, 2015 at 7:19 am
This the very nice explanation about CDC…
April 21, 2015 at 10:45 am
nice article. So CDC is just keeping track of last udpated time and also to maintain the latest record unlike having all the history record, correct?
April 21, 2015 at 3:44 pm
Hi Beer Mohamed, We indeed can track the history of our data which i had promised to cover in another article (a continuation of the CDC), Where i will be using the CDC flag that we have come up in this article and keep this table as base to load the next Fact table which would have the whole history of the changes with an Effective Date column saying when a change has occurred.
Hope this helps..!!
March 25, 2016 at 10:24 pm
really nice article, could you please let me know the next article on this continuation part when it will available , if available please share it.
June 27, 2016 at 4:14 pm
which is old data here .. source one or target one .. m confused 😦
August 5, 2016 at 2:00 pm
Hi Yash, When i say old data i mean the snap shot of your target table, which also is one of the source (in your warehouse or stage) in a CDC mapping, which you would join with new snap shot from your source data(outside your warehouse or stage). Thanks!
May 30, 2017 at 2:13 pm
Hi mahaveer,
did you get a chance to explain the history load with CDC, i don’t see any article on it.
June 6, 2017 at 6:28 am
Very good explanation. Thank you
July 28, 2017 at 2:55 pm
Please clarify my understanding –
Traction_History is target table which you are calling OLD data in mapping.
Traction_history_CDC is kind of staging source table which holds all the old data plus new data?
July 28, 2017 at 2:57 pm
Also if there is no key PK on table then should we compare all the columns of source and target table?