Change Data Capture – Informatica Mapping Logic for CDC Implementation

October 12, 2014

So, Finally here I go with an article on CDC (Change Data Capture) implementation through an informatica which had been a long waiting from my side to be posted.

Requirement – To track on the data change in the source system with the earlier data pulled into the warehouse and to track on Updates, new Inserts and Delete on the data that is no more in the source. I would continue on this CDC implementation with an another article on loading of data forward into a table to save the change history for the data user to keep a track on the change history.

Mapping Design:

The mapping I have come up with for the CDC logic would be as below,

CDC_Data_Load_img2

 

(Click on the image for a better view)

The mapping would involve the Staging/source table which would be a daily truncate load from the source system as is with the current status on the transactions from a source. And the data are then to be sorted with respect to the primary key columns, sample table used would be as below. The sample implies on the target CDC table to have the same composite primary key columns.

Source Table:

CDC_Data_Load_img3

Target Table:

CDC_Data_Load_img4

The mapping would involve a simple logic to have a joiner (Full Outer Join) with the source table and the Target CDC table and to check for the matching and non matching records on the Primary key columns as is illustrated below,

CDC_Data_Load_img5

This being a Full Outer Join, enables the loading of all matching and non matching transactions to flow through and then the following three expressions should be defined to check on the Change type,

Flag_DELETE:

ISNULL(TransactionID) AND ISNULL(ProductID) AND ISNULL(ReferenceOrderID) and ISNULL(ReferenceOrderLineID)

 

Flag_INSERT:

ISNULL(TransactionID_old) AND ISNULL(ProductID_old) AND ISNULL(ReferenceOrderID_old) AND ISNULL(ReferenceOrderLineID_old)

 

Flag_UPDATE:

TransactionID=TransactionID_old AND

ProductID=ProductID_old AND

ReferenceOrderID=ReferenceOrderID_old AND

ReferenceOrderLineID=ReferenceOrderLineID_old AND

(TransactionDate != TransactionDate_old OR

TransactionType != TransactionType_old OR

Quantity != Quantity_old OR

ABS(ActualCost – ActualCost_old)>1)

 

Flag => IIF(v_Flag_Update=1,-1,IIF(v_Flag_Insert=1,-2,IIF(v_Flag_Delete=1,-3,0)))

where ‘0’ is for no change records, and then to use a filter transformation to filter out all the no change transactions with Flag=0.

Then a Router to define the grouping as per the result of the above expression on checking if the Transaction is an INSERT, UPDATE or DELETE.

 

CDC_Data_Load_img1

(Click on the image for a better view)

And the Update strategy transformation for each to perform and Update or Insert for each, and to note even for DELETE flags I will be going for a DD_UPDATE, which would update the CDC flag for the transactions not available in the source system to ‘D’ in the target table.

Also to note, that the Delete flow, would involve pulling the ports of CDC source table from the Router to the target and Update on the key column matching records with CDC_Flag=’D’.

And finally to make the session level changes for each target to define ‘Update as Update’ or ‘Update as Insert’ for the targets respectively. Upon running the mapping for the first time all transactions would be tagged as INSERT and then the proceeding runs will then depend on the source system changes to be captured through the CDC logic of our mapping.

Note – The DELETE transactions are the one which are no more in the source table, and hence these once after being tagged as ‘D’ for the CDC_Flag can be deleted from the target CDC table as well after making use of the change on to the History tracking table which will be covered more in detail with the next article shortly.

I will shortly be posting another phase of this topic on CDC and on tracking of history on a transaction change in your warehouse with the CDC. Hope you enjoyed reading this and helps you. Please post your comments and feedback for a better work and clear ideas. Thanks! !

 

18 Responses to “Change Data Capture – Informatica Mapping Logic for CDC Implementation”

  1. Adi Raina Says:

    Reblogged this on Integration On My Mind and commented:
    Can I use Informatica for CDC Implementation? Sure. Here’s how to do it!

  2. Venkat Says:

    Nice Explanation, & a very informative article it is !!!
    Looking forward for further articles 🙂 !

    Keep Rocking !!!

    Thanks,
    Venkat.

  3. Susant Says:

    Hi, if the source is not having any column like undated record, version or flag then how to implement the SCD type 2 . how we can capture the CDC .


    • Hi Susant,

      As far i understand your question, you wish to track a change (CDC/SCD) even if there are no Date column or flag column in the source??

      If this is what you are asking, i would say we just should be clear on what will be our Primary Key columns (one or more set of columns) and keeping them we may have to just compare the other column if they have any change against the snap shot of our target loaded the previous day. As i have explained the Joiner and the Expression used after that would help you understand this better.

      Let me know if this does not answers your query..!!

      • Susant Says:

        Hi mahaveer yes just need more clarification ,

        say in Emp table data is
        100 ABC MUM
        101 XYZ BLR

        no in the 2nd data is

        100 ABC MUM
        101 XYZ HYD — Change
        102 DEF CHE — New

        My question is lets say in source is having 10 milion records
        i dont want to pull all the records i just want to pull the change
        and new records from the source table

        can it be achive with ur method which you described ..

        i know if we have date field then we can pull d data from source using last date modified column.


      • Hi Susant,

        Yes this is what the purpose of having a CDC, not to depend on any Date Column to track on changes.
        As i have explained here in the article, for any new insert in the source which would have a key column as in your example 100,101… defining the transaction number which can be the column to track on the new inserts.
        Something as i have mentioned – ISNULL(EmployeeID_target), where you can also have more than one column to be a key column and you see up on joining a Employee_ID is available in Source and is not there in the target table and you would get that as NULL when you join and hence this record from source shall be tagged as INSERT.

        Whereas to track an update you can see, the key column like Employee_ID will be present both in Source and the Target snap shot but there will be a change in any one of the other non key columns.
        So we may have expression for your example like – EmployeeID_source=EmployeeID_target AND (City_Source City_target OR Department_source Department_target)
        Hence you get these under a flag for UPDATE.

        Hope this clarifies and answers your question..!!

      • susant kumar Says:

        Hi Can you send me the mapping xml file for this CDC logic ?

        Also i have doubt on that mapping why there is filter and router used in the pipeline where both act almost same .

        :::::::::thanks ::::::::::::: regards…….susant

  4. pragnya Says:

    This the very nice explanation about CDC…


  5. nice article. So CDC is just keeping track of last udpated time and also to maintain the latest record unlike having all the history record, correct?


    • Hi Beer Mohamed, We indeed can track the history of our data which i had promised to cover in another article (a continuation of the CDC), Where i will be using the CDC flag that we have come up in this article and keep this table as base to load the next Fact table which would have the whole history of the changes with an Effective Date column saying when a change has occurred.

      Hope this helps..!!

  6. Madhu Chittareddy Says:

    really nice article, could you please let me know the next article on this continuation part when it will available , if available please share it.

  7. yash Says:

    which is old data here .. source one or target one .. m confused 😦


    • Hi Yash, When i say old data i mean the snap shot of your target table, which also is one of the source (in your warehouse or stage) in a CDC mapping, which you would join with new snap shot from your source data(outside your warehouse or stage). Thanks!

  8. kiran angara Says:

    Hi mahaveer,
    did you get a chance to explain the history load with CDC, i don’t see any article on it.

  9. harish Says:

    Very good explanation. Thank you

  10. Zip Zap Says:

    Please clarify my understanding –
    Traction_History is target table which you are calling OLD data in mapping.

    Traction_history_CDC is kind of staging source table which holds all the old data plus new data?

  11. Zip Zap Says:

    Also if there is no key PK on table then should we compare all the columns of source and target table?


Leave a comment