So, Finally here I go with an article on CDC (Change Data Capture) implementation through an informatica which had been a long waiting from my side to be posted.
Requirement – To track on the data change in the source system with the earlier data pulled into the warehouse and to track on Updates, new Inserts and Delete on the data that is no more in the source. I would continue on this CDC implementation with an another article on loading of data forward into a table to save the change history for the data user to keep a track on the change history.
The mapping I have come up with for the CDC logic would be as below,
(Click on the image for a better view)
The mapping would involve the Staging/source table which would be a daily truncate load from the source system as is with the current status on the transactions from a source. And the data are then to be sorted with respect to the primary key columns, sample table used would be as below. The sample implies on the target CDC table to have the same composite primary key columns.
The mapping would involve a simple logic to have a joiner (Full Outer Join) with the source table and the Target CDC table and to check for the matching and non matching records on the Primary key columns as is illustrated below,
This being a Full Outer Join, enables the loading of all matching and non matching transactions to flow through and then the following three expressions should be defined to check on the Change type,
ISNULL(TransactionID) AND ISNULL(ProductID) AND ISNULL(ReferenceOrderID) and ISNULL(ReferenceOrderLineID)
ISNULL(TransactionID_old) AND ISNULL(ProductID_old) AND ISNULL(ReferenceOrderID_old) AND ISNULL(ReferenceOrderLineID_old)
(TransactionDate != TransactionDate_old AND
TransactionType != TransactionType_old AND
Quantity != Quantity_old AND
ABS(ActualCost – ActualCost_old)>1)
Flag => IIF(v_Flag_Update=1,-1,IIF(v_Flag_Insert=1,-2,IIF(v_Flag_Delete=1,-3,0)))
where ‘0’ is for no change records, and then to use a filter transformation to filter out all the no change transactions with Flag=0.
Then a Router to define the grouping as per the result of the above expression on checking if the Transaction is an INSERT, UPDATE or DELETE.
(Click on the image for a better view)
And the Update strategy transformation for each to perform and Update or Insert for each, and to note even for DELETE flags I will be going for a DD_UPDATE, which would update the CDC flag for the transactions not available in the source system to ‘D’ in the target table.
Also to note, that the Delete flow, would involve pulling the ports of CDC source table from the Router to the target and Update on the key column matching records with CDC_Flag=’D’.
And finally to make the session level changes for each target to define ‘Update as Update’ or ‘Update as Insert’ for the targets respectively. Upon running the mapping for the first time all transactions would be tagged as INSERT and then the proceeding runs will then depend on the source system changes to be captured through the CDC logic of our mapping.
Note – The DELETE transactions are the one which are no more in the source table, and hence these once after being tagged as ‘D’ for the CDC_Flag can be deleted from the target CDC table as well after making use of the change on to the History tracking table which will be covered more in detail with the next article shortly.
I will shortly be posting another phase of this topic on CDC and on tracking of history on a transaction change in your warehouse with the CDC. Hope you enjoyed reading this and helps you. Please post your comments and feedback for a better work and clear ideas. Thanks! !
July 13, 2014
This was something i had faced couple of times in my development adding a logic to my mapping and i have always done the same mistake. Now this is why i have decided to share this with other developers, avoid using the ‘!='(Not Equal) operator in your expression as much as you can.
Let me quickly put down the real issue i had faced with my experience getting to use ‘!=’ operator and this was some real interesting learning experience getting to understand what and how really your data is being processed by the Informatica’s Integration Service.
I had a requirement, where in i had to populate 2 targets satisfying 2 different logic based on a column. So as an illustration i am taking the below records,
Now my need for the day is to populate to 2 targets as,
Target_A -> where Group_Nm=’A’
Target_B -> where Group_Nm !=’A’ and Group_Nm !=’B’ (expecting all other records other than those in Target_A and are not
So as the above logic says, i had straight away gone to put down my expressions as below, which would take all Group-A as Flag=1 and remaining records with Group name != ‘B’ would be expected to go to Flag=2.
IIF(Group_Nm = ‘A’,1,
IIF(Group_Nm != ‘B’,2,0))
I am expecting
1,3 to be tagged as Flag=1
4,5,6,7 to be tagged as Flag=2
Now what have i got in my target is,
1,3 tagged as Flag=1 flows to Target_A
6,7 tagged as Flag=2 flows to Target_B.
So it was quite clear that the issue was with the records having Group_Nm as NULL. That clearly gave a clue that the expression was not handling the NULL values as expected. So the root caus ebeing the second IF condition checking for Group_Nm != ‘B’ and hence scanning the left out records 2,4,5,6,7 it had got 6,7 satisfying the condition so what really happened to 4,5 was ‘as it happened to have NULL values these were ignored for a check and so were dropped out’ and so i had them flagged as ‘0’.
So quickly understanding all i had a change in the expression was to keep it as below,
IIF(Group_Nm = ‘A’,1,
IIF(Group_Nm = ‘B’,0,2))
So finally i had my results as expected with the right records flowing to the respective targets. Hope this made so good understanding to you. Please drop in your questions or suggestions if any.
April 26, 2014
Hello all, I am today going to discuss and brief on some key best practices that i have followed or have come across in my experience with Informatica. Most of the best practices are specific to keep low the ambiguity and make it easy for every other developer around you to understand or share the work and few with intention to have a better performance without making things complex. Let’s quickly go through them,
Shortcuts (Shared Folder) – The uses of shared folder to use the sources/targets around multiple folders keeps your repository and objects less and clean. A shared folder allows you make the best use of re-usability in Informatica and maintains consistence and standard and keeps the code in tact with less number of duplicate objects being created.
Re-usability – It’s always good to take advantage of re-usability options on Informatica with the Mapplets, Worklets, reusable transformations, tasks etc. which highly reduces effort on design and keeps things simple.
Simplicity – This is the practice to keep your mapping/code as much as simple as you can, it can be good to have multiple mapping to do things rather than one to do something more complex. Mapping or your code should be easier to understand and debug in the future.
Source-Qualifier – It’s the place where you can tap your required data and filter out unnecessary column /rows that are no were relevant to our requirement. It’s always good to tap out the data at SQ rather than allowing them to be processed and being filtered at a later part of your mapping.
Filter Transformation – It’s the best you use your Filters as close as possible to the Source, to allow lesser data to travel across your mapping and thus improving the performance of your mapping with minimal data.
Most important of all, avoid complex expressions and logic on your filter condition, it’s always good you make them all in your Expression transformation and decide with a Flag port that travels on to the Filter and makes the Filter transformation far too simple and better.
Expression Transformation – It’s always good using more operators (||,+,/) rather than Functions where ever required, and also the processing and loading of Integer type data is always the faster compared to String type. Good to make use of local variables and keep the calculations as much as simple as possible with multiple variables being used to breakdown the complex calculations to less redundant and easy to understand one.
Joiners – If possible, join the data at the SQ, and in case of must to go for a JOINER transformation make sure you use the Source with less records used as the Master and the other source as the Detail which has more records to process.
Parameterize your code – Wherever possible it’s always the best thing to do is to parameterize your code, this helps in making the code easy for any future changes to be made with no change on the Informatica code. Avoid hard coding values, and parameterize them which can ease for any changes going forward.
SQL Query – Always the most important thing to check for is how good your SQ or any SQL related process will work on the DB side. Your source/target relational object on the DB side should be good enough to handle any UPDATE/INSERT/SELECT process more effectively. Go for the SQL execution plan, and accordingly add Indexes (Clustered/Non-Clustered) as and when required on THE DB table.
Router transformation – Always use a less complex conditional expression on the Groups of your Router, avoiding any calculation/function being performed row wise at your router eating away more time and memory in running your session. Instead do your conditional checks for grouping on the expression prior to loading to the Router and just keep it to a Flag which decides the Grouping on the Router making it perform at its best. Never use a Filter after a Router, which makes it logic less use, as the Router is to perform a Group, and hence the filters can be added alongside the logic of your Grouping (which again can be handles at an expression in getting the Flags to group by).
Lookups – Also try using unconnected lookups wherever we can as it gives space for re using the same Lookup multiple times and can use the same value to test/calculate in expression. Use Filter on your Lookup source to avoid Caching in unwanted data thus optimizing the Cache use improving the performance. Try using a Joiner instead of a connected lookup wherever possible.
In case you use a Unconnected lookup and wish to return multiple values to expression, then concatenate the columns and after returning the value go for splitting the values J
Union Transformation – The lesser the Union transformation, better is your mapping performance. Try avoid using Unions and if used keep the number to minimal by trying to Union the source on SQ itself as the Union consumes a lot of Memory and time in processing being an active transformation.
I will keep updating this Post further more with some more points on best practices in Informatica i might come across. Please do share if you have any and that can help people around here. Learning and Sharing is a joy :)
This was something of a unexpected bug found in our code dealing with a mapping variable and i was unable to figure out why was the variable value not taken up as expected.
The issue was when i was trying to pass a value from a port to the Variable and use the same in the Workflow. So while i was checking at the Session level, the value was different whereas the same Port value flowing to the target was looking correct.
With some repeated run and test, i was able to catch what had really went wrong. Here i used a mapping variable defined and i had my only option to set the Aggregation option as either ‘MAX’ or ‘MIN’. So went on to select ‘MAX’.
Then comes assigning the port value to the variable with SETVARIABLE(). In doing so i was assuming the value on every run shall just be assigned to the Variable here. I was completely wrong with this assumption.
As referred in the Informatica content-Help,
“At the end of a successful session, the PowerCenter Integration Service compares the final current value of the variable to the start value of the variable. Based on the aggregate type of the variable, it saves a final current value to the repository. Unless overridden, it uses the saved value as the initial value of the variable for the next session run.”
Breaking this down for an easy understanding – The Variable $$Var will first be assigned the default value, then on the successful run when we try setting a new value say here default value was $$Var=10 and i am having a new value as ’15’ and my aggregation was type ‘MAX’ and so compares the new value with old and finds the new is greater value so $$Var=15 is assigned.
And now this value $$Var shall not change or be over written until i get a value >15. So was my issue where the new values were not assigned as expected with SETVARIABLE().
Thus my solution was to set a minimum value before the session starts, as simple as it is to just make it a override value through Parameter file (.prm file) and so i set it as ‘$$Var=0′. Now my new value gets always the greater value and is assigned to the variable with the latest :)
Hope this makes a good understanding, please put down your questions and comments if any. Thanks!!
February 27, 2014
Lets have some lights on to the basic Architecture of PowerCenter which is an ABCD learning for any Informatica Developer. Its good we know how and where our Infa objects gets created and how the way it runs in the back end.
So as we might be well aware of the multi-tier architecture of our Informatica, having a Client Server model with different services running together for what we just see on the PowerCenter client as objects created, deleted and run. As a developer, we must be well aware of the Repository database and its purpose, saves all the meta data on the objects in Informatica.
As in here we have the two major services, Repository Services and the Integration Services. The Repository services helps in communication between the PowerCenter components responding to the PC Client and getting in the Repository Database with a native driver connection for the meta data.
PowerCenter Client uses native protocol to communicate with the PowerCenter Repository Service and PowerCenter Integration Service.The PowerCenter Client uses ODBC to connect to source and target databases.
How it works:
When we use the PC Client, and we create a Domain connection with a gateway from the client tool we have the service manager in the server which then connects to the PowerCenter Repository Services with a TCP/IP connection established. Also we define the Repository Database which is defined while installation of Informatica and is connected with a native driver connection from the Repository services to the Database.
Every time a job/task runs it is run through the Client connecting with the Integration service which in turn interacts with the Repository services to get the Repository objects from the Repository Database. It uses TCP/IP to connect to the associated PowerCenter Repository Service and retrieve metadata.
The ODBC Connection in the Client/Integration services:
When ever we try creating source/target on our client machine we have an independent ODBC connections created through the client machine and the Integration Services runs with the similar way by connecting to the source through a ODBC/Native connection established and defined already.
The flow runs as, when we connect to a client (eg. Workflow Manager) a Repository connection is made with the Repository services which retrieves meta data connecting to the Repository database as per the request from the client. Then as we run a workflow, the Client connects to the Integration services and the Command is fetched by the Integration service to connect with the Source/Target through an ODBC connection to fetch the data or write the output. The IS then interacts with the Repository services for the Infa object metadata and the process runs as defined in the mapping in the Integration services.
Hope this gives a bit of clarity on the backstage working of our Informatica. Do let me know your suggestions or corrections and queries on this article.
February 25, 2014
Hello Friends, Time for some learning. This time i have taken up the Informatica Variables and Parameters which i am very much fond of :)
This time i have taken up something that i came across in one of the development my team was involved in, and there was a need to pass a value from one task/session to another as we had a check with respect to the value we get from one session to be used going forward in the process.
For instance, my requirement is to check on the count of records from session_1 and then to count on records in session_2 but with the same counter (increment from where the Session_1 count ended) ending up with the total count of records. Something like, i had 100 records in session_1 and then i have to count from there as 100+1 and so on for the session_2. Looks simple, but quite a nail biting and challenging experience while we try getting it in Informatica!!
And then we came across this approach/feature available in the Informatica ‘Pre-Session/Post-Session Variable Assignment’. Which was all about the assigning of Variable values between the Sessions and it’s Parent Worklet/Workflow and vice-versa. Interesting right? If not clear lets go with a step by step illustration for a better understanding on this.
1. All we need to do is define a Counter variable/Parameter $$v1_Count for the session_1, and then our session runs to populate $$v1_Count as a usual Mapping Variable/Parameter does.
2. Next is, to assign this value to the common Variable (that acts as an intermediate between two of our sessions) which will be the Parent Worklet/Workflow Variable defined as any other Variable in the Edit Workflow/Worklet Options. In my case i have Workflow variable $$wf_v_Count defined.
3. Now in the session_1, Components tab -> lets edit the “Post session Variable assignment” option (it can be on failure or on success),
(i) Post-session on success variable assignment
(ii) Post-session on failure variable assignment
and i define it as - “Parent_Workflow/Worklet_Variable = Mapping_Variable/Parameter”
i.e., $$wf_v_Count = $$v1_Count
So here we have the $$v1_Count value open for any task,session or Worklet under the same parent can take up and use it for processing.
4. In our case, for the next session_2, i have to go for Components tab -> “Pre Session Variable Assignment” to be defined to assign the value from the $$wf_v_Count to the respective session’s Mapping Variable/Parameter $$v2_Count.
So we define it as “Mapping_Variable/Parameter = Parent_Workflow/Worklet_Variable”
i.e., $$v2_Count = $$wf_v_Count
The same can be used all over the Parent Workflow/Worklet, in any decision tasks or on link conditions. Thus we have now passed on the value from one session to the other and can be processed thereafter. Hope this was an interesting and helpful piece to you.
Do post your comments on any corrections or queries with respect to this. Share this if it can help someone.
December 12, 2013
I have been trying to get on with this article for a long time, but have been moving around with other commitments. Finally time to share a interesting topic with all Informatica nerds.
Not so functional or a logical thing i am here with to share with you guys, it was just about how i made use of Sorter and never meant to sort things!
Confused?? OK without much ado lets get on with this..
I was in a need to compare 2 results from same source (SQL Table) and then do an update on the same, kind of a data clean up activity i was involved into. And so i was just done with my mapping design and then just went on to run the session as well. But to my surprise, all i see was the queries, 2 source tables was competing with each other and also there was an insert/update on the target end leading to lot of pressure onto the DB and hence resulting in the process to hang with data load literally halting.
Then a bulb glows (a bit late though).. and the thought was, why cant i tap the data at one point and once it reaches, then to open the tap. But how?
As we all know we have ‘The Integration Service passes all incoming data into the Sorter transformation before it performs the sort operation.’ I just made use of this feature, and as you all might have got it by now, i used 2 sorters at the 2 source flows coming in and joining and then once the source is read with different SQ SQL overrides, all the transformations are performed and finally the data is written on to the same SQL table.
Hope this gets as a Savior to you too at a similar point when you are helpless!!
Share and help others learn with a joy :)