I got this running, but it doesn't appear to be handling deleted records appropriately. I want a missing/deleted record from the source data to be marked as Active = 0 and ActiveEndTime set. However, when I run the dataflow with a source record that was previously in the data and now isn't, nothing happens. What do I need to do to handle "deleted" records?
This is really good. What i observed is it will update start time. end time of all old entries for that ID. ideally it should update the recent active entry only. to avoid this the ID and Key columns should be used in sink key columns.
Thank you Daniel for the nice video. As you mentioned at timeline 5:05, is it possible to compute the row hash on BOTH the source and the target ONLY for the column names that are common to both? I tried to something like "src_hash = md5(of only columns that are also in target) and trg_hash = md5(of only columns that are also in the source)". But, I wasn't able to! Can you please do video about this -- if such a thing is possible or throw some ideas?
Hi Daniel, I just have a basic question. In the source of data flow, you dont explicitly mention the file name. Still, it works perfectly. How is that possible?
Thanks for this video, question related to performance: when creating md5 hash and comparing run time, do you see any issue? Say if my dimension table got 1 million rows and have 70 columns then, don't you see issue comparing the has columns run time? Or do you suggest I should store the hash values in a column of target dimension table and compare that with the source?
@@MSDataFactory thanks a lot. I also wonder about this question. Doing a hash comparison is definitely better if we compare multiple columns. But I think the question is something different. target dim table does not have hashcolumn stored, so every time we run the pipeline we recalculate the hashcolumn both for incoming source and destination tables, instead of storing the hash value on the destination and only generating the hash for incoming source columns. If the dim table is huge as question states, generating hash at every runtime and comparing wouldn't cause any performance issues? on the other side, if we create a hashcolumn in Dimtable, then storage will be an issue. Is that why you don't store the hashvalue in dimtable?
@@easyyuz2707 Calculating the hash in Spark data frames is a very fast operation. It's the I/O and database connector read/write times that will effect your flows the most. Either method is valid, although storing the hash in the Dim table is also very common.
Do we need to create a separate data flow for each dimension? because this approach can work only on a single file what if the other file have diff schema?
You need to create a metadata in Azure SQL Database where you can define all the particulars of your dimension tables. The metadata can be configurable and can be called in your dataflow's pipeline via Lookup activity.
Unless I’ve missed something this doesn’t work , it updates all previous records with the the endtime. So if I have 5 previous records it updates them all based on the PK.
Great demo - thanks!! However can this be made to work if the source and target column names are different? eg customer_id in the source and custId in the target
Hi, Thanks for this nice demo! In the case of multiple csv's with the same primary key, all new records are marked for insert which leads to two active records. My expectation is that the current record is ended ( which is the case). The two new records should processed in sequence so that I have only one active record. Example: current state in dimension table: record a is active csv 1 has an update for record a csv 2 has an update for record a Both files are being read and both records are inserted which leads to the following result current record is set inactive record from csv 1 is set active record from csv 2 is set active How can I solve this issue? I think the oldest record should be processed first. thx
Thanks for the lovely demo. Can you please advise if there is any option where we can parameterise the data type conversion from String to Int/date etc. e.g. passing a list of attributes which shoould be converted to int/date before table load ? implicit conversion does not seem to work Thanks in advance
Hey Manish, you can always use column patterns to have matching conditions or infer drifted columns if the data is coming from CSV. Drifted data takes the data type of the source system.
This is good, but what happens when you don't have an ID column in your source? I can't find a way to insert the ID column name (from the parameters) as a new field in the data flow. Tried with a Select transformation and a Derived Column transformation, but can't set the name of the field to be dynamic based on the parameter. Any tips?
You need a key to be able to know which dimension rows to update or insert. You can use Exists to check for the existence of a dimensions member by looking for it by other attributes in your existing dimension table. If the dimension member is not found, then it is new and you will generate a new ID using Surrogate Key. If the dimension member exists, and properties have changed, you can then grab the ID from the target dimension table and use it to update the existing row.
Is the surrogate key column 'Key' is identity in the SQL Database? Do you need to make the surrogate key in the table as Identity for this scenario? I was thinking if you use MPP database then you can't have sequential number.
In ADF, "Key column" is just being used as a way to uniquely identify which rows to update in the target database. So the target database, like an MPP, is not required to support primary key as a column property.
@@MSDataFactory Thanks, Actually, I was talking about 'Key' column (surrogate key) not the unique business key (ID) which you treated as Primary key. I believe you need the surrogate key. So the question was, are you using IDENTITY for the surrogate key?
@@rahuldhip2002 This example is generating the surrogate key inside the data flow logic itself using the Surrogate Key transformation, not the database Identity column property. The primary key parameter is just the name of the database key column for matching.
@@MSDataFactory Thanks for your explanation. It means whenever I am creating surrogate key in the database, I should not make that column as Identity. That is my takeaway.
@@rahuldhip2002 You can still do that, but when you map that identity column in a data flow sink, you will need to check "Skip writing key columns" or use a pre-processing script to enable identity write.
How do we pass composite primary key columns in a parameterized way ?. In the above example, we have ONLY one column parameterized as 'ID'. What if I have composite primary key columns like 'ID, Code'. I tried with this approach and below is the error "message":"Job failed due to reason: at Sink 'sink1': org.apache.spark.sql.AnalysisException: cannot resolve `source.AIDNLIFDMD,NSEQ` in search condition given columns... Here the composite primary key columns are AIDNLIFDMD & NSEQ Any suggestions would be much appreciated pls ?
@@MSDataFactory Thanks for your response team. But, I could not find the option as "Custom expression" in Sink Settings under "Key Columns". It says "Import schema to get the column list". When we give as ['key1','key2'], it is not validated and it says "Invalid type, expected string and provided string[]". Sorry I could not attach my screen shot in the comment here. Appreciate your support on this.
@@balanm8570 This only works with database sinks and when you delete, update, or upserts turned on. See if you are able to open my screenshot here: github.com/kromerm/adfdataflowdocs/blob/master/images/keycols.png
@@MSDataFactory Thanks Team for your timely response. Really Appreciate it.... Do we have any plan or ETA by when we can expect this feature for Delta Sink?
I believe the data types should be declared as 'VARCHAR' for the 'Salary' column in the Dim table. I also had the problem, as i declared the salary data type as DECIMAL. Although the values are the same, the MD5 returned different values. Then tried changing the data type to VARCHAR, & it worked. Botton line is..
I have used exact same flow for my transformation but i am getting dups for 2nd run... Addhashexist and addhasexisting should be different? in my case it is different for same data row.. when i am running the pipeline 2nd time i am getting dups what would be the reasons .Any suggestions .
Hello Daniel, it's a wonderfull example of type 2. Your the only one sofar that has made a clear video about it. I have done it with my one data. It adds new records perfectly, only updated records will not get an enddate and the field Active is not updated from 1 to 0. Weird enough in the data preview it does exactly what it should do but in real unfortunately not. Any idea how i can fix that?
@@leidagoncalves6403 unfortunately not, at this point we work with An sql database so My work around is to do the scdtype2 with An sql script which works perfectly. When we start working with datalake in stead of sql database, we have to work with dataflows in the future , then this will be a problem. I creates An issue at Microsoft but they did not answer it. On the internet i was not able to Find other websites or other RUclips tutorials that about the specific slowly changing dimensionale type 2 case, only type 1 . Some where there but totally un clear
@@RonaldPostelmans Thank you very much. I am actually sinking the data to CosmosDb but having the exact same issue. It seems that some other step or configuration is needed for updates and deletes but I need to investigate further. Thanks.
@@RonaldPostelmans It's the last step (the sink) you need to enable the Update and set the column using thePrimaryKey Parameter - see 11mins 41secs of the video. I hope this helps someone else and I'll not get my 2 hours back!
I got this running, but it doesn't appear to be handling deleted records appropriately. I want a missing/deleted record from the source data to be marked as Active = 0 and ActiveEndTime set. However, when I run the dataflow with a source record that was previously in the data and now isn't, nothing happens. What do I need to do to handle "deleted" records?
This is really good. What i observed is it will update start time. end time of all old entries for that ID. ideally it should update the recent active entry only. to avoid this the ID and Key columns should be used in sink key columns.
I really like this generice approach.
How would I handle the case, if the source column name is different from the sink column name?
Thank you Daniel for the nice video. As you mentioned at timeline 5:05, is it possible to compute the row hash on BOTH the source and the target ONLY for the column names that are common to both? I tried to something like "src_hash = md5(of only columns that are also in target) and trg_hash = md5(of only columns that are also in the source)". But, I wasn't able to! Can you please do video about this -- if such a thing is possible or throw some ideas?
In the "AddHashExisting", why the primary key is split (Comma Seperated) but not on the CSV source?
Hi Daniel, I just have a basic question. In the source of data flow, you dont explicitly mention the file name. Still, it works perfectly. How is that possible?
The delimited text source is reading from a parameterized folder name. All files in that folder will be read by the source.
Thanks for this video, question related to performance: when creating md5 hash and comparing run time, do you see any issue? Say if my dimension table got 1 million rows and have 70 columns then, don't you see issue comparing the has columns run time? Or do you suggest I should store the hash values in a column of target dimension table and compare that with the source?
Its tough to say without looking at the monitoring logs! If you are doing multiple comparisons, hash values should be quicker though
@@MSDataFactory thanks a lot. I also wonder about this question. Doing a hash comparison is definitely better if we compare multiple columns. But I think the question is something different. target dim table does not have hashcolumn stored, so every time we run the pipeline we recalculate the hashcolumn both for incoming source and destination tables, instead of storing the hash value on the destination and only generating the hash for incoming source columns. If the dim table is huge as question states, generating hash at every runtime and comparing wouldn't cause any performance issues? on the other side, if we create a hashcolumn in Dimtable, then storage will be an issue. Is that why you don't store the hashvalue in dimtable?
@@easyyuz2707 Calculating the hash in Spark data frames is a very fast operation. It's the I/O and database connector read/write times that will effect your flows the most. Either method is valid, although storing the hash in the Dim table is also very common.
@@MSDataFactory Thanks a lot! This is very helpful!
@@easyyuz2707 I 've implemented this in both Informatica and ADF same MD5. ADF is faster than Informatica
Do we need to create a separate data flow for each dimension? because this approach can work only on a single file what if the other file have diff schema?
You need to create a metadata in Azure SQL Database where you can define all the particulars of your dimension tables. The metadata can be configurable and can be called in your dataflow's pipeline via Lookup activity.
Unless I’ve missed something this doesn’t work , it updates all previous records with the the endtime. So if I have 5 previous records it updates them all based on the PK.
Quick question the data debug cluster taking lot of time to start cluster.
Great demo - thanks!! However can this be made to work if the source and target column names are different? eg customer_id in the source and custId in the target
Hi,
Thanks for this nice demo!
In the case of multiple csv's with the same primary key, all new records are marked for insert which leads to two active records. My expectation is that the current record is ended ( which is the case). The two new records should processed in sequence so that I have only one active record.
Example:
current state in dimension table: record a is active
csv 1 has an update for record a
csv 2 has an update for record a
Both files are being read and both records are inserted which leads to the following result
current record is set inactive
record from csv 1 is set active
record from csv 2 is set active
How can I solve this issue? I think the oldest record should be processed first.
thx
Will this work on Fabric?
Thanks for the lovely demo.
Can you please advise if there is any option where we can parameterise the data type conversion from String to Int/date etc.
e.g. passing a list of attributes which shoould be converted to int/date before table load ? implicit conversion does not seem to work
Thanks in advance
Hey Manish, you can always use column patterns to have matching conditions or infer drifted columns if the data is coming from CSV. Drifted data takes the data type of the source system.
unable to see this "Generic Type 2 Slowly Changing Dimension" in template section
Should be available from the template gallery next week
@@MSDataFactory still not available .but,it was very good video
This is good, but what happens when you don't have an ID column in your source? I can't find a way to insert the ID column name (from the parameters) as a new field in the data flow. Tried with a Select transformation and a Derived Column transformation, but can't set the name of the field to be dynamic based on the parameter. Any tips?
You need a key to be able to know which dimension rows to update or insert. You can use Exists to check for the existence of a dimensions member by looking for it by other attributes in your existing dimension table. If the dimension member is not found, then it is new and you will generate a new ID using Surrogate Key. If the dimension member exists, and properties have changed, you can then grab the ID from the target dimension table and use it to update the existing row.
Heya, how do the parameters for the Generic DataSets get set? We don't seem to be passing them in the way I would in a pipeline?
The parameters for the generic dataset get passed in as part of the Execute Data Flow activity. These parameters are created in the dataset.
Is the surrogate key column 'Key' is identity in the SQL Database? Do you need to make the surrogate key in the table as Identity for this scenario? I was thinking if you use MPP database then you can't have sequential number.
In ADF, "Key column" is just being used as a way to uniquely identify which rows to update in the target database. So the target database, like an MPP, is not required to support primary key as a column property.
@@MSDataFactory Thanks, Actually, I was talking about 'Key' column (surrogate key) not the unique business key (ID) which you treated as Primary key. I believe you need the surrogate key. So the question was, are you using IDENTITY for the surrogate key?
@@rahuldhip2002 This example is generating the surrogate key inside the data flow logic itself using the Surrogate Key transformation, not the database Identity column property. The primary key parameter is just the name of the database key column for matching.
@@MSDataFactory Thanks for your explanation. It means whenever I am creating surrogate key in the database, I should not make that column as Identity. That is my takeaway.
@@rahuldhip2002 You can still do that, but when you map that identity column in a data flow sink, you will need to check "Skip writing key columns" or use a pre-processing script to enable identity write.
How do we pass composite primary key columns in a parameterized way ?. In the above example, we have ONLY one column parameterized as 'ID'. What if I have composite primary key columns like 'ID, Code'. I tried with this approach and below is the error
"message":"Job failed due to reason: at Sink 'sink1': org.apache.spark.sql.AnalysisException: cannot resolve `source.AIDNLIFDMD,NSEQ` in search condition given columns...
Here the composite primary key columns are AIDNLIFDMD & NSEQ
Any suggestions would be much appreciated pls ?
In Sink Settings under "Key columns", select "Custom expression". Use a string array parameter like: ['id','name']
@@MSDataFactory Thanks for your response team. But, I could not find the option as "Custom expression" in Sink Settings under "Key Columns". It says "Import schema to get the column list". When we give as ['key1','key2'], it is not validated and it says "Invalid type, expected string and provided string[]". Sorry I could not attach my screen shot in the comment here.
Appreciate your support on this.
@@balanm8570 This only works with database sinks and when you delete, update, or upserts turned on. See if you are able to open my screenshot here: github.com/kromerm/adfdataflowdocs/blob/master/images/keycols.png
@@MSDataFactory Thanks Team for your timely response. Really Appreciate it....
Do we have any plan or ETA by when we can expect this feature for Delta Sink?
I did the exact same thing and It is failing at NewAndUpdatedRows. While columns_hash comparison between Salary. Any thoughts?
I believe the data types should be declared as 'VARCHAR' for the 'Salary' column in the Dim table. I also had the problem, as i declared the salary data type as DECIMAL. Although the values are the same, the MD5 returned different values. Then tried changing the data type to VARCHAR, & it worked.
Botton line is..
Can some one provide the link to the template mentioned at the end of the video. where do i find: data factory scd type 2 templates
In the ADF or Synapse UI, you will use the template gallery. In the gallery, you can find the template or use the search bar.
I have used exact same flow for my transformation but i am getting dups for 2nd run... Addhashexist and addhasexisting should be different? in my case it is different for same data row.. when i am running the pipeline 2nd time i am getting dups what would be the reasons
.Any suggestions .
to avoid the dups you have to add one more condition in the branch that addhasexist ! = Addhash incoming .
In SSIS this is very very easy to accomplish, why is it still so painstaking cumbersome in ADF?
This is a pattern for generic SCD Type 2 for any dimension table using schema drift and column pattern transformations, which is not possible in SSIS
Its really helpful.
Hello Daniel, it's a wonderfull example of type 2. Your the only one sofar that has made a clear video about it. I have done it with my one data. It adds new records perfectly, only updated records will not get an enddate and the field Active is not updated from 1 to 0. Weird enough in the data preview it does exactly what it should do but in real unfortunately not. Any idea how i can fix that?
I am having the exact same problem, did you manage to find a solution?
@@leidagoncalves6403 unfortunately not, at this point we work with An sql database so My work around is to do the scdtype2 with An sql script which works perfectly. When we start working with datalake in stead of sql database, we have to work with dataflows in the future , then this will be a problem. I creates An issue at Microsoft but they did not answer it. On the internet i was not able to Find other websites or other RUclips tutorials that about the specific slowly changing dimensionale type 2 case, only type 1 . Some where there but totally un clear
@@RonaldPostelmans Thank you very much. I am actually sinking the data to CosmosDb but having the exact same issue. It seems that some other step or configuration is needed for updates and deletes but I need to investigate further. Thanks.
@@RonaldPostelmans It's the last step (the sink) you need to enable the Update and set the column using thePrimaryKey Parameter - see 11mins 41secs of the video. I hope this helps someone else and I'll not get my 2 hours back!
@@leidagoncalves6403 see my post below
For Dropping columns.can you please send me the code or logic ?
To drop columns, use the Select tranformation
!in(['id_hash','columns_hash','MaxSurrogateKey'],name) I hope that helps
@@MSDataFactory they know that, you didnt show the code on the screen to use in the select.
@@shauntkhalatian429 thank you, works
@@crispy9489 There is a link to the actual pipeline template in the video description that has all of the code
nice video