Generic Type 2 Slowly Changing Dimension using Mapping Data Flows

Data Factory

Просмотров 21 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 28 дек 2024

Комментарии •

@jodyclaggett786 3 года назад ⁺²
I got this running, but it doesn't appear to be handling deleted records appropriately. I want a missing/deleted record from the source data to be marked as Active = 0 and ActiveEndTime set. However, when I run the dataflow with a source record that was previously in the data and now isn't, nothing happens. What do I need to do to handle "deleted" records?
@yogeshnikam8064 2 года назад ⁺¹
This is really good. What i observed is it will update start time. end time of all old entries for that ID. ideally it should update the recent active entry only. to avoid this the ID and Key columns should be used in sink key columns.
@jorglang7883 6 месяцев назад
I really like this generice approach.
How would I handle the case, if the source column name is different from the sink column name?
@ubaddala 4 года назад
Thank you Daniel for the nice video. As you mentioned at timeline 5:05, is it possible to compute the row hash on BOTH the source and the target ONLY for the column names that are common to both? I tried to something like "src_hash = md5(of only columns that are also in target) and trg_hash = md5(of only columns that are also in the source)". But, I wasn't able to! Can you please do video about this -- if such a thing is possible or throw some ideas?
@vickyvinay23 3 года назад
In the "AddHashExisting", why the primary key is split (Comma Seperated) but not on the CSV source?
@AllAboutBI 4 года назад
Hi Daniel, I just have a basic question. In the source of data flow, you dont explicitly mention the file name. Still, it works perfectly. How is that possible?
@MSDataFactory 4 года назад ⁺²
The delimited text source is reading from a parameterized folder name. All files in that folder will be read by the source.
@rahuldhip2002 4 года назад ⁺¹
Thanks for this video, question related to performance: when creating md5 hash and comparing run time, do you see any issue? Say if my dimension table got 1 million rows and have 70 columns then, don't you see issue comparing the has columns run time? Or do you suggest I should store the hash values in a column of target dimension table and compare that with the source?
@MSDataFactory 4 года назад
Its tough to say without looking at the monitoring logs! If you are doing multiple comparisons, hash values should be quicker though
@easyyuz2707 3 года назад
@@MSDataFactory thanks a lot. I also wonder about this question. Doing a hash comparison is definitely better if we compare multiple columns. But I think the question is something different. target dim table does not have hashcolumn stored, so every time we run the pipeline we recalculate the hashcolumn both for incoming source and destination tables, instead of storing the hash value on the destination and only generating the hash for incoming source columns. If the dim table is huge as question states, generating hash at every runtime and comparing wouldn't cause any performance issues? on the other side, if we create a hashcolumn in Dimtable, then storage will be an issue. Is that why you don't store the hashvalue in dimtable?
@MSDataFactory 3 года назад
@@easyyuz2707 Calculating the hash in Spark data frames is a very fast operation. It's the I/O and database connector read/write times that will effect your flows the most. Either method is valid, although storing the hash in the Dim table is also very common.
@easyyuz2707 3 года назад
@@MSDataFactory Thanks a lot! This is very helpful!
@epictamil3755 3 года назад
@@easyyuz2707 I 've implemented this in both Informatica and ADF same MD5. ADF is faster than Informatica
@jannujasmineshalini Год назад ⁺¹
Do we need to create a separate data flow for each dimension? because this approach can work only on a single file what if the other file have diff schema?
@bytedonor 8 месяцев назад
You need to create a metadata in Azure SQL Database where you can define all the particulars of your dimension tables. The metadata can be configurable and can be called in your dataflow's pipeline via Lookup activity.
@mountainsandmagic645 3 года назад ⁺¹
Unless I’ve missed something this doesn’t work , it updates all previous records with the the endtime. So if I have 5 previous records it updates them all based on the PK.
@puthenful 3 года назад
Quick question the data debug cluster taking lot of time to start cluster.
@philiplee6011 2 года назад
Great demo - thanks!! However can this be made to work if the source and target column names are different? eg customer_id in the source and custId in the target
@martijnstraatman6296 3 года назад ⁺¹
Hi,
Thanks for this nice demo!
In the case of multiple csv's with the same primary key, all new records are marked for insert which leads to two active records. My expectation is that the current record is ended ( which is the case). The two new records should processed in sequence so that I have only one active record.
Example:
current state in dimension table: record a is active
csv 1 has an update for record a
csv 2 has an update for record a
Both files are being read and both records are inserted which leads to the following result
current record is set inactive
record from csv 1 is set active
record from csv 2 is set active
How can I solve this issue? I think the oldest record should be processed first.
thx
@sandroquinta1125 4 месяца назад
Will this work on Fabric?
@manianan897 4 года назад
Thanks for the lovely demo.
Can you please advise if there is any option where we can parameterise the data type conversion from String to Int/date etc.
e.g. passing a list of attributes which shoould be converted to int/date before table load ? implicit conversion does not seem to work
Thanks in advance
@MSDataFactory 4 года назад
Hey Manish, you can always use column patterns to have matching conditions or infer drifted columns if the data is coming from CSV. Drifted data takes the data type of the source system.
@devprakash7876 4 года назад ⁺¹
unable to see this "Generic Type 2 Slowly Changing Dimension" in template section
@MSDataFactory 4 года назад ⁺²
Should be available from the template gallery next week
@mani653 Год назад
@@MSDataFactory still not available .but,it was very good video
@BrentLeslie 3 года назад
This is good, but what happens when you don't have an ID column in your source? I can't find a way to insert the ID column name (from the parameters) as a new field in the data flow. Tried with a Select transformation and a Derived Column transformation, but can't set the name of the field to be dynamic based on the parameter. Any tips?
@MSDataFactory 3 года назад
You need a key to be able to know which dimension rows to update or insert. You can use Exists to check for the existence of a dimensions member by looking for it by other attributes in your existing dimension table. If the dimension member is not found, then it is new and you will generate a new ID using Surrogate Key. If the dimension member exists, and properties have changed, you can then grab the ID from the target dimension table and use it to update the existing row.
@Brondahl 4 года назад
Heya, how do the parameters for the Generic DataSets get set? We don't seem to be passing them in the way I would in a pipeline?
@MSDataFactory 4 года назад
The parameters for the generic dataset get passed in as part of the Execute Data Flow activity. These parameters are created in the dataset.
@rahuldhip2002 4 года назад
Is the surrogate key column 'Key' is identity in the SQL Database? Do you need to make the surrogate key in the table as Identity for this scenario? I was thinking if you use MPP database then you can't have sequential number.
@MSDataFactory 4 года назад
In ADF, "Key column" is just being used as a way to uniquely identify which rows to update in the target database. So the target database, like an MPP, is not required to support primary key as a column property.
@rahuldhip2002 4 года назад
@@MSDataFactory Thanks, Actually, I was talking about 'Key' column (surrogate key) not the unique business key (ID) which you treated as Primary key. I believe you need the surrogate key. So the question was, are you using IDENTITY for the surrogate key?
@MSDataFactory 4 года назад
@@rahuldhip2002 This example is generating the surrogate key inside the data flow logic itself using the Surrogate Key transformation, not the database Identity column property. The primary key parameter is just the name of the database key column for matching.
@rahuldhip2002 4 года назад
@@MSDataFactory Thanks for your explanation. It means whenever I am creating surrogate key in the database, I should not make that column as Identity. That is my takeaway.
@MSDataFactory 4 года назад ⁺³
@@rahuldhip2002 You can still do that, but when you map that identity column in a data flow sink, you will need to check "Skip writing key columns" or use a pre-processing script to enable identity write.
@balanm8570 4 года назад
How do we pass composite primary key columns in a parameterized way ?. In the above example, we have ONLY one column parameterized as 'ID'. What if I have composite primary key columns like 'ID, Code'. I tried with this approach and below is the error
"message":"Job failed due to reason: at Sink 'sink1': org.apache.spark.sql.AnalysisException: cannot resolve `source.AIDNLIFDMD,NSEQ` in search condition given columns...
Here the composite primary key columns are AIDNLIFDMD & NSEQ
Any suggestions would be much appreciated pls ?
@MSDataFactory 4 года назад
In Sink Settings under "Key columns", select "Custom expression". Use a string array parameter like: ['id','name']
@balanm8570 4 года назад
@@MSDataFactory Thanks for your response team. But, I could not find the option as "Custom expression" in Sink Settings under "Key Columns". It says "Import schema to get the column list". When we give as ['key1','key2'], it is not validated and it says "Invalid type, expected string and provided string[]". Sorry I could not attach my screen shot in the comment here.
Appreciate your support on this.
@MSDataFactory 4 года назад ⁺¹
@@balanm8570 This only works with database sinks and when you delete, update, or upserts turned on. See if you are able to open my screenshot here: github.com/kromerm/adfdataflowdocs/blob/master/images/keycols.png
@balanm8570 4 года назад
@@MSDataFactory Thanks Team for your timely response. Really Appreciate it....
Do we have any plan or ETA by when we can expect this feature for Delta Sink?
@sivaramkodali8282 3 года назад
I did the exact same thing and It is failing at NewAndUpdatedRows. While columns_hash comparison between Salary. Any thoughts?
@mugunthanc8660 3 года назад ⁺¹
I believe the data types should be declared as 'VARCHAR' for the 'Salary' column in the Dim table. I also had the problem, as i declared the salary data type as DECIMAL. Although the values are the same, the MD5 returned different values. Then tried changing the data type to VARCHAR, & it worked.
Botton line is..
@anilmuppavarapu8618 3 года назад
Can some one provide the link to the template mentioned at the end of the video. where do i find: data factory scd type 2 templates
@MSDataFactory 3 года назад
In the ADF or Synapse UI, you will use the template gallery. In the gallery, you can find the template or use the search bar.
@Sarabjitsethi 3 года назад
I have used exact same flow for my transformation but i am getting dups for 2nd run... Addhashexist and addhasexisting should be different? in my case it is different for same data row.. when i am running the pipeline 2nd time i am getting dups what would be the reasons
.Any suggestions .
@epictamil3755 3 года назад
to avoid the dups you have to add one more condition in the branch that addhasexist ! = Addhash incoming .
@RobertLenior Год назад
In SSIS this is very very easy to accomplish, why is it still so painstaking cumbersome in ADF?
@MSDataFactory Год назад
This is a pattern for generic SCD Type 2 for any dimension table using schema drift and column pattern transformations, which is not possible in SSIS
@AshishPatel-kn3kc 3 года назад ⁺¹
Its really helpful.
@RonaldPostelmans 2 года назад
Hello Daniel, it's a wonderfull example of type 2. Your the only one sofar that has made a clear video about it. I have done it with my one data. It adds new records perfectly, only updated records will not get an enddate and the field Active is not updated from 1 to 0. Weird enough in the data preview it does exactly what it should do but in real unfortunately not. Any idea how i can fix that?
@leidagoncalves6403 2 года назад
I am having the exact same problem, did you manage to find a solution?
@RonaldPostelmans 2 года назад
@@leidagoncalves6403 unfortunately not, at this point we work with An sql database so My work around is to do the scdtype2 with An sql script which works perfectly. When we start working with datalake in stead of sql database, we have to work with dataflows in the future , then this will be a problem. I creates An issue at Microsoft but they did not answer it. On the internet i was not able to Find other websites or other RUclips tutorials that about the specific slowly changing dimensionale type 2 case, only type 1 . Some where there but totally un clear
@leidagoncalves6403 2 года назад
@@RonaldPostelmans Thank you very much. I am actually sinking the data to CosmosDb but having the exact same issue. It seems that some other step or configuration is needed for updates and deletes but I need to investigate further. Thanks.
@nigelcrask9408 Год назад ⁺¹
@@RonaldPostelmans It's the last step (the sink) you need to enable the Update and set the column using thePrimaryKey Parameter - see 11mins 41secs of the video. I hope this helps someone else and I'll not get my 2 hours back!
@nigelcrask9408 Год назад
@@leidagoncalves6403 see my post below
@srikarvandanapu2192 3 года назад
For Dropping columns.can you please send me the code or logic ?
@MSDataFactory 3 года назад
To drop columns, use the Select tranformation
@shauntkhalatian429 3 года назад ⁺¹
!in(['id_hash','columns_hash','MaxSurrogateKey'],name) I hope that helps
@crispy9489 3 года назад
@@MSDataFactory they know that, you didnt show the code on the screen to use in the select.
@crispy9489 3 года назад
@@shauntkhalatian429 thank you, works
@MSDataFactory 3 года назад
@@crispy9489 There is a link to the actual pipeline template in the video description that has all of the code
@learnnrelearn7553 4 года назад ⁺¹
nice video

Следующие

Автовоспроизведение

Processing Slowly Changing Dimensions with ADF Data Flows