SQL Tutorial - How to compare multiple columns in SQL

Поделиться
HTML-код
  • Опубликовано: 21 авг 2022
  • Learn how to compare multiple columns in SQL, in this video I cover comparing multiple columns in the WHERE clause, using CHECKSUM, BINARY_CHECKSUM and HASHBYTES. I also discuss the advantages and disadvantages of using CHECKSUM, BINARY_CHECKSUM and HASHBYTES and what is my preferred approach for comparing multiple columns.
    An important point when using CHECKSUM, BINARY_CHECKSUM or HASHBYTES to compare multiple columns is that the column list show always be in the same order to get an accurate comparison. When using the HASHBYTES function, always add a separator that does not exist in the data such as a pipe.
    If you would like more detail on the functions used in this video to compare multiple columns in SQL then take a look at the links below:
    docs.microsoft.com/en-us/sql/...
    docs.microsoft.com/en-us/sql/...
    docs.microsoft.com/en-us/sql/...
    The SQL queries used in this video are available below:
    DROP TABLE [dbo].[CustomerStaging];
    CREATE TABLE [dbo].[CustomerStaging]
    (
    [SourceKey] [int] NULL,
    [Title] [varchar](10) NULL,
    [FirstName] [varchar](50) NULL,
    [MiddleName] [varchar](50) NULL,
    [LastName] [varchar](50) NULL,
    [DOB] [date] NULL,
    [AddressLine1] [varchar](50) NULL,
    [AddressLine2] [varchar](50) NULL,
    [City] [varchar](50) NULL,
    [PostCode] [varchar](10) NULL
    );
    INSERT INTO dbo.CustomerStaging (SourceKey, Title, FirstName, MiddleName, LastName, DOB, AddressLine1, AddressLine2, City, PostCode)
    VALUES
    (1000, 'Mrs', 'Deborah', NULL, 'Wilkinson', '19960513', '1 The Barns', NULL, 'London', 'NE1 1RS'),
    (2000, 'Dr', 'Siobhan', 'Mary', 'Jones', '19840822', 'Portobello House', 'Yarn Street', 'Birmingham', 'B16 1PZ'),
    (3000, 'Mr', 'Karl', NULL, 'Smith', '19790906', '5 Main Street', NULL, 'Manchester', 'M1 1PQ');
    DROP TABLE [dbo].[Customer];
    CREATE TABLE [dbo].[Customer]
    (
    [CustomerKey] [int] NOT NULL IDENTITY(1, 1),
    [SourceKey] [int] NULL,
    [Title] [varchar](10) NULL,
    [FirstName] [varchar](50) NULL,
    [MiddleName] [varchar](50) NULL,
    [LastName] [varchar](50) NULL,
    [DOB] [date] NULL,
    [AddressLine1] [varchar](50) NULL,
    [AddressLine2] [varchar](50) NULL,
    [City] [varchar](50) NULL,
    [PostCode] [varchar](10) NULL
    );
    INSERT INTO dbo.Customer (SourceKey, Title, FirstName, MiddleName, LastName, DOB, AddressLine1, AddressLine2, City, PostCode)
    VALUES
    (1000, 'Miss', 'Deborah', NULL, 'Austin', '19960513', '1 The Barns', NULL, 'London', 'NE1 1RS'),
    (2000, 'Dr', 'Siobhan', 'Mary', 'Jones', '19840822', 'Portobello House', 'Yarn Street', 'Birmingham', 'B16 1PZ'),
    (3000, 'Mr', 'Karl', NULL, 'Smith', '19790906', '250 Wood Road', NULL, 'Manchester', 'M25 1HH');
    -- CHECKSUM
    SELECT
    *
    FROM dbo.CustomerStaging AS A
    LEFT OUTER JOIN dbo.Customer AS B
    ON A.SourceKey = B.SourceKey
    WHERE
    CHECKSUM(
    A.Title,
    A.FirstName,
    A.MiddleName,
    A.LastName,
    A.DOB,
    A.AddressLine1,
    A.AddressLine2,
    A.City,
    A.PostCode
    )
    [replace with not equal to]
    CHECKSUM(
    B.Title,
    B.FirstName,
    B.MiddleName,
    B.LastName,
    B.DOB,
    B.AddressLine1,
    B.AddressLine2,
    B.City,
    B.PostCode
    )
    -- BINARY_CHECKSUM
    SELECT
    *
    FROM dbo.CustomerStaging AS A
    LEFT OUTER JOIN dbo.Customer AS B
    ON A.SourceKey = B.SourceKey
    WHERE
    BINARY_CHECKSUM(
    A.Title,
    A.FirstName,
    A.MiddleName,
    A.LastName,
    A.DOB,
    A.AddressLine1,
    A.AddressLine2,
    A.City,
    A.PostCode
    )
    [replace with not equal to]
    BINARY_CHECKSUM(
    B.Title,
    B.FirstName,
    B.MiddleName,
    B.LastName,
    B.DOB,
    B.AddressLine1,
    B.AddressLine2,
    B.City,
    B.PostCode
    )
    -- HASHBYTES
    SELECT
    *
    FROM dbo.CustomerStaging AS A
    LEFT OUTER JOIN dbo.Customer AS B
    ON A.SourceKey = B.SourceKey
    WHERE
    HASHBYTES('SHA2_512',
    CONCAT(
    A.Title,
    A.FirstName,
    A.MiddleName,
    A.LastName,
    A.DOB,
    A.AddressLine1,
    A.AddressLine2,
    A.City,
    A.PostCode
    )
    )
    [replace with not equal to]
    HASHBYTES('SHA2_512',
    CONCAT(
    B.Title,
    B.FirstName,
    B.MiddleName,
    B.LastName,
    B.DOB,
    B.AddressLine1,
    B.AddressLine2,
    B.City,
    B.PostCode
    )
    )
  • НаукаНаука

Комментарии • 17

  • @johnbreslin4521
    @johnbreslin4521 10 месяцев назад +1

    Very useful and a different way of comparing columns. I will have to try out this method. Thanks very much for an easy to follow video as always.

  • @briandennehy6380
    @briandennehy6380 Год назад +4

    Really interesting, never seen these functions before, thanks as always 👍

  • @Sttuey
    @Sttuey Год назад +2

    Good discussion of the various methods! Have done very similar for tasks such as loading regular product files with frequent changed data such as stock levels, pricing, fees etc. For some cases with only a handful of columns to compare I've actually used *not exists* in combination with *intersect*, which is both performant and cruicially dependable for nullable columns; for some use-cases with many columns to compare i've gone the Hashbytes route but with SHA2 which was the most performant in testing and requires relatively little storage (security is not a concern in this situation). I use hashed checksum via an indexed computed column in the target tables and calculate it on the fly in dynamically created views against the staging tables that receive data via bulkload or polybase. Performance is excellent even with a product feed such as Amazon's with many millions of rows to be compared for changes.

    • @BeardedDevData
      @BeardedDevData  Год назад +1

      Excellent points, I also use not exists usually for snapshot Fact tables where we want to insert any balance changes. There are many options to choose from and when you look at the amount of columns and data types it's always best to test out different solutions 👍.

  • @trinhk
    @trinhk 4 месяца назад +1

    Great video, thank you. Would you consider implementing the various checksum methods for data migration of millions or hundreds of millions records?

    • @BeardedDevData
      @BeardedDevData  4 месяца назад

      It would depend on your acceptable range of collisions, certainly checksum is out the window, you'd have to use hashbytes with an algorithm that generates at least a 64-bit value. I've just had a look online for some numbers, found some details that say at 609 million rows you have a 1 in 100 chance of a collision, that is different inputs returning the same value, this may be acceptable as the alternative is column by column comparison to get 100% results. If this is a one off activity you could also use hashing then handle any collisions separately.

  • @KarinS-tk3qt
    @KarinS-tk3qt Год назад +2

    Excellent video. I have a question , your video shows data compression and then finding the difference if exists.
    If I want to update the data in table 2 according to the change in table 1 or alternatively add rows that were not in table 2 - how can I do that?

    • @BeardedDevData
      @BeardedDevData  Год назад

      Personally I then break this down in to separate insert and updates statements within a transaction, I mainly do this for auditing purposes and to control the size of transactions. Another option will be to use MERGE but I would advise testing properly, because it can cause performance issues and some adverse effects.

  • @Beesay1978
    @Beesay1978 Год назад +1

    In my SSMS select checksum('1')
    select checksum('-1 ') returns 133 and 597 . It is ssms 2019. Thank you for your educational video.

  • @shima1960ify
    @shima1960ify Год назад +1

    Excelente

  • @allmazd
    @allmazd Год назад +1

    In normal use why not to use just a traditional method and not showing off with these tricky functions?

    • @BeardedDevData
      @BeardedDevData  Год назад +2

      Hi, can you explain what you mean by normal method? If you are referring to comparing column by column then that would be considered more traditional, however as I explained in the video if you have a lot of columns that will be very tedious and even more tricky. I'm not trying to show off but help people understand how things can be done simpler so that they are easier to manage 😀.

    • @allmazd
      @allmazd Год назад +1

      @@BeardedDevData I meant not you showing off I mean a particular user who would do some comparison. All in all it is always good to have many tools to accomplish same operations. And you normal use I mean comparison each columns with each over, it terms of code it seems doesn't make it longer or more tedious.

    • @BeardedDevData
      @BeardedDevData  Год назад

      I think you are referring to the times when we need to compare a small amount of columns, when you need to compare, for example 10+ columns, it can be become very tedious to write out a massive WHERE clause. Remember, we also probably want to do this comparison multiple times too, we will want to detect what doesn't exist in the target table, what's changed and what exists in target but not in the source table, instead I prefer to keep things simple, there also other advantages, we can optimise the process better by just comparing one column, it's also easier to visually see change when looking through the data, we can store the history of the hash value and see change, this is a massive advantage when it comes to debugging our SCD processes. I just don't see why you would want to write out a lot more code, have a process that performs slower and can cause a lot of headaches when there is a much simpler approach.