Find Duplicates using Azure Data Factory (two source files)

4 min read

Summary

There are some scenarios where you need to combine different datasets with some similarities to find duplicates. In this post, I demonstrate how to use Azure Data Factory and Dataflows to find duplicates by combining two columns, and using that concatenated column to search in my secondary data source.

Assuming that you have two data sets:

DataSet A

DataSet B

DataSet A, has a persons name split into two parts: first_name and last_name. DataSet B has the same persons name in full (full name).

Assuming that your goal was to take both datasets and then run a duplicate rule to find all matching records by full name, this is how you would do it:

Summary of steps

  1. Add both datasets.
  2. Create a derived column called full name that combines both (first_name and last_name).
  3. Select the columns to output (remove first_name and last_name).
  4. Add an Exists task to check if the record exists in any of the datasets.
  5. Drop output to a Sink storage (optional).

Add both datasets

This is a simple task. I’m including the script reference here just for convenience.

Script:

source(output(
		first_name as string,
		last_name as string,
		company_name as string,
		address as string,
		city as string,
		county as string,
		state as string,
		zip as string,
		phone1 as string,
		phone2 as string,
		email as string,
		web as string
	),
	allowSchemaDrift: true,
	validateSchema: false) ~> source1
source(output(
		{full name} as string,
		company_name as string,
		address as string,
		city as string,
		county as string,
		state as string,
		zip as string,
		phone1 as string,
		phone2 as string,
		email as string,
		web as string
	),
	allowSchemaDrift: true,
	validateSchema: false) ~> source2

Create the derived column

Use the Concat expression to combine first_name and last_name to derive a new column called full name.

Script:

source1 derive({full name} = concat(first_name, ' ', last_name)) ~> DerivedColumn1
DerivedColumn1 select(mapColumn(
		{full name},
		company_name,
		address,
		city,
		county,
		state,
		zip,
		phone1,
		phone2,
		email,
		web
	),
	skipDuplicateMapInputs: true,
	skipDuplicateMapOutputs: true) ~> Select1

Select the columns to output

Here I am dropping the first_name and last_name from the table.

Script:

elect1, source2 exists(Select1@{full name} == source2@{full name},
	negate:false,
	broadcast: 'none')~> Exists1

Add the Exists task

Finally add the Exist task. Here I am checking to see if the column full name, exists as a record in either of the sources.

Script:

Exists1 sink(allowSchemaDrift: true,
	validateSchema: false) ~> sink1

Output to Sink

This step is unnecessary, but I often add a Sink to store the curated files.

You must be logged in to post a comment.