Find duplicate rows based on 2 column criteria
I have a sheet that I'm trying to removed duplicates from based on 2 criteria.
In the example above, you can see that the APP_ID is the same for all 4 entries. But only the last one is a duplicate based on looking at the unique combination of PID and APP_ID.
It's ok for an APP_ID to repeat for different PID #s. But if the same PID and same APP_ID, then that row must be marked for removal.
Appreciate your help!
Tony
Answers
-
Hi @twarner
Generally speaking, replace this problem with the problem of counting how many times the values are the same in multiple columns until a certain row.
If the result of the count is two or more, then it is a Duplicate.
As we often face similar problems when creating Smartsheet solutions, I will summarize the methods here.
There are several ways to do this.
Use COUNTIF
Using COUNTIF is a common method in Excel and other applications, where you specify a COUNTIF range of a column up to a row and use the value of the column of the row, and count how many times the value is in the range.
In Smartsheet, the formula is as follows
- =COUNTIF([PID-APP_ID]$1:[PID-APP_ID]@row, [PID-APP_ID]@row)
- COUNTIF range of a column up to a row: [PID-APP_ID]$1:[PID-APP_ID]@row
- The value of the column of the row: [PID-APP_ID]@row
The problem with this method is that [PID-APP_ID]$1 uses absolute reference, and you can not use the column function so that the formula can break.
Use RANKEQ
When using RANKEQ, you need to prepare a number column. And also, you need to ensure the uniqueness of the numbers by automatic number, etc.,
Then, use the RANKEQ function to obtain the rank of that number in a specific row within the range of the numbers that meet some criteria.
Use automatic number
The easiest way is to use an automatic number column.
- =RANKEQ([Row ID]@row, COLLECT([Row ID]:[Row ID], PID:PID, PID@row, [APP_ID]:[APP_ID], [APP_ID]@row), 1)
Note that here, by COLLECT([Row ID]:[Row ID], PID:PID, PID@row, [APP_ID]:[APP_ID], [APP_ID]@row), we get the range of [Row ID] with the same PID and APP_ID as the PID and APP_ID of a particular row, and use it as the range for RANKEQ.
Use Created system column.
If, for some reason, the sheet cannot use auto-numbering and multiple PID and APP_ID values are not entered at the same time, i.e., users add them sequentially, there is a way to convert system column Created to a numerical value and use RANQKEQ based on this value. (If users do not simultaneously enter multiple PID and APP_ID values, the system column will guarantee the uniqueness of the converted Created values.)
One possible case where you can not use auto-numbering is when rows containing auto-numbers are copied from other rows by Automation.
Converting "Created" to a numerical value is a complex formula, but you can use it elsewhere once created.
Example formula:
- =(Created@row - BaseD#) * 24 * 60 + (VALUE(MID(Created@row, 10, FIND(":", Created@row) - 10)) + IF(AND(RIGHT(Created@row, 2) = "PM", MID(Created@row, 10, FIND(":", Created@row) - 10) > 12), 12, 0)) * 60 + VALUE(MID(Created@row, FIND(":", Created@row) + 1, 2))
This expression takes the date, hour, and minute data from Created and calculates the number of minutes from a base date.
Then, as in the case of automatic numbering, RANKEQ is used based on this Created Minute number.
Please review the published sheet in the following URL and test how each method works in un-locked rows.
- =COUNTIF([PID-APP_ID]$1:[PID-APP_ID]@row, [PID-APP_ID]@row)
-
Here is another method leveraging the Created Date type column:
=IF(COUNTIFS(PID:PID, @cell = PID@row, [APP_ID]:[APP_ID], @cell = [APP_ID]@row)> 1, IF([Created Date]@row> MIN(COLLECT([Created Date]:[Created Date], PID:PID, @cell = PID@row, [APP_ID]:[APP_ID], @cell = [APP_ID]@row)), 1))
-
IF([Created Date]@row> MIN(COLLECT([Created Date]:[Created Date], PID:PID, @cell = PID@row, [APP_ID]:[APP_ID], @cell = [APP_ID]@row)),❤️
-
The first part of my formula was first checking if it was a duplicate, but I realize now that part isn't needed. [Created Date]@row can't be greater than [Created Date]@row if it is the only entry.
=IF([Created Date]@row> MIN(COLLECT([Created Date]:[Created Date], PID:PID, @cell = PID@row, [APP_ID]:[APP_ID], @cell = [APP_ID]@row)), 1)
-
Correction:
Converting "Created" to a numerical value
Example formula:
- =(Created@row - BaseD#) * 24 * 60 + (VALUE(MID(Created@row, 10, FIND(":", Created@row) - 10)) + IF(AND(RIGHT(Created@row, 2) = "PM", MID(Created@row, 10, FIND(":", Created@row) - 10) <> 12), 12, 0)) * 60 + VALUE(MID(Created@row, FIND(":", Created@row) + 1, 2))
-
I use a variation of this that also excludes blanks:
=IF(AND([PID]@row <> "", [APP ID]@row <> "", COUNTIFS([PID]:[PID], [PID]@row, [APP ID]:[APP ID], [APP ID]@row) > 1), 1, 0)
Help Article Resources
Categories
- All Categories
- 14 Welcome to the Community
- Smartsheet Customer Resources
- 63.1K Get Help
- 380 Global Discussions
- 212 Industry Talk
- 444 Announcements
- 4.6K Ideas & Feature Requests
- 140 Brandfolder
- 129 Just for fun
- 130 Community Job Board
- 450 Show & Tell
- 30 Member Spotlight
- 1 SmartStories
- 291 Events
- 35 Webinars
- 7.3K Forum Archives
Check out the Formula Handbook template!