Excel tip: How to clean duplicate rows

By | November 14, 2015

Data is often imperfect. As consultants we know that perfect data is a myth. No matter the organization or website, data is gunky, irregular, and often broken.

Was at Starbucks this morning doing my travel and expenses. Went to the American Express website and downloaded the last 6 months of corporate charges, or about $34K in planes, hotels, rental cars, Fedex prints, and restaurant bills. So far so good.

Oddly, the sums did not match my records. Hmmmm.  After a little bit of sorting and filtering, I saw that there were duplicate expenses. Come on Amex, what’s up?

How to eliminate duplicate data in excel? There were about 370+ expenses, of which 1 in 5 were duplicates (same date, same vendor, same amount). Sorted by $ amount and you can see the sporadic duplicates. Oddly, only some were duplicates. Sheesh.

Consultantsmind - Sort by amount

The novice approach would be to eye-ball the duplicates and selectively start deleting. Obviously that is a weak and fallible approach. Trust me. . .this kind of ad-hoc data clean up happens all the time in corporate America.  It’s okay when you have 15 rows of data, but what happens with you have 24,000 rows?  Need scale, need auditing.

1. CONCATENATE. Not an excel formula you use often, but super useful; it simply combines two cells into 1 cell. In this example, we are just combining “apple” + “watch” to get “applewatch”. Ta-da. Simple, a bit boring, but trust me, useful.

Consultantsmind - Concatenate 1

For my file, I combined column A (date) + column B (amount) to get an unique id; The two YELLOW cells became the ORANGE cell. Copy down, all rows get a unique id.

Consultantsmind - Concatenate

2. SORT. Then I sorted the file by my new new id to put things in order (column D).

  • Data –> Sort (this might look different depending on your MS Office version)
  • Or you can do what excel wizards do. .  .[ALT] D, F, F.

Consultantsmind - Sort

3. IF statement. This is something analysts use all the time.

  • The formula works like this: = if ( A relationship B, then C, if not then D)
  • One example: if the test score is greater than 75, I am happy, if not I am sad

Consultantsmind - If 2

For my file, I told the excel to see if the cell was the same as the one directly below it. (Remember I SORTED them in order). If ORANGE = YELLOW, then 1, if not 0. Consultantsmind - If 3

Copy down, this is what your get. . . a series of 1 = yes duplicate, 0 = not duplicate.

Consultantsmind - If 4

4. Copy / Paste as Values. This is a critical step lots of newbies miss (trust me I have made this mistake 100 times). . . copy the newly created formula columns and PASTE AS VALUES. This is how you keep the 1s and the 0s matched up the correct rows.

Consultantsmind - Copy Paste as Values

5. Sort by Duplicates.  Since the 1s are duplicates, you can sort again and see there are 79 duplicates. If you eliminate those, then you have clean data.

Consultantsmind - Sort 1s

6. Caveat #1.  This method is not infallible because you conceivably could have legitimate expenses which were the same date & amount.  Perhaps a “safer” way to go would be to also concatenate the vendor.

7. Caveat #2. When you are cleaning data, I recommend that you:

  • Keep 1 tab of your original data, and just copy data over to a new worksheet
  • What ever you remove from the old data set, “cut” and paste on a different tab

Consultantsmind - tabs

To newbies.  Let me know if you want a copy of the source data to play with.

To excel wizards: I know this was rudimentary.  Let me know what other quick tips you have for deleting duplicates in data?

P.S. In the comments, Kevin noted that there is a REMOVE DUPLICATE function in Excel. . under Data –> Data Validation. In the exercise above, you will want to keep the original file to make sure you can audit what the “blackbox” functions are doing.

Related Posts:

15 thoughts on “Excel tip: How to clean duplicate rows

  1. Kevin

    On the “Data” tab of Excel, under the “Data Tools” subsection, there is a “Remove Duplicates” function. You specify which column(s) to check, it removes duplicates.

    Great blog. Thanks.

    Reply
  2. Kevin W.

    Concatenate shouldn’t be necessary when using the “Remove Duplicates” function as all the selected columns would then have to match exactly, per row.

    Reply
      1. ilan

        yes it looks at the combination of every cell in the row of the selected array. The downside about using the remove dupes function is that you can’t see what was taken out – just the end result so i usually copy the data to a new tab and remove from there. Your concatenate method is a lot more work, but it would let you see what is being duplicated which might be of use in some analyses.

        Reply
  3. Jack

    I would have used a pivot table with your concatenate column + an added counter column (all records in the data set = 1). Then in the pivot, any values listed multiple times would show up as >1 in the counter column.

    Reply
  4. Tomm

    With ‘remove duplicates’ you cannot track what was removed or am I wrong?

    Reply
    1. consultantsmind Post author

      Tomm – that is correct. So, might make sense to keep a copy of the original data on a separate tab. .so you can “audit” what was taken out. Good call.

      Reply
  5. Phuc

    Newbie here. Can i have the copy to try out recommended methods?. Everyday reading your posts really work wonders to my mind.

    Thanks

    Reply

Leave a Reply