How to display "closest matching" rows based on an "input" row?

Hello all!

I've been building out what is essentially a database to track information related to customer orders of a specific product. The product is essentially custom tailored firmware - specifically, automotive engine tuning files. Most of the information is provided by a customer through a form, some is manually input directly into the sheet on my end.

What I need is a way to quickly search the database to find rows that are similar, and some visual indicator of the "percentage match". So if Customer A selected options A, B, and C, and Customer B selected options A, B, and D, they would have a 66% "match".

Ultimately my goal is to speed up the firmware creation process by using matching (or at least very similar) files as a starting point, rather than starting from scratch for each new customer. This ability should significantly reduce the amount of time we have to spend creating new files for customers.

I should clarify - I do realize that this can be done by using the filters to just narrow down columns until I find the closest combination - but that process will likely take longer than it would to just create a new file for each customer.

In the image below, lines 118 and 236 would be exact matches. The columns for ROM ID, Year, Model, Transmission, Intake, Headers, Catalytic Converters, and Injectors all match. Being able to type in the order number from line 118 and have a filtered list of results with line 236 being at the top of that list is what I'm looking for.

Where do I start? Is this even possible? I've thought about trying to use a Form as the "input" and then setting up a Workflow that creates a Report each time an input is added. But that seems like a very roundabout way to approach the problem, and would require a lot of work just to determine if it is even feasible. I've also looked briefly at Sheet Summaries to see if that could help, but I couldn't really determine if that was a good option.

Any advice on a direction to take would be greatly appreciated. I am open to any ideas. Thanks!

-Tyler

Best Answer

  • TylerC
    TylerC
    Answer ✓

    I figured it out, so thought I'd close the loop here in case anyone is looking for a similar solution.

    Solving this required a combination of the approaches that I mentioned above.

    The first piece of the puzzle is to set up a helper column for EACH of the columns that you're interested in matching. This is what mine looks like:

    The "match" columns are set up as check boxes, and each one has a column formula formatted like this:

    =IF([Column Name]@row = INDEX([Column Name]:[Column Name], 1), 1, 0)

    The next step is to add another helper column, called "% Match". The column is set up to display percentages, and my column formula is formatted like this:

    =IF([ROM Match]@row = 1, 0.02, 0) + IF([Year Match]@row = 1, 0.14, 0) + IF([Model Match]@row = 1, 0.14, 0) + IF([Transmission Match]@row = 1, 0.14, 0) + IF([Intake Match]@row = 1, 0.14, 0) + IF([Headers Match]@row = 1, 0.14, 0) + IF([Cats Match]@row = 1, 0.14, 0) + IF([Injectors Match]@row = 1, 0.14, 0)

    This allows you to assign a "weight" to each column. In my case, all of the columns are equally weighted, except for ROM Match. That column is less important for me.

    The %Match column will look like this:

    The first row will always be a 100% match because it's being compared to itself. The use of INDEX in the "match" column formulas circumvents the problem I was having previously where I could not use the $Column$Row formatting to fix the cell being compared.

    What's great about this setup is that if you want to check a different row, all you have to do is move it to the top row. Whatever is in the top row is the "input" row. Once you're done checking a row, you can simply resort to put everything back where it belongs. In my case, I sort by Order Number to keep everything organized.

    Now, I could have stopped here. All I would have to do to find matching (or closely matching) rows is simply filter the %Match column to show anything with a high enough percentage. However, I decided to go back to the report I created earlier, which had the 7 filters set up. I realized all I needed now was a single filter set up to filter by the %Match column. I set it to show anything greater than 80%, and this is the result:

    Now all I have to do is look up the name of the customer (hidden) to find their original file, and use that as a base to create my new file. And again, you're always going to have a 100% match in the first row - that's the input row, comparing itself to itself, so it'll always be 100%. But it actually works out nicely that it shows up in the report, because then you can look at the results and see exactly where the differences are.

    I'm very pleased with this solution. I think there's room for improvement though, mainly in terms expanding the definition of a "match". In the last image, of the report, you can see the first row and the last row both have "HFCs" from different brands under the "Catalytic Converters". The reality is that these will behave the same as far as the firmware is concerned, so these could technically be considered a match, which would increase the %Match for the last row from 84% to 98%. The same sort of thing applies to the "Year" column, where I could consider anything between 2009 and 2013 to be a match, and anything between 2014 and 2020 to be a match. These changes will come down to updating the individual match column formulas to include some more complex logic, and ultimately will be specific to my application, so there's not much need to share them here (unless someone is interested!).

    Hopefully this will help someone else!

Answers

  • I've spent some more time on this, I figured I would start down a path and see where I ended up.

    I created a Report based on my original sheet which brings in the columns I'm interested in. I've got seven filters set up, six of which require adjustment each time I have a new file that I want to find a match for. It's clunky and slow, but it does work.

    As far as I know, there is no way to add formulas to the filter criteria, however I believe that is what I need to make this work. I could bring in a second sheet and set up an automation that copies new rows in the original sheet to the new sheet. Then have the filter set up to search based on whatever the latest row is in the second sheet. I'd have to delete rows from the second sheet as I complete them, but that isn't a problem. I can probably set up an automation for that as well.

    I've considered adding helper rows to the original sheet, but I haven't been able to wrap my head around exactly how that would help or if it would work.

    I did try to create a column formula that looked for exact matches - not exactly what I want, but still a useful step. However, column formulas can't reference a static cell, which means that each row is simply referencing itself, rather than the "target" row. The static cell issue with column formulas also prevents me from making nested IF statements that can determine how many of my criteria are met by a single row. My formula as it sits now simply determines if ANY row has a matching value in each column, and only returns "Red" if there is an option selected in a row that does not exist in any other row. Sort of the opposite of what I'm trying to accomplish.

    Here's that formula for reference:

    =IF(COUNTIF(Year:Year, Year@row) > 1, IF(COUNTIF(Model:Model, Model@row) > 1, IF(COUNTIF(Transmission:Transmission, Transmission@row) > 1, IF(COUNTIF(Intake:Intake, Intake@row) > 1, IF(COUNTIF(Headers:Headers, Headers@row) > 1, IF(COUNTIF([Catalytic Converters]:[Catalytic Converters], [Catalytic Converters]@row) > 1, IF(COUNTIF(Injectors:Injectors, Injectors@row) > 1, "Green", "Red"), "Red"), "Red"), "Red"), "Red"), "Red"), "Red")

  • TylerC
    TylerC
    Answer ✓

    I figured it out, so thought I'd close the loop here in case anyone is looking for a similar solution.

    Solving this required a combination of the approaches that I mentioned above.

    The first piece of the puzzle is to set up a helper column for EACH of the columns that you're interested in matching. This is what mine looks like:

    The "match" columns are set up as check boxes, and each one has a column formula formatted like this:

    =IF([Column Name]@row = INDEX([Column Name]:[Column Name], 1), 1, 0)

    The next step is to add another helper column, called "% Match". The column is set up to display percentages, and my column formula is formatted like this:

    =IF([ROM Match]@row = 1, 0.02, 0) + IF([Year Match]@row = 1, 0.14, 0) + IF([Model Match]@row = 1, 0.14, 0) + IF([Transmission Match]@row = 1, 0.14, 0) + IF([Intake Match]@row = 1, 0.14, 0) + IF([Headers Match]@row = 1, 0.14, 0) + IF([Cats Match]@row = 1, 0.14, 0) + IF([Injectors Match]@row = 1, 0.14, 0)

    This allows you to assign a "weight" to each column. In my case, all of the columns are equally weighted, except for ROM Match. That column is less important for me.

    The %Match column will look like this:

    The first row will always be a 100% match because it's being compared to itself. The use of INDEX in the "match" column formulas circumvents the problem I was having previously where I could not use the $Column$Row formatting to fix the cell being compared.

    What's great about this setup is that if you want to check a different row, all you have to do is move it to the top row. Whatever is in the top row is the "input" row. Once you're done checking a row, you can simply resort to put everything back where it belongs. In my case, I sort by Order Number to keep everything organized.

    Now, I could have stopped here. All I would have to do to find matching (or closely matching) rows is simply filter the %Match column to show anything with a high enough percentage. However, I decided to go back to the report I created earlier, which had the 7 filters set up. I realized all I needed now was a single filter set up to filter by the %Match column. I set it to show anything greater than 80%, and this is the result:

    Now all I have to do is look up the name of the customer (hidden) to find their original file, and use that as a base to create my new file. And again, you're always going to have a 100% match in the first row - that's the input row, comparing itself to itself, so it'll always be 100%. But it actually works out nicely that it shows up in the report, because then you can look at the results and see exactly where the differences are.

    I'm very pleased with this solution. I think there's room for improvement though, mainly in terms expanding the definition of a "match". In the last image, of the report, you can see the first row and the last row both have "HFCs" from different brands under the "Catalytic Converters". The reality is that these will behave the same as far as the firmware is concerned, so these could technically be considered a match, which would increase the %Match for the last row from 84% to 98%. The same sort of thing applies to the "Year" column, where I could consider anything between 2009 and 2013 to be a match, and anything between 2014 and 2020 to be a match. These changes will come down to updating the individual match column formulas to include some more complex logic, and ultimately will be specific to my application, so there's not much need to share them here (unless someone is interested!).

    Hopefully this will help someone else!