All posts by Victor Fehlberg

Excel Address Verification is Here!

I’m happy to announce that Excel Address Verification is now available in the US!  You can use it to easily ensure address deliverability, improve data quality, improve matching, and also certify your mail with CASS.

What is CASS?

USPS ZIP + 4®CASS™ certification avoids undeliverable mail resulting from bad or poor-quality addresses.  Each year the US alone handles over 1.7 billion addresses that can’t be delivered because of missing data. The cost to the USPS is exorbitant – a jaw-dropping $159 million. Consequently, the US Postal Service implemented a program to incentivize companies to send only “good” mail.

USPS requires all CASS Certified™ vendors, such as Aim-Smart, to perform DPV and LACSLink processing when correcting an address.  We do this work behind the scenes without requiring anything additional from the user.

What is DPV?

DPV ensures your mailing list has deliverable addresses. Whereas typical address correction processes only match an address to a range of valid addresses, DPV verifies that a mailing address is a known USPS delivery point.

To illustrate, suppose we had an address such as 456 Springfield Blvd.  Standard address correction procedures might indicate that the address is valid, thereby assigning a Zip+4 value, when all the processor really did was make sure that 400-500 was a valid range for Springfield Blvd.  In reality though, there might not be a 456 Springfield Blvd. This type of situation is avoided with DPV.  Aim-Smart’s address verification for Excel will report back an error code and corresponding message indicating that the address is not deliverable.

What is LACSLink?

Many local governments are assigning new addresses to rural routes to improve the ability of emergency services (firefighters, police, etc.) to find addresses more quickly.  You can imagine that it could be confusing to find an address like this one: 109506 County Road 1.  As these addresses get renumbered, street names added, etc. it’s helpful for the USPS to have the updated address to deliver mail more efficiently. This is what LACSLink does – provide the lookup between the rural route address and its updated counterpart.  As with DPV, this happens behind the scenes automatically for the user.

 

More information on DPV and LACSLink can be found at:

https://www.usps.com/business/manage-address-quality.htm
http://zip4.usps.com/ncsc/addressmgmt/dpv.htm
http://zip4.usps.com/ncsc/addressservices/addressqualityservices/lacsystem.htm

 

Optimal Hardware for Aim-Smart’s Data Quality Excel Add-in

Customers often ask me what our minimum and recommended hardware specifications are.  Well, I’ll give a short answer, followed by some additional explanation.

Minimum Requirement Recommended
CPU Pentium 4 or Higher Intel i7
Memory 1 GB free 6 GB or more free
Hard Disk 5400 rpm SSD

Aim-Smart has been designed with performance in mind from the beginning.  For this reason users are able to parse hundreds of thousands of records in just a few seconds.  When building the software, if at any point we thought performance was suboptimal, we spent significant effort to pinpoint the cause.  Each time, we’d find that, with a few more optimizations, we were able to improve the code and reduce the processing time by orders of magnitude in most cases.

We implemented our software in this fashion because we never knew what kind of hardware our customer’s might have.  In addition, because Excel doesn’t run on large powerful servers but rather individual laptops and desktops, we knew we didn’t have the luxury of sloppy code.

The hardest challenge to-date has been optimizing our fuzzy matching and deduplication algorithms. Having said all of this, our  fuzzy matching performance is excellent with nice linear behavior as the number of input records increases.  In addition, if you have additional demands on speed, we can work with you and come up with a solution (we have some tricks up our sleeves but it requires some extra configuration).

Despite everything we’ve done to reduce the need to have powerful computers, it of course doesn’t hurt and can only help.  The single most important thing when running Aim-Smart is the speed of the hard drive.  Because working with data is highly I/O dependent, a small investment in a fast SSD hard drive is well worth it.

South African Support

We recently added South African support to Aim-Smart. Now users can parse ZA addresses, phone numbers, match names using fuzzy logic, and perform other data quality functions within Excel.

One of the most complicated features to implement was the ability to parse full South African addresses, given that the format is not always consistent. We find that often addresses don’t specify the city, but rather only the suburb; however, this is not always the case – sometimes both are present and sometimes only the city is specified. Most often the province is not specified, but even then that’s not always true either. Here are some example addresses:

Waterfront Drive Knysna, South Africa 6571
Old Rustenburg Road Magaliesburg, South Africa 1791
Summit Place Precinct (corner of N1 North and Garsfontein off ramp), 213 Thys St Menlyn, Pretoria, Gauteng 0181
277 Main Rd, Sandton, South Africa
Johannes Road Randburg, South Africa
Porterfield Rd Cape Town, South Africa

As you can see, the format varies significantly; however, the new engine can handle most of the formats we found and does a great job matching addresses.

For gender guessing, we are able to guess gender, but we don’t have the statistical probabilities that exist in the US because we weren’t able to find a data source that allowed us to determine probability.

As with other countries, adding ZA support is very reasonably priced.  Good luck!

Smart Parse and Smart Standardize Performance

Because Aim-Smart runs within Microsoft Excel, many people ask me how well Aim-Smart performs.  I’ve written a blog about Smart Match performance, but today I’m sharing performance statistics on Smart Parse and Smart Standardize.

For the test, I used several sets of real US addresses, making test cases of 1, 10, 100… and up to 1 million records (close to the limits of Excel).  You can see the results in the table below. I also graphed the results to show the linear behavior of the two functions.

Parse & Standardize Performance TableParse & Standardize Performance

As you can see, performance is very, very good.  As shown, Aim-Smart parses 1 million records in just 47 seconds.  Smart Standardize runs on 1 million records in a similar time – 55 seconds.

I chose to parse and standardize address line 1 for this test because it is more complex than most data types (such as name or phone). Despite the additional complexity performance is still very nice.

The tests were performed on a Windows 8 laptop with 8GB of memory, an i7 Intel CPU (2.4Ghz) and a Samsung 850 EVO solid-state drive.

Although the laptop has several CPU cores, only one of them is used by the system at any given time, allowing the laptop to continue to be responsive – and therefore useful – for additional tasks such as checking email, etc.

Improving Match Performance

When using Aim-Smart to match data, there are several steps you can do to improve performance.

First, when matching two sources, always make the second source (called the target) the larger of the two databases.  The reason for this is that the sound-alike engine is applied only to Source A.  There is processing time involved in creating each sound-alike, and additional time in storing them for matching later.

Second, if the target (source B) is greater than a few hundred thousand records, you may want to save the results to a Custom Database for use later.  Thus, for any subsequent matches, performance is greatly improved.  I have done extensive testing where source B had 3.5 million records and matching ran within minutes when source A had up to 10,000 records (to be precise 8 min 52 sec).  To save the records to a custom database, use this feature when matching:

Custom Database Matching

You can then reuse the “Master_Data” database by pressing the Custom DB Matching icon within the toolbar.

Third, each additional condition that is added, makes matching slightly slower.  In my testing I found that going from one match condition to five increased processing time about 50%.  You may not want nor be able to reduce the number of match conditions used, but something to keep in mind.

Lastly, use a solid state drive.  🙂  It’s amazing how much they improve I/O performance.

 

How to Smart Match Despite Data Quality Differences

One of the challenges matching involves inconsistent data quality between the sources involved in matching. Often times, users are trying to match one input against a “master” source, such as a CRM system, Master Data Management (MDM) system, etc. In these cases the master’s data quality is typically high. On the other hand, the input data source is often sporadically populated, and generally of poorer data quality.

Aim-Smart can obviously help identify the problems in the source data and also standardize the data in either source so that the formatting is consistent. The next issue is how to match data between the two sources given the differences in quality.

The input data source might look something like this:

input_source

The MDM/CRM source might appear something more like:

master_source

We can use Aim-Smart to parse the full name and full address into more granular components so that it aligns more directly with our master source. After doing that it might look like this:

input_source_post_parse

Note that while the master source had two different columns for phone number, the input source only has one. Furthermore, we don’t know if the input source is recording the office phone, the mobile phone, or perhaps a mixture of the two. Therefore, let’s map the fields as shown below – doing so will allow any phone number in the input source to match either the office or the mobile phone of the master.

mapping2

I’ve found that the best way to overcome the differences in data quality between an input/master is to create a series of match rules that overlap each other. For example, given the data in these two sources, I might create a tiering of match rules that looks something like this:

match_rules

Note that the first and second “High” matches are similar. The first will search for matches using the address + zip whereas the second match will use address + city + state. Of course in theory the second rule won’t provide any new matches, I’ve found that it often does, especially in situations where a zip code is not known. The reverse is often true as well – someone may choose to specify their city as “Los Angeles” instead of “Monterey Park” because the former is easily recognized; thus the city won’t match but the zip code will.

Next, I want to point out that I have disabled the “Must Exist to Match” checkbox on middle name. What this means that records that are missing a middle name in either the input or the master, will still match each other; however, if a middle name is provided in both the input and the master, and they conflict with each other, the match will be broken. Thus, Victor A. Fehlberg matches Victor Fehlberg, but Victor A. Fehlberg does not match Victor B. Fehlberg. This same logic can be used with suffixes, etc. Suppose you want John Smith to match John Smith, Jr., but that you don’t want John Smith Sr. to match John Smith Jr. – you could use this same concept.  In case you’re curious, here’s what the smart options looked like for middle name.  In addition to enabling the component, I used one character so that John Albert Smith would match John A Smith.

middle_name_checked

Going back to my discussion on match rules, note that the “Medium” matches are defined to be a superset of the high matches. This is really helpful so that the lower quality matches can account for further data quality degradation or differences. When might this happen? A common scenario is when a person’s address in one system is the home address yet in the second system an office address is provided. Another scenario where this happens is when a person works at two addresses, e.g. a doctor that has an office but also works at the local hospital.

With match rules that cover a wide variety of data quality differences, and the ability to conditionally match a middle name (or suffix), you’ll find increased confidence in your match results. Happy matching!