Matching Explained
To understand the need for data matching technology requires a basic familiarity
with how databases access data and what their limitations are. Databases store,
identify, and find data through the use of keys. A key is simply a unique value
that is assigned to a record in a database, much like an account number. Different
organizations such as mortgage companies, title companies, and county tax offices
often gather and store information about the very same physical entities (homes,
homeowners, loans, etc.), but do not use the same key values to identify that information
in their respective databases. This presents a challenge when data from one database
must be used to supplement data from another (parcelization, title searches, deed
updates, etc.), since no database has a built-in way of performing such a merge.
Certainty versus uncertainty
Databases are perfectly accurate and extremely efficient at matching data when records
share the same key. When a key, by definition, identifies one and only one record
in a database, querying the database for a matching key value returns a "yes"
or "no" answer; either a record exists with this key value or it doesn't.
In other words, the outcome is absolutely certain.
But when databases do not share the same key, non-unique identifiers
like addresses, names and legal descriptions are the only way to merge the datasets
(match records). But when using data elements like names and addresses, any
given record from database A could have hundreds if not thousands of potentially
matching counterparts in database B, or no matching record at all, so the selection
process becomes much more difficult. Therefore, instead of "yes" or "no",
the answer to the above question becomes "maybe", and therefore uncertain.
The difference between "yes/no" and "maybe" represents the large
gap between the two main camps of matching technology: deterministic vs. probabilistic.
Historically, the real estate industry has relied on deterministic-based technology
for its matching needs. These systems use traditional nested logic to initially
separate potentially good matches from bad ones, then back it up with a safety net
of rules and exceptions to further weed out likely mistakes. To many companies
who demand accuracy in matching, a deterministic, or rules-based approach, seems
easy and straightforward since individual rules and exceptions are easy to define
and enforce. Good matches are simply defined as those that do not violate any of
the rules or exceptions. And while some deterministic systems are reliable, there
are drawbacks to the approach (sometimes referred to as triangulated systems in
the real estate industry). The sets of rules and exceptions typically grow over
the years into a tangle of overlapping conditions that tend to cannibalize one another.
As a result, they hit a ceiling of sorts that keeps match rates down.
Mimicking the Human Thought Process
To increase match rates and better manage the uncertainty of matching on non-unique
identifiers, we at Accumatch have taken a different approach: a statistically-based
probabilistic one. As we developed this software, we painstakingly considered the
way a human being looks up and verifies matching real estate records. We found
that experienced 'Searchers' don't just adhere to a set of rules and exceptions
when deciding which matches are good ones - they trust their instincts. As we say,
they take a 'whole record approach'. Based on their expertise with real estate data,
they weigh the positive and negative aspects of all the data elements that contribute
and detract from a potential match, until they settle on a sort of subconscious
score. With a skilled appreciation of what is acceptable to their company, they
are able to compare this 'score' to their company's business rules and separate
correct matches from mismatches.
The advanced algorithms embedded in Searchlight were designed to mimic those same
human thought processes. Unlike most deterministic systems, Searchlight is able
to overcome misspellings, formatting issues, data mutilation, and other data discrepancies
the way a skilled human would, without the mistakes caused by fatigue or typos.
The result is higher match rates with virtually no errors. Through proper weighting,
scoring, and threshold determination, Searchlight is able to match records that
would have fallen through the cracks of traditional deterministic systems, without
introducing errors.
A mortgage company's loan record:
|
Loan account number:
|
182-48974-12
|
|
Borrower name:
|
Francis Jones
|
|
Property address:
|
123 Main St. Dallas, TX 75202
|
|
Property legal description:
|
Main Pl. Blk 1 Lot 3
|
A county tax office's real property record:
|
Parcel number:
|
1000002355897
|
|
Homeowner name:
|
Frances Jones
|
|
Property address:
|
123 Main Street Dallas, TX 75202
|
|
Property legal description:
|
Bk 1 Lt 3 Main Place Subdivision
|
|
Annual taxes due:
|
$2,780.25
|
*The loan record 'matched' the real property record on name, address,
and legal description, despite misspellings and formatting differences.
But what if the mortgage company needs to enhance some of the records in its database
with data from the county's database?
In the figure above you see a record from Big Nation's database, followed by the
best matching record from the Dallas County Tax Office's database. Since Francis
Jones is a Big Nation customer, Big Nation is responsible for paying $2,780.25
in property taxes to Dallas County at year's end, on Francis' behalf. To make sure
the taxes are paid on the correct piece of property, the mortgage company must include
the parcel number when making a payment to the tax office. The tax office requires
a parcel number with that payment because it identifies one and only one piece of
property on their own database. i.e., it is a unique identifier.
But the mortgage company uses a different number to identify Francis Jones' home
on its database and doesn't know what the parcel number is without looking it up
on the tax office's database. Since their own unique identifier, the loan account
number, is completely unrelated to the county's parcel number it is useless as a
one-to-one matching identifier.
Using imprecise identifiers
The only way for Big Nation to find the parcel number for Francis Jones' home is
to search for it on the county's own database using her name, address, and maybe
even the legal description of her property.
This is where things get tricky. The root 'matching problem' in this case is that
Big Nation must find the parcel number that uniquely identifies a piece of property
in the tax office's database, using data that does not uniquely
identify that piece of property on the tax office's database, data such
as names and addresses. Furthermore, they must be very accurate about it,
otherwise they run the risk of paying someone else's taxes by mistake, which can
be very costly.
For a mortgage company the size of Big Nation, which processes 500 or more loans
per day, doing all this matching by hand is slow and expensive, both in terms of
salaries and in human error costs. A human searcher could take a couple of days
to look up parcel numbers for 500 loans, and would probably make four or five mistakes
in the process. Searchlight automated matching software can match thousands of records
per hour, with virtually no errors.