To understand the need for data matching technology requires a basic
familiarity with how databases access data and what their limitations are. Databases store,
identify, and find data through the use of keys. A key is simply a
unique value that is assigned to a record in a database, much like an
account number. Different organizations
such as mortgage companies, title companies, and county tax offices often gather and store information about the
very same physical entities (homes, homeowners, loans, etc.), but do not
use the same key values to identify that information in their respective
databases. This presents a challenge when data from one database must be
used to supplement data from another (parcelization, title searches,
deed updates, etc.), since no database has a built-in way of
performing such a merge.
Certainty
versus uncertainty
Databases are
perfectly accurate and extremely efficient at matching data when records
share the same key. When a key, by definition, identifies one and only
one record in a database, querying the database for a matching key value
returns a "yes" or "no" answer; either a record exists with this key
value or it doesn't. In other words, the outcome is absolutely certain.
But when databases
do not share the same key, non-unique identifiers like addresses,
names and legal descriptions are the only way to merge the datasets (match records).
But when using data elements like names and addresses, any given record from database A could have hundreds
if not thousands of potentially matching counterparts in database B, or
no matching record at all, so the
selection process becomes much more difficult. Therefore, instead of "yes" or
"no", the answer to the above question becomes "maybe", and
therefore uncertain.
The difference
between "yes/no" and "maybe" represents the large gap between
the two main camps of matching technology: deterministic vs. probabilistic.
Historically, the real estate industry has relied on deterministic-based
technology for its matching needs. These systems use traditional nested
logic to initially separate potentially good matches from bad ones, then
back it up with a safety net of rules and exceptions to further weed out
likely mistakes. To many companies who demand accuracy in
matching, a deterministic, or rules-based approach, seems easy and
straightforward since individual rules and exceptions are easy to define
and enforce. Good matches are simply defined as those that do not violate any of the rules or
exceptions. And while some deterministic systems are reliable, there are
drawbacks to
the approach (sometimes referred to as triangulated
systems in the real estate industry). The sets of rules and exceptions
typically grow over the years into a tangle of overlapping
conditions that tend to cannibalize one another. As a result, they
hit a ceiling of sorts that keeps match rates down.
Mimicking the Human Thought Process
To increase match
rates and better manage the uncertainty of matching on non-unique
identifiers, we at Accumatch have
taken a different approach: a statistically-based probabilistic one. As we developed this software, we painstakingly considered the way a human being
looks up and verifies matching real estate records. We found that
experienced 'Searchers' don't just adhere to a set of rules and exceptions
when deciding which matches are good ones - they trust their instincts. As
we say, they take a 'whole record approach'. Based on their expertise with
real estate data, they weigh the positive and negative aspects of all the
data elements that
contribute and detract from a potential match, until they settle on a sort
of subconscious score. With a skilled appreciation of what is acceptable
to their company, they are able to compare this
'score' to their company's business rules and separate correct matches from
mismatches.
The advanced
algorithms embedded in Searchlight were designed to mimic those same
human thought processes. Unlike most deterministic systems, Searchlight
is able to overcome misspellings, formatting issues, data mutilation,
and other data discrepancies the way a skilled human would, without the
mistakes caused by fatigue or typos. The result is higher match rates
with virtually no errors. Through proper weighting, scoring, and
threshold determination, Searchlight is able to match records that would
have fallen through the cracks of traditional deterministic systems,
without introducing errors.
An example
Big
Nation Mortgage Company has a database containing information about its
homeowners and the homes they purchased in Dallas County. The Dallas
County Tax Office has its own database containing data it gathers on the
very same homes and homeowners in Dallas County.
A mortgage company's loan
record:
| Loan account number: |
182-48974-12 |
| Borrower name: |
Francis Jones |
| Property address: |
123 Main St. Dallas, TX 75202 |
| Property legal description: |
Main Pl. Blk 1 Lot 3 |
A county tax office's
real property record:
| Parcel number: |
1000002355897 |
| Homeowner name: |
Frances Jones |
| Property address: |
123 Main Street Dallas, TX 75202 |
| Property legal description: |
Bk 1 Lt 3 Main Place Subdivision |
|
Annual taxes due: |
$2,780.25 |
*The loan record 'matched' the
real property record on name, address, and legal description, despite
misspellings and formatting differences.
But what if the mortgage
company needs to enhance some of the records in its database with data
from the county's database?
In the figure above you see a record from
Big Nation's database, followed by the best matching record from the
Dallas County Tax Office's database. Since Francis Jones is a Big Nation
customer, Big Nation is responsible for paying $2,780.25 in
property taxes to Dallas County at year's end, on Francis' behalf. To
make sure the taxes are paid on the correct piece of property, the
mortgage company must include the parcel number when making a payment to
the tax office.
The tax office requires a parcel number
with that payment because it identifies one and only one piece of property on their
own database.
i.e., it is a unique identifier.
But the mortgage company uses a different
number to identify Francis Jones' home on its database and doesn't know what the parcel
number is without looking it up on the tax office's database. Since their own
unique identifier, the loan account number,
is completely unrelated to the county's parcel number it is useless as a
one-to-one matching identifier.
Using
imprecise identifiers
The only way for Big
Nation to find the parcel number for Francis Jones' home is to search for it on the
county's own database using her name, address, and maybe even the
legal description of her
property. This is where things get tricky. The root 'matching problem'
in this case is that Big Nation must find the parcel number that
uniquely identifies a piece of property in the tax office's database,
using data that does not uniquely identify that
piece of property on the tax office's database, data such as names and
addresses. Furthermore, they must be very accurate about it,
otherwise they run the risk of paying someone else's taxes by mistake,
which can be very costly.
For a mortgage
company the size of Big Nation, which processes 500 or more loans per
day, doing all this matching by hand is slow and expensive, both in
terms of salaries and in human error costs. A human searcher could take
a couple of days to look up parcel numbers for 500 loans, and would probably make
four or five mistakes in the process. Searchlight
automated matching software can match thousands of records per hour, with
virtually no errors.
For
more technical discussions on matching technology, see our section of white papers.
Main | Products and Services
| Matching Explained |
FAQ | Contact Us
Accumatch
•
2727 LBJ Frwy. Suite 120
• Dallas, TX 75234
•
info@accumatch.com
•
214.823.5579