For law enforcement agencies looking to extract crucial insights from their data to accelerate investigations and threat detection, the first steps are to access, cleanse, and transform the data. Once accomplished, entity resolution (ER) emerges as the final stage before conducting analytics using a data fusion or decision intelligence platform. ER is pivotal in consolidating diverse values and depictions of different entity types, such as individuals, addresses, and phone numbers. Think of it as a digital Rolodex within a contacts app on your phone. It streamlines analytics by unifying information from various sources into a coherent, singular representation of that entity. The efficacy of entity resolution directly influences the quality and confidence of analytical outcomes, ensuring a comprehensive and accurate presentation of all content associated with a given entity.
It is common knowledge that individual names lack uniqueness and often exhibit significant discrepancies, particularly when integrating across various data systems that accommodate nicknames, abbreviations, and other adaptations. ER comes into play to establish distinct profiles, eliminating redundant entries from the database and decision intelligence platform. Consider this: Is John Smith identical to Johnnie Smith, Jon Smith, or perhaps Johnny Smith?
It can be challenging to tell without a unique identifier. Entity resolution commonly relies on supplementary data like physical traits (race, sex, eye color, scars, tattoos, weight, and height), birthdates, or shared identification numbers, addresses, phone numbers, or email addresses. Some ER systems use deterministic rules that leverage these descriptive attributes derived from the data and subsequent transformations to include, for example:
- Alias of a first name (John, Johnny, Juan -> Jon)
- Soundex of a last name (Smith -> S350)
- Plus/minus 1 or 2 years for the year of birth (YOB)
- Same birth month/day or day/month combination
- Lives in same city/state (or use of MGRS (Military Grid Reference System) precision grids)
- Located within 5 miles (zip code/postal code centroid distance calculation)
- Same ethnicity/race
- Has matching gender/sex
Certain ER systems generate matches with various applied weightings – usually defined by the user or admin. For instance, it may require mandatory matches between the last and first names, the same year of birth, and the same gender/sex. Subsequently, it may require at least one or more additional matches across all the other conditions. An ER system’s assurance level also considers interconnectedness with other entities, such as shared email, phone, or address information. Frequently, these connections are visualized through network or graph representations.
Other ER systems employ machine learning and probabilistic matching using a combination of string distances, similarity matching, and common tokens (shared entities). Many systems use graph-based matching methods and work well in homogeneous domains with the same feature space (attributes). Additionally, the ER system must handle overmatching common values due to data quality or inconsistencies to remove biases for values like telephone = (555)-555-5555, id-number = 12345, or address = not given / unknown.
If an investigator needs to run a background check or search for the name of a subject of interest, running the query in a source or environment controlled by a third party could easily expose their intentions. Therefore, for environments with sensitive content where data sharing is highly restrictive or where there are concerns with unintentional disclosure or misuse of the data, the ER system should address anonymous matching. This is achieved by incorporating a hashing function (SHA-256) of the names and other sensitive values (id-numbers, addresses, etc.). Anonymous-ER is achieved when different systems use the same hashing function (and/or algorithms) for performing the match.
Ensuring complete confidence in presenting comprehensive information about a particular entity (whether a person, address, or ID number) from various origins within a unified context is vital for analysis. Increased uniqueness in describing an entity leads to superior ER system outcomes. At the same time, the seamless integration of ER functionalities into data fusion or decision intelligence platforms enhances the overall user experience.
Visit NEXYTE.AI to learn more about NEXYTE, the data fusion and machine learning platform revolutionizing decision intelligence.