Accessing abundant data, including metadata, plays a crucial role in accelerating law enforcement investigations and allowing authorities to quickly adapt to emerging situations and threats. However, the value of the insights gained is dependent on the data being properly cleansed, enriched, fused and analyzed. Applying different transformations through ETL (Extract-Transform-Load), ELT (Extract-Load-Transform), and Federated Search is critical to extracting high-quality analytical insights. But what do the transformations look like?
For example, July 4, 1976 (07/04/1976) is like any other date and can be correlated, referenced, or aggregated with other events or transactions happening that same day. Readers based in the US might recognize that it represents the 200th anniversary of adopting the US Declaration of Independence. However, on an analytical level there is much more to this date (or any date). Performing basic transformations exposes new dimensions that analytical platforms use to identify specific meanings, patterns, and behaviors.
Known information about July 4, 1976:
|Month of Year||July|
|Day of Month||4|
|Day of Week||Sunday|
|Week of Month||2|
|Week of Year||28|
|Day of Year||186|
Not only is July 4th a significant US national holiday, but other notable historical events also occurred on this day: West Point was officially opened in 1802, the Philippines gained independence from the US and became a sovereign state in 1946, and the American journalist Geraldo Rivera turned 33 years old (80 today). In other countries, dates and anniversaries are equally important and generally defined in simple look-up tables.
Furthermore, secondary observations can be derived from the data. Being a US national holiday and a Sunday, we know specific types of business establishments, including banks and financial institutions, will be closed. Therefore, a spike in certain activity (e.g., money transfers) on this day may be cause for further investigation. However, in the Middle East, Sunday is the first workday of the week, so different patterns would be expected.
Generally, transforms are categorized into three groups depending on their complexity:
Basic – Simple calculations, substitutions, and formatting (REGEX)
- Uppercasing all text, trimming spaces, convert types
- Formatting dates into YYYY/MM/DD or extracting day-of-week
- Replace characters such as _AVENUE_ with _AVE_
- Calculating ratios between values: col_x / col_y = col_z
- Trimming US Zip code+4 to a five alphanumeric value
- Rounding lat/longs to five decimal places
Intermediate – Custom algorithms, heuristics, and reference data
- Performing social security or national ID validation checks (age/issuance/ITIN/death)
- Simple checksum verification algorithms (Luhn / Verhoeff)
- Running phonetic reduction/hashing algorithms (Eudex, Soundex, Ngrams)
- Creating secure one-way cipher text hash functions (MD5, SHA-256)
- Converting Zip or postal codes to lat/longs or military grid reference system (MGRS)
- Calculating the distance between two lat/long points (Haversine Formula)
Advanced: Often utilize pre-trained machine learning models to define outputs
- Creating relationships between people identified in a photograph
- Defining the make/model, color, and year of a car from a picture
- Identifying weapons, drugs, license plates, and other text/logos in pictures
- Flagging stressful situations and content from body-worn camera videos
- Parsing and normalizing street addresses into a standardized format
- Natural Language Processing (NLP) to identify proper entities and relationships
When loading data, basic cleanups are often run on all the data to trim leading/trailing spaces, remove double spaces, delete special characters (periods, commas, quotes, dashes, etc.), and convert text to uppercase. These transformations are not mutually exclusive, so one transform’s output can be another’s input, and chaining the transforms into a workflow helps ensure consistency. This example shows the transforms applied to an address:
Input: 1600 Pennsylvania Avenue, N.W. Washington, D.C., 20500-0004
First Pass: 1600 PENNSYLVANIA AVENUE NW WASHINGTON DC 205000004
The process continues with standard replacements to help map values to a common denominator. In the example, the address contains the word “AVENUE,” and the United States Postal Service shows variations to include AV, AVE, AVEN, AVENU, AVN, AVNUE. These derivations should always return a defined value, such as AVE. The same goes for other entries such as STREET, PARKWAY, BOULEVARD, ROUTE, etc. Additionally, the +4 for the US Zip code won’t be needed since it does not provide any additional value for geo-position (in this case) and is trimmed to five characters.
Output: 1600 PENNSYLVANIA AVE NW WASHINGTON DC 20500
The more standardized the data values, the more consistent the results. Advanced transformations like LibPostal generate standard address outputs from dirty or erroneous input data. This example is relatively simple, but an address in New York City (20 W 34th St. #123, New York, NY 10001) may cause some parsing inconsistencies. Addresses in other countries like Japan (150-2345 Tokyo-to, Shibuya-ku, Hommachi 2 choume, 4-7, Sunny Mansion 203) might prove too complicated for basic processing scripts not designed for this format.
The results returned from a geo-position transform don’t change the address but are additive to the overall entity details; new attributes are created as placeholders to put this derived data, as shown below. Often, these transformations are designed (containerized) for a particular purpose and are embeddable into the data fusion or decision intelligence system.
Lat = 38.89767
Long = -77.03655
MGRS = 18SUJ23400740
Transformations improve data quality, provide additional dimensions to evaluate, and help standardize the content into consistent and reliable values. All these changes are designed to improve the results generated by an analytical or decision intelligence system. However, a critical piece is still required to tie it all together by merging the different entities created. Our upcoming blog in this series, Entity Resolution: Assembling the Entity Data Puzzle, will focus on best practices for performing entity resolution.
Visit NEXYTE.AI to learn more about NEXYTE, the data fusion and machine learning platform revolutionizing decision intelligence.