geodata.Normalize module
Provide functions to normalize text strings by converting to lowercase, removing noisewords.
This is used by the lookup functions and the database build functions and match scoring
noise_words is a list of replacements only used for match scoring
phrase_cleanup is a list of replacements for db build, lookup and match scoring
Module variables
var alias_list
var noise_words
var phrase_cleanup
Functions
def add_aliases_to_database(
geo_files)
def admin1_normalize(
admin1_name, iso)
Normalize historic or colloquial Admin1 names to current geoname standard
def admin2_normalize(
admin2_name, iso)
Normalize historic or colloquial Admin2 names to standard
Args: admin2_name: iso:
Returns: TUPLE (result, modified) - result is new string, modified - True if modified
def country_normalize(
country_name)
normalize local language Country name to standardized English country name for lookups :param country_name: :return: (result, modified) result - new string modified - True if modified
def normalize(
text, remove_commas)
Normalize text - Convert from UTF-8 to lowercase ascii.
Remove commas if parameter set.
Remove all non alphanumeric except $ and *
Then call _phrase_normalize() which normalizes common phrases with multiple spellings, such as saint to st
Args:
text: Text to normalize remove_commas: True if commas should be removed
Returns:
Normalized text
def normalize_for_scoring(
text, iso)
Normalize the title we use to determine how close a match we got.
See normalize() for details
Also remove noise words such as City Of
Args:
text: text to normalize iso: ISO country code
Returns:
def remove_aliase(
input_words, res_words)