Top

geodata.MatchScore module

Calculate a heuristic score for how well a result place name matches a target place name.

Functions

def full_normalized_title(

place)

def is_street(

text)

def remove_if_input_empty(

target_tokens, res_tokens)

Classes

class MatchScore

Calculate a heuristic score for how well a result place name matches a target place name. The score is based on percent of characters that didnt match plus other items - described in match_score()

Ancestors (in MRO)

Static methods

def __init__(

self)

Initialize self. See help(type(self)) for accurate signature.

def match_score(

self, target_place, result_place)

Calculate a heuristic score for how well a result place name matches a target place name. The score is based on percent of characters that didnt match in input and output (plus other items described below). Mismatch score is 0-100% reflecting the percent mismatch between the user input and the result. This is then adjusted by Feature type (large city gives best score) plus other items to give a final heuristic where -10 is perfect match of a large city and 100 is no match.

A) Heuristic:
1) Create 5 part title (prefix, city, county, state/province, country)
2) Normalize text - Normalize.normalize_for_scoring()
3) Remove sequences of 2 chars or more that match in target and result
4) Calculate inscore - percent of characters in input that didn't match result.  Weight by term (city,,county,state,ctry)
        Exact match of city term gets a bonus
5) Calculate result score - percent of characters in db result that didn't match input

B) Score components (All are weighted in final score):   
in_score - (0-100) - percent of characters in input that didnt match output   
out_score - (0-100) - percent of characters in output that didnt match input   
feature_score - (0-100)  More important features get lower score.   
City with 1M population is zero.  Valley is 100.  Geodata.feature_priority().  
wildcard_penalty - score is raised by X if it includes a wildcard   
prefix_penalty -  score is raised by length of Prefix

C) A standard text difference, such as Levenstein, was not used because those treat both strings as equal,   
whereas this treats the User text as more important than DB result text and also weights each token.  A user's   
text might commonly be something like: Paris, France and a DB result of Paris, Paris, Ile De France, France.   
The Levenstein distance would be large, but with this heuristic, the middle terms can have lower weights, and   
having all the input matched can be weighted higher than mismatches on the county and province.  This heuristic gives   
a score of -9 for Paris, France.

Args:

target_place:  Loc  with users entry.
result_place:  Loc with DB result.

Returns:

score

def set_weighting(

self, token_weight, prefix_weight, feature_weight, result_weight)

Set weighting of scoring components. See match_score for details of weighting. All weights are positive Args: token_weight: List with Weights relative to City for County, State/Province, Country. City is 1.0
prefix_weight: Weighting for prefix score feature_weight: Weighting for Feature match score result_weight: Weighting for % of DB result that didnt match the target

Returns:

Instance variables

var feature_weight

var input_weight

var logger

var prefix_weight

var result_weight

var score_diags

var token_weight

var wildcard_penalty

class Score

Ancestors (in MRO)

Class variables

var GOOD

var POOR

var VERY_GOOD

var VERY_POOR