Usage¶
The Kiez class enables the usage of different nearest neighbor (NN) algorithms and different hubness reduction techniques. There are several ways to tell kiez what you want to use:
from kiez import Kiez
# via string and arguments as dict
k_inst = Kiez(
algorithm="SklearnNN",
n_candidates=10,
hubness="LocalScaling",
hubness_kwargs={"method": "NICDM"},
)
from kiez.hubness import LocalScaling
# via class and arguments as dict
from kiez.neighbors import HNSW
k_inst = Kiez(
algorithm=SklearnNN,
n_candidates=10,
hubness=LocalScaling,
hubness_kwargs={"method": "NICDM"},
)
# via initialized object
hr = LocalScaling(method="NICDM")
nn_algo = HNSW(n_candidates=10)
k_inst = Kiez(algorithm=nn_algo, hubness=hr)
# You can also initalize Kiez via a json file
# content of 'conf.json' file
# {
# "algorithm": "SklearnNN",
# "algorithm_kwargs": {
# "n_candidates": 10
# },
# "hubness": "LocalScaling",
# "hubness_kwargs": {
# "method": "NICDM"
# }
# }
>>> kiez = Kiez.from_path("conf.json")
With your initialized kiez instance you are ready to fit your data and retrieve the k nearest neighbors utilizing hubness reduction:
# create example data
import numpy as np
rng = np.random.RandomState(0)
source = rng.rand(100,50)
target = rng.rand(100,50)
k_inst.fit(source, target)
neigh_dist, neigh_ind = k_inst.kneighbors(5)
This will retrieve all nearest neighbors of source entities in the target entities.
Single source case¶
While the main focus of kiez is to be part of an embedding-based entity resolution process between two data sources, it can also be used to query a single data source:
# initialize your kiez instance as before
# ... and then fit a single source
k_inst.fit(source)
# get the nearest neighbors of all source entities amongst themselves
k_inst.kneighbors()
Evaluation¶
If you have gold standard matches for your entity resolution task you can calculate the hits@k:
from kiez.evaluate import hits
import numpy as np
# small example with toy nearest neighbor result
nn_ind = np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]])
gold = {0: 2, 1: 4, 2: 3, 3: 4}
hits_result = hits(nn_ind, gold)
print(hits_result)
{1: 0.5, 5: 1.0, 10: 1.0}
The default result gives you the results for hits@{1,5,10}. But you can specify the ones you want:
hits_result = hits(nn_ind, gold,k=[5])
print(hits_result)
{5: 1.0}