kiez.neighbors.SklearnNN

class kiez.neighbors.SklearnNN(n_candidates=5, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=None)[source]

Wrapper for scikit learn’s NearestNeighbors class.

Parameters:
  • n_candidates (int) – number of nearest neighbors used in search

  • algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto') –

    Algorithm used to compute the nearest neighbors:

    • ’ball_tree’ will use sklearn.neighbors.BallTree

    • ’kd_tree’ will use sklearn.neighbors.KDTree

    • ’brute’ will use a brute-force search.

    • ’auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit() method.

  • leaf_size (int, default=30) – Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

  • metric (str, default = 'minkowski') – distance measure used in search default is minkowski with p=2, which is equivlanet to euclidean possible measures are found in SklearnNN.valid_metrics

  • p (int, default=2) – Parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

  • metric_params (dict, default=None) – Additional keyword arguments for the metric function. metric_params

  • n_jobs (int, default=None) – The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Notes

See also scikit learn’s guide: https://scikit-learn.org/stable/modules/neighbors.html#unsupervised-neighbors

__init__(n_candidates=5, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=None)[source]

Methods

__init__([n_candidates, algorithm, ...])

fit(source[, target, only_fit_target])

Indexes the given data using the underlying algorithm.

kneighbors([k, query, s_to_t, return_distance])

Attributes

valid_metrics

fit(source, target=None, only_fit_target: bool = False)

Indexes the given data using the underlying algorithm.

Parameters:
  • source (matrix of shape (n_samples, n_features)) – embeddings of source entities

  • target (matrix of shape (m_samples, n_features)) – embeddings of target entities or None in a single-source use case

  • only_fit_target (bool) – If true only indexes target. Will lead to problems later with many hubness reduction methods and should mainly be used for search without hubness reduction

Raises:

ValueError – If source and target have a different number of features