kiez.analysis.estimation

Estimate hubness in datasets.

Functions

hubness_score(nn_ind, target_samples, *[, ...])

Calculate hubness scores from given neighbor indices.

kiez.analysis.estimation.hubness_score(nn_ind: ndarray, target_samples: int, *, k: Optional[int] = None, hub_size: float = 2.0, verbose: int = 0, return_value: str = 'all_but_gini', store_k_occurrence: bool = False) Union[float, dict][source]

Calculate hubness scores from given neighbor indices.

Utilizes findings from [1] and [2].

Parameters:
  • nn_ind (np.ndarray) – Neighbor index matrix

  • target_samples (int) – number of entities in the target space

  • k (int) – number of k for k-nearest neighbor

  • hub_size (float) – Hubs are defined as objects with k-occurrence > hub_size * k.

  • verbose (int) – Level of output messages

  • return_value (str) – Hubness measure to return By default, return all but gini, because gini is slow on large datasets Use “all” to return a dict of all available measures, or check kiez.analysis.VALID_HUBNESS_MEASURE for available measures.

  • store_k_occurrence (bool) – Whether to save the k-occurrence. Requires O(n_test) memory.

Returns:

hubness_measure – Return the hubness measure as indicated by return_value. if return_value is ‘all’, a dict of all hubness measures is returned.

Return type:

float or dict

Raises:

ValueError – If nn_ind has wrong type

References

Examples

>>> from kiez import Kiez
>>> from kiez.analysis import hubness_score
>>> import numpy as np
>>> # create example data
>>> rng = np.random.RandomState(0)
>>> source = rng.rand(100,50)
>>> target = rng.rand(100,50)
>>> # fit and get neighbors
>>> k_inst = Kiez()
>>> k_inst.fit(source, target)
>>> nn_ind = k_inst.kneighbors(return_distance=False)
>>> # get hubness
>>> hub_score = hubness_score(nn_ind, target.shape[1])
>>> hub_score["robinhood"]
    0.31