Approximating distances between prototypes. By which Nearest Neighbour (NN)

 

Approximating
and Eliminating Search Algorithm (AESA) makes use of metric
properties of given distance. The algorithm consists of (1) an efficient
elimination rules and (2) an appropriate search for available rules by which maximum
efficiency is maintained. For N prototypes, the it uses precompution of triangular
array of (N2-N)/2 distances between prototypes. By which  Nearest Neighbour (NN) Search is carried out
through a very small number of distance computations. Furthermore, for large N,
this number of computations, tends to be independent of N (asymptotic constant
time-complexity). However, the great storage complexity and preprocessing time
(O (N2)), severely limit the practical use of the AESA for large
sets of prototypes.

Reference: E. Vidal,
“An algorithm for finding nearest neighbors in (approximately) constant average
time,” Pattern Recog. Lett., vol. 4, no. 3, pp. 145–157, 1986.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

LAESA (Linear AESA) a new version of the AESA, which uses a linear array of
distance, but is strictly based on metric arguments like the original AESA. The
procedure starts by selecting “Base Prototypes” (BP) from the given
set of prototypes which is relatively a small set there by computing the
distances between these BP’s and the completing set of prototypes.

Reference: L. Mico,
J.Oncina, E. Vidal An algorithm for finding nearest neighbours in constant
average time with a linear space complexity Proceedings, 11th IAPR International
Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition
Methodoly and Systems Year; 1992

 

Tree LAESA (TLAESA) is based on the LAESA which reduces its
average time complexity to sublinear. The TLAESA algorithm consists of two
parts: (1) preprocessing and (2) search. In the preprocessing algorithm, two
structures are built: a binary search tree storing all prototypes and a table of
distances. The search algorithm is essentially a traversal of the binary tree
where the table of distances is used in order to avoid the exploration of some branches
of the tree.

Reference: L.
Mico, J. Oncina, and R. C. Carrasco, “A fast branch & bound nearest
neighbour classifier in metric spaces,” Pattern Recog. Lett., vol. 17, no. 7,
pp. 731–739, 1996.

 

BAESA

 

Approximating
k-LAESA
(Ak-LAESA) is a fast classifier for general metric
spaces (no vector space required) based on the LAESA algorithm The aim is to
achieve classification rates similar to those of a k-NN classifier, using k
neighbours that may not be the k-NNs and preserving the main properties of
LAESA (distance computations and time and space complexities). It obtains error
rates very close to those of a k-NN classifier, while calculating a much lower
number of distances (the number of distances of the Ak-LAESA algorithm is
exactly the same that the computed by the LAESA algorithm).

Reference: A
modification of the LAESA algorithm for approximated k-NN classification,
Pattern Recognition Letters, Vol. 24, Issue: 1-3, January 2003,
Pages: 47-53

 

Fixed
Queries Trees
(FQ Trees) are an evolution where the same pivot
is used for all the nodes of the same level of the tree. In this case the pivot
does not need to belong to the subtree. The main goal of this structure is to
minimize the number of element comparisons, as opposed to other data structure
operations (such as tree traversal). This is especially important in
applications where computing distances between two elements is a much more
expensive operation than, say following pointers. This tree structure differs
from other tree structures where the keys on each level of the tree are all the
same, so we have just one key per level. In other words, which comparisons we
make does not depend on the results of previous comparisons up the tree. Thus
the name fixed-queries tree or FQ-tree.  A major strength of FQ-trees is that the data
structure and the search algorithms are independent of the distance function,
as long as it satisfies the triangle inequality.

Reference: R.
Baeza-Yates, W. Cunto, U. Manber, and S. Wu. Proximity matching using fixed-queries
trees. In Proc. CPM’94, LNCS 807, pages 198–212, 1994.

 

Fixed
Height fq-tree
(fhq-tree) A variant called FQ Tree where all the
leaves are at the same depth h, regardless of the bucket size

Reference:
Ricardo Baeza-Yates Gonzalo Navarro “Fast
approximate string matching in a dictionary” Published in: String
Processing and Information Retrieval: A South American Symposium, 1998.
Proceedings IEEE

 

Fixed
Queries Array (FQA), has two interesting properties. First, it is the first data
structure which is able to achieve a sub linear (in the database size) number
of side computations without using any extra space. Second, it is able to trade
number of pivots k for their precision, so as to optimize the usage of the
available space. FQA
are based in a common idea: k pivots are selected and each object is mapped to
k coordinates which are its distances to the pivots. Later, the query q is also
mapped and if it differs from an object in more than r along some coordinate
then the element is filtered out by the triangle inequality. That is, if for
some pivot pi and some element v of the set it holds |d(q,
pi) ? d(v, pi)| > r, then we
know that d(q, v) > r without need to evaluate d(v, q). The elements that
cannot be filtered out using this rule are directly compared.

Reference: E. Ch´avez, J.L. Marroquin, and G.
Navarro. Fixed queries array: A fast and economical data structure for
proximity searching. Multimedia Tools and Applications (MTAP), 14(2):113–135,
2001.

 

Sparse Spatial
Selection (SSS),
a new pivot-based technique. The main characteristic of this method is that it
guarantees a good pivot selection more efficiently. In addition, SSS adapts
itself to the dimensionality of the metric space we are working with, without
being necessary to specify in advance the number of pivots to use. Furthermore,
SSS is dynamic, that is, it is capable to support object insertions in the
database efficiently, it can work with both continuous and discrete distance
functions, and it is suitable for secondary memory storage.

The main
contribution of the SSS is the use of a new pivot selection strategy which
generates a comparatively small number of good-quality pivots.
The main contribution of SSS is the generating a number of pivots that depends
on the intrinsic dimensionality of the space (something interesting from both
the theoretical and practical points of view.

SSS
can be extended to deal with more complex metric spaces where the distances
among subsets of objects depend on specific dimensions that are not relevant
for other set of objects. That is, in some applications areas, objects can be
grouped in different clusters with different associated metric spaces, all of
them nested into the general metric space that explains the distances among
clusters. To deal with these complex spaces we propose the extension of SSS
becoming Sparse Spatial Selection for Nested Metric Spaces (SSSNMS)

Reference: Spatial Selection of Sparse Pivots for Similarity
Search in Metric Spaces, JCS&T Vol. 7 No. 1, April
2007

 

Vantage Point
Tree (VP Tree) Like the
KD-Tree, each VP-Tree node cuts/divides the space. A VP-Tree node employs distance
from a selected vantage point rather than using coordinate values. Near points
make up the left/inside subspace while the right/outside subspace consists of
far points. A binary tree is formed by recursion. Each of the tree nodes
identifies a vantage point, and for its children (left/right), the node
contains bounds associated to subspace by the vantage point. Other forms of the
VP-Tree include additional subspace bounds and may employ buckets near leaf
level.

 

VPS-Tree an element of S is compared with the vantage point
belonging to each of its ancestors in the tree. This information is also not
captured in the simple tree. It consists of a database element identifier ‘id’,
and a list ‘hist’ of distances from the item to each vantage point tracing back
to the root. A list of these structures is initialized with the history set to
null from the entire database. The algorithm then recursively splits the list
into two lists L and R, while adding to the history of each item. Each resulting
tree node contains a list ‘bnds’ which have a range interval for its
corresponding subspace which is seen by each ancestral vantage point.

 

VPSB-Tree Here we consider one way to form buckets of leaves
in order to save space while preserving the notion of ancestral history present
in the VPS-Tree. Buckets are formed by collapsing subtrees near leaf
level into a flat structure. Each bucket contains say no element records. Each
record must specify an id, and in addition holds distance entries for every
ancestor. We call the resulting structure a VPSB-Tree.

Reference: P. N.
Yianilos, “Data structures and algorithms for nearest neighbor search in
general metric spaces,” in Proc. Annu. ACM-SIAM Symp. Discrete Algorithms,
1993, pp. 311–321.

 

Multi-Vantage
Point Tree
(MVP Tree) A distance based index structure called
multi-vantage point (MVP) tree for similarity queries on high-dimensional metric
spaces. The MVP Tree uses more than one vantage point to partition the space
into spherical cuts at each level. It also utilizes the pre-computed (at
construction time) distances between the data points and the vantage points.

Reference: T. Bozkaya and M. Ozsoyoglu. Distance-based
indexing for high-dimensional metric spaces. In Proc. SIGMOD’97, pages 357–368, 1997.
Sigmod Record 26(2).

 

Grid
Files A file data structure designed to
manage a disk allocation storage in terms of fixed size units like disk blocks,
pages or buckets depending on level of description. It is a variant of grid
method where the requirement is cell division lines must
be equidistant. Hashing method for multidimensional points. It is an extension
of extensible hashing.

Multipaging is similar to the grid file. It
uses a directory called axial arrays which is in the form of linear
scales instead of using a grid director. There are two variants of multipaging.
In dynamic multipaging, performance is controlled by setting a bound on the
probe factor which is defined as the average number of pages accessed in a
probe. In static multipaging, performance is controlled by setting a bound in
the load factor, or the average page occupancy.

Reference: A Survey on
Multidimensional Access Methods Hee-Kap Ahn Nikos Mamoulis Ho Min Wong
UU-CS-2001-14 May, 2001

 

EXCELL
method (Extendible CELL) related to
the grid file. It is a bintree with a directory in the form of an array
providing access by address computation. It can also be viewed as an adaptation
of extendible hashing to multidimensional point data. In contrast to the grid
file, where the partitioning hyperplanes may be spaced arbitrarily, the EXCELL
method decomposes the space regularly
ad all grid cells are of equal size.

Two-Level Grid
File. The basic idea of it is to use a second
grid file to manage the grid directory. The first level is called the root
directory. Entries of the root directory contains pointers to the directory
pages of the lower level, which in turn contain pointers to the data pages. By having
a second level, splits are often confined to the subdirectory regions without
affecting too much of their surroundings.

Twin
Grid File It is other hashing method. It tries to
increase space utilization compared to the original grid file by introducing a
second grid file. The relationship between these two grid files is not hierarchical
but somewhat more balanced. Both grid files span the whole space (universe).
The distribution of the data among the two files is performed dynamically.

Quadtrees
are one of the first data structures for higher-dimensional data. A quadtree is a
rooted tree in which every internal node has four children. Every node in the quadtree
corresponds to a square. If a node v has children, then their
corresponding squares are the four quadrants of the square of v ¾ hence
the name of the tree. This implies that the squares of the leaves together form
a subdivision of the square of the root. We call this subdivision the quadtree
subdivision. The children of the root are labeled NE, NW, SW, and SE to
indicate to which quadrant they belong to; NE stands for the north-east
quadrant, NW for the north-west quadrant, and so on.