density-凯发k8网页登录

density-based algorithm for clustering data

since r2021a

description

clusterdbscan clusters data points belonging to a p-dimensional feature space using the density-based spatial clustering of applications with noise (dbscan) algorithm. the clustering algorithm assigns points that are close to each other in feature space to a single cluster. for example, a radar system can return multiple detections of an extended target that are closely spaced in range, angle, and doppler. clusterdbscan assigns these detections to a single detection.

the dbscan algorithm assumes that clusters are dense regions in data space separated by regions of lower density and that all dense regions have similar densities.
to measure density at a point, the algorithm counts the number of data points in a neighborhood of the point. a neighborhood is a p-dimensional ellipse (hyperellipse) in the feature space. the radii of the ellipse are defined by the p-vector ε. ε can be a scalar, in which case, the hyperellipse becomes a hypersphere. distances between points in feature space are calculated using the euclidean distance metric. the neighborhood is called an ε-neighborhood. the value of ε is defined by the epsilon property. epsilon can either be a scalar or p-vector:
- a vector is used when different dimensions in feature space have different units.
- a scalar applies the same value to all dimensions.
clustering starts by finding all core points. if a point has a sufficient number of points in its ε-neighborhood, the point is called a core point. the minimum number of points required for a point to become a core point is set by the minnumpoints property.
the remaining points in the ε-neighborhood of a core point can be core points themselves. if not, they are border points. all points in the ε-neighborhood are called directly density reachable from the core point.
if the ε-neighborhood of a core point contains other core points, the points in the ε-neighborhoods of all the core points merge together to form a union of ε-neighborhoods. this process continues until no more core points can be added.
- all points in the union of ε-neighborhoods are density reachable from the first core point. in fact, all points in the union are density reachable from all core points in the union.
- all points in the union of ε-neighborhoods are also termed density connected even though border points are not necessarily reachable from each other. a cluster is a maximal set of density-connected points and can have an arbitrary shape.
points that are not core or border points are noise points. they do not belong to any cluster.
the clusterdbscan object can estimate ε using a k-nearest neighbor search, or you can specify values. to let the object estimate ε, set the epsilonsource property to 'auto'.
the clusterdbscan object can disambiguate data containing ambiguities. range and doppler are examples of possibly ambiguous data. set enabledisambiguation property to true to disambiguate data.

to cluster detections:

create the clusterdbscan object and set its properties.
call the object with arguments, as if it were a function.

to learn more about how system objects work, see what are system objects?

creation

syntax

clusterer = clusterdbscan

clusterer = clusterdbscan(name,value)

description

clusterer = clusterdbscan creates a clusterdbscan object, clusterer, object with default property values.

effect of epsilon on clustering

clusterer = clusterdbscan(name,value) creates a clusterdbscan object, clusterer, with each specified property name set to the specified value. you can specify additional name-value pair arguments in any order as (name1,value1,...,namen,valuen). any unspecified properties take default values. for example,

clusterer = clusterdbscan('minnumpoints',3,'epsilon',2, ...
'enabledisambiguation',true,'ambiguousdimension',[1 2]);

creates a clusterer with the enabledisambiguation property set to true and the ambiguousdimension set to [1,2].

properties

unless otherwise indicated, properties are nontunable, which means you cannot change their values after calling the object. objects lock when you call them, and the function unlocks them.

if a property is tunable, you can change its value at any time.

for more information on changing property values, see .

`epsilonsource` — source of epsilon
`'property'` (default) | `'auto'`

source of epsilon values defining an ε-neighborhood, specified as 'property' or 'auto'.

when you set the epsilonsource property to 'property', ε is obtained from the epsilon property.
when you set the epsilonsource property to 'auto', ε is estimated automatically using a k-nearest neighbor (k-nn) search over a range of k values from k_min to k_max.

$\begin{array}{l} k_{\min} = minnumpoints - 1 \\ k_{\max} = maxnumpoints - 1 \end{array}$
the subtraction of one is needed because the number of neighbors of a point does not include the point itself, whereas minnumpoints and maxnumpoints refer to the total number of points in a neighborhood.

data types: char | string

`epsilon` — radius for neighborhood search
`10.0` (default) | positive scalar | positive, real-valued 1-by-p row vector

radius for a neighborhood search, specified as a positive scalar or positive, real-valued 1-by-p row vector. p is the number of features in the input data, x.

epsilon defines the radii of an ellipse around any point to create an ε-neighborhood. when epsilon is a scalar, the same radius applies to all feature dimensions. you can apply different epsilon values for different features by specifying a positive, real-valued 1-by-p row vector. a row vector creates a multidimensional ellipse (hyperellipse) search area, useful when the data features have different physical meanings, such as range and doppler. see estimate epsilon for more information about this property.

you can use the or object functions to help estimate a scalar value for epsilon.

example: [11 21.0]

tunable: yes

dependencies

to enable this property, set the epsilonsource property to 'property'.

data types: double

`minnumpoints` — minimum number of points required for cluster
`3` (default) | positive integer

minimum number of points in an ε-neighborhood of a point for that point to become a core point, specified as a positive integer. see choosing the minimum number of points for more information. when the object automatically estimates epsilon using a k-nn search, the starting value of k (k_min) is minnumpoints - 1.

example: 5

data types: double

`maxnumpoints` — set end of k-nn search range
`10` (default) | positive integer

set end of k-nn search range, specified as a positive integer. when the object automatically estimates epsilon using a k-nn search, the ending value of k (k_max) is maxnumpoints - 1.

example: 13

dependencies

to enable this property, set the epsilonsource property to 'auto'.

data types: double

`epsilonhistorylength` — length of cluster threshold epsilon history
`10` (default) | positive integer

length of the stored epsilon history, specified as a positive integer. when set to one, the history is memory-less, meaning that each epsilon estimate is immediately used and no moving-average smoothing occurs. when greater than one, epsilon is averaged over the history length specified.

example: 5

dependencies

to enable this property, set the epsilonsource property to 'auto'.

data types: double

`enabledisambiguation` — enable disambiguation of dimensions
`false` (default) | `true`

switch to enable disambiguation of dimensions, specified as false or true. when true, clustering can occur across boundaries defined by the input amblims at execution. use the ambiguousdimensions property to specify the column indices of x in which ambiguities can occur. you can disambiguate up to two dimensions. turning on disambiguation is not recommended for large data sets.

data types: logical

`ambiguousdimension` — indices of ambiguous dimensions
`1` (default) | positive integer | 1-by-2 vector of positive integers

indices of ambiguous dimensions, specified as a positive integer or 1-by-2 vector of positive integers. this property specifies the column of x in which to apply disambiguation. a positive integer indicates a single ambiguous dimension in the input data matrix x. a 1-by-2 row vector specifies two ambiguous dimensions. the size and order of ambiguousdimension must be consistent with the object input amblims.

example: [3 4]

dependencies

to enable this property, set the enabledisambiguation property to true.

data types: double

usage

syntax

idx = clusterer(x)

[idx,clusterids] = clusterer(x)

[___] = clusterer(x,amblims)

[___] = clusterer(x,update)

[___] = clusterer(x,amblims,update)

description

example

idx = clusterer(x) clusters the points in the input data, x. idx contains a list of ids identifying the cluster to which each row of x belongs. noise points are assigned as '–1'.

example

[idx,clusterids] = clusterer(x) also returns an alternate set of cluster ids, clusterids, for use in the and objects. clusterids assigns a unique id to each noise point.

[___] = clusterer(x,amblims) also specifies the minimum and maximum ambiguity limits, amblims, to apply to the data.

to enable this syntax, set the enabledisambiguation property to true.

[___] = clusterer(x,update) automatically estimates epsilon from the input data matrix, x, when update is set to true. the estimation uses a k-nn search to create a set of search curves. for more information, see estimate epsilon. the estimate is an average of the l most recent epsilon values where l is specified in epsilonhistorylength

to enable this syntax, set the epsilonsource property to 'auto', optionally set the maxnumpoints property, and also optionally set the epsilonhistorylength property.

[___] = clusterer(x,amblims,update) sets ambiguity limits and estimates epsilon when update is set to true. to enable this syntax, set enabledisambiguation to true and set epsilonsource to 'auto'.

input arguments

`x` — input feature data
real-valued n-by-p matrix

input feature data, specified as a real-valued n-by-p matrix. the n rows correspond to feature points in a p-dimensional feature space. the p columns contain the values of the features over which clustering takes place. the dbscan algorithm can cluster any type of data with appropriate minnumpoints and epsilon settings. for example, a two-column input can contain the xy cartesian coordinates, or range and doppler.

data types: double

`amblims` — ambiguity limits
1-by-2 real-valued vector (default) | 2-by-2 real-valued matrix

ambiguity limits, specified as a real-valued 1-by-2 vector or real-valued 2-by-2 matrix. for a single ambiguity dimension, specify the limits as a 1-by-2 vector [minambiguitylimitdimension1,maxambiguitylimitdimension1]. for two ambiguity dimensions, specify the limits as a 2-by-2 matrix [minambiguitylimitdimension1, maxambiguitylimitdimension1; minambiguitylimitdimension2,maxambiguitylimitdimension2]. ambiguity limits allow clustering across boundaries to ensure that ambiguous detections are appropriately clustered.

the ambiguous columns of x are defined in the ambiguousdimension property. amblims defines the minimum and maximum ambiguity limits in the same units as the data in the ambiguousdimension columns of x.

example: [0 20; -40 40]

dependencies

to enable this argument, set enabledisambiguation to true and set the ambiguousdimension property.

data types: double

`update` — enable automatic update of epsilon
`false` (default) | `true`

enable automatic update of the epsilon estimate, specified as false or true.

when true, the epsilon threshold is first estimated as the average of the knees of k-nn search curves. the estimate is then added to a buffer whose length l is set in the epsilonhistorylength property. the final epsilon that is used is calculated as the average of the l-length epsilon history buffer. if epsilonhistorylength is set to 1, the estimate is memory-less. memory-less means that each epsilon estimate is immediately used and no moving-average smoothing occurs.
when false, a previous epsilon estimate is used. estimating epsilon is computationally intensive and not recommended for large data sets.

dependencies

to enable this argument, set the epsilonsource property to 'auto' and specify the maxnumpoints property.

data types: double

output arguments

`idx` — cluster indices
n-by-1 integer-valued column vector

cluster indices, returned as an integer-valued n-by-1 column vector. idx represents the clustering results of the dbscan algorithm. positive idx values correspond to clusters that satisfy the dbscan clustering criteria. a value of '-1' indicates a dbscan noise point.

data types: double

`clusterids` — alternative cluster ids
1-by-n integer-valued row vector

alternative cluster ids, returned as a 1-by-n row vector of positive integers. each value is a unique identifier indicating a hypothetical target cluster. this argument contains unique positive cluster ids for all points including noise. in contrast, the idx output argument labels noise points with '–1'. use clusterids as the input to phased array system toolbox™ objects such as and .

data types: double

object functions

to use an object function, specify the system object™ as the first input argument. for example, to release system resources of a system object named obj, use this syntax:

release(obj)

specific to `clusterdbscan`

	find cluster hierarchy in data
	estimate neighborhood clustering threshold
	plot clusters

common to all system objects

	run system object algorithm
	release resources and allow changes to system object property values and input characteristics
	reset internal states of system object

examples

cluster detections in range and doppler

create detections of extended objects with measurements in range and doppler. assume the maximum unambiguous range is 20 m and the unambiguous doppler span extends from $- 30$ hz to $30$ hz. data for this example is contained in the dataclusterdbscan.mat file. the first column of the data matrix represents range, and the second column represents doppler.

the input data contains the following extended targets and false alarms:

an unambiguous target located at $(10, 15)$
an ambiguous target in doppler located at $(10, - 30)$
an ambiguous target in range located at $(20, 15)$
an ambiguous target in range and doppler located at $(20, 30)$
5 false alarms

create a clusterdbscan object and specify that disambiguation is not performed by setting enabledisambiguation to false. solve for the cluster indices.

load('dataclusterdbscan.mat');
cluster1 = clusterdbscan('minnumpoints',3,'epsilon',2, ...
    'enabledisambiguation',false);
idx = cluster1(x);

use the clusterdbscan plot object function to display the clusters.

plot(cluster1,x,idx)

figure clusters contains an axes object. the axes object with title clusters, xlabel dimension 1, ylabel dimension 2 contains 10 objects of type line, scatter, text. one or more of the lines displays its values using only markers

the plot indicates that there are eight apparent clusters and six noise points. the 'dimension 1' label corresponds to range and the 'dimension 2' label corresponds to doppler.

next, create another clusterdbscan object and set enabledisambiguation to true to specify that clustering is performed across the range and doppler ambiguity boundaries.

cluster2 = clusterdbscan('minnumpoints',3,'epsilon',2, ...
    'enabledisambiguation',true,'ambiguousdimension',[1 2]);

perform the clustering using ambiguity limits and then plot the clustering results. the dbscan clustering results correctly show four clusters and five noise points. for example, the points at ranges close to zero are clustered with points near 20 m because the maximum unambiguous range is 20 m.

amblims = [0 maxrange; mindoppler maxdoppler];
idx = cluster2(x,amblims);
plot(cluster2,x,idx)

figure clusters contains an axes object. the axes object with title clusters, xlabel dimension 1, ylabel dimension 2 contains 6 objects of type line, scatter, text. one or more of the lines displays its values using only markers

effect of epsilon on clustering

cluster two-dimensional cartesian position data using clusterdbscan. to illustrate how the choice of epsilon affects clustering, compare the results of clustering with epsilon set to 1 and epsilon set to 3.

create random target position data in xy cartesian coordinates.

x = [rand(20,2) 12; rand(20,2) 10; rand(20,2) 15];
plot(x(:,1),x(:,2),'.')

figure contains an axes object. the axes contains a line object which displays its values using only markers.

create a clusterdbscan object with the epsilon property set to 1 and the minnumpoints property set to 3.

clusterer = clusterdbscan('epsilon',1,'minnumpoints',3);

cluster the data when epsilon equals 1.

idxepsilon1 = clusterer(x);

cluster the data again but with epsilon set to 3. you can change the value of epsilon because it is a tunable property.

clusterer.epsilon = 3;
idxepsilon2 = clusterer(x);

plot the clustering results side-by-side. do this by passing in the axes handles and titles into the plot method. the plot shows that for epsilon set to 1, three clusters appear. when epsilon is 3, the two lower clusters are merged into one.

hax1 = subplot(1,2,1);
plot(clusterer,x,idxepsilon1, ...
    'parent',hax1,'title','epsilon = 1')
hax2 = subplot(1,2,2);
plot(clusterer,x,idxepsilon2, ...
    'parent',hax2,'title','epsilon = 3')

figure contains 2 axes objects. axes object 1 with title epsilon = 1, xlabel dimension 1, ylabel dimension 2 contains 4 objects of type scatter, text. axes object 2 with title epsilon = 3, xlabel dimension 1, ylabel dimension 2 contains 3 objects of type scatter, text.

algorithms

clustering algorithm

clustering overview

this section illustrates the basic principles of cluster formation. the figure shows points in a two-dimensional feature space. the clusters are compact and well-separated. a few noise points appear.

clusters formed from a single ε-neighborhood

clusters start from core points. the first step in the algorithm is identifying all core points.
the figure here shows the point p₁ and its ε-neighborhood n_ε(p₁). the ε-neighborhood has eight points (including itself) within a radius ε. using the minnumpoints property to set the threshold to 8 means that p₁ is a core point. the blue points that lie within n_ε are called border points. these border points are directly density reachable from the core point p₁.
no other points in the figure have enough neighboring points in their ε-neighborhood to become a core point. p₂ is not a core point because it has only five points within its neighborhood. p₂ is directly density reachable from p₁. the reverse is not true because p₂ is not a core point. the one-way arrow connecting the two points shows this asymmetry.
points that fall outside n_ε(p₁) are noise points (red) and do not belong to the cluster.
because no other points are core points, the core point and border points are a maximal set of density-connected points and therefore form a cluster.

cluster of points from two ε-neighborhoods

the next figure shows a larger set of points containing two core points, p₁ and p₂. p₂ is a border point of p₁ but p₂ also has enough points in its own neighborhood to become a core point. because they are both core points, p₁ is directly density reachable from p₂, and p₁ is directly density reachable from p₂. the two-way arrow connecting them shows this symmetry.
p₃ is directly density reachable from p₂ but not from p₁ (as indicated by the one-way arrow). however, p₃ is called simply density reachable from p₁.
because no other points are core points, the two core points and their border points form a maximal set of density-connected points and form one cluster.

cluster points in adjacent ε-neighborhoods

this process of growing a cluster can be extended from core point to core point until there are no more core points to add. the core points and the border points belong to the same cluster. in general, a point p_n is density reachable from point p₁ when there is a chain of core points, p₁,p₂, p₃, …, p_n-1 such that each core point p_i₁ is directly density reachable from p_i, and p_n is directly density reachable from p_n_-1.

density connectivity

the next figure illustrates some properties of density connectivity.

a cluster can have multiple branching chains, for example (p₁, p₂, p₃, p₄) and (p₁, p₂, p₅, p₆).
two points, p₆ and p₄, are density connected when there is a third point p₂ such that p₆ and p₄ are density reachable from p₂.
two density connected points are not necessarily density reachable from one another.
a maximal set of density connected points define a cluster. it does not matter which core point is the starting core point.
all points in a cluster are density reachable from all core points.

estimate epsilon

dbscan clustering requires a value for the neighborhood size parameter ε. the clusterdbscan object and the clusterdbscan.estimateepsilon function use a k-nearest-neighbor search to estimate a scalar epsilon. let d be the distance of any point p to its k^th nearest neighbor. define a d_k(p)-neighborhood as a neighborhood surrounding p that contains its k-nearest neighbors. there are k 1 points in the d_k(p)-neighborhood including the point p itself. an outline of the estimation algorithm is:

for each point, find all the points in its d_k(p)-neighborhood
accumulate the distances in all d_k(p)-neighborhoods for all points into a single vector.
sort the vector by increasing distance.
plot the sorted k-dist graph, which is the sorted distance against point number.
find the knee of the curve. the value of the distance at that point is an estimate of epsilon.

the figure here shows distance plotted against point index for k = 20. the knee occurs at approximately 1.5. any points below this threshold belong to a cluster. any points above this value are noise.

there are several methods to find the knee of the curve. clusterdbscan and clusterdbscan.estimateepsilon first define the line connecting the first and last points of the curve. the ordinate of the point on the sorted k-dist graph furthest from the line and perpendicular to the line defines epsilon.

when you specify a range of k values, the algorithm averages the estimate epsilon values for all curves. this figure shows that epsilon is fairly insensitive to k for k ranging from 14 through 19.

to create a single k-nn distance graph, set the minnumpoints property equal to the maxnumpoints property.

choosing the minimum number of points

the purpose of minnumpoints is to smooth the density estimates. because a cluster is a maximal set of density-connected points, choose smaller values when the expected number of detections in a cluster is unknown. however, smaller values make the dbscan algorithm more susceptible to noise. a general guideline for choosing minnumpoints is:

generally, set minnumpoints = 2p where p is the number of feature dimensions in x.
for data sets that have one or more of the following properties:
- many noise points
- large number of points, n
- large dimensionality, p
- many duplicates
increasing minnumpoints can often improve clustering results.

ambiguous data

the clustering algorithm is general enough to process ambiguities in any feature, but applying clustering to range and doppler ambiguities in radar are important applications.

range ambiguity

the time delay between pulse transmission and reception determines the range, r, of a target. r is proportional to time delay, t, by

$r = \frac{c t}{2}$

where c is the speed of light. time is measured from the transmission time of the pulse. if only one pulse is transmitted, the equation accurately determines the range.

often, the radar transmits multiple pulses spaced at intervals t, the pulse repetition interval (pri). range ambiguities occur when the echoes from one pulse are not received before the next pulse is transmitted. range is computed from the time difference of the arrival of the received pulse from the transmission time of the most recent transmitted pulse. therefore the range can be incorrect by some integer multiple of the unambiguous range. the unambiguous range of a radar system is the maximum range at which a target can be located to guarantee that the reflected pulse from that target corresponds to the most recent transmitted pulse. the pri determines the unambiguous range.

$r_{\max} = \frac{c t}{2}$

the range of a detection less than r_max is an unambiguous range. range disambiguation clusters detections that cross ambiguous range boundaries.

turn on disambiguation by setting the enabledisambiguation to true. then, use the ambiguousdimension property to select the column in the input data corresponding to range. set the actual ambiguity limits for range using the amblims argument at execution time.

doppler ambiguity

doppler aliasing occurs when echoes arrive from targets that move fast enough for the doppler frequency to exceed the pulse repetition frequency (prf). if the doppler shift is greater than ½ prf or less than –½ prf, the doppler shift is aliased into the range (–½ prf, ½ prf). this range is called the unambiguous doppler. turn on disambiguation by setting the enabledisambiguation to true. then, use the ambiguousdimension property to select the column in the input data corresponding to doppler. set the actual ambiguity limits for doppler using the amblims argument at execution time. doppler ambiguity implies radial speed ambiguity as well. make sure that amblims matches the interpretation of the feature.

references

[1] ester m., kriegel h.-p., sander j., and xu x. "a density-based algorithm for discovering clusters in large spatial databases with noise". proc. 2nd int. conf. on knowledge discovery and data mining, portland, or, aaai press, 1996, pp. 226-231.

[2] erich schubert, jörg sander, martin ester, hans-peter kriegel, and xiaowei xu. 2017. "dbscan revisited, revisited: why and how you should (still) use dbscan". acm trans. database syst. 42, 3, article 19 (july 2017), 21 pages.

[3] dominik kellner, jens klappstein and klaus dietmayer, "grid-based dbscan for clustering extended objects in radar data", 2012 ieee intelligent vehicles symposium.

[4] thomas wagner, reinhard feger, and andreas stelzer, "a fast grid-based clustering algorithm for range/doppler/doa measurements", proceedings of the 13th european radar conference.

[5] mihael ankerst, markus m. breunig, hans-peter kriegel, jörg sander, "optics: ordering points to identify the clustering structure", proc. acm sigmod’99 int. conf. on management of data, philadelphia pa, 1999.

extended capabilities

c/c code generation
generate c and c code using matlab® coder™.

version history

introduced in r2021a

density-凯发k8网页登录

description

creation

syntax

description

properties

epsilonsource — source of epsilon 'property' (default) | 'auto'

epsilon — radius for neighborhood search 10.0 (default) | positive scalar | positive, real-valued 1-by-p row vector

dependencies

minnumpoints — minimum number of points required for cluster 3 (default) | positive integer

maxnumpoints — set end of k-nn search range 10 (default) | positive integer

dependencies

epsilonhistorylength — length of cluster threshold epsilon history 10 (default) | positive integer

dependencies

enabledisambiguation — enable disambiguation of dimensions false (default) | true

ambiguousdimension — indices of ambiguous dimensions 1 (default) | positive integer | 1-by-2 vector of positive integers

dependencies

usage

syntax

description

input arguments

x — input feature data real-valued n-by-p matrix

amblims — ambiguity limits 1-by-2 real-valued vector (default) | 2-by-2 real-valued matrix

dependencies

update — enable automatic update of epsilon false (default) | true

dependencies

output arguments

idx — cluster indices n-by-1 integer-valued column vector

clusterids — alternative cluster ids 1-by-n integer-valued row vector

object functions

specific to clusterdbscan

common to all system objects

examples

cluster detections in range and doppler

effect of epsilon on clustering

algorithms

clustering algorithm

estimate epsilon

choosing the minimum number of points

ambiguous data

references

extended capabilities

c/c code generation generate c and c code using matlab® coder™.

version history

see also

wechat

`epsilonsource` — source of epsilon
`'property'` (default) | `'auto'`

`epsilon` — radius for neighborhood search
`10.0` (default) | positive scalar | positive, real-valued 1-by-p row vector

`minnumpoints` — minimum number of points required for cluster
`3` (default) | positive integer

`maxnumpoints` — set end of k-nn search range
`10` (default) | positive integer

`epsilonhistorylength` — length of cluster threshold epsilon history
`10` (default) | positive integer

`enabledisambiguation` — enable disambiguation of dimensions
`false` (default) | `true`

`ambiguousdimension` — indices of ambiguous dimensions
`1` (default) | positive integer | 1-by-2 vector of positive integers

`x` — input feature data
real-valued n-by-p matrix

`amblims` — ambiguity limits
1-by-2 real-valued vector (default) | 2-by-2 real-valued matrix

`update` — enable automatic update of epsilon
`false` (default) | `true`

`idx` — cluster indices
n-by-1 integer-valued column vector

`clusterids` — alternative cluster ids
1-by-n integer-valued row vector

specific to `clusterdbscan`

c/c code generation
generate c and c code using matlab® coder™.