main content

density-凯发k8网页登录

density-based algorithm for clustering data

since r2021a

description

clusterdbscan clusters data points belonging to a p-dimensional feature space using the density-based spatial clustering of applications with noise (dbscan) algorithm. the clustering algorithm assigns points that are close to each other in feature space to a single cluster. for example, a radar system can return multiple detections of an extended target that are closely spaced in range, angle, and doppler. clusterdbscan assigns these detections to a single detection.

  • the dbscan algorithm assumes that clusters are dense regions in data space separated by regions of lower density and that all dense regions have similar densities.

  • to measure density at a point, the algorithm counts the number of data points in a neighborhood of the point. a neighborhood is a p-dimensional ellipse (hyperellipse) in the feature space. the radii of the ellipse are defined by the p-vector ε. ε can be a scalar, in which case, the hyperellipse becomes a hypersphere. distances between points in feature space are calculated using the euclidean distance metric. the neighborhood is called an ε-neighborhood. the value of ε is defined by the epsilon property. epsilon can either be a scalar or p-vector:

    • a vector is used when different dimensions in feature space have different units.

    • a scalar applies the same value to all dimensions.

  • clustering starts by finding all core points. if a point has a sufficient number of points in its ε-neighborhood, the point is called a core point. the minimum number of points required for a point to become a core point is set by the minnumpoints property.

  • the remaining points in the ε-neighborhood of a core point can be core points themselves. if not, they are border points. all points in the ε-neighborhood are called directly density reachable from the core point.

  • if the ε-neighborhood of a core point contains other core points, the points in the ε-neighborhoods of all the core points merge together to form a union of ε-neighborhoods. this process continues until no more core points can be added.

    • all points in the union of ε-neighborhoods are density reachable from the first core point. in fact, all points in the union are density reachable from all core points in the union.

    • all points in the union of ε-neighborhoods are also termed density connected even though border points are not necessarily reachable from each other. a cluster is a maximal set of density-connected points and can have an arbitrary shape.

  • points that are not core or border points are noise points. they do not belong to any cluster.

  • the clusterdbscan object can estimate ε using a k-nearest neighbor search, or you can specify values. to let the object estimate ε, set the epsilonsource property to 'auto'.

  • the clusterdbscan object can disambiguate data containing ambiguities. range and doppler are examples of possibly ambiguous data. set enabledisambiguation property to true to disambiguate data.

to cluster detections:

  1. create the clusterdbscan object and set its properties.

  2. call the object with arguments, as if it were a function.

to learn more about how system objects work, see what are system objects?

creation

description

clusterer = clusterdbscan creates a clusterdbscan object, clusterer, object with default property values.

effect of epsilon on clustering

clusterer = clusterdbscan(name,value) creates a clusterdbscan object, clusterer, with each specified property name set to the specified value. you can specify additional name-value pair arguments in any order as (name1,value1,...,namen,valuen). any unspecified properties take default values. for example,

clusterer = clusterdbscan('minnumpoints',3,'epsilon',2, ...
'enabledisambiguation',true,'ambiguousdimension',[1 2]);
creates a clusterer with the enabledisambiguation property set to true and the ambiguousdimension set to [1,2].

properties

unless otherwise indicated, properties are nontunable, which means you cannot change their values after calling the object. objects lock when you call them, and the function unlocks them.

if a property is tunable, you can change its value at any time.

for more information on changing property values, see .

source of epsilon values defining an ε-neighborhood, specified as 'property' or 'auto'.

  • when you set the epsilonsource property to 'property', ε is obtained from the epsilon property.

  • when you set the epsilonsource property to 'auto', ε is estimated automatically using a k-nearest neighbor (k-nn) search over a range of k values from kmin to kmax.

    kmin=minnumpoints1kmax=maxnumpoints1

    the subtraction of one is needed because the number of neighbors of a point does not include the point itself, whereas minnumpoints and maxnumpoints refer to the total number of points in a neighborhood.

data types: char | string

radius for a neighborhood search, specified as a positive scalar or positive, real-valued 1-by-p row vector. p is the number of features in the input data, x.

epsilon defines the radii of an ellipse around any point to create an ε-neighborhood. when epsilon is a scalar, the same radius applies to all feature dimensions. you can apply different epsilon values for different features by specifying a positive, real-valued 1-by-p row vector. a row vector creates a multidimensional ellipse (hyperellipse) search area, useful when the data features have different physical meanings, such as range and doppler. see estimate epsilon for more information about this property.

you can use the or object functions to help estimate a scalar value for epsilon.

example: [11 21.0]

tunable: yes

dependencies

to enable this property, set the epsilonsource property to 'property'.

data types: double

minimum number of points in an ε-neighborhood of a point for that point to become a core point, specified as a positive integer. see choosing the minimum number of points for more information. when the object automatically estimates epsilon using a k-nn search, the starting value of k (kmin) is minnumpoints - 1.

example: 5

data types: double

set end of k-nn search range, specified as a positive integer. when the object automatically estimates epsilon using a k-nn search, the ending value of k (kmax) is maxnumpoints - 1.

example: 13

dependencies

to enable this property, set the epsilonsource property to 'auto'.

data types: double

length of the stored epsilon history, specified as a positive integer. when set to one, the history is memory-less, meaning that each epsilon estimate is immediately used and no moving-average smoothing occurs. when greater than one, epsilon is averaged over the history length specified.

example: 5

dependencies

to enable this property, set the epsilonsource property to 'auto'.

data types: double

switch to enable disambiguation of dimensions, specified as false or true. when true, clustering can occur across boundaries defined by the input amblims at execution. use the ambiguousdimensions property to specify the column indices of x in which ambiguities can occur. you can disambiguate up to two dimensions. turning on disambiguation is not recommended for large data sets.

data types: logical

indices of ambiguous dimensions, specified as a positive integer or 1-by-2 vector of positive integers. this property specifies the column of x in which to apply disambiguation. a positive integer indicates a single ambiguous dimension in the input data matrix x. a 1-by-2 row vector specifies two ambiguous dimensions. the size and order of ambiguousdimension must be consistent with the object input amblims.

example: [3 4]

dependencies

to enable this property, set the enabledisambiguation property to true.

data types: double

usage

description

example

idx = clusterer(x) clusters the points in the input data, x. idx contains a list of ids identifying the cluster to which each row of x belongs. noise points are assigned as '–1'.

example

[idx,clusterids] = clusterer(x) also returns an alternate set of cluster ids, clusterids, for use in the and objects. clusterids assigns a unique id to each noise point.

[___] = clusterer(x,amblims) also specifies the minimum and maximum ambiguity limits, amblims, to apply to the data.

to enable this syntax, set the enabledisambiguation property to true.

[___] = clusterer(x,update) automatically estimates epsilon from the input data matrix, x, when update is set to true. the estimation uses a k-nn search to create a set of search curves. for more information, see estimate epsilon. the estimate is an average of the l most recent epsilon values where l is specified in epsilonhistorylength

to enable this syntax, set the epsilonsource property to 'auto', optionally set the maxnumpoints property, and also optionally set the epsilonhistorylength property.

[___] = clusterer(x,amblims,update) sets ambiguity limits and estimates epsilon when update is set to true. to enable this syntax, set enabledisambiguation to true and set epsilonsource to 'auto'.

input arguments

input feature data, specified as a real-valued n-by-p matrix. the n rows correspond to feature points in a p-dimensional feature space. the p columns contain the values of the features over which clustering takes place. the dbscan algorithm can cluster any type of data with appropriate minnumpoints and epsilon settings. for example, a two-column input can contain the xy cartesian coordinates, or range and doppler.

data types: double

ambiguity limits, specified as a real-valued 1-by-2 vector or real-valued 2-by-2 matrix. for a single ambiguity dimension, specify the limits as a 1-by-2 vector [minambiguitylimitdimension1,maxambiguitylimitdimension1]. for two ambiguity dimensions, specify the limits as a 2-by-2 matrix [minambiguitylimitdimension1, maxambiguitylimitdimension1; minambiguitylimitdimension2,maxambiguitylimitdimension2]. ambiguity limits allow clustering across boundaries to ensure that ambiguous detections are appropriately clustered.

the ambiguous columns of x are defined in the ambiguousdimension property. amblims defines the minimum and maximum ambiguity limits in the same units as the data in the ambiguousdimension columns of x.

example: [0 20; -40 40]

dependencies

to enable this argument, set enabledisambiguation to true and set the ambiguousdimension property.

data types: double

enable automatic update of the epsilon estimate, specified as false or true.

  • when true, the epsilon threshold is first estimated as the average of the knees of k-nn search curves. the estimate is then added to a buffer whose length l is set in the epsilonhistorylength property. the final epsilon that is used is calculated as the average of the l-length epsilon history buffer. if epsilonhistorylength is set to 1, the estimate is memory-less. memory-less means that each epsilon estimate is immediately used and no moving-average smoothing occurs.

  • when false, a previous epsilon estimate is used. estimating epsilon is computationally intensive and not recommended for large data sets.

dependencies

to enable this argument, set the epsilonsource property to 'auto' and specify the maxnumpoints property.

data types: double

output arguments

cluster indices, returned as an integer-valued n-by-1 column vector. idx represents the clustering results of the dbscan algorithm. positive idx values correspond to clusters that satisfy the dbscan clustering criteria. a value of '-1' indicates a dbscan noise point.

data types: double

alternative cluster ids, returned as a 1-by-n row vector of positive integers. each value is a unique identifier indicating a hypothetical target cluster. this argument contains unique positive cluster ids for all points including noise. in contrast, the idx output argument labels noise points with '–1'. use clusterids as the input to phased array system toolbox™ objects such as and .

data types: double

object functions

to use an object function, specify the system object™ as the first input argument. for example, to release system resources of a system object named obj, use this syntax:

release(obj)
find cluster hierarchy in data
estimate neighborhood clustering threshold
plot clusters
run system object algorithm
release resources and allow changes to system object property values and input characteristics
reset internal states of system object

examples

create detections of extended objects with measurements in range and doppler. assume the maximum unambiguous range is 20 m and the unambiguous doppler span extends from -30 hz to 30 hz. data for this example is contained in the dataclusterdbscan.mat file. the first column of the data matrix represents range, and the second column represents doppler.

the input data contains the following extended targets and false alarms:

  • an unambiguous target located at (10,15)

  • an ambiguous target in doppler located at(10,-30)

  • an ambiguous target in range located at (20,15)

  • an ambiguous target in range and doppler located at (20,30)

  • 5 false alarms

create a clusterdbscan object and specify that disambiguation is not performed by setting enabledisambiguation to false. solve for the cluster indices.

load('dataclusterdbscan.mat');
cluster1 = clusterdbscan('minnumpoints',3,'epsilon',2, ...
    'enabledisambiguation',false);
idx = cluster1(x);

use the clusterdbscan plot object function to display the clusters.

plot(cluster1,x,idx)

figure clusters contains an axes object. the axes object with title clusters, xlabel dimension 1, ylabel dimension 2 contains 10 objects of type line, scatter, text. one or more of the lines displays its values using only markers

the plot indicates that there are eight apparent clusters and six noise points. the 'dimension 1' label corresponds to range and the 'dimension 2' label corresponds to doppler.

next, create another clusterdbscan object and set enabledisambiguation to true to specify that clustering is performed across the range and doppler ambiguity boundaries.

cluster2 = clusterdbscan('minnumpoints',3,'epsilon',2, ...
    'enabledisambiguation',true,'ambiguousdimension',[1 2]);

perform the clustering using ambiguity limits and then plot the clustering results. the dbscan clustering results correctly show four clusters and five noise points. for example, the points at ranges close to zero are clustered with points near 20 m because the maximum unambiguous range is 20 m.

amblims = [0 maxrange; mindoppler maxdoppler];
idx = cluster2(x,amblims);
plot(cluster2,x,idx)

figure clusters contains an axes object. the axes object with title clusters, xlabel dimension 1, ylabel dimension 2 contains 6 objects of type line, scatter, text. one or more of the lines displays its values using only markers

cluster two-dimensional cartesian position data using clusterdbscan. to illustrate how the choice of epsilon affects clustering, compare the results of clustering with epsilon set to 1 and epsilon set to 3.

create random target position data in xy cartesian coordinates.

x = [rand(20,2) 12; rand(20,2) 10; rand(20,2) 15];
plot(x(:,1),x(:,2),'.')

figure contains an axes object. the axes contains a line object which displays its values using only markers.

create a clusterdbscan object with the epsilon property set to 1 and the minnumpoints property set to 3.

clusterer = clusterdbscan('epsilon',1,'minnumpoints',3);

cluster the data when epsilon equals 1.

idxepsilon1 = clusterer(x);

cluster the data again but with epsilon set to 3. you can change the value of epsilon because it is a tunable property.

clusterer.epsilon = 3;
idxepsilon2 = clusterer(x);

plot the clustering results side-by-side. do this by passing in the axes handles and titles into the plot method. the plot shows that for epsilon set to 1, three clusters appear. when epsilon is 3, the two lower clusters are merged into one.

hax1 = subplot(1,2,1);
plot(clusterer,x,idxepsilon1, ...
    'parent',hax1,'title','epsilon = 1')
hax2 = subplot(1,2,2);
plot(clusterer,x,idxepsilon2, ...
    'parent',hax2,'title','epsilon = 3')

figure contains 2 axes objects. axes object 1 with title epsilon = 1, xlabel dimension 1, ylabel dimension 2 contains 4 objects of type scatter, text. axes object 2 with title epsilon = 3, xlabel dimension 1, ylabel dimension 2 contains 3 objects of type scatter, text.

algorithms

references

[1] ester m., kriegel h.-p., sander j., and xu x. "a density-based algorithm for discovering clusters in large spatial databases with noise". proc. 2nd int. conf. on knowledge discovery and data mining, portland, or, aaai press, 1996, pp. 226-231.

[2] erich schubert, jörg sander, martin ester, hans-peter kriegel, and xiaowei xu. 2017. "dbscan revisited, revisited: why and how you should (still) use dbscan". acm trans. database syst. 42, 3, article 19 (july 2017), 21 pages.

[3] dominik kellner, jens klappstein and klaus dietmayer, "grid-based dbscan for clustering extended objects in radar data", 2012 ieee intelligent vehicles symposium.

[4] thomas wagner, reinhard feger, and andreas stelzer, "a fast grid-based clustering algorithm for range/doppler/doa measurements", proceedings of the 13th european radar conference.

[5] mihael ankerst, markus m. breunig, hans-peter kriegel, jörg sander, "optics: ordering points to identify the clustering structure", proc. acm sigmod’99 int. conf. on management of data, philadelphia pa, 1999.

extended capabilities

c/c code generation
generate c and c code using matlab® coder™.

version history

introduced in r2021a

see also

| |

网站地图