Evaluating Cluster Dispersion in Oracle Data Mining

Posted on Updated on

When working with the Clustering algorithms, and particularly k-Means, in the Oracle Data Miner tool there is no way of seeing how compact or dispersed the data is within a cluster.

There are a number of measures typically used in various tools and algorithms, but with Oracle Data Miner we are not presented with any of this information.

But if we flip from using the Oracle Data Miner tool to using SQL we can get to see some more details of the clusters produced by the k-Means algorithm along with some additional and useful information.

As I said there are a number of different measures used to evaluate clusters. The one that Oracle uses is called Dispersion. Now there are a few different definitions of what this could be and I haven’t been able to locate what is Oracle’s own definition of it in any of the documentation.

We can use the Dispersion value as a measure of how compact or how spread out the data is within a cluster. The Dispersion value is a number greater than 0. The lower the value of the more compact the cluster is i.e. the data points are close the the centroid of the cluster. The larger the value the more disperse or spread out the data points are.

The DBMS_DATA_MINING PL/SQL package comes with a function called GET_MODEL_DETAILS_KM. This function returns a record of the form DM_CLUSTERS.

(id                   NUMBER,
 cluster_id           VARCHAR2(4000),
 record_count         NUMBER,
 parent               NUMBER,
 tree_level           NUMBER,
 dispersion           NUMBER,
 split_predicate      DM_PREDICATES,
 child                DM_CHILDREN,
 centroid             DM_CENTROIDS,
 histogram            DM_HISTOGRAMS,
 rule                 DM_RULE)

We can not use the following query to get the Dispersion value for each of the clusters from an ODM cluster model.

SELECT cluster_id,
FROM  table(dbms_data_mining.get_model_details_km('CLUS_KM_3_2'));