python - k-fold Cross Validation for determining k in k-means? -

May 15, 2012

in document clustering process, info pre-processing step, first applied singular vector decomposition obtain u, s , vt , choosing suitable number of eigen values truncated vt, gives me document-document correlation read here. performing clustering on columns of matrix vt cluster similar documents , chose k-means , initial results looked acceptable me (with k = 10 clusters) wanted dig bit deeper on choosing k value itself. determine number of clusters k in k-means, suggested @ cross-validation.

before implementing wanted figure out if there built-in way accomplish using numpy or scipy. currently, way performing kmeans utilize function scipy.

import numpy, scipy  # preprocess   info , compute svd u, s, vt = svd(a) # tfidf representation of original term-document matrix  # obtain document-document correlations vt # 50 threshold obtained after examining scree plot of s docvectors = numpy.transpose(self.vt[0:50, 0:])   # prepare   info run k-means whitened = whiten(docvectors) res, idx = kmeans2(whitened, 10, iter=20)

assuming methodology right far (please right me if missing step), @ stage, standard way of using output perform cross-validation? reference/implementations/suggestions on how applied k-means appreciated.

to run k-fold cross validation, you'd need measure of quality optimize for. either classification measure such accuracy or f1, or specialized 1 such v-measure.

even clustering quality measures know of need labeled dataset ("ground truth") work; difference classification need part of info labeled evaluation, while k-means algorithm can create utilize info determine centroids , clusters.

v-measure , several other scores implemented in scikit-learn, generic cross validation code , "grid search" module optimizes according specified measure of evaluation using k-fold cv. disclaimer: i'm involved in scikit-learn development, though didn't write of code mentioned.

python statistics numpy nlp machine-learning

Search This Blog

JC

python - k-fold Cross Validation for determining k in k-means? -

Comments

Post a Comment

Popular posts from this blog

iphone - Dismissing a UIAlertView -

c# - Can ProtoBuf-Net deserialize to a flat class? -

javascript - Change element in each JQuery tab to dynamically generated colors -