python - k-fold Cross Validation for determining k in k-means? -
python - k-fold Cross Validation for determining k in k-means? -
in document clustering process, info pre-processing step, first applied singular vector decomposition obtain u
, s
, vt
, choosing suitable number of eigen values truncated vt
, gives me document-document correlation read here. performing clustering on columns of matrix vt
cluster similar documents , chose k-means , initial results looked acceptable me (with k = 10 clusters) wanted dig bit deeper on choosing k value itself. determine number of clusters k
in k-means, suggested @ cross-validation.
before implementing wanted figure out if there built-in way accomplish using numpy or scipy. currently, way performing kmeans
utilize function scipy.
import numpy, scipy # preprocess info , compute svd u, s, vt = svd(a) # tfidf representation of original term-document matrix # obtain document-document correlations vt # 50 threshold obtained after examining scree plot of s docvectors = numpy.transpose(self.vt[0:50, 0:]) # prepare info run k-means whitened = whiten(docvectors) res, idx = kmeans2(whitened, 10, iter=20)
assuming methodology right far (please right me if missing step), @ stage, standard way of using output perform cross-validation? reference/implementations/suggestions on how applied k-means appreciated.
to run k-fold cross validation, you'd need measure of quality optimize for. either classification measure such accuracy or f1, or specialized 1 such v-measure.
even clustering quality measures know of need labeled dataset ("ground truth") work; difference classification need part of info labeled evaluation, while k-means algorithm can create utilize info determine centroids , clusters.
v-measure , several other scores implemented in scikit-learn, generic cross validation code , "grid search" module optimizes according specified measure of evaluation using k-fold cv. disclaimer: i'm involved in scikit-learn development, though didn't write of code mentioned.
python statistics numpy nlp machine-learning
Comments
Post a Comment