Thursday, August 02, 2007

Sample Size and Dimensionality

Sample size and dimensionality are critical to parametric optimization
of machine learning and prediction. Small datasets with high
dimensionality poses the low ROC problem in research community.

A naive Bayes classifier (Maron, 1961) is a simple probabilistic
classifier based on applying Bayes’ theorem with strong independence
assumptions. Depending on the precise nature of the probability model,
naive Bayes classifiers can be trained very
efficiently in a supervised learning setting. In many practical
applications, parameter estimation for naive Bayes models uses the
method of maximum likelihood. Recent researches on Bayesian
classification problem has shown that there are some theoretical reasons
for the apparently unreasonable efficacy of naive Bayes
classifiers(Zhang, 2004). Because independent variables are assumed,
only the variances of the variables for each class need to be determined
and not the entire covariance matrix. Hence, Naive Bayes classifier
requires small training data for classification prediction.

Support vector machines (SVMs) is another set of supervised learning
methods for classification (Cortes & Vapnik, 1995). It maps input
vectors to a higher dimensional space where a maximal separating
hyperplane is created. Two parallel hyper-planes
are constructed on each side of the hyperplane that separates samples.
The separating hyperplane is the hyperplane that maximizes the distance
between the two parallel hyper-planes. The larger the margin or distance
between these parallel hyper-planes is, The better the generalization
error of the classifier will be. It requires large samples.

No comments: