There
is a key division between the various methods of data mining: they are either
supervised or unsupervised. Those terms
are not meant to imply whether a person needs to be present or not for them to
work. What they refer to is the presence,
or lack thereof, of an outcome variable to predict or classify (Shmueli, Patel, & Bruce, 2011). Ultimately, then, the
decision of whether you chose supervised or unsupervised comes down to the data
that are available.
Supervised Methods
Most
data mining methods are of the supervised variety. “Supervised learning algorithms are those
used in classification and prediction.
We must have data available in which the value of the outcome of
interest (e.g., purchase or no purchase) is known” (Shmueli et al., 2011). With these methods,
a target variable is specified, and examples are provided to the algorithm so
it can then learn to match predictor variables.
The two primary types of supervised algorithms are classification and
regression.
Classification
Easily the most common data mining task, classification is
the process of determining a pattern of relationships between variables (a
model) so that this pattern can then be used to identify new observations. An
example of this is the adage, “If it walks like a duck, and it quacks like a
duck, it must be a duck.” Commonly used
methods for classification include decision trees, neural networks, k-nearest
neighbors.
Regression
In regression, the output variable is a real or continuous
value, such as “salary” or “weight” (Shukla, 2019).
There are many different models that can be used, the simplest of which is
linear regression. This is most often used
to find a relationship between variables so that the known value can be used to
predict the unknown one.
Unsupervised Methods
Unsupervised methods involve data that is neither classified
nor labeled and allow the algorithm to act on that data without guidance. The purpose
of the algorithm is to group unsorted data according to similarities, patterns,
and differences without any prior training.
Without the bias of previous knowledge, the algorithm is free to look
for associations that might otherwise be overlooked. The most common categories of unsupervised
methods are clustering and association, although association can be used with supervised
data mining as well (Larose & Larose, 2014).
Clustering
Clustering is used when you want to determine the innate
groupings in the data, such as the types of customers that tend to make
specific purchases. Typical algorithms of this type include k-means, k-medians,
Expectation Maximization, and Hierarchical Clustering (Pierson, 2017).
Association
Association is used when you want to discover rules that describe large portions of your data, such as customers that buy item A also tend to buy B. Apriori algorithms, support, and confidence play a large part in association methodology.
Larose, D. T., & Larose, C. D. (2014). Discovering
Knowledge in Data. In Discovering Knowledge in Data (2nd ed.).
https://doi.org/10.1002/9781118874059
Pierson, L. (2017). Data Science for Dummies. John Wiley & Sons,
Incorporated.
Shmueli, G., Patel, N., & Bruce, P. (2011). Data Mining for
Business Intelligence: Concepts, Techniques, and Applications in Microsoft
Office Excel with Xlminer (2nd ed.). John Wiley & Sons.
Shukla, S. (2019). Regression and Classification: Supervised Machine
Learning. Retrieved June 12, 2019, from
https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/
You must be logged in to post a comment.