There is a key division between the various methods of data mining: they are either supervised or unsupervised. Those terms are not meant to imply whether a person needs to be present or not for them to work. What they refer to is the presence, or lack thereof, of an outcome variable to predict or classify (Shmueli, Patel, & Bruce, 2011). Ultimately, then, the decision of whether you chose supervised or unsupervised comes down to the data that are available.
Most data mining methods are of the supervised variety. “Supervised learning algorithms are those used in classification and prediction. We must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known” (Shmueli et al., 2011). With these methods, a target variable is specified, and examples are provided to the algorithm so it can then learn to match predictor variables. The two primary types of supervised algorithms are classification and regression.
Easily the most common data mining task, classification is the process of determining a pattern of relationships between variables (a model) so that this pattern can then be used to identify new observations. An example of this is the adage, “If it walks like a duck, and it quacks like a duck, it must be a duck.” Commonly used methods for classification include decision trees, neural networks, k-nearest neighbors.
In regression, the output variable is a real or continuous value, such as “salary” or “weight” (Shukla, 2019). There are many different models that can be used, the simplest of which is linear regression. This is most often used to find a relationship between variables so that the known value can be used to predict the unknown one.
Unsupervised methods involve data that is neither classified nor labeled and allow the algorithm to act on that data without guidance. The purpose of the algorithm is to group unsorted data according to similarities, patterns, and differences without any prior training. Without the bias of previous knowledge, the algorithm is free to look for associations that might otherwise be overlooked. The most common categories of unsupervised methods are clustering and association, although association can be used with supervised data mining as well (Larose & Larose, 2014).
Clustering is used when you want to determine the innate groupings in the data, such as the types of customers that tend to make specific purchases. Typical algorithms of this type include k-means, k-medians, Expectation Maximization, and Hierarchical Clustering (Pierson, 2017).
Association is used when you want to discover rules that describe large portions of your data, such as customers that buy item A also tend to buy B. Apriori algorithms, support, and confidence play a large part in association methodology.
Larose, D. T., & Larose, C. D. (2014). Discovering Knowledge in Data. In Discovering Knowledge in Data (2nd ed.). https://doi.org/10.1002/9781118874059
Pierson, L. (2017). Data Science for Dummies. John Wiley & Sons, Incorporated.
Shmueli, G., Patel, N., & Bruce, P. (2011). Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with Xlminer (2nd ed.). John Wiley & Sons.
Shukla, S. (2019). Regression and Classification: Supervised Machine Learning. Retrieved June 12, 2019, from https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/
You must be logged in to post a comment.