Supervised Methods

Most data mining methods are of the supervised variety. “Supervised learning algorithms are those used in classification and prediction. We must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known” (Shmueli et al., 2011). With these methods, a target variable is specified, and examples are provided to the algorithm so it can then learn to match predictor variables. The two primary types of supervised algorithms are classification and regression.

Classification

Easily the most common data mining task, classification is the process of determining a pattern of relationships between variables (a model) so that this pattern can then be used to identify new observations. An example of this is the adage, “If it walks like a duck, and it quacks like a duck, it must be a duck.” Commonly used methods for classification include decision trees, neural networks, k-nearest neighbors.

Regression

In regression, the output variable is a real or continuous value, such as “salary” or “weight” (Shukla, 2019). There are many different models that can be used, the simplest of which is linear regression. This is most often used to find a relationship between variables so that the known value can be used to predict the unknown one.

Unsupervised Methods

Unsupervised methods involve data that is neither classified nor labeled and allow the algorithm to act on that data without guidance. The purpose of the algorithm is to group unsorted data according to similarities, patterns, and differences without any prior training. Without the bias of previous knowledge, the algorithm is free to look for associations that might otherwise be overlooked. The most common categories of unsupervised methods are clustering and association, although association can be used with supervised data mining as well (Larose & Larose, 2014).

Association

Association is used when you want to discover rules that describe large portions of your data, such as customers that buy item A also tend to buy B. Apriori algorithms, support, and confidence play a large part in association methodology.

Larose, D. T., & Larose, C. D. (2014). Discovering Knowledge in Data. In Discovering Knowledge in Data (2nd ed.). https://doi.org/10.1002/9781118874059

Pierson, L. (2017). Data Science for Dummies. John Wiley & Sons, Incorporated.

Shmueli, G., Patel, N., & Bruce, P. (2011). Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with Xlminer (2nd ed.). John Wiley & Sons.

Cross-Industry Standard Process for Data Mining, or CRISP-DM, is an industry-proven process model provides an overview of the data mining life cycle. It is also a methodology that includes descriptions of the typical phases of a project, the tasks involved with each phase, and an explanation of the relationships between these tasks.

The life cycle model consists of six phases, with arrows indicating dependencies between phases. The sequence of the phases is not set in stone, a fact illustrated by the various depictions that can be found online. It is possible to move back and forth between phases as often as needed or skip phases altogether if they are not relevant. CRISP-DM allows you to create a data mining model that fits your needs.

Business/Research Understanding Phase

The goal of the first phase is to come up with a proposed solution to a specific problem presented by the business. An investigation must be conducted to make sure there is a clear understanding of the objectives and requirements based on business goals, and not on any existing reports or processes. This should also consider existing resources and should involve any subject matter experts within the company. This specific business objective should then be converted into an equally clear data mining definition that communicates the goals that, if met, can be used by the business to address the original problem or question. This should include specifying the type of data problem being faced and the benchmarks to be used for measuring the technical goals and outcomes. With this information in hand, a project plan can be put in place which specifies the effort required, resources needed, and cost.

Data Understanding Phase

The data understanding phase of CRISP-DM begins with the initial data collection, followed by a close examination to familiarize yourself with the data collected. The goal here is to evaluate the quality of the data, identify potential problems, gain insights into the data, and potentially detect interesting subsets that may lead to actionable patterns. This step is critical in avoiding unexpected problems during the next phase–data preparation–which is typically the longest part of a project. It is possible that this might also require going back to the previous phase if the problem posed isn’t clear enough.

Data Preparation Phase

As stated, the data preparation is the most time-consuming part of data mining, taking an estimated 50-70% of the required time and effort, and is also the most important. It includes all the activities necessary to construct the final dataset out of the initial raw data collected in the previous phase, including case and variable selection, variable transformation, and data cleansing.

Modeling Phase

The modeling phase is where the hard work from the previous three phases begins to pay off. Sometimes requiring multiple passes using several different models, the process usually begins using default parameters that will be fine-tuned over time. It is also possible that, based on a models’ requirements, it will be necessary to loop back to the preparation phase to manipulate the data to fit a specific model better. In the end, the results should begin to shed some light on the business or research problem posed during Business Understanding.

Evaluation Phase

Now that most of the data mining has been completed, the models that have been built need to be tested using the business success criteria established at the beginning of the project to ensure both their quality and effectiveness and that all critical business issues have been sufficiently considered. Ultimately, one model should be chosen as the best choice to proceed with.

Deployment Phase

The deployment phase is where the new insights gained in the previous phase will be used to make changes within the organization. In general, this includes two activities: planning and monitoring the deployment of a code representation of the model and any completion tasks, such as reports and project reviews. The deployed code representation will be used to score or categorize new data as it arises that are then read into a data warehouse and to create a means for the use of that new data in the solution of the original business problem.

Discombobul8d

…or: How I Learned to Stop Fighting and Love With Aplomb

Category Archives: Data Mining

Contrasting Supervised and Unsupervised Methods of Data Modeling