The journey of a fledgling data scientist.

I have been quiet on my blog here for quite some time, the main reason being twofold: I didn’t have a lot to write about, nor did I have the time in which to write it. Then, while conducting research for my current studies, I had an epiphany. I could start blogging my experiences and lessons learned as both a way to help others and as a way to reinforce what I was learning.

So, that will be the purpose of this new series of blogs. As I learn things, I will endeavor to pass that knowledge on without being too dry. I hope. After all, the subjects data science, data mining, machine learning, business intelligence, and so forth, aren’t for everyone. However, for those that are interested, I hope my posts are informative.

More to come!

Contrasting Supervised and Unsupervised Methods of Data Modeling

There is a key division between the various methods of data mining: they are either supervised or unsupervised.  Those terms are not meant to imply whether a person needs to be present or not for them to work.  What they refer to is the presence, or lack thereof, of an outcome variable to predict or classify (Shmueli, Patel, & Bruce, 2011).  Ultimately, then, the decision of whether you chose supervised or unsupervised comes down to the data that are available.

Supervised Methods

Most data mining methods are of the supervised variety.  “Supervised learning algorithms are those used in classification and prediction.  We must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known” (Shmueli et al., 2011).   With these methods, a target variable is specified, and examples are provided to the algorithm so it can then learn to match predictor variables.  The two primary types of supervised algorithms are classification and regression.

Classification

Easily the most common data mining task, classification is the process of determining a pattern of relationships between variables (a model) so that this pattern can then be used to identify new observations. An example of this is the adage, “If it walks like a duck, and it quacks like a duck, it must be a duck.”  Commonly used methods for classification include decision trees, neural networks, k-nearest neighbors.

Regression

In regression, the output variable is a real or continuous value, such as “salary” or “weight” (Shukla, 2019). There are many different models that can be used, the simplest of which is linear regression.  This is most often used to find a relationship between variables so that the known value can be used to predict the unknown one. 

Unsupervised Methods

Unsupervised methods involve data that is neither classified nor labeled and allow the algorithm to act on that data without guidance. The purpose of the algorithm is to group unsorted data according to similarities, patterns, and differences without any prior training.  Without the bias of previous knowledge, the algorithm is free to look for associations that might otherwise be overlooked.  The most common categories of unsupervised methods are clustering and association, although association can be used with supervised data mining as well (Larose & Larose, 2014).

Clustering

Clustering is used when you want to determine the innate groupings in the data, such as the types of customers that tend to make specific purchases. Typical algorithms of this type include k-means, k-medians, Expectation Maximization, and Hierarchical Clustering (Pierson, 2017).

Association

Association is used when you want to discover rules that describe large portions of your data, such as customers that buy item A also tend to buy B.  Apriori algorithms, support, and confidence play a large part in association methodology. 

Larose, D. T., & Larose, C. D. (2014). Discovering Knowledge in Data. In Discovering Knowledge in Data (2nd ed.). https://doi.org/10.1002/9781118874059

Pierson, L. (2017). Data Science for Dummies. John Wiley & Sons, Incorporated.

Shmueli, G., Patel, N., & Bruce, P. (2011). Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with Xlminer (2nd ed.). John Wiley & Sons.

Shukla, S. (2019). Regression and Classification: Supervised Machine Learning. Retrieved June 12, 2019, from https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/

CRISP-DM

Cross-Industry Standard Process for Data Mining, or CRISP-DM, is an industry-proven process model provides an overview of the data mining life cycle.  It is also a methodology that includes descriptions of the typical phases of a project, the tasks involved with each phase, and an explanation of the relationships between these tasks.

The life cycle model consists of six phases, with arrows indicating dependencies between phases. The sequence of the phases is not set in stone, a fact illustrated by the various depictions that can be found online. It is possible to move back and forth between phases as often as needed or skip phases altogether if they are not relevant.  CRISP-DM allows you to create a data mining model that fits your needs.

The CRISP-DM model

Business/Research Understanding Phase

The goal of the first phase is to come up with a proposed solution to a specific problem presented by the business.  An investigation must be conducted to make sure there is a clear understanding of the objectives and requirements based on business goals, and not on any existing reports or processes.  This should also consider existing resources and should involve any subject matter experts within the company.  This specific business objective should then be converted into an equally clear data mining definition that communicates the goals that, if met, can be used by the business to address the original problem or question.  This should include specifying the type of data problem being faced and the benchmarks to be used for measuring the technical goals and outcomes.  With this information in hand, a project plan can be put in place which specifies the effort required, resources needed, and cost.

Data Understanding Phase

The data understanding phase of CRISP-DM begins with the initial data collection, followed by a close examination to familiarize yourself with the data collected.  The goal here is to evaluate the quality of the data, identify potential problems, gain insights into the data, and potentially detect interesting subsets that may lead to actionable patterns. This step is critical in avoiding unexpected problems during the next phase–data preparation–which is typically the longest part of a project.  It is possible that this might also require going back to the previous phase if the problem posed isn’t clear enough.

Data Preparation Phase

As stated, the data preparation is the most time-consuming part of data mining, taking an estimated 50-70% of the required time and effort, and is also the most important.  It includes all the activities necessary to construct the final dataset out of the initial raw data collected in the previous phase, including case and variable selection, variable transformation, and data cleansing.

Modeling Phase

The modeling phase is where the hard work from the previous three phases begins to pay off.  Sometimes requiring multiple passes using several different models, the process usually begins using default parameters that will be fine-tuned over time.  It is also possible that, based on a models’ requirements, it will be necessary to loop back to the preparation phase to manipulate the data to fit a specific model better. In the end, the results should begin to shed some light on the business or research problem posed during Business Understanding.

Evaluation Phase

Now that most of the data mining has been completed, the models that have been built need to be tested using the business success criteria established at the beginning of the project to ensure both their quality and effectiveness and that all critical business issues have been sufficiently considered.  Ultimately, one model should be chosen as the best choice to proceed with.

Deployment Phase

The deployment phase is where the new insights gained in the previous phase will be used to make changes within the organization.  In general, this includes two activities: planning and monitoring the deployment of a code representation of the model and any completion tasks, such as reports and project reviews.  The deployed code representation will be used to score or categorize new data as it arises that are then read into a data warehouse and to create a means for the use of that new data in the solution of the original business problem.