Data Scientist Capstone — Customer Segmentation Report and Marketing Campaign Prediction

12 min readJun 26, 2021

Project Definition

Project Overview

Market/Customer Segmentation is a widely used technique by businesses for gaining competitive advantage, increasing sales and profits. Population Segmentation is the process of dividing population into similar groups/segments (typically based on demographic attributes). Customer segmentation is the process of mapping customers into above segments to understand them better and look for new growth opportunities. It is in the best interest of the company to target the potential customers with less expenses. The responsiveness of the historical campaigns will be analyzed to communicate with potential responsive customers .

In this project, I have used Data Science tools and Machine Learning algorithms on the datasets provided by Bertelsmann/Arvato Financial Services. The datasets consists of 1 set of demographics dataset for the general population of Germany, 1 set of demographics data for customers of a mail-order company, 2 sets of demographics data for individuals who were targets of a marketing campaign.

Problem Introduction

The goal is to

Cluster the general population of Germany into segments
Map the customers of a mail order company to the appropriate segments
Train classifier ML models on a previous campaign dataset
Measure and choose the best model
Use the trained model to predict responses of a final test data set and post the results to Kaggle Data Science competition

Strategy to solve the problem

As the dimensions are high in number, higher percentage of change in entropy between population and customers data set will help in narrowing down to specific features.
Using Principal Component Analysis will reduce the dimensionality and reduce the effect of correlated features by transforming data into independent orthonormal vectors(dimensions)
Choosing an appropriate unsupervised clustering algorithm based on the intra cluster distance metric.
Mapping customers to the clusters to identify the clusters having high proportion of customers compared to the population
Choosing an appropriate Classifier which predicts binary responses (‘Yes’ or ‘No’) based on the ROC-AUC metric for the final machine learning of the campaign prediction

Metrics

For PCA, the cumulative sum of high to low variance components was plotted and the number of components which can explain 99% of actual variance was chosen
For clustering algorithm the average intra cluster distance (called inertia) which denotes how tightly cohesive was used. The elbow plot was used to determine the right number of clusters. Silhouette score may be used as it balances intra and inter cluster distance. However, it is computationally expensive
For the classifier algorithm, as the data set is imbalanced, Precision Recall AUC was used to gauge the positive power and ROC AUC to measure the general performance of a model. Weighted Average f1- score is also good indicator of model’s performance in case of imbalanced data set.

Precision = (TP)/(TP+FP)
Recall-(True Positive Rate) =(TP)/(TP+FN),
False Positive Rate/FPR= FP/(TN+FP) .
ROC -Area Under Curve is TPR vs FPR plot area for various probability cutoffs. PR-Recall curve is Precision vs Recall plot area for various probability cutoffs.

Exploratory Data Analysis EDA:

Data Sets:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Data Exploration:

The given data set contains 366 features which is a big number. The general population consists of 891211 persons (rows) . Also the data set is accompanied by two spreadsheets explaining the feature names and their descriptions. The main attributes can be broadly categorized as Person, Household, Building, Zip Code etc., The features were analyzed both through textual interpretation and data driven statistics.
There are 92 non-null columns

'LNR', 'AGER_TYP', 'D19_BANKEN_ANZ_12', 'D19_BANKEN_ANZ_24',
       'D19_BANKEN_DATUM', 'D19_BANKEN_DIREKT', 'D19_BANKEN_GROSS',
       'D19_BANKEN_LOKAL', 'D19_BANKEN_OFFLINE_DATUM',
       'D19_BANKEN_ONLINE_DATUM', 'D19_BANKEN_REST', 'D19_BEKLEIDUNG_GEH',
       'D19_BEKLEIDUNG_REST', 'D19_BILDUNG', 'D19_BIO_OEKO', 'D19_BUCH_CD',
       'D19_DIGIT_SERV', 'D19_DROGERIEARTIKEL', 'D19_ENERGIE', 'D19_FREIZEIT',
       'D19_GARTEN', 'D19_GESAMT_ANZ_12', 'D19_GESAMT_ANZ_24',
       'D19_GESAMT_DATUM', 'D19_GESAMT_OFFLINE_DATUM',
       'D19_GESAMT_ONLINE_DATUM', 'D19_HANDWERK', 'D19_HAUS_DEKO',
       'D19_KINDERARTIKEL', 'D19_KONSUMTYP_MAX', 'D19_KOSMETIK',
       'D19_LEBENSMITTEL', 'D19_NAHRUNGSERGAENZUNG', 'D19_RATGEBER',
       'D19_REISEN', 'D19_SAMMELARTIKEL', 'D19_SCHUHE', 'D19_SONSTIGE',
       'D19_TECHNIK', 'D19_TELKO_ANZ_12', 'D19_TELKO_ANZ_24',
       'D19_TELKO_DATUM', 'D19_TELKO_MOBILE', 'D19_TELKO_OFFLINE_DATUM',
       'D19_TELKO_ONLINE_DATUM', 'D19_TELKO_REST', 'D19_TIERARTIKEL',
       'D19_VERSAND_ANZ_12', 'D19_VERSAND_ANZ_24', 'D19_VERSAND_DATUM',
       'D19_VERSAND_OFFLINE_DATUM', 'D19_VERSAND_ONLINE_DATUM',
       'D19_VERSAND_REST', 'D19_VERSI_ANZ_12', 'D19_VERSI_ANZ_24',
       'D19_VERSI_DATUM', 'D19_VERSI_OFFLINE_DATUM', 'D19_VERSI_ONLINE_DATUM',
       'D19_VERSICHERUNGEN', 'D19_VOLLSORTIMENT', 'D19_WEIN_FEINKOST',
       'FINANZ_ANLEGER', 'FINANZ_HAUSBAUER', 'FINANZ_MINIMALIST',
       'FINANZ_SPARER', 'FINANZ_UNAUFFAELLIGER', 'FINANZ_VORSORGER',
       'FINANZTYP', 'GEBURTSJAHR', 'GREEN_AVANTGARDE', 'HEALTH_TYP',
       'KOMBIALTER', 'NATIONALITAET_KZ', 'PRAEGENDE_JUGENDJAHRE', 'SEMIO_DOM',
       'SEMIO_ERL', 'SEMIO_FAM', 'SEMIO_KAEM', 'SEMIO_KRIT', 'SEMIO_KULT',
       'SEMIO_LUST', 'SEMIO_MAT', 'SEMIO_PFLICHT', 'SEMIO_RAT', 'SEMIO_REL',
       'SEMIO_SOZ', 'SEMIO_TRADV', 'SEMIO_VERT', 'SHOPPER_TYP', 'VERS_TYP',
       'ZABEOTYP', 'ANREDE_KZ', 'ALTERSKATEGORIE_GROB'

Most of the columns data type is categorical (ordinal data) , a few of them are numerical columns. For some columns , because of a string value ‘X’ or ‘XX’, pandas interpreted it as an object data type which is cleaned later.
Correlation of columns: For simplicity, Pearson correlation coefficients was used to identify similarity since the columns are ordinal. However Spearman rank correlation coefficient is also a good choice. Correlated columns were not removed for Clustering exercise(since PCA should take care of handling correlations). However they were removed for the supervised learning exercise.

Closely correlated columns programmatically identified after imputation

Data Visualization

Some of the mostly nan valued columns are ‘ALTER_KIND1’, ‘ALTER_KIND2’, ‘ALTER_KIND3’, ‘ALTER_KIND4’. ALTER_KIND1 and ALTER_KIND2 are not dropped however since the change in entropy between populations and customers is significant as explained later.

Methodology

Data Pre Processing

Categorical:

Most of the data set consists of categorical variables(label encoded with ordinal values).One hot encoding bloats dimensions and becomes difficult to do PCA. Ordinal data is one with which has a meaningful order (ex: movie ratings reviews 1,2,3,4,5).The ordinal data is fine for PCA

‘OST_WEST_KZ’ — is a column which indicates the historical region(East Germany or West Germany). One-hot encoding was used for this feature as ordinal values are not meaningful.

Numeric cols

There are very few numeric columns. NaN values have been replaced with median values as there are outliers.

'ANZ_HAUSHALTE_AKTIV','ANZ_HH_TITEL','ANZ_PERSONEN','ANZ_TITEL','KBA13_ANZAHL_PKW','ANZ_STATISTISCHE_HAUSHALTE'

Cleaning:

Majority of the times, statistical methods were chosen to drop less important columns . Forward likelihood ratio was not used for feature selection since the dataset is an exhaustive representation of the population.

‘GEBURTSJAHR’ — ‘Year of birth’ was removed since there are many columns indicating the age of a person. Other columns are also removed as they represent inserted data, reporting date and meta — column

'GEBURTSJAHR', 'EINGEFUEGT_AM', 'EINGEZOGENAM_HH_JAHR', 'LNR', 'MIN_GEBAEUDEJAHR', 'D19_LETZTER_KAUF_BRANCHE'

Data Imputation:

One of the major challenges was to choose a default value for imputation which is discussed below.

Rows were not skipped because there are 90+ non-null columns in all rows. Different types of columns have 0,–1,9 as unknowns. The null/NaN values were carefully imputed by studying the column. For some ordinal columns such as LP_LEBENSPHASE_FEIN , median imputation was used as there is no unknown value defined in document.

nunique:

If the variable is categorical, nunique() gives the unique values of a column. So if the nunique() to the number of rows is quite high, then the column could be a numerical column (like a date column/number of cars column).

Entropy:

The goal was to choose the columns which make a difference between population and customers to reduce dimensions

Most of the data set has categorical variables. Entropy is a statistical measure of how disorderly a random variable is. The more the entropy the more likely it is not a categorical variable. For example , LNR had an entropy of 13+. After inspection and understanding from the spreadsheet, it turns out to be unique id of the customer (which is not required for analysis). For some columns online German to English dictionary was used to understand the meaning

Percentage of change in entropy between population and customers data set allowed me to focus on a smaller set of columns. Entropy with null values was used as well to not miss out excluding any unintended columns.
With 10% cutoff for change in entropy, 117 columns were showing significant variance between population and customers

With 5% cutoff for change in entropy,200+ columns were showing significance

Here are the top % change in entropy

PCA

Principal Component Analysis is a linear dimension reduction technique. PCA works on covariance matrix which is symmetrical and square. Therefore PCA produces eigen vectors(new dimensions) which are orthonormal and thus independent.

PCA was fitted and analysis suggested that 100 components were enough to explain 99% of variance. From earlier analysis, SOHO(Small Office Home office) and TITLE had a 0.9 correlation. PCA should automatically find such correlations and reduce the dimensions.

Data was standardized(scaled) before PCA. Centering only but not scaling impacts the quality since higher scale features explain most of the variance

Principal Component 1 (Top 5 Positive and Negative features)

TITEL_KZ — flag whether this person holds an academic title. This is one of the most important columns in PC1 which is reflecting in the entropy method as well.
KBA05_KRSKLEIN — share of small cars (referred to the county average). This column is related to the locality of a person lives in.
A few columns have no descriptions

Principal Component 2 (Top 5 Positive and Negative features)

ONLINE_AFFINITAET — ONLINE AFFINITY
D19_VERSAND_ANZ_12 — transaction activity MAIL-ORDER in the last 12 months. This column also indicates if a person could become a mail order customer.

Principal Component 3 (Top 5 Positive and Negative features)

Th financial attributes of the person have a definite influence on clustering of the population

FINANZ_SPARER — financial typology: money saver
FINANZ_VORSORGER — financial typology: be prepared
FINANZ_ANLEGER — financial typology: investor
FINANZ_UNAUFFAELLIGER — financial typology: unremarkable
SEMIO_PFLICHT — affinity indicating in what way the person is duty full traditional minded
SEMIO_TRADV — affinity indicating in what way the person is traditional minded

Principal Component 4 (Top 5 Positive and Negative features)

ANZ_HAUSHALTE_AKTIV — number of households in the building.
BAUMAX — most common building-type

These are the demographic factors influencing the clusters

Modelling

Implementation

KMeans++:

DBSCAN finds clusters of different shapes and number of clusters . A little peek into a simple PCA 2 component plot suggests that there is no clear boundary among general population and also among customers. MeanShift automatically finds the number of clusters based on density tapering but it is computationally intensive.

Given the population data set size(800K+) , KMeans algorithm was chosen and then the ideal number of clusters was identified

PCA reduced population dataset was used to fit the KMeans algorithm. Below elbow plot is used to find the best fit clusters size — ex:12 clusters

Customer Segmentation:

Customers data set was transformed using the fitted PCA object as above and clusters are predicted using the KMeans fitted model.

Most of the customers belong to cluster 11,6,3 among clusters numbered from 0 to 11

Comparing the cluster with highest customer concentration (11) with lowest customer concentration (10), attributes such as LP_LEBENSPHASE_FEIN — life stage fine , LP_LEBENSPHASE_GROB — life stage and financial rough, play a major role.

KBA13_ANZAHL_PKW — number of cars in the PLZ8, indicate that a premium locality would have a great number of customers.

Here are the top positive and top negative differences among the clusters.

Top 5 differences between Cluster 11 and Cluster 10

Supervised Learning

Training data set was cleaned and transformed the with the procedures same as above

PCA for identifying clusters

Train and test data Set: PCA was applied on training and test data set and then mapped to clusters. The distribution is mostly similar to the customers’ cluster data set. So the mail campaign was targeted at the right set of potential customers.

No PCA for supervised learning

One key learning is that PCA focuses on variables having large variance. In some problems the labels might be aligned with PCA but in this particular data set it was not the case. When it comes to classification problems, the separation of labelled data points might sometimes be well explained by variables even with very small variance. With PCA, the model was not performing at its best

Scaling of the columns did not significantly improve the performance as most of the data set is ordinal.

Imbalanced Data Set

The data set is imbalanced. The number of people who responded ‘positively’ to the campaign is low. The positive samples are in a minority. So the accuracy metric is favored towards the majority class. The average precision/f1-score metric of individual class is highly effected by extreme values. Therefore the weighted-average f1-score/precision can be looked at. Instead of accuracy, metrics like ROC-AUC Score/PR-AUC score can be used to evaluate the model’s generic capability.

Training test set was split using stratify=y for maintaining the ratio of the class imbalance

Oversampling

SMOTE oversampling was tried out to improve class balance but the models were slightly worse than those trained without SMOTE. This is because SMOTE is overfitting the model in this case. Advanced oversampling and under sampling might help

Classifier for training:

Chosen columns for the first trial :

Any one among correlated columns (>0.9 ) is retained

['AGER_TYP', 'AKT_DAT_KL', 'ALTER_KIND1', 'ALTER_KIND2', 'ANZ_HAUSHALTE_AKTIV', 'ANZ_HH_TITEL', 'ANZ_KINDER', 'ANZ_PERSONEN', 'ANZ_TITEL', 'CAMEO_DEUG_2015', 'CJT_KATALOGNUTZER', 'CJT_TYP_3', 'CJT_TYP_4', 'CJT_TYP_5', 'CJT_TYP_6', 'D19_BANKEN_ANZ_12', 'D19_BANKEN_ANZ_24', 'D19_BANKEN_DATUM', 'D19_BANKEN_DIREKT', 'D19_BANKEN_GROSS', 'D19_BANKEN_LOKAL', 'D19_BANKEN_OFFLINE_DATUM', 'D19_BANKEN_ONLINE_DATUM', 'D19_BANKEN_ONLINE_QUOTE_12', 'D19_BEKLEIDUNG_GEH', 'D19_BEKLEIDUNG_REST', 'D19_BILDUNG', 'D19_BIO_OEKO', 'D19_BUCH_CD', 'D19_DIGIT_SERV', 'D19_ENERGIE', 'D19_FREIZEIT', 'D19_GARTEN', 'D19_GESAMT_ANZ_12', 'D19_GESAMT_OFFLINE_DATUM', 'D19_GESAMT_ONLINE_QUOTE_12', 'D19_HANDWERK', 'D19_HAUS_DEKO', 'D19_KINDERARTIKEL', 'D19_KONSUMTYP_MAX', 'D19_KOSMETIK', 'D19_LEBENSMITTEL', 'D19_LOTTO', 'D19_NAHRUNGSERGAENZUNG', 'D19_RATGEBER', 'D19_REISEN', 'D19_SAMMELARTIKEL', 'D19_SONSTIGE', 'D19_SOZIALES', 'D19_TECHNIK', 'D19_TELKO_ONLINE_DATUM', 'D19_VERSAND_ANZ_24', 'D19_VERSAND_DATUM', 'D19_VERSAND_OFFLINE_DATUM', 'D19_VERSI_OFFLINE_DATUM', 'D19_VERSI_ONLINE_DATUM', 'D19_VERSI_ONLINE_QUOTE_12', 'D19_VOLLSORTIMENT', 'D19_WEIN_FEINKOST', 'DSL_FLAG', 'EXTSEL992', 'FINANZ_ANLEGER', 'FINANZ_MINIMALIST', 'FINANZ_SPARER', 'FINANZ_UNAUFFAELLIGER', 'FINANZ_VORSORGER', 'GREEN_AVANTGARDE', 'HH_DELTA_FLAG', 'KBA05_ANTG3', 'KBA05_ANTG4', 'KBA05_AUTOQUOT', 'KBA05_BAUMAX', 'KBA05_MAXAH', 'KBA13_ANTG4', 'KBA13_ANZAHL_PKW', 'KBA13_BAUMAX', 'KBA13_KMH_110', 'KBA13_KMH_251', 'KBA13_KRSSEG_KLEIN', 'KK_KUNDENTYP', 'KOMBIALTER', 'KONSUMZELLE', 'LP_LEBENSPHASE_FEIN', 'MOBI_RASTER', 'NATIONALITAET_KZ', 'ONLINE_AFFINITAET', 'OST_WEST_KZ', 'RETOURTYP_BK_S', 'RT_KEIN_ANREIZ', 'RT_SCHNAEPPCHEN', 'SEMIO_DOM', 'SEMIO_ERL', 'SEMIO_FAM', 'SEMIO_KRIT', 'SEMIO_LUST', 'SEMIO_PFLICHT', 'SEMIO_SOZ', 'SEMIO_TRADV', 'SOHO_KZ', 'UMFELD_JUNG', 'UNGLEICHENN_FLAG', 'VERS_TYP', 'VHA', 'VK_DISTANZ', 'WOHNDAUER_2008', 'W_KEIT_KIND_HH', 'ZABEOTYP'] 107

Various Classifiers such as Logistic Regression, AdaBoostClassifier, GradientBoostClassifier, RandomForestClassifier and XGBClassifier were tried out.

Model Selection

Cross validation found that Gradient Boosting Classifiers are performing better(in terms of ROC-AUC) compared to Logistic Regression and Random Forest Classifier.

Hyper Parameter Tuning

RandomizedSearchCV and GridSearchCV were used for hyper param tuning.

Results

Model Evaluation

Precision-Recall is preferable for imbalanced data sets. However, PR AUC was not that great. The data set is challenging from separability point of view

Best worked model

An ROC- AUC score of 0.79 was achieved with only 109 columns and GradientBoostClassifier. This is the current leader board position in the data Science Kaggle competition as on 26th Jun 2021

Refinements:

With more number of columns using 5% change in entropy as cut off , there was only a slight improvement in performance.

Conclusion/Reflection:

The project can be summarized as

Downloading data sets ,cleaning , transforming and imputing columns
Choosing a narrower set of columns using entropy for PCA and Clustering
Choosing a clustering model based on inertia
Training multiple classifier models on training data sets and tuning them with Grid Search
Submitting results to Kaggle competition
Tuning models to improve the performance of results submitted to Kaggle

I found the most challenging steps to be 2 and 6. I have used entropy to choose important columns and further removed correlated columns to train Prediction models with 107 columns. The ROC AUC was peaking between 0.77 - 0.80.

I would have to identify a better criteria for picking the best among related columns -discussed in improvements. I would have to work on my XGBoost skills for improving the performance.

Improvements

Use a better selection criteria to pick the best one among a set of correlated columns — example CAMEO_DEU_2015 ,CAMEO_DEUG_2015,CAMEO_INTL_2015. Right now the model picks one of them, but CAMEO_DEUG_2015 has a larger % change in entropy among all three .
Use XGBClassifier to improve the performance.
Use a blend of other advance oversampling — like ADASYN and under sampling techniques .
Use other tuning techniques such as Bayesian search for parameters along with using well known distributions

Acknowledgements

Special thanks to Udacity and its partner Bertelsmann/Arvato Financial Services for making this available to us as part of the Udacity Data Scientist nanodegree.

References:

My Github

I am thankful to Google.com , Udacity Forums and various articles to understand everything better.