Clustering and decision tree
On the final day of my statistics training I learnt clustering and decision tree. The funny thing about these 2 topics is I learnt the algorithms in my computer science class but was never taught how to apply them to real world problems. That’s what I’m doing today.
Clustering
Clustering is good for uncovering unknown groupings in a data set. It is primarily a descriptive technique to help ask further questions in future analysis. The following table lists several countries with their numerical scores on several measures:
Country | Literacy | Infant mortality | Birth rate | Death rate |
---|---|---|---|---|
Argentina | 95 | 25.6 | 20 | 9 |
Australia | 100 | 7.3 | 15 | 8 |
Bolivia | 78 | 75 | 34 | 9 |
Cameroon | 54 | 77 | 41 | 12 |
Chile | 93 | 14.6 | 23 | 6 |
China | 78 | 52 | 21 | 7 |
Costa Rica | 93 | 11 | 26 | 4 |
Egypt | 48 | 76.4 | 29 | 9 |
Ethiopia | 24 | 110 | 45 | 14 |
Greece | 93 | 8.2 | 10 | 10 |
Haiti | 53 | 109 | 40 | 19 |
India | 52 | 79 | 29 | 10 |
Indonesia | 77 | 68 | 24 | 9 |
Italy | 97 | 7.6 | 11 | 10 |
Kenya | 69 | 74 | 42 | 11 |
Kuwait | 73 | 12.5 | 28 | 2 |
Mexico | 87 | 35 | 28 | 5 |
Nicaragua | 57 | 52.5 | 35 | 7 |
Nigeria | 51 | 75 | 44 | 12 |
Phillippine | 90 | 51 | 27 | 7 |
Somalia | 24 | 126 | 46 | 13 |
Thailand | 93 | 37 | 19 | 6 |
USA | 97 | 8.1 | 15 | 9 |
Vietnam | 88 | 46 | 27 | 8 |
Zambia | 73 | 85 | 46 | 18 |
One way to group data is hierarchical clustering. The result is a dendrogram.
The hclust function has many clustering algorithms. Changing the algorithms sometimes produces surprising insights. A dendrogram is great for people to visualise the cluster, but if we want the clusters in numerical form we can do this:
The k=4 argument to cutree says we want 4 clusters. It’s easily verified from the dendrogram. Alternatively, starting from the dendrogram, we can specify a height. For example, cutting the dendrogram at height 60 will yield 3 clusters (a horizontal line y=60 will cut through 3 lines).
Next, we want to see how clusters 1 and 2 differ. Let’s trim the cluster table down a bit.
It becomes quite clear that clusters 1 and 2 differ in their literacy rates. A dot chart reveals further clustering:
Our 2-cluster data table actually contains 3 clusters! The dot chart shows there are high literacy countries such as Australia and Italy; middling ones such as China and Indonesia; and lowly-literate ones like Egypt and India. A picture tells a thousand words indeed. The point of clustering and the different charts is really to discover hidden patterns in the data. When we start to find homogeneity as we go into a cluster (all the highly-literate countries, say), it becomes useful to explore other dimensions such as infant mortality.
Another way of clustering data is K-means clustering.
The magic here is the kmeans function that produces the specified 3 clusters we wanted. It has several algorithms to figure out a formula to describe the data columns and find the centroid of each record. The things to watch out for are cohesion of points within a cluster and separation of clusters. In our country list, Vietnam, Phillipines and China are ambiguous. Further analysis can help guide the clustering effort.
Decision tree
In this exercise we use the bank loan data sets. The first, the training set, is a historical data set of borrower data and their known default status. In other words, a group of people who’ve borrowed money and whether they defaulted the loan. We want to understand what causes borrowers to default.
Most columns are easy to understand but some takes a bit of explaining. ed is an ordinal attributes with 1 meaning below high school completion to 5 representing a post graduate degree. It follows that you can use their Euclidean distance to find how different people are even though strictly speaking the difference is not linear. employ and address represent number of years spent in the current employment and home address, respectively. income, creddebt and othdebt are all dollar amount in thousands, and debtinc is debt to income percentage.
Our aim is to use the training set of 700 records to predict how likely a future borrower will default. First, we construct a decision tree.
We can see that people who have defaulted most are those with a high debt-to-income ratio (> 14.7), staying at the same job for more than 5 years and has a high credit card debt (leaf node 19 on the right). Intuitively it makes sense. Although staying in a job shows stability, in this case it’s more likely that those borrowers are stuck in their job because of their high debt. Age, education and other factors don’t count as long as you have much debt. n=23 in the box means there’re 23 such borrowers in our training data set, representing 82.6% of this sub-population. It’s easy to see the least-likely to default borrowers are those with low debt, long current employment (> 10) and little credit card debt (leaf node 13). There’re 171 such borrowers in the training set and only 2.9% of them defaulted. The percentage can also be seen as the likelihood of someone fitting this profile defaulting. In other words, someone fitting the profile of a safe borrower has a 2.9% chance of defaulting.
Once we have fitted our training data to a tree, we can use the tree to predict new data. In our test data set, we have 150 loan applicants:
We want to see the default likelihood of each applicant. To do so, we use the predict function:
You can verify using the tree the first row in the test data set is a safe borrower. Hence, our prediction model scores this applicant with a 2.9% defaulting risk.
The trick to using a decision tree is to realise some factors are correlated. Oftentimes, a bit of domain knowledge will help spot and verify such correlations, which can be removed from the model.