# Finding Donors for Charity¶

This project will test out several supervised algorithms to accurately model individuals’ income using data collected from the 1994 U.S. Census. We will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. The goal with is to construct a model that accurately predicts whether an individual makes more than $50,000. This sort of task can arise in a non-profit setting, where organizations survive on donations. Understanding an individual’s income can help a non-profit better understand how large of a donation to request, or whether or not they should reach out to begin with. While it can be difficult to determine an individual’s general income bracket directly from public sources, we can infer this value from other publicly available features. The dataset for this project originates from the UCI Machine Learning Repository. The dataset was donated by Ron Kohavi and Barry Becker, after being published in the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". You are welcome to read the article by Ron Kohavi online. The data we investigate here consists of small changes to the original dataset, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries. ## Exploring the Data¶ We will begin with exploratory analysis and loading the data. Note that the last column from this dataset, 'income', will be our target label (whether an individual makes more than, or at most,$50,000 annually). All other columns are features about each individual in the census database.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames
import warnings
warnings.filterwarnings('ignore')# This allows to skip warning messages for this ipython notebook

# Import supplementary visualization code visuals.py
import visuals as vs
import seaborn as sns
import matplotlib.pyplot as plt

# Pretty display for notebooks
%matplotlib inline

# Success - Display the first record

age workclass education_level education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 39 State-gov Bachelors 13.0 Never-married Adm-clerical Not-in-family White Male 2174.0 0.0 40.0 United-States <=50K

A introductory investigation of the dataset should determine how many individuals fit into either group, and will tell us about the percentage of these individuals making more than \$50,000. • The total number of records, 'n_records' • The number of individuals making more than \$50,000 annually, 'n_greater_50k'.
• The number of individuals making at most \$50,000 annually, 'n_at_most_50k'. • The percentage of individuals making more than \$50,000 annually, 'greater_percent'.
In [2]:
#datatypes of this feature
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
age                45222 non-null int64
workclass          45222 non-null object
education_level    45222 non-null object
education-num      45222 non-null float64
marital-status     45222 non-null object
occupation         45222 non-null object
relationship       45222 non-null object
race               45222 non-null object
sex                45222 non-null object
capital-gain       45222 non-null float64
capital-loss       45222 non-null float64
hours-per-week     45222 non-null float64
native-country     45222 non-null object
income             45222 non-null object
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB

In [3]:
# TODO: Total number of records
n_records = len(data)

# TODO: Number of records where individual's income is more than $50,000 n_greater_50k = len(data[data['income'] == '>50K']) # TODO: Number of records where individual's income is at most$50,000
n_at_most_50k = len(data[data['income'] == '<=50K'])

# TODO: Percentage of individuals whose income is more than $50,000 greater_percent = 100 * n_greater_50k / n_records # Print the results print("Total number of records: {}".format(n_records)) print("Individuals making more than$50,000: {}".format(n_greater_50k))
print("Individuals making at most $50,000: {}".format(n_at_most_50k)) print("Percentage of individuals making more than$50,000: {}%".format(greater_percent))

Total number of records: 45222
Individuals making more than $50,000: 11208 Individuals making at most$50,000: 34014

### Feature Relevance Observation¶

There are thirteen available features for each individual on record in the census data. Of these thirteen records, we can guess which five features might be most important for prediction. Below I have given my personal shot at this.

In my opinion, these are most important for prediction:

1. occupation: Different jobs have various pay scales. Some jobs pay higher and lower than others.
2. education: Those with higher levels of education may earn more due to higher levels of training and/or specialization.
3. age: With age, more wealth can be acquired.
4. sex: Unfortunately, historically men have earned more than women.
5. workclass: The working class they belong to can also be correlated with how much money they make.

These are all ranked according the impact I personally feel they have on a person’s income. Occupation is ranked number one as different jobs pay differently. People with higher education are more likely to earn more.

### Extracting Feature Importance¶

In [17]:
# TODO: Import a supervised learning model that has 'feature_importances_'

# TODO: Train the supervised model on the training set using .fit(X_train, y_train)

# TODO: Extract the feature importances using .feature_importances_
importances = model.feature_importances_

# Plot
vs.feature_plot(importances, X_train, y_train)


### Extracting Feature Importance¶

Of the five features predicted in the earlier section, only two of them (age, and education-num) are included in the list of features considered most important by Adaboost, with different rankings than what I chose.

I did not consider three other important features: capital-gain, capital-loss, and hours-per-week due to not fully understanding these variables (not having much experience with it) and also simply just failing to understand/remember not everyone works full-time. After evaluating the meaning of capital-gain and capital-loss (profit or loss from on the sale of assets/property), it does make sense for these features to be important. Those that have earned profits from sale of assets are definitely likely to earn more (and potentially be in a higher income bracket depending on what type of assets), while those who incurred losses are likely to have had lower income. Those that work full-time will likely earn more overall than those that work part-time.

### Feature Selection¶

How does a model perform if we only use a subset of all the available features in the data? With less features required to train, the expectation is that training and prediction time is much lower — at the cost of performance metrics. From the visualization above, we see that the top five most important features contribute more than half of the importance of all features present in the data. This hints that we can attempt to reduce the feature space and simplify the information required for the model to learn.

In [18]:
# Import functionality for cloning a model
from sklearn.base import clone

# Reduce the feature space
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]

# Train on the "best" model found from grid search earlier
clf = (clone(best_clf)).fit(X_train_reduced, y_train)

# Make new predictions
reduced_predictions = clf.predict(X_test_reduced)

# Report scores from the final model using both versions of data
print("Final Model trained on full data\n------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
print("\nFinal Model trained on reduced data\n------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_predictions, beta = 0.5)))

Final Model trained on full data
------
Accuracy on testing data: 0.8690
F-score on testing data: 0.7489

Final Model trained on reduced data
------
Accuracy on testing data: 0.8428
F-score on testing data: 0.7008


### Effects of Feature Selection¶

On a reduced dataset, the final model’s accuracy and f-score are still incredibly similar to the full dataset.

The accuracy is 2.62% lower, while the f-score is 4.81% lower. Even though Adaboost is relatively faster than one of the other classifiers than the others selected, It would still be beneficial to consider training on the reduced data if training time was a factor, and there are more training points to process. This decision will also depend on how important accuracy and f-scores are (or if f-score is more important than the accuracy, as the dip for that is larger than the dip in accuracy), to make a final decision.