Mini Project Shallow Learning
Instructions:
You are a Data Scientist working in a Public Policy team. Your team needs you to come up with a prediction model to know if a person, based on his/her demographic data will earn $50,000 or more. This prediction will help the team in making policy decisions for providing financial assistance to the low-income group. You are given a sample data of the population along with their annual income. You can use that data to train your machine learning model.
You can build your model in your own hardware / pc / laptop and just upload the prediction as shown in the below format.
You are free to use R or Python or any other programming language of your preference to explore and build the model.
Instructions for the case study are provided below.
- Build a Machine Learning Model, which is capable of predicting if an individual's income is greater than 50k or not.
- The prediction must be done based on various data attributes provided below.
- Use 'TrainData' file provided below for building the model.
- Use 'TestData' file provided below for testing your predictions.
Data Attributes description
- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holland-Netherlands.
- income_>50K: binary (Target that needs to be predicted)
Task 1. Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
Task 2. Load the Dataset
train = "https://hrcdn.net/s3_pub/istreet-assets/LgLPfzg0V-7G1vBzJsBxdA/train.csv"
test = "https://hrcdn.net/s3_pub/istreet-assets/PV13ExA_QndhFEhxoaHG_A/test.csv"
df = pd.read_csv(train)
test_df = pd.read_csv(test)
Check the following.
- Number of rows & columns
- Data types (categorical vs numerical)
- Missing values
- Target Variable Count
print(df.shape)
df.info()
"""
RangeIndex: 43957 entries, 0 to 43956
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 43957 non-null int64
1 workclass 41459 non-null object
2 fnlwgt 43957 non-null int64
3 education 43957 non-null object
4 educational-num 43957 non-null int64
5 marital-status 43957 non-null object
6 occupation 41451 non-null object
7 relationship 43957 non-null object
8 race 43957 non-null object
9 gender 43957 non-null object
10 capital-gain 43957 non-null int64
11 capital-loss 43957 non-null int64
12 hours-per-week 43957 non-null int64
13 native-country 43194 non-null object
14 income_>50K 43957 non-null int64
dtypes: int64(7), object(8)
memory usage: 5.0+ MB
"""
df["income_>50K"].value_counts()
"""
income_>50K
0 33439
1 10518
Name: count, dtype: int64
"""
Task 4: Handle Missing Values
df.isnull().any().any()
"""
np.True_
"""
df.replace("?", np.nan, inplace=True) #Some datasets contain non-standard null characters that Pandas does not recognize by default.
df.replace("?", np.nan, inplace=True)
test_df.replace("?", np.nan, inplace=True)
df.dropna(inplace=True)
test_df.dropna(inplace=True)
df.isnull().any().any()
"""
np.False_
"""
Task 5: Separate Features and Target from Dataset
X = df.drop("income_>50K", axis=1 ) # Feature
y = df["income_>50K"] # Target
X.head(5)
| age | workclass | fnlwgt | education | educational-num | marital-status | occupation | relationship | race | gender | capital-gain | capital-loss | hours-per-week | native-country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 67 | Private | 366425 | Doctorate | 16 | Divorced | Exec-managerial | Not-in-family | White | Male | 99999 | 0 | 60 | United-States |
| 1 | 17 | Private | 244602 | 12th | 8 | Never-married | Other-service | Own-child | White | Male | 0 | 0 | 15 | United-States |
| 2 | 31 | Private | 174201 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 40 | United-States |
| 3 | 58 | State-gov | 110199 | 7th-8th | 4 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 0 | 40 | United-States |
| 4 | 25 | State-gov | 149248 | Some-college | 10 | Never-married | Other-service | Not-in-family | Black | Male | 0 | 0 | 40 | United-States |
Step 6: Encode Categorical Variables
cat_col_df = X.select_dtypes(include="object")
le= LabelEncoder()
# num = le.fit_transform(X["workclass"])
# le.inverse_transform(num) # Fetch original values back
for i in cat_col_df:
X[i]= le.fit_transform(X[i])
test_df[i] = le.fit_transform(test_df[i])
| age | workclass | fnlwgt | education | educational-num | marital-status | occupation | relationship | race | gender | capital-gain | capital-loss | hours-per-week | native-country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 67 | 2 | 366425 | 10 | 16 | 0 | 3 | 1 | 4 | 1 | 99999 | 0 | 60 | 38 |
| 1 | 17 | 2 | 244602 | 2 | 8 | 4 | 7 | 3 | 4 | 1 | 0 | 0 | 15 | 38 |
| 2 | 31 | 2 | 174201 | 9 | 13 | 2 | 3 | 0 | 4 | 1 | 0 | 0 | 40 | 38 |
| 3 | 58 | 5 | 110199 | 5 | 4 | 2 | 13 | 0 | 4 | 1 | 0 | 0 | 40 | 38 |
| 4 | 25 | 5 | 149248 | 15 | 10 | 4 | 7 | 1 | 2 | 1 | 0 | 0 | 40 | 38 |
# (Optional) You can check which String value assigned which numeric number by this code.
# temp =le.fit_transform(cat_col_df["workclass"])[:100]
#print(list(enumerate(le.classes_))) ##[(0, 'Cambodia'), (1, 'Canada'), (2, 'China'), ]
Step 7: Feature Scaling (Important)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Step 8: Train‑Test Split
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
Step 9: Build the Model (Logistic Regression)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
Step 10: Model Evaluation
y_pred = model.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))
"""
Accuracy: 0.8169653817824699
precision recall f1-score support
0 0.84 0.94 0.89 6135
1 0.71 0.44 0.54 2011
accuracy 0.82 8146
macro avg 0.77 0.69 0.71 8146
weighted avg 0.80 0.82 0.80 8146
"""
(Optional) Confusion matrix:
sns.heatmap(confusion_matrix(y_val, y_pred), annot=True, fmt="d")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
Step 11: Train on Full Data
model.fit(X_scaled, y)
Step 12: Prepare Test Data
test_scaled = scaler.transform(test_df)
test_scaled
Step 13: Predict on Test Data
test_predictions = model.predict(test_scaled)
test_df["income_50K"] = test_predictions
test_df
Save Predictions
test_df[["income_50K"]].to_csv("submission.csv", index=True)