Turing Machine Data Scientist Program: Use Case 5 Fresco Play Handson Solution

Mini Project Shallow Learning

Instructions:

You are a Data Scientist working in a Public Policy team. Your team needs you to come up with a prediction model to know if a person, based on his/her demographic data will earn $50,000 or more. This prediction will help the team in making policy decisions for providing financial assistance to the low-income group. You are given a sample data of the population along with their annual income. You can use that data to train your machine learning model.

You can build your model in your own hardware / pc / laptop and just upload the prediction as shown in the below format.

You are free to use R or Python or any other programming language of your preference to explore and build the model.

Instructions for the case study are provided below.

  • Build a Machine Learning Model, which is capable of predicting if an individual's income is greater than 50k or not.
  • The prediction must be done based on various data attributes provided below.
  • Use 'TrainData' file provided below for building the model.
  • Use 'TestData' file provided below for testing your predictions.

Data Attributes description

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt: continuous.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holland-Netherlands.
  • income_>50K: binary (Target that needs to be predicted)

Task 1. Import Required Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

Task 2. Load the Dataset

train = "https://hrcdn.net/s3_pub/istreet-assets/LgLPfzg0V-7G1vBzJsBxdA/train.csv"
test = "https://hrcdn.net/s3_pub/istreet-assets/PV13ExA_QndhFEhxoaHG_A/test.csv"
df = pd.read_csv(train)
test_df =  pd.read_csv(test)

Check the following.

  • Number of rows & columns
  • Data types (categorical vs numerical)
  • Missing values
  • Target Variable Count
print(df.shape)
df.info()
"""

RangeIndex: 43957 entries, 0 to 43956
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   age              43957 non-null  int64
 1   workclass        41459 non-null  object
 2   fnlwgt           43957 non-null  int64
 3   education        43957 non-null  object
 4   educational-num  43957 non-null  int64
 5   marital-status   43957 non-null  object
 6   occupation       41451 non-null  object
 7   relationship     43957 non-null  object
 8   race             43957 non-null  object
 9   gender           43957 non-null  object
 10  capital-gain     43957 non-null  int64
 11  capital-loss     43957 non-null  int64
 12  hours-per-week   43957 non-null  int64
 13  native-country   43194 non-null  object
 14  income_>50K      43957 non-null  int64
dtypes: int64(7), object(8)
memory usage: 5.0+ MB
"""
df["income_>50K"].value_counts()
"""
income_>50K
0    33439
1    10518
Name: count, dtype: int64
"""

Task 4: Handle Missing Values

df.isnull().any().any()
"""
np.True_
"""
df.replace("?", np.nan, inplace=True) #Some datasets contain non-standard null characters that Pandas does not recognize by default.
df.replace("?", np.nan, inplace=True)
test_df.replace("?", np.nan, inplace=True)
df.dropna(inplace=True)
test_df.dropna(inplace=True)
df.isnull().any().any()
"""
np.False_
"""

Task 5: Separate Features and Target from Dataset


X =  df.drop("income_>50K", axis=1 ) # Feature
y = df["income_>50K"]                # Target
X.head(5)
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country
0 67 Private 366425 Doctorate 16 Divorced Exec-managerial Not-in-family White Male 99999 0 60 United-States
1 17 Private 244602 12th 8 Never-married Other-service Own-child White Male 0 0 15 United-States
2 31 Private 174201 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 United-States
3 58 State-gov 110199 7th-8th 4 Married-civ-spouse Transport-moving Husband White Male 0 0 40 United-States
4 25 State-gov 149248 Some-college 10 Never-married Other-service Not-in-family Black Male 0 0 40 United-States

Step 6: Encode Categorical Variables


cat_col_df = X.select_dtypes(include="object")

le= LabelEncoder()
# num = le.fit_transform(X["workclass"])
# le.inverse_transform(num)  # Fetch original values back
for i in cat_col_df:
    X[i]= le.fit_transform(X[i])
    test_df[i] = le.fit_transform(test_df[i])
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country
0 67 2 366425 10 16 0 3 1 4 1 99999 0 60 38
1 17 2 244602 2 8 4 7 3 4 1 0 0 15 38
2 31 2 174201 9 13 2 3 0 4 1 0 0 40 38
3 58 5 110199 5 4 2 13 0 4 1 0 0 40 38
4 25 5 149248 15 10 4 7 1 2 1 0 0 40 38

# (Optional) You can check which String value assigned which numeric number by this code.
# temp =le.fit_transform(cat_col_df["workclass"])[:100]
#print(list(enumerate(le.classes_))) ##[(0, 'Cambodia'), (1, 'Canada'), (2, 'China'), ]

Step 7: Feature Scaling (Important)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 8: Train‑Test Split

X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Step 9: Build the Model (Logistic Regression)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Step 10: Model Evaluation

y_pred = model.predict(X_val)

print("Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

"""
Accuracy: 0.8169653817824699

              precision    recall  f1-score   support

           0       0.84      0.94      0.89      6135
           1       0.71      0.44      0.54      2011

    accuracy                           0.82      8146
   macro avg       0.77      0.69      0.71      8146
weighted avg       0.80      0.82      0.80      8146
"""

(Optional) Confusion matrix:


sns.heatmap(confusion_matrix(y_val, y_pred), annot=True, fmt="d")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Step 11: Train on Full Data

model.fit(X_scaled, y)

Step 12: Prepare Test Data

test_scaled = scaler.transform(test_df)
test_scaled

Step 13: Predict on Test Data

test_predictions = model.predict(test_scaled)
test_df["income_50K"] = test_predictions
test_df

Save Predictions

test_df[["income_50K"]].to_csv("submission.csv", index=True)

ML Binary Classification Mini-Project

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

Post a Comment