Turing Machine Data Scientist Program: Use Case 5 Fresco Play Handson Solution

Mini Project Shallow Learning

Instructions:

You are a Data Scientist working in a Public Policy team. Your team needs you to come up with a prediction model to know if a person, based on his/her demographic data will earn $50,000 or more. This prediction will help the team in making policy decisions for providing financial assistance to the low-income group. You are given a sample data of the population along with their annual income. You can use that data to train your machine learning model.

You can build your model in your own hardware / pc / laptop and just upload the prediction as shown in the below format.

You are free to use R or Python or any other programming language of your preference to explore and build the model.

Instructions for the case study are provided below.

Build a Machine Learning Model, which is capable of predicting if an individual's income is greater than 50k or not.
The prediction must be done based on various data attributes provided below.
Use 'TrainData' file provided below for building the model.
Use 'TestData' file provided below for testing your predictions.

Data Attributes description

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holland-Netherlands.
income_>50K: binary (Target that needs to be predicted)

Task 1. Import Required Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

Task 2. Load the Dataset

train = "https://hrcdn.net/s3_pub/istreet-assets/LgLPfzg0V-7G1vBzJsBxdA/train.csv"
test = "https://hrcdn.net/s3_pub/istreet-assets/PV13ExA_QndhFEhxoaHG_A/test.csv"
df = pd.read_csv(train)
test_df =  pd.read_csv(test)

Check the following.

Number of rows & columns
Data types (categorical vs numerical)
Missing values
Target Variable Count

print(df.shape)
df.info()
"""

RangeIndex: 43957 entries, 0 to 43956
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   age              43957 non-null  int64
 1   workclass        41459 non-null  object
 2   fnlwgt           43957 non-null  int64
 3   education        43957 non-null  object
 4   educational-num  43957 non-null  int64
 5   marital-status   43957 non-null  object
 6   occupation       41451 non-null  object
 7   relationship     43957 non-null  object
 8   race             43957 non-null  object
 9   gender           43957 non-null  object
 10  capital-gain     43957 non-null  int64
 11  capital-loss     43957 non-null  int64
 12  hours-per-week   43957 non-null  int64
 13  native-country   43194 non-null  object
 14  income_>50K      43957 non-null  int64
dtypes: int64(7), object(8)
memory usage: 5.0+ MB
"""

df["income_>50K"].value_counts()
"""
income_>50K
0    33439
1    10518
Name: count, dtype: int64
"""

Task 4: Handle Missing Values

df.isnull().any().any()
"""
np.True_
"""

df.replace("?", np.nan, inplace=True) #Some datasets contain non-standard null characters that Pandas does not recognize by default.
df.replace("?", np.nan, inplace=True)
test_df.replace("?", np.nan, inplace=True)
df.dropna(inplace=True)
test_df.dropna(inplace=True)

df.isnull().any().any()
"""
np.False_
"""

Task 5: Separate Features and Target from Dataset


X =  df.drop("income_>50K", axis=1 ) # Feature
y = df["income_>50K"]                # Target
X.head(5)

	age	workclass	fnlwgt	education	educational-num	marital-status	occupation	relationship	race	gender	capital-gain	hours-per-week	native-country
0	67	Private	366425	Doctorate	16	Divorced	Exec-managerial	Not-in-family	White	Male	99999	60	United-States
1	17	Private	244602	12th	8	Never-married	Other-service	Own-child	White	Male	0	15	United-States
2	31	Private	174201	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	40	United-States
3	58	State-gov	110199	7th-8th	4	Married-civ-spouse	Transport-moving	Husband	White	Male	0	40	United-States
4	25	State-gov	149248	Some-college	10	Never-married	Other-service	Not-in-family	Black	Male	0	40	United-States

Step 6: Encode Categorical Variables


cat_col_df = X.select_dtypes(include="object")

le= LabelEncoder()
# num = le.fit_transform(X["workclass"])
# le.inverse_transform(num)  # Fetch original values back
for i in cat_col_df:
    X[i]= le.fit_transform(X[i])
    test_df[i] = le.fit_transform(test_df[i])

	age	workclass	fnlwgt	education	educational-num	marital-status	occupation	relationship	race	gender	capital-gain	hours-per-week	native-country
0	67	2	366425	10	16	0	3	1	4	1	99999	60	38
1	17	2	244602	2	8	4	7	3	4	1	0	15	38
2	31	2	174201	9	13	2	3	0	4	1	0	40	38
3	58	5	110199	5	4	2	13	0	4	1	0	40	38
4	25	5	149248	15	10	4	7	1	2	1	0	40	38


# (Optional) You can check which String value assigned which numeric number by this code.
# temp =le.fit_transform(cat_col_df["workclass"])[:100]
#print(list(enumerate(le.classes_))) ##[(0, 'Cambodia'), (1, 'Canada'), (2, 'China'), ]

Step 7: Feature Scaling (Important)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 8: Train‑Test Split

X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Step 9: Build the Model (Logistic Regression)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Step 10: Model Evaluation

y_pred = model.predict(X_val)

print("Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

"""
Accuracy: 0.8169653817824699

              precision    recall  f1-score   support

           0       0.84      0.94      0.89      6135
           1       0.71      0.44      0.54      2011

    accuracy                           0.82      8146
   macro avg       0.77      0.69      0.71      8146
weighted avg       0.80      0.82      0.80      8146
"""

(Optional) Confusion matrix:


sns.heatmap(confusion_matrix(y_val, y_pred), annot=True, fmt="d")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Step 11: Train on Full Data

model.fit(X_scaled, y)

Step 12: Prepare Test Data

test_scaled = scaler.transform(test_df)
test_scaled

Step 13: Predict on Test Data

test_predictions = model.predict(test_scaled)
test_df["income_50K"] = test_predictions
test_df

Save Predictions

test_df[["income_50K"]].to_csv("submission.csv", index=True)

PDFcup.com