Structured Data Classification Fresco Play Handson Solution HackerRank

Build and evaluate text classification models using TF-IDF, train-test split, SVM, and SGD classifiers in Python for NLP and machine learning projects
Run the Cell to import the packages
import pandas as pd
import numpy as np
Data Loading Fill in the Command to load your CSV dataset "weather.csv" with pandas
weather = pd.read_csv('weather.csv', sep=',')
**Data Analysis** - Get the shape of the dataset and print it. - Get the column names in list and print it. - Describe the dataset to understand the basic statistics of the dataset. - Print the first three rows of the dataset
data_size= weather.shape

print(data_size)

weather_col_names = list(weather.columns)

print(weather_col_names)

print(weather.describe())

print(weather.head(3))
Target Identification
weather_target=weather['RainTomorrow'] 

print(weather_target)
Feature Identification In our case by analyzing the dataset, we can understand that the columns like **Date** might be irrelevant as they are not dependent on call usage pattern. Since **RainTomorrow** is our target variable, we will be removing it from the feature set. - Perform appropriate operation to drop the columns **Date** and **RainTomorrow**
cols_to_drop = ['Date','RainTomorrow']

weather_feature =  weather.drop(columns=['Date','RainTomorrow'] ) 

print(weather_feature.head(5))
Categorical Data In order to Identify the categorical variable in a data, use the following command in the below cell,
weather_categorical = weather.select_dtypes(include=[object])
print(weather_categorical.head(15))
Convert to boolean** Assign the column **RainToday** for the variable **yes_no_cols** and run the below cell to print first 5 rows of **weather_feature**
yes_no_cols = ["RainToday"]

weather_feature[yes_no_cols] = weather_feature[yes_no_cols] == 'Yes'

print(weather_feature.head(5))
One Hot Encoding** Execute the below cells to perform **One Hot Encoding**
weather_dumm=pd.get_dummies(weather_feature, columns=["Location","WindGustDir","WindDir9am","WindDir3pm"], prefix=["Location","WindGustDir","WindDir9am","WindDir3pm"])

weather_matrix = weather_dumm.values.astype(np.float)
print(weather_matrix)
**Imputing-Missing Values** Do the Imputing-Missing Values by using the following parameters - missing_values=np.nan - strategy=mean - fill_value=None - verbose=0 - copy=True
from sklearn.impute import SimpleImputer

imp=SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=None, verbose=0, copy=True)

weather_matrix=imp.fit_transform(weather_matrix)
print(weather_matrix)
**Standardization** Run the below cell to perform standardization
from sklearn.preprocessing import StandardScaler

#Standardize the data by removing the mean and scaling to unit variance

scaler = StandardScaler()

#Fit to data, then transform it.

weather_matrix = scaler.fit_transform(weather_matrix)
print(weather_matrix)
**Train and Test Data** Splitting the data for training and testing(90% train,10% test) - Perform train-test split on **weather_matrix** and **weather_target** with 90% as train data and 10% as test data and set random_state as seed.
from sklearn.model_selection import train_test_split

seed=5000

train_data,test_data, train_label, test_label = train_test_split(weather_matrix, weather_target, test_size=.1,random_state=seed)
**Decision Tree Classification** - Initialize **SVM** classifier with following parameters - kernel = linear - C= 0.025 - random_state=seed - Train the model with train_data and train_label - Now predict the output with test_data - Evaluate the classifier with score from test_data and test_label - Print the predicted score
from sklearn.svm import SVC
import numpy as np

 

# Initialize SVM classifier with given parameters
classifier = SVC(kernel='linear', C=0.025, random_state=seed)

# Train the model
classifier = classifier.fit(train_data, train_label)

# Predict output for test data
churn_predicted_target = classifier.predict(test_data)

# Evaluate the classifier
score = classifier.score(test_data, test_label)

# Print the predicted score
print('SVM Classifier : ', score)

# Write the score to output.txt
with open('output.txt', 'w') as file:
    file.write(str(score))

**Random Forest Classifier** - Do the **Random Forest** Classifier of the Dataset using the following parameters. - max_depth=5 - n_estimators=10 - max_features=10 - random_state=seed - Train the model with train_data and train_label. - Now predict the output with test_data. - Evaluate the classifier with score from test_data and test_label.
from sklearn.ensemble import RandomForestClassifier
import numpy as np
 
# Initialize Random Forest Classifier with given parameters
classifier = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10, random_state=seed)

# Train the model
classifier = classifier.fit(train_data, train_label)

# Predict output for test data
churn_predicted_target = classifier.predict(test_data)

# Evaluate the classifier
score = classifier.score(test_data, test_label)

# Print the predicted score
print('Random Forest Classifier : ', score)

# Write the score to output1.txt
with open('output1.txt', 'w') as file:
    file.write(str(score))

with open('out.txt', 'w') as file:
    file.write(str(int(score*100)))

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

Post a Comment