Run the Cell to import the packages
import pandas as pd
import numpy as np
Data Loading Fill in the Command to load your CSV dataset "weather.csv" with pandas
weather = pd.read_csv('weather.csv', sep=',')
**Data Analysis**
- Get the shape of the dataset and print it.
- Get the column names in list and print it.
- Describe the dataset to understand the basic statistics of the dataset.
- Print the first three rows of the dataset
data_size= weather.shape
print(data_size)
weather_col_names = list(weather.columns)
print(weather_col_names)
print(weather.describe())
print(weather.head(3))
Target Identification
weather_target=weather['RainTomorrow']
print(weather_target)
Feature Identification
In our case by analyzing the dataset, we can understand that the columns like **Date** might be irrelevant as they are not dependent on call usage pattern.
Since **RainTomorrow** is our target variable, we will be removing it from the feature set.
- Perform appropriate operation to drop the columns **Date** and **RainTomorrow**
cols_to_drop = ['Date','RainTomorrow']
weather_feature = weather.drop(columns=['Date','RainTomorrow'] )
print(weather_feature.head(5))
Categorical Data In order to Identify the categorical variable in a data, use the following command in the below cell,
weather_categorical = weather.select_dtypes(include=[object])
print(weather_categorical.head(15))
Convert to boolean**
Assign the column **RainToday** for the variable **yes_no_cols** and run the below cell to print first 5 rows of **weather_feature**
yes_no_cols = ["RainToday"]
weather_feature[yes_no_cols] = weather_feature[yes_no_cols] == 'Yes'
print(weather_feature.head(5))
One Hot Encoding**
Execute the below cells to perform **One Hot Encoding**
weather_dumm=pd.get_dummies(weather_feature, columns=["Location","WindGustDir","WindDir9am","WindDir3pm"], prefix=["Location","WindGustDir","WindDir9am","WindDir3pm"])
weather_matrix = weather_dumm.values.astype(np.float)
print(weather_matrix)
**Imputing-Missing Values**
Do the Imputing-Missing Values by using the following parameters
- missing_values=np.nan
- strategy=mean
- fill_value=None
- verbose=0
- copy=True
from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=None, verbose=0, copy=True)
weather_matrix=imp.fit_transform(weather_matrix)
print(weather_matrix)
**Standardization**
Run the below cell to perform standardization
from sklearn.preprocessing import StandardScaler
#Standardize the data by removing the mean and scaling to unit variance
scaler = StandardScaler()
#Fit to data, then transform it.
weather_matrix = scaler.fit_transform(weather_matrix)
print(weather_matrix)
**Train and Test Data**
Splitting the data for training and testing(90% train,10% test)
- Perform train-test split on **weather_matrix** and **weather_target** with 90% as train data and 10% as test data and set random_state as seed.
from sklearn.model_selection import train_test_split
seed=5000
train_data,test_data, train_label, test_label = train_test_split(weather_matrix, weather_target, test_size=.1,random_state=seed)
**Decision Tree Classification**
- Initialize **SVM** classifier with following parameters
- kernel = linear
- C= 0.025
- random_state=seed
- Train the model with train_data and train_label
- Now predict the output with test_data
- Evaluate the classifier with score from test_data and test_label
- Print the predicted score
from sklearn.svm import SVC
import numpy as np
# Initialize SVM classifier with given parameters
classifier = SVC(kernel='linear', C=0.025, random_state=seed)
# Train the model
classifier = classifier.fit(train_data, train_label)
# Predict output for test data
churn_predicted_target = classifier.predict(test_data)
# Evaluate the classifier
score = classifier.score(test_data, test_label)
# Print the predicted score
print('SVM Classifier : ', score)
# Write the score to output.txt
with open('output.txt', 'w') as file:
file.write(str(score))
**Random Forest Classifier**
- Do the **Random Forest** Classifier of the Dataset using the following parameters.
- max_depth=5
- n_estimators=10
- max_features=10
- random_state=seed
- Train the model with train_data and train_label.
- Now predict the output with test_data.
- Evaluate the classifier with score from test_data and test_label.
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# Initialize Random Forest Classifier with given parameters
classifier = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10, random_state=seed)
# Train the model
classifier = classifier.fit(train_data, train_label)
# Predict output for test data
churn_predicted_target = classifier.predict(test_data)
# Evaluate the classifier
score = classifier.score(test_data, test_label)
# Print the predicted score
print('Random Forest Classifier : ', score)
# Write the score to output1.txt
with open('output1.txt', 'w') as file:
file.write(str(score))
with open('out.txt', 'w') as file:
file.write(str(int(score*100)))