Unstructured data classification Fresco Play Handson Solution

Build and evaluate text classification models using TF-IDF, train-test split, SVM, and SGD classifiers in Python for NLP and machine learning projects

Welcome to Unstructured-Classification

#Run the Cell to import the packages
import pandas as pd
import numpy as np
import csv

Fill in the Command to load your CSV dataset "imdb.csv" with pandas


imdb.csv
imdb=pd.read_csv("imdb.csv")
imdb.columns = ["index","text","label"]
print(imdb.head(5))

Data Analysis Process

  • Get the shape of the dataset and print it.
  • Get the column names in list and print it.
  • Group the dataset by label and describe the dataset to understand the basic statistics of the dataset.
  • Print the first three rows of the dataset

data_size = imdb.shape
print(data_size)

imdb_col_names = list(imdb.columns)

print(imdb_col_names)
print(imdb.groupby('label').describe())
print(imdb.head(3))

Target Identification

#Execute the below cell to identify the target variables. 
#If 0 it is a bad review,if it is 1 it is a good review.
imdb_target=imdb['label'] 
print(imdb_target)

Tokenization Process

  • Convert the text into lower.
  • Tokenize the text using word_tokenize
  • Apply the function split_tokens for the column text in the imdb dataset with axis =1

from nltk.tokenize import word_tokenize
import nltk
nltk.download('all')
import re
def split_tokens(text):
#     print(text)
    message = text.lower()
    
    # Remove punctuation/special characters
    # message = re.sub(r'[^\w\s]', '', message)
    
    word_tokens = word_tokenize(message)
    return word_tokens

imdb['tokenized_message'] = imdb.text.apply(split_tokens)

Lemmatization Process

  • Apply the function split_into_lemmas for the column tokenized_message with axis=1
  • Print the 55th row from the column tokenized_message.
  • Print the 55th row from the column lemmatized_message

from nltk.stem.wordnet import WordNetLemmatizer
def split_into_lemmas(text): 
    lemma = []
    lemmatizer = WordNetLemmatizer()
 
    for word in text:
        a=lemmatizer.lemmatize(word) 
        lemma.append(a)

    return lemma

imdb['lemmatized_message'] = imdb.tokenized_message.apply(split_into_lemmas)
print('Tokenized message:', imdb['tokenized_message'][54])
print('Lemmatized message:', imdb['lemmatized_message'][54])

Stop Word Removal Process

  • Set the stop words language as english in the variable stop_words
  • Apply the function stopword_removal to the column lemmatized_message with axis=1
  • Print the 55th row from the column preprocessed_message

from nltk.corpus import stopwords
 
def stopword_removal(text): 
    stop_words =set(stopwords.words('english')) 
    filtered_sentence = [] 
    filtered_sentence = ' '.join([word for word in text if word not in stop_words])
    return filtered_sentence
 
imdb['preprocessed_message'] = imdb.lemmatized_message.apply(stopword_removal)
print('Preprocessed message:',imdb['preprocessed_message'][54])
Training_data=pd.Series(list(imdb['preprocessed_message']))
Training_label=pd.Series(list(imdb['label']))

Term Document Matrix Process

  • Apply CountVectorizer with following parameters: ngram_range = (1,2) , min_df = (1/len(Training_label)) , max_df = 0.7
  • Fit the tf_vectorizer with the Training_data
  • Transform the Total_Dictionary_TDM with the Training_data

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer​
tf_vectorizer = CountVectorizer(ngram_range = (1,2) ,min_df = (1/len(Training_label)), max_df = 0.7)
Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)
message_data_TDM =  tf_vectorizer.transform(Training_data)

Term Frequency Inverse Document Frequency (TFIDF)

  • Apply TfidfVectorizer with following parameters: ngram_range = (1,2) , min_df = (1/len(Training_label)) , max_df = 0.7
  • Fit the tfidf_vectorizer with the Training_data
  • Transform the Total_Dictionary_TFIDF with the Training_data

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
# Calculate min_df value
min_df_value = 1 / len(Training_label)
 
# Initialize TfidfVectorizer with given parameters
tfidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=min_df_value,
    max_df=0.7
)
# Fit the vectorizer on the training data
Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)
# Transform the training data
message_data_TFIDF = tfidf_vectorizer.transform(Training_data)

Train and Test Data Process

  • Splitting the data for training and testing(90% train,10% test)
  • Perform train-test split on message_data_TDM and Training_label with 90% as train data and 10% as test data.
from sklearn.model_selection import train_test_split#Splitting the data for training and testing
train_data,test_data, train_label, test_label = train_test_split(
    message_data_TDM,
    Training_label,
    test_size=0.1
)

Support Vector Machine Process

  • Get the shape of the train-data and print the same.
  • Get the shape of the test-data and print the same.
  • Initialize SVM classifier with following parameters: kernel = linear, C= 0.025 , random_state=seed
  • Train the model with train_data and train_label
  • Now predict the output with test_data
  • Evaluate the classifier with score from test_data and test_label
  • Print the predicted score

from sklearn.svm import SVC

# Step 1: Set seed
seed = 9

# Step 2: Get shape of train/test data
train_data_shape = train_data.shape
test_data_shape = test_data.shape

print("The shape of train data:", train_data_shape)
print("The shape of test data:", test_data_shape)

# Step 3: Initialize SVM Classifier
classifier = SVC(kernel='linear', C=0.025, random_state=seed)

# Step 4: Train the classifier
classifier = classifier.fit(train_data, train_label)

# Step 5: Predict using test data
target = classifier.predict(test_data)

# Step 6: Evaluate the model
score = classifier.score(test_data, test_label)

# Step 7: Print the score
print('SVM Classifier :', score)

# Step 8: Save tokenized and lemmatized message at index 55
with open('output.txt', 'w') as file:
    file.write(str((imdb['tokenized_message'][55], imdb['lemmatized_message'][55])))

Stochastic Gradient Descent Classifier Process

  • Perform train-test split on message_data_TDM and Training_label with this time 80% as train data and 20% as test data.
  • Get the shape of the train-data and print the same.
  • Get the shape of the test-data and print the same.
  • Initialize SVM classifier with following parameters: loss = modified_huber, shuffle= True, random_state=seed
  • Train the model with train_data and train_label
  • Now predict the output with test_data
  • Evaluate the classifier with score from test_data and test_label
  • Print the predicted score

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

# Step 1: Set seed
seed = 9

# Step 2: Split the data (80% train, 20% test)
train_data, test_data, train_label, test_label = train_test_split(
    message_data_TDM, Training_label, test_size=0.2
)

# Step 3: Get shape of train and test data
train_data_shape = train_data.shape
test_data_shape = test_data.shape

print("The shape of train data:", train_data_shape)
print("The shape of test data:", test_data_shape)

# Step 4: Initialize SGD Classifier
classifier = SGDClassifier(loss='modified_huber', shuffle=True, random_state=seed)

# Step 5: Train the classifier
classifier = classifier.fit(train_data, train_label)

# Step 6: Predict using test data
target = classifier.predict(test_data)

# Step 7: Evaluate the model
score = classifier.score(test_data, test_label)

# Step 8: Print the score
print('SGD classifier :', score)
with open('output1.txt', 'w') as file:
    file.write(str((imdb['preprocessed_message'][55])))

Save the Final Score in the File.


with open('out.txt', 'w') as file:
    file.write(str(int(score*100)))

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

Post a Comment