Python Pandas Tutorial: Essential DataFrame Tips from Beginner to Expert Solution

Unlock Python Pandas mastery with our guide: essential DataFrame skills, from beginner basics to advanced techniques and practical solutions examples
Python Pandas Tutorial: Essential DataFrame Tips from Beginner to Expert Solution - www.pdfcup.com

#1. How to Create a DataFrame in Python Using Pandas

A DataFrame is a tabular data structure similar to an Excel spreadsheet. You can create it using dictionaries, lists, or other data structures. There are three commonly used methods in data analysis are:

Option 1:


import pandas as pd
data = {
"price_aprox_usd": [115910.26, 48718.17, 28977],
"surface_covered_in_m2": [128.0, 210.0, 58.0],
"rooms": [4.0, 3.0, 2.0],
}

df_houses = pd.DataFrame(data)

Option 2:


height_series = pd.Series([161.5, 156.2, 172.8, 140.9])
weight_series = pd.Series([58.6, 65.7, 58.8, 86.1])

frame = {'height': height_series,
         'weight': weight_series}
df = pd.DataFrame(frame)


Option 3:


h = pd.Series([161.5, 156.2, 172.8, 140.9], name="Marks")
w = pd.Series([58.6, 65.7, 58.8, 86.1], name="Average")
pd.concat([h,w], axis=1)

#df = pd.concat([df, new_col], axis=1)   # using this, you can add a new column

#2. How to Drop or Delete NaN Values from the DataFrame.

Missing values can affect your analysis. You can remove them using dropna(). The dropna() function helps to remove rows from the DataFrame if they contain any NaN values.


# Drop entire row if any NaN values available.
df_cleaned = df.dropna()

#If you want to remove rows where all elements are NaN, use this method:
df_cleaned = df.dropna(how='all')

#To remove columns with NaN values:
df_cleaned = df.dropna(axis=1)

#To remove rows with NaN values only in specific columns:
df_cleaned = df.dropna(subset=['colA', 'colB'])

#3. How to Delete or Remove Duplicate Values from the DataFrame.

Duplicate values can skew analysis. Use drop_duplicates() to remove them. The drop_duplicates() function helps to remove duplicate values from the entire DataFrame.


#To remove duplicate rows based on all columns:
df_cleaned = df.drop_duplicates()

#To remove duplicates based on specific columns:
df_cleaned = df.drop_duplicates(subset=['colA', 'colB'])

#To keep the last occurrence of the duplicates instead of the first:
df_cleaned = df.drop_duplicates(keep='last')

# Keep changes in a original DataFrame itself.
df.drop_duplicates(subset=['colA', 'colB'], keep="first", inplace=True)

How to Convert a Column to a List

Extract column data for further processing.

list_of_values = df["column_name"].tolist()

#4. How to Remove or Delete Special Characters or Symbols from a Column in DataFrame?

The str.replace() function helps to remove any word or character from the DataFrame.


df1["price"].str.replace("$","Dollar")

#5. How to Change or Cast the Column Data Type in a Python DataFrame.

Ensuring correct data types is crucial. The astype() function provides the functionality to change the data type of an existing column in a DataFrame.


df1["price"].astype(float)

#6. How to Drop a Column from a DataFrame in Python?

Use drop() to remove unnecessary columns. The drop() function provides functionality for deleting rows or columns.


df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [15, 6, 17, 8]
}, index=['a', 'b', 'c', 'd'])

# Drop row 'y' in place
df.drop(d, inplace = True)

# Drop rows with index 'b' and 'd'
df_dropped_rows = df.drop(['b', 'd'])

# Drop (Multiple) columns 'B' and 'C'
df_dropped_column = df.drop(['B','C'], axis=1)

# Drop (Multiple) rows 'b' and 'd'
df_dropped_multiple = df.drop(['b', 'd'], axis=0)

How to Create a New Column Based on Conditions?

Apply conditional logic to create new features.

df["new_column"] = np.where(df["price"] > 50000, "High", "Low")

How to Convert Column Values to Lowercase?

Standardize text formatting.

df["column_name"] = df["column_name"].str.lower()

#7. How to save a DataFrame to CSV file.

Export your data for further use. We can save the DataFrame to a CSV file using various methods, with the most popular being conversion through the pandas library.


your_dataframe.to_csv('output.csv') # default convertion

# Another way is to use manual parameters, providing more control over the conversion.
your_dataframe.to_csv(
    path_or_buf='new_titanic.csv',
    sep=',',
    na_rep='',
    float_format=None,
    columns=[],
    header=True,
    index=True,
    index_label=None,
    mode='w',
    encoding='utf-8',
    quoting=csv.QUOTE_MINIMAL,
    lineterminator=os.linesep,
    storage_options={},
    compression='infer',
    quotechar='',
    chunksize=None,
    date_format=None,
    doublequote=True,
    escapechar=None,
    decimal='.',
    errors='strict'
)

#8. How to add a new column to a dataframe in python?

Creating new features is key in Data Science. The Assignment '=' operator can help to assign a new column in a dataframe.


df["new_column_1"] = [6, 16, 26, 36, 46, 56]

df["new_column_2"] = df["price"]*10

#9. How to Rename Columns in a DataFrame

Make your column names more readable.


df.rename(columns={"old_name": "new_name"}, inplace=True)

How to split a column into two or multiple columns in a pandas dataframe?

The str.split() function usefull to split the columns.


df3[["lat", "lon"]] = df3["lat-lon"].str.split(",", expand=True)

How to concat or append or merge two dataframes in python using Pandas?

The concat() function helps to merge two different dataframes.


import pandas as pd
df_list = ["df1", "df2"]
new_df = pd.concat( df_list ) // by-default merged the dataframe horizontally.

How to merge two dataframes horizontally in python using Pandas?

The concat() function is capable to merge the dataframes horizontally with parameter axis=0.


import pandas as pd
df_list = ["df1", "df2"]
new_df = pd.concat( df_list, axis = 0 ) 

How to merge or append two dataframes vertically in python using Pandas?

The concat() function is capable to merge the dataframes horizontally with parameter axis=1


import pandas as pd
df_list = ["df1", "df2"]
new_df = pd.concat( df_list, axis = 1 ) 

How to save or export dataframe as csv in python?

The to_csv() can be use to export the data frame in csv file.


df.to_csv("Dataset_output.csv")

How to save or export dataframe as csv in python without indexes?

The index parameter can be use in the to_csv() function to remove the indexes from the dataframe.


df.to_csv("Dataset_output.csv" , index = False)

How to save or export dataframe as excel file in python?

The to_excel() Function can be use to export the data frame in excel file


df.to_excel("Dataset_output.xlsx" )

How to get or find all unique values from a dataframe column?

The unique() function can be apply with the respected column to get the list of unique values.


df["col"].unique()

How to get total number of unique value from the dataframe column?

The nunique() function can be used to get the count of unique value in each column.


df["col"].nunique()

How to Count Occurrences of Unique Values. How to get top N most frequent values from the dataframe column using pandas in python?

The values_counts() function can be used to count the occurance of the values in the dataframe.


arrange_by_rank = df["salary"].value_counts()
top_3_values = arrange_by_rank.head(3) 

How to Fill Missing Values with a Specific Value?

Replace NaNs with a chosen value.

anyValue = 0
df.fillna(anyValue, inplace=True) 

How to Convert a Column to Datetime Format?

Convert strings to datetime for time-series analysis.

df["date_column"] = pd.to_datetime(df["date_column"]) 

How to Extract the Year from a Date Column?

Retrieve specific time components.

df["year"] = df["date_column"].dt.year

How to Check for Missing Values in a Column?

Detecting NaNs is important for preprocessing.

df["column_name"].isna().sum()  

How to Check Data Types of Columns?

Use dtypes to inspect column types.

df.dtypes 

How to Reset the Index of a DataFrame?

Resetting index after row deletions keeps it organized.

df.reset_index(drop=True, inplace=True)

How to Get Summary Statistics of a DataFrame? How to calculate the min, max, Average-Mean, standard deviation, and quartiles of dataframe using pandas in Python?

The function "describe()" helps to calculate all the usefull statics numbers.


# Calculate for entire dataframe.
df.describe() 

# Calculate for Specific dataframe column's.
df[["area_m2", "price_usd"]].describe() # Calculate for Specific dataframe's columns.

# Another way to calculate using agg() function.
df['price_usd'].sort_values().agg(['mean', 'median','min'])

How to get the Summary of a DataFrame to understand the dataset's structure.

The .info() method in Pandas provides a concise summary of a DataFrame, helping data scientists quickly understand the dataset's structure.

df.info() # What df.info() Displays:-

'''
Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   WineName   5000 non-null   object 
 1   Country    5000 non-null   object 
 2   Price      4500 non-null   float64
 3   Rating     4800 non-null   float64
 4   Year       4000 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 195.4+ KB
'''

How to Calculate Rolling Mean (Moving Average)?

Smoothing time-series data is useful for trend analysis.

df["rolling_avg"] = df["price"].rolling(window=3).mean()

How to group a dataframe based on different column and calculate its mean value in python Dataframe?

The mean() function along with groupby() function can be used to achieve the goal.


df.groupby(["state"], sort=False).mean()

How to Apply a custom Function to a DataFrame Column?

Use apply() for complex transformations.

df["new_col"] = df["col"].apply(lambda x: x * 2)

How to group a repetataive column values based on a another column in python Dataframe?

The groupby() function can also useful, see below-


df_2 = df[["state", "price_usd"]].groupby(["state"], sort=False).mean()

How to group multiple column in python Dataframe?

The groupby() function can be used to group the multiple columns by name under the square bracket.


df_2 = df[["state","area_m2", "price_usd"]].groupby(["state","area_m2"], sort=False).mean()

How to sort a dataframe based on a column in python?

The sort_values() function is helpful to sort the columns values.


df.sort_values(by="price_usd",  ascending=False)

How to group the dataframe and return only specific or particular columns?

The groupby() function can be used to group the columns by name and provide the column name in square bracket to return the specific column.


df.groupby(["state"], sort=False)["price_usd"].mean().sort_values(ascending=False)

How to plot a dataframe using pandas library?

The pandas's plot() function can also use to dispaly the plots.


df.plot(kind="bar",xlabel = "X-axis", ylabel= "Y-axis", title= "Heading");  # There should be two coloumns for batter visibility

chaining method:-


(

    df
    .groupby("state")
    ["price_per_m2"].mean()
    .sort_values(ascending=False)
    .plot(
        kind="bar",
        xlabel="State",
        ylabel="Mean Price per M^2 [USD]",
        title="Mean House Price per M^2 by State"

)

);

How to check if the value exists in dataframe column?

The str.contains() function helps to check the value avaiable in column or not.


# Returns True if the value is exists otherwise return False.

bool  = df["place_with_parent_names"].quantile([0.1, 0.9]).str.contains("Capital") 

print(df[bool]).head() #return rows only for True values.

How to calculate quantile in python?

The QUANTILE() function provides an estimate of what range the data value actually lies in based on the given function parameter. For example, here we have added two parameters to the QUANTILE function. The first value '0.1' will return the exact number that lies at 0.1 or 10% of the data. Similarly, '0.9' will return the exact value that lies at 90% of the data.
Note: This function only works with numeric data, if you implement it with string data it will return the number of errors.


df["price_usd"].quantile([0.1, 0.9])  # Note: 0.1 equals 10% and 0.9 equals 90%

How to Check for Outliers Using IQR in a Dataframe?

Detect outliers using the interquartile range.

Q1 = df["price_column"].quantile(0.25)
Q3 = df["price_column"].quantile(0.75)
IQR = Q3 - Q1
df_outliers_removed = df[~((df["price"] < (Q1 - 1.5 * IQR)) | (df["price"] > (Q3 + 1.5 * IQR)))]

How to One-Hot Encode Categorical Columns in Pandas DataFrame?

Convert categorical values into numerical form. When working with machine learning models, categorical variables must be converted into a numerical format since most algorithms do not work with raw categorical data. One-hot encoding is a technique where each unique category in a column is transformed into a separate binary column (0 or 1). The pd.get_dummies() function in Pandas is used to achieve this.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    "ID": [1, 2, 3],
    "Category": ["A", "B", "A"]
})

# Apply One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['Category'])

print(df_encoded)

'''
Output:

   ID  Category_A  Category_B
0   1          1          0
1   2          0          1
2   3          1          0
'''

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

Post a Comment