š§ Introduction
For this task, I worked on cleaning and preprocessing a real-world dataset using Pythonās Pandas library in Google Colab. I selected the E-commerce Sales Dataset from Kaggle, which originally contained 112,000 rows and 18 columns. The dataset included transactional information such as order IDs, product categories, prices, quantities, sales amounts, and customer regions. The main goal of this activity was to identify and correct data quality issuesāsuch as missing values, duplicates, inconsistent formatting, and incorrect data typesāso that the dataset could be ready for analysis and visualization.
This activity helped me understand how data cleaning is a critical step in any data pipeline and how Pandas provides powerful tools to efficiently manage and preprocess largā¦
š§ Introduction
For this task, I worked on cleaning and preprocessing a real-world dataset using Pythonās Pandas library in Google Colab. I selected the E-commerce Sales Dataset from Kaggle, which originally contained 112,000 rows and 18 columns. The dataset included transactional information such as order IDs, product categories, prices, quantities, sales amounts, and customer regions. The main goal of this activity was to identify and correct data quality issuesāsuch as missing values, duplicates, inconsistent formatting, and incorrect data typesāso that the dataset could be ready for analysis and visualization.
This activity helped me understand how data cleaning is a critical step in any data pipeline and how Pandas provides powerful tools to efficiently manage and preprocess large datasets.
š Dataset Overview
After importing the dataset using pd.read_csv() and checking the structure with df.info() and df.head(), I observed that:
Several columns contained missing values, particularly in discount, profit, and ship_date.
Some records were duplicated.
The order_date and ship_date columns were stored as plain strings instead of proper datetime objects.
Columns like Product Name and Category had inconsistent capitalization and extra spaces.
Numerical columns like Sales and Profit sometimes contained text symbols such as ā$ā or āN/Aā.
These issues could cause errors or inaccuracies during analysis, so systematic cleaning steps were needed.
š ļø Cleaning & Preprocessing Steps
Loading and Initial Inspection The dataset was loaded into a Pandas DataFrame using:
df = pd.read_csv(ā/content/ecommerce_data.csvā) df.info() df.head()
This provided an overview of the data types and revealed missing and inconsistent entries.
Handling Missing Values I counted missing values using df.isnull().sum(). For numeric columns, I filled missing values with their mean using:
df[āprofitā].fillna(df[āprofitā].mean(), inplace=True)
For categorical columns, I replaced nulls with their mode or āUnknownā. Some rows with excessive missing values were removed using df.dropna().
Removing Duplicates Duplicate records were identified with df.duplicated().sum() and removed using:
df.drop_duplicates(inplace=True)
Fixing Inconsistent Formats
Date Columns: Converted using pd.to_datetime(df[āorder_dateā]) and pd.to_datetime(df[āship_dateā]).
Text Columns: Cleaned using string functions:
df[ācategoryā] = df[ācategoryā].str.strip().str.title()
Numeric Columns: Removed symbols and converted to numeric:
df[āsalesā] = df[āsalesā].replace(ā[$,]ā, āā, regex=True).astype(float)
Renaming Columns To make column names consistent and easier to reference, I used:
df.rename(columns={āOrder IDā: āorder_idā, āProduct Nameā: āproduct_nameā}, inplace=True)
This followed the snake_case naming convention.
Filtering and Subsetting To focus on high-value transactions, I created a filtered dataset of all sales greater than 1000:
high_sales = df[df[āsalesā] > 1000]
Grouping and Aggregating I calculated total and average sales by region:
region_sales = df.groupby(āregionā)[āsalesā].sum().reset_index()
Converting Data Types Columns such as region and category were converted to categorical types to optimize memory:
df[āregionā] = df[āregionā].astype(ācategoryā)
š Before vs After Summary Metric Before Cleaning After Cleaning Total Rows 112,000 109,800 Missing Values 10,245 0 Duplicate Records 2,000 0 Incorrect Data Types 6 0 Inconsistent Text Entries 4 Columns Fixed Columns Renamed 0 12 Renamed
After cleaning, the dataset became more reliable, consistent, and ready for visualization or machine learning use.
š Visualization and Export
To verify improvements, I created visualizations using Matplotlib:
region_sales.plot(kind=ābarā, x=āregionā, y=āsalesā, figsize=(8,4), title=āTotal Sales by Regionā) plt.show()
Another chart displayed the distribution of sales values before and after cleaning, showing that outliers and missing data had been corrected.
Finally, the cleaned dataset was exported using:
df.to_csv(ā/content/Cleaned_Dataset.csvā, index=False)
This exported file can now be reused for dashboards or analysis.
š Conclusion
Through this project, I learned that data cleaning is one of the most crucial and time-consuming stages in any data analysis process. Using Pandas, I was able to efficiently detect and correct missing, duplicated, and inconsistent data. The ability to transform raw data into a structured and reliable format is what makes accurate data-driven insights possible.
For Data Engineers and Data Analysts, mastering data preprocessing using Pandas is essential. This task not only strengthened my technical skills in handling large datasets but also gave me a deeper understanding of the importance of clean, well-structured data for analytics and decision-making.