Get all the code used in this project here :
https://github.com/manishpaneru/Customer_segmentation
As a data analyst at Arnova Store, I have been entrusted by the CEO with the task of uncovering actionable insights from our customer data. The CEO's key question is: What are the key segments in our customer base, and how do they differ in purchasing behavior?
To answer this, I will analyze the dataset using a structured approach:
Methodology
To address the CEO's question and effectively segment Arnova Store's customer base, I followed a systematic, hands-on approach using Python for all analysis tasks. The dataset, which contained customer demographics, spending habits, and other behavioral attributes, was provided to me by the CEO in Excel format. Here’s how I tackled the project step by step:
Data Acquisition
Data Cleaning
Exploratory Data Analysis (EDA)
Customer Clustering (Segmentation)
Cluster Analysis
Insights and Reporting
Data Overview The dataset provided by the CEO contained 2,000 records and eight key attributes, including customer demographics (e.g., Gender, Age, Profession), spending habits (e.g., Annual Income and Spending Score), and additional details like Work Experience and Family Size. Upon importing the data, I noticed that some values in the "Profession" column were missing, which I addressed by imputing the most frequent value (mode).
To ensure data integrity, I also removed duplicate entries and standardized column names to make them easier to reference during analysis. Outliers in the numerical attributes were identified through box plots, particularly in "Annual Income" and "Spending Score," and were capped appropriately to prevent them from skewing the clustering results. This preprocessing ensured that the dataset was clean, consistent, and ready for analysis.
Exploratory Data Analysis (EDA) To gain an initial understanding of the data, I analyzed demographic attributes such as Age, Gender, and Profession.
A histogram of Age revealed that the majority of customers were between 25 and 45 years old, indicating a younger customer base. The Gender distribution was fairly balanced, while the Profession data showed a concentration in fields like Engineering, Healthcare, and Entertainment.
Next, I analyzed spending habits. A histogram of Annual Income showed that most customers earned between $30,000 and $90,000 annually. The Spending Score histogram revealed diverse purchasing behaviors, with some customers scoring very low and others extremely high. A scatter plot of Annual Income vs. Spending Score highlighted potential patterns, with high earners clustering into distinct spending categories.
These insights provided a solid foundation for clustering and revealed initial trends in customer demographics and spending behaviors.
Customer Clustering (Segmentation) To segment the customer base, I applied k-means clustering. Before clustering, I normalized numerical attributes such as "Annual Income" and "Spending Score" to ensure that each feature contributed equally to the clustering process.
The elbow method was used to determine the optimal number of clusters, revealing that four distinct groups would provide the most meaningful segmentation.
After performing k-means clustering, I assigned cluster labels to each customer and visualized the results using a scatter plot. This plot displayed clear segmentation, with each cluster representing a unique group based on income and spending behavior.
Cluster Analysis
With clusters assigned, I examined the characteristics of each group, comparing attributes like average income, spending score, and demographics (e.g., gender, age, and profession).
I visualized the clusters to highlight differences between groups, using scatter plots and bar charts to make the distinctions clear.
How My Approach Ties to the Project Goals
Insights and Reporting