US Cost of Living Dataset:
Step 1: Import Necessary Libraries
[ ] import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Load the Dataset
# Load the dataset
df = pd.read_csv(“/content/drive/MyDrive/cost_of_living_us.csv”)
Step 3: Initial Exploration
# Display basic information about the dataset
print(df.info())
<class ‘pandas.core.frame.DataFrame’> RangeIndex: 5 entries, 0 to 4 Data columns (total 2 columns): # Column Non-Null Count Dtype — — — — — — — — — — — — — — — 0 Category 5 non-null object 1 Value 5 non-null int64 dtypes: int64(1), object(1) memory usage: 208.0+ bytes None
Step 4: Missing Values Analysis
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)
code generates no missing values
Output:
Missing Values:
case_id 0
state 0
isMetro 0
areaname 0
county 0
family_member_count 0
housing_cost 0
food_cost 0
transportation_cost 0
healthcare_cost 0
other_necessities_cost 0
childcare_cost 0
taxes 0
total_cost 0
median_family_income 10 dtype:
int64
Step 5: Data Description
code:
# Summary statistics
summary_stats = df.describe()
print(“Summary Statistics:”)
print(summary_stats)
Data Description output
Summary Statistics: case_id housing_cost food_cost transportation_cost \
count 31430.000000 31430.000000 31430.000000 31430.000000
mean 1589.311804 11073.673539 8287.504557 13593.856381
std 917.218414 4165.606147 3271.140249 1640.456562
min 1.000000 4209.311280 2220.276840 2216.461440
25% 792.000000 8580.000000 5801.424360 12535.159800
50% 1593.000000 10416.000000 8129.156280 13698.164400
75% 2386.000000 12444.000000 10703.624280 14765.758500
max 3171.000000 61735.587600 31178.619600 19816.482000
Healthcare_cost other_necessities_cost childcare_cost taxes \
count 31430.000000 31430.000000 31430.000000 31430.000000
mean 13394.031748 7015.318377 9879.584233 7657.714782
std 5204.545710 2397.415490 6778.223399 3339.795571
min 3476.379960 2611.642080 0.000000 1027.800756
25% 9667.440000 5286.354120 5341.621590 5597.970360
50% 13082.700000 6733.056120 10166.340120 6898.468860
75% 16657.816800 8413.090230 14276.377800 8790.207270
max 37252.274400 28829.443200 48831.085200 47753.390400
total_cost median_family_income count income
count 31430.000000 31420.000000
mean 70901.683601 68315.997017
std 21846.545235 16886.970245
min 30087.662400 25529.976562
25% 53776.019400 57223.988281 50% 70977.682800 65955.605469
75% 85371.341100 76136.070312
max 223717.548000 177662.468750
Step 6: Data Visualization
code
a. Histogram of Income
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv(“/content/drive/MyDrive/cost_of_living_us.csv”)
# Step 3: Initial Exploration
# Display basic information about the dataset
print(df.info())
# Step 4: Missing Values Analysis
# Check for missing values
missing_values = df.isnull().sum()
print(“Missing Values:”)
print(missing_values)
# Step 5: Data Description
# Summary statistics
summary_stats = df.describe()
print(“Summary Statistics:”)
print(summary_stats)
# Step 6: Data Visualization
# Example: Histogram of a numerical column (replace ‘income’ with an actual column name)
# Ensure that ‘income’ is replaced with the correct column name from your dataset
plt.figure(figsize=(10, 6))
# Print the column names to see the available columns
print(df.columns)
plt.xlabel(‘Income’)
plt.ylabel(‘Frequency’)
plt.title(‘Income Distribution’)
plt.show()
run this code show income distribution income chart show :
Bar Chart of Family Types
# Bar chart of family types
plt.figure(figsize=(10, 6))
# Print the column names to see the available columns
print(df.columns)
plt.xticks(rotation=45)
plt.xlabel(‘Family Type’)
plt.ylabel(‘Count’)
plt.title(‘Count of Family Types’)
plt.show()
Index([‘case_id’, ‘state’, ‘isMetro’, ‘areaname’, ‘county’, ‘family_member_count’, ‘housing_cost’, ‘food_cost’, ‘transportation_cost’, ‘healthcare_cost’, ‘other_necessities_cost’, ‘childcare_cost’, ‘taxes’, ‘total_cost’, ‘median_family_income’], dtype=’object’)
c. Correlation Matrix
# Correlation matrix
correlation_matrix = df.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap=’coolwarm’, fmt=”.2f”)
plt.title(‘Correlation Matrix’)
plt.show()
This step calculates and visualizes the correlation matrix between numerical columns.
These steps provide a comprehensive analysis of the US Family Budget Dataset, including data exploration, missing values analysis, data description, and various visualizations. You can further customize and extend this analysis based on your specific research objectives and questions.
This code provides examples of three common types of visualizations:
Bar Chart: Visualizes categorical data with bars representing the values of each category.
Scatter Plot: Shows the relationship between two numerical variables.
Line Plot: Displays data points connected by lines to visualize trends or changes over a continuous variable.This code provides examples of three common types of visualizations:
Bar Chart: Visualizes categorical data with bars representing the values of each category.
Scatter Plot: Shows the relationship between two numerical variables.
Line Plot: Displays data points connected by lines to visualize trends or changes over a continuous variable.