US Cost of Living Dataset:

Step 1: Import Necessary Libraries

Sofialiaqat
3 min readOct 10, 2023

[ ] import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

Step 2: Load the Dataset

# Load the dataset

df = pd.read_csv(“/content/drive/MyDrive/cost_of_living_us.csv”)

Step 3: Initial Exploration

# Display basic information about the dataset

print(df.info())

<class ‘pandas.core.frame.DataFrame’> RangeIndex: 5 entries, 0 to 4 Data columns (total 2 columns): # Column Non-Null Count Dtype — — — — — — — — — — — — — — — 0 Category 5 non-null object 1 Value 5 non-null int64 dtypes: int64(1), object(1) memory usage: 208.0+ bytes None

Step 4: Missing Values Analysis

# Check for missing values

missing_values = df.isnull().sum()

print("Missing Values:")

print(missing_values)
code generates no missing values
Output:

Missing Values:
case_id 0
state 0
isMetro 0
areaname 0
county 0
family_member_count 0
housing_cost 0
food_cost 0
transportation_cost 0
healthcare_cost 0
other_necessities_cost 0
childcare_cost 0
taxes 0
total_cost 0
median_family_income 10 dtype:
int64

Step 5: Data Description

code:
# Summary statistics

summary_stats = df.describe()

print(“Summary Statistics:”)

print(summary_stats)

Data Description output

Summary Statistics: case_id housing_cost food_cost transportation_cost \

count 31430.000000 31430.000000 31430.000000 31430.000000
mean 1589.311804 11073.673539 8287.504557 13593.856381
std 917.218414 4165.606147 3271.140249 1640.456562
min 1.000000 4209.311280 2220.276840 2216.461440
25% 792.000000 8580.000000 5801.424360 12535.159800
50% 1593.000000 10416.000000 8129.156280 13698.164400
75% 2386.000000 12444.000000 10703.624280 14765.758500
max 3171.000000 61735.587600 31178.619600 19816.482000

Healthcare_cost other_necessities_cost childcare_cost taxes \

count 31430.000000 31430.000000 31430.000000 31430.000000
mean 13394.031748 7015.318377 9879.584233 7657.714782
std 5204.545710 2397.415490 6778.223399 3339.795571
min 3476.379960 2611.642080 0.000000 1027.800756
25% 9667.440000 5286.354120 5341.621590 5597.970360
50% 13082.700000 6733.056120 10166.340120 6898.468860
75% 16657.816800 8413.090230 14276.377800 8790.207270
max 37252.274400 28829.443200 48831.085200 47753.390400

total_cost median_family_income count income

count 31430.000000 31420.000000
mean 70901.683601 68315.997017
std 21846.545235 16886.970245
min 30087.662400 25529.976562
25% 53776.019400 57223.988281 50% 70977.682800 65955.605469
75% 85371.341100 76136.070312
max 223717.548000 177662.468750

Step 6: Data Visualization

code
a. Histogram of Income

# Import necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Load the dataset

df = pd.read_csv(“/content/drive/MyDrive/cost_of_living_us.csv”)

# Step 3: Initial Exploration

# Display basic information about the dataset

print(df.info())

# Step 4: Missing Values Analysis

# Check for missing values

missing_values = df.isnull().sum()

print(“Missing Values:”)

print(missing_values)

# Step 5: Data Description

# Summary statistics

summary_stats = df.describe()

print(“Summary Statistics:”)

print(summary_stats)

# Step 6: Data Visualization

# Example: Histogram of a numerical column (replace ‘income’ with an actual column name)

# Ensure that ‘income’ is replaced with the correct column name from your dataset

plt.figure(figsize=(10, 6))

# Print the column names to see the available columns

print(df.columns)

plt.xlabel(‘Income’)

plt.ylabel(‘Frequency’)

plt.title(‘Income Distribution’)

plt.show()

run this code show income distribution income chart show :

Bar Chart of Family Types

# Bar chart of family types

plt.figure(figsize=(10, 6))

# Print the column names to see the available columns

print(df.columns)

plt.xticks(rotation=45)

plt.xlabel(‘Family Type’)

plt.ylabel(‘Count’)

plt.title(‘Count of Family Types’)

plt.show()

Index([‘case_id’, ‘state’, ‘isMetro’, ‘areaname’, ‘county’, ‘family_member_count’, ‘housing_cost’, ‘food_cost’, ‘transportation_cost’, ‘healthcare_cost’, ‘other_necessities_cost’, ‘childcare_cost’, ‘taxes’, ‘total_cost’, ‘median_family_income’], dtype=’object’)

c. Correlation Matrix

# Correlation matrix

correlation_matrix = df.corr()

plt.figure(figsize=(10, 6))

sns.heatmap(correlation_matrix, annot=True, cmap=’coolwarm’, fmt=”.2f”)

plt.title(‘Correlation Matrix’)

plt.show()

This step calculates and visualizes the correlation matrix between numerical columns.

These steps provide a comprehensive analysis of the US Family Budget Dataset, including data exploration, missing values analysis, data description, and various visualizations. You can further customize and extend this analysis based on your specific research objectives and questions.

This code provides examples of three common types of visualizations:

Bar Chart: Visualizes categorical data with bars representing the values of each category.

Scatter Plot: Shows the relationship between two numerical variables.

Line Plot: Displays data points connected by lines to visualize trends or changes over a continuous variable.This code provides examples of three common types of visualizations:

Bar Chart: Visualizes categorical data with bars representing the values of each category.

Scatter Plot: Shows the relationship between two numerical variables.

Line Plot: Displays data points connected by lines to visualize trends or changes over a continuous variable.

--

--

Sofialiaqat
Sofialiaqat

Written by Sofialiaqat

python developer Data science I write Article on Machine Learning| Deep Learning| NLP | Open CV | AI

No responses yet