148. Using pandas for Data Analysis

The pandas library is one of the most powerful and popular tools for data analysis and manipulation in Python. It provides data structures like DataFrame and Series for handling structured data, such as tables in a database or spreadsheet.

Here are 10 Python snippets demonstrating common data analysis tasks using pandas:

1. Creating a DataFrame

Creating a DataFrame from a dictionary of lists.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Miami']
}

df = pd.DataFrame(data)
print(df)

Explanation:

A DataFrame is created from a dictionary, where the keys are the column names and the values are lists of data.

2. Reading Data from a CSV File

Reading a CSV file into a DataFrame.

import pandas as pd

df = pd.read_csv('data.csv')  # Replace 'data.csv' with your file path
print(df.head())  # Display the first 5 rows of the DataFrame

Explanation:

pd.read_csv() loads data from a CSV file into a DataFrame.

3. DataFrame Selection and Indexing

Selecting a single column or multiple columns from a DataFrame.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}
df = pd.DataFrame(data)

# Select a single column
print(df['Name'])

# Select multiple columns
print(df[['Name', 'Age']])

Explanation:

Use df['column_name'] for selecting a single column and df[['col1', 'col2']] for selecting multiple columns.

4. Filtering Data

Filtering data based on conditions.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22], 'City': ['New York', 'Chicago', 'Los Angeles']}
df = pd.DataFrame(data)

# Filter rows where Age is greater than 23
filtered_df = df[df['Age'] > 23]
print(filtered_df)

Explanation:

You can filter a DataFrame by applying a condition on columns like df[df['Age'] > 23].

5. Handling Missing Data

Handling missing or NaN values in a DataFrame.

import pandas as pd
import numpy as np

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, np.nan, 22]}
df = pd.DataFrame(data)

# Fill missing values with a default value
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)

Explanation:

df.fillna(value) replaces NaN values with the specified value.

6. Grouping Data

Grouping data by one or more columns and performing aggregation.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [24, 27, 22, 32], 'City': ['NY', 'LA', 'NY', 'LA']}
df = pd.DataFrame(data)

# Group by 'City' and calculate the mean age
grouped = df.groupby('City')['Age'].mean()
print(grouped)

Explanation:

df.groupby('City') groups the data by the 'City' column and allows performing aggregation functions like mean().

7. Sorting Data

Sorting a DataFrame by one or more columns.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}
df = pd.DataFrame(data)

# Sort by Age in ascending order
sorted_df = df.sort_values('Age', ascending=True)
print(sorted_df)

Explanation:

df.sort_values('column_name') sorts the DataFrame by the specified column. Use ascending=False for descending order.

8. Applying Functions to Columns

Applying a custom function to each element of a column.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}
df = pd.DataFrame(data)

# Create a function to convert age to age group
def categorize_age(age):
    if age < 25:
        return 'Young'
    elif 25 <= age < 30:
        return 'Mid-age'
    else:
        return 'Older'

# Apply the function to the 'Age' column
df['Age Group'] = df['Age'].apply(categorize_age)
print(df)

Explanation:

df['Age'].apply(func) applies a custom function to each element in the 'Age' column.

9. Merging DataFrames

Merging two DataFrames on a common column.

import pandas as pd

data1 = {'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']}
data2 = {'ID': [1, 2, 4], 'Age': [24, 27, 22]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

Explanation:

pd.merge(df1, df2, on='column_name') merges two DataFrames based on a common column. The how parameter defines the type of join: inner, outer, left, or right.

10. Pivot Table

Creating a pivot table to summarize data.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'City': ['NY', 'LA', 'NY', 'LA'], 'Age': [24, 27, 22, 32]}
df = pd.DataFrame(data)

# Create a pivot table with average age by city
pivot_table = pd.pivot_table(df, values='Age', index='City', aggfunc='mean')
print(pivot_table)

Explanation:

pd.pivot_table(df, values='column_name', index='group_column') creates a pivot table that summarizes the data, allowing for aggregation functions like mean, sum, count, etc.

Conclusion:

pandas provides a comprehensive set of tools to handle and analyze structured data. Whether you're performing basic data manipulation, cleaning, aggregation, or advanced data analysis, pandas simplifies the task, allowing you to focus on the logic of your analysis rather than the implementation details.

Previous147. os.path for Path Manipulation Next149. Context Managers for Database Connections

Last updated 2 months ago