A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). It is similar to a column in a spreadsheet or a SQL table.
import pandas as pd
# Create a Series
data = [10, 20, 30, 40]
index = ['a', 'b', 'c', 'd']
s = pd.Series(data, index=index)
print(s)
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It can be thought of as a spreadsheet or a SQL table.
# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Pandas provides various functions to read data from different file formats such as CSV, Excel, JSON, etc.
# Read a CSV file
csv_file = 'data.csv'
df = pd.read_csv(csv_file)
print(df.head())
You can select specific columns, rows, or a combination of both from a DataFrame.
# Select a single column
ages = df['Age']
print(ages)
# Select rows based on a condition
young_people = df[df['Age'] < 30]
print(young_people)
Data cleaning involves handling missing values, incorrect data types, and inconsistent data.
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values)
# Fill missing values with a specific value
df = df.fillna(0)
Data transformation includes operations such as applying functions to columns, creating new columns, and reshaping data.
# Apply a function to a column
df['Age_in_5_years'] = df['Age'].apply(lambda x: x + 5)
print(df)
Missing values are a common issue in real-world data. You can handle them by dropping rows or columns with missing values or filling them with appropriate values.
# Drop rows with missing values
df = df.dropna()
# Fill missing values with the mean of the column
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)
Duplicate rows can distort your analysis. You can remove them using the drop_duplicates()
method.
# Remove duplicate rows
df = df.drop_duplicates()
Data aggregation involves summarizing data by groups. You can use the groupby()
method to group data and apply aggregation functions such as sum()
, mean()
, etc.
# Group data by a column and calculate the mean of another column
grouped = df.groupby('City')['Age'].mean()
print(grouped)
Write clean and readable code by using meaningful variable names, adding comments, and following a consistent coding style.
# Use meaningful variable names
ages = df['Age']
above_30 = ages[ages > 30]
print(above_30)
For large datasets, performance can be a concern. Use vectorized operations instead of loops, and consider using data types that consume less memory.
# Vectorized operation
df['Double_Age'] = df['Age'] * 2
Data wrangling with Pandas is an essential skill for any data analyst or scientist. By understanding the fundamental concepts, usage methods, common practices, and best practices, you can efficiently clean, transform, and analyze data. Pandas provides a rich set of functions and data structures that make data wrangling a breeze.