The data science workflow typically consists of several sequential steps:
In Python, we often use the pandas
library to load and manipulate tabular data. Here is an example of loading a CSV file:
import pandas as pd
# Load a CSV file
data = pd.read_csv('example.csv')
# Display the first few rows of the data
print(data.head())
We can use various methods to explore the data, such as checking the data types, shape, and summary statistics.
# Check data types
print(data.dtypes)
# Check the shape of the data
rows, columns = data.shape
if rows > 0 and columns > 0:
print(f"The data has {rows} rows and {columns} columns.")
# Get summary statistics
print(data.describe())
Missing values can be handled in several ways, such as dropping rows or columns with missing values or filling them with appropriate values.
# Check for missing values
print(data.isnull().sum())
# Drop rows with missing values
data = data.dropna()
# Fill missing values with the mean of a column
data['column_name'] = data['column_name'].fillna(data['column_name'].mean())
Outliers can be detected using statistical methods like the inter - quartile range (IQR).
Q1 = data['column_name'].quantile(0.25)
Q3 = data['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
data = data[(data['column_name'] >= lower_bound) & (data['column_name'] <= upper_bound)]
We can use libraries like matplotlib
and seaborn
for data visualization.
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram
sns.histplot(data['column_name'], kde=True)
plt.show()
# Scatter plot
sns.scatterplot(x='column1', y='column2', data=data)
plt.show()
# Correlation matrix
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
Feature engineering involves creating new features or transforming existing ones.
# Create a new feature by combining two existing features
data['new_feature'] = data['column1'] + data['column2']
# One - hot encoding for categorical variables
data = pd.get_dummies(data, columns=['categorical_column'])
Let’s assume we are doing a simple linear regression. We will use the scikit - learn
library.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Separate features and target variable
X = data.drop('target_column', axis=1)
y = data['target_column']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
We can use different metrics to evaluate the performance of the model.
from sklearn.metrics import mean_squared_error, r2_score
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R - squared: {r2}")
The data science workflow is a complex but well - structured process that allows us to extract valuable insights from raw data. Python provides a rich ecosystem of libraries that make each step of the workflow more accessible. By following the steps outlined in this blog, from loading raw data to evaluating models, you can effectively perform data science tasks in Python. Remember that data cleaning and preprocessing are crucial steps that can significantly impact the performance of your models. Also, continuous exploration and experimentation are key to finding the best solutions for your data science problems.