Python is an interpreted, high - level, general - purpose programming language. In data science, basic Python constructs such as variables, data types (e.g., integers, floats, strings), and control flow statements (if - else, for loops, while loops) are essential.
# Variable assignment
x = 10
y = 20
z = x + y
print(z)
# For loop example
numbers = [1, 2, 3, 4, 5]
for num in numbers:
print(num * 2)
my_list = [1, 2, 3, 'apple', 'banana']
print(my_list[3])
my_tuple = (1, 2, 3)
print(my_tuple[1])
my_dict = {'name': 'John', 'age': 30}
print(my_dict['name'])
Object - oriented programming (OOP) in Python allows you to create classes and objects. In data science, OOP can be used to organize code and create reusable components.
class DataPoint:
def __init__(self, x, y):
self.x = x
self.y = y
def distance_from_origin(self):
return (self.x**2 + self.y**2)**0.5
point = DataPoint(3, 4)
print(point.distance_from_origin())
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr.mean())
import pandas as pd
data = {'Name': ['John', 'Jane'], 'Age': [30, 25]}
df = pd.DataFrame(data)
print(df)
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.show()
Pandas provides powerful functions for data manipulation and cleaning. For example, you can filter rows based on a condition.
import pandas as pd
data = {'Name': ['John', 'Jane', 'Bob'], 'Age': [30, 25, 40]}
df = pd.DataFrame(data)
filtered_df = df[df['Age'] > 25]
print(filtered_df)
Matplotlib and Seaborn are commonly used for data visualization.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
data = {'Category': ['A', 'B', 'C'], 'Value': [10, 20, 30]}
df = pd.DataFrame(data)
sns.barplot(x='Category', y='Value', data=df)
plt.show()
Pandas can read and write data in various formats such as CSV, Excel, and JSON.
import pandas as pd
# Reading a CSV file
df = pd.read_csv('data.csv')
# Writing a DataFrame to a CSV file
df.to_csv('new_data.csv', index=False)
Missing values are common in real - world data. You can use Pandas to handle them.
import pandas as pd
import numpy as np
data = {'Name': ['John', np.nan, 'Bob'], 'Age': [30, 25, np.nan]}
df = pd.DataFrame(data)
df = df.dropna() # Drop rows with missing values
print(df)
Feature engineering involves creating new features from existing ones. For example, you can create a new feature by combining two existing features.
import pandas as pd
data = {'Height': [170, 180, 165], 'Weight': [70, 80, 65]}
df = pd.DataFrame(data)
df['BMI'] = df['Weight'] / ((df['Height']/100)**2)
print(df)
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
squared_arr = arr**2
print(squared_arr)
Use Git for version control. You can create a repository for your data science project and track changes over time.
Document your code using docstrings. For example:
def add_numbers(a, b):
"""
This function adds two numbers.
Args:
a (int or float): The first number.
b (int or float): The second number.
Returns:
int or float: The sum of a and b.
"""
return a + b
Mastering Python for data science is a journey that involves understanding fundamental concepts, learning usage methods, adopting common practices, and following best practices. By leveraging Python’s rich ecosystem of libraries and applying the techniques discussed in this guide, you can become proficient in data science tasks such as data manipulation, visualization, and analysis.