Data Science adalah salah satu skill paling dicari saat ini. Mari pelajari roadmap untuk memulai dari nol.
Apa itu Data Science?
Definisi
Data Science adalah kombinasi dari:
- Statistics & Mathematics
- Programming
- Domain Knowledge
Untuk mengekstrak insight dari data dan
membuat keputusan data-driven.
Peran Data Scientist
Responsibilities:
- Collect and clean data
- Exploratory Data Analysis (EDA)
- Build predictive models
- Communicate insights
- Deploy ML models
Roadmap Belajar
Phase 1: Foundations (1-2 bulan)
1. Python Basics
- Variables, data types
- Control flow
- Functions
- OOP basics
- Statistics Dasar
- Mean, median, mode
- Standard deviation
- Probability
- Distributions
- Linear Algebra
- Vectors dan matrices
- Matrix operations
- Eigenvalues
Phase 2: Data Analysis (2-3 bulan)
1. NumPy
- Arrays
- Broadcasting
- Linear algebra operations
- Pandas
- DataFrames
- Data manipulation
- Groupby, merge, pivot
- Data Visualization
- Matplotlib
- Seaborn
- Plotly
Phase 3: Machine Learning (3-4 bulan)
1. Supervised Learning
- Linear Regression
- Logistic Regression
- Decision Trees
- Random Forest
- SVM
- Unsupervised Learning
- K-Means Clustering
- PCA
- Hierarchical Clustering
- Model Evaluation
- Train/test split
- Cross-validation
- Metrics (accuracy, precision, recall)
Phase 4: Advanced Topics (ongoing)
- Deep Learning (TensorFlow/PyTorch)
- Natural Language Processing
- Computer Vision
- Time Series Analysis
- Big Data (Spark)
Python untuk Data Science
Setup Environment
# Install Anaconda
# Download dari anaconda.com
pip install numpy pandas matplotlib seaborn scikit-learn jupyter
Start Jupyter
jupyter notebook
NumPy Basics
import numpy as np
Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6]])
Array operations
print(arr + 10) # [11, 12, 13, 14, 15]
print(arr * 2) # [2, 4, 6, 8, 10]
print(np.mean(arr)) # 3.0
print(np.std(arr)) # 1.414
Matrix operations
print(matrix.shape) # (2, 3)
print(matrix.T) # Transpose
Random
np.random.seed(42)
random_arr = np.random.randn(5) # Normal distribution
Pandas Basics
import pandas as pd
Create DataFrame
df = pd.DataFrame({
'nama': ['Budi', 'Ani', 'Citra'],
'umur': [25, 23, 28],
'kota': ['Jakarta', 'Bandung', 'Surabaya']
})
Basic operations
print(df.head()) # First 5 rows
print(df.info()) # Data types
print(df.describe()) # Statistics
Selection
print(df['nama']) # Single column
print(df[['nama', 'umur']]) # Multiple columns
print(df[df['umur'] > 24]) # Filter
Read/Write data
df = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)
Groupby
df.groupby('kota')['umur'].mean()
Handling missing values
df.fillna(0)
df.dropna()
Data Visualization
Matplotlib
import matplotlib.pyplot as plt
Line plot
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('Line Plot')
plt.show()
Scatter plot
plt.scatter(x, y)
plt.show()
Bar plot
categories = ['A', 'B', 'C']
values = [10, 20, 15]
plt.bar(categories, values)
plt.show()
Histogram
data = np.random.randn(1000)
plt.hist(data, bins=30)
plt.show()
Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes[0, 0].plot(x, y)
axes[0, 1].scatter(x, y)
axes[1, 0].bar(categories, values)
axes[1, 1].hist(data)
plt.tight_layout()
plt.show()
Seaborn
import seaborn as sns
Load sample dataset
tips = sns.load_dataset('tips')
Distribution plot
sns.histplot(tips['total_bill'])
plt.show()
Box plot
sns.boxplot(x='day', y='total_bill', data=tips)
plt.show()
Scatter with hue
sns.scatterplot(x='total_bill', y='tip', hue='sex', data=tips)
plt.show()
Heatmap
correlation = tips.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.show()
Pair plot
sns.pairplot(tips, hue='sex')
plt.show()
Machine Learning dengan Scikit-learn
Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Prepare data
X = df[['feature1', 'feature2']]
y = df['target']
Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Train model
model = LinearRegression()
model.fit(X_train, y_train)
Predict
y_pred = model.predict(X_test)
Evaluate
print(f"MSE: {mean_squared_error(y_test, y_pred)}")
print(f"R2 Score: {r2_score(y_test, y
pred)}")
print(f"Coefficients: {model.coef}")
Classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Train model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
Predict
y_pred = clf.predict(X_test)
Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Cross-Validation
from sklearn.model_selection import cross_val_score
5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")
Feature Importance
# Get feature importance
importance = pd.DataFrame({
'feature': X.columns,
'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)
print(importance)
Visualize
plt.barh(importance['feature'], importance['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()
Exploratory Data Analysis (EDA)
EDA Checklist
# 1. Load data
df = pd.read_csv('data.csv')
2. Basic info
print(df.shape)
print(df.info())
print(df.describe())
3. Check missing values
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)
4. Check duplicates
print(df.duplicated().sum())
5. Data types
print(df.dtypes)
6. Unique values
for col in df.columns:
print(f"{col}: {df[col].nunique()} unique values")
7. Distributions
for col in df.select_dtypes(include=[np.number]).columns:
plt.figure()
sns.histplot(df[col])
plt.title(col)
plt.show()
8. Correlations
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()
9. Outliers
for col in df.select_dtypes(include=[np.number]).columns:
plt.figure()
sns.boxplot(x=df[col])
plt.title(col)
plt.show()
Data Preprocessing
Handling Missing Values
# Check missing
print(df.isnull().sum())
Drop missing
df_clean = df.dropna()
Fill with mean/median
df['column'] = df['column'].fillna(df['column'].mean())
Fill with mode (categorical)
df['category'] = df['category'].fillna(df['category'].mode()[0])
Forward/backward fill
df['column'] = df['column'].fillna(method='ffill')
Feature Encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
Label encoding
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
One-hot encoding
df_encoded = pd.get_dummies(df, columns=['category'])
Scikit-learn OneHotEncoder
ohe = OneHotEncoder(sparse=False)
encoded = ohe.fit_transform(df[['category']])
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Normalization (0-1)
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X)
Project Ideas untuk Portfolio
Beginner Projects
1. Titanic Survival Prediction
- Kaggle classic
- Classification problem
- House Price Prediction
- Regression problem
- Feature engineering
- Customer Segmentation
- Clustering
- RFM analysis
Intermediate Projects
1. Sentiment Analysis
- NLP
- Twitter/review data
- Stock Price Prediction
- Time series
- LSTM
- Recommendation System
- Collaborative filtering
- Content-based
Resources
Learning Platforms
Free:
- Kaggle Learn
- Google ML Crash Course
- Fast.ai
- DataCamp (beberapa gratis)
Paid:
- Coursera (Andrew Ng courses)
- DataCamp
- Udemy
Practice
- Kaggle Competitions
- DrivenData
- Analytics Vidhya
- HackerRank
Kesimpulan
Data Science adalah journey yang panjang. Mulai dengan Python dan statistics, lalu gradually build ke machine learning. Practice dengan real projects adalah kunci.
Ditulis oleh
Hendra Wijaya