Lewati ke konten
Kembali ke Blog

Cara Belajar Data Science untuk Pemula

· · 6 menit baca

Data Science adalah salah satu skill paling dicari saat ini. Mari pelajari roadmap untuk memulai dari nol.

Apa itu Data Science?

Definisi

Data Science adalah kombinasi dari:
- Statistics & Mathematics
- Programming
- Domain Knowledge

Untuk mengekstrak insight dari data dan membuat keputusan data-driven.

Peran Data Scientist

Responsibilities:
- Collect and clean data
- Exploratory Data Analysis (EDA)
- Build predictive models
- Communicate insights
- Deploy ML models

Roadmap Belajar

Phase 1: Foundations (1-2 bulan)

1. Python Basics
   - Variables, data types
   - Control flow
   - Functions
   - OOP basics
  1. Statistics Dasar
    • Mean, median, mode
    • Standard deviation
    • Probability
    • Distributions
  1. Linear Algebra
    • Vectors dan matrices
    • Matrix operations
    • Eigenvalues

Phase 2: Data Analysis (2-3 bulan)

1. NumPy
   - Arrays
   - Broadcasting
   - Linear algebra operations
  1. Pandas
    • DataFrames
    • Data manipulation
    • Groupby, merge, pivot
  1. Data Visualization
    • Matplotlib
    • Seaborn
    • Plotly

Phase 3: Machine Learning (3-4 bulan)

1. Supervised Learning
   - Linear Regression
   - Logistic Regression
   - Decision Trees
   - Random Forest
   - SVM
  1. Unsupervised Learning
    • K-Means Clustering
    • PCA
    • Hierarchical Clustering
  1. Model Evaluation
    • Train/test split
    • Cross-validation
    • Metrics (accuracy, precision, recall)

Phase 4: Advanced Topics (ongoing)

- Deep Learning (TensorFlow/PyTorch)
- Natural Language Processing
- Computer Vision
- Time Series Analysis
- Big Data (Spark)

Python untuk Data Science

Setup Environment

# Install Anaconda
# Download dari anaconda.com

pip install numpy pandas matplotlib seaborn scikit-learn jupyter

Start Jupyter

jupyter notebook

NumPy Basics

import numpy as np

Create arrays

arr = np.array([1, 2, 3, 4, 5]) matrix = np.array([[1, 2, 3], [4, 5, 6]])

Array operations

print(arr + 10) # [11, 12, 13, 14, 15] print(arr * 2) # [2, 4, 6, 8, 10] print(np.mean(arr)) # 3.0 print(np.std(arr)) # 1.414

Matrix operations

print(matrix.shape) # (2, 3) print(matrix.T) # Transpose

Random

np.random.seed(42) random_arr = np.random.randn(5) # Normal distribution

Pandas Basics

import pandas as pd

Create DataFrame

df = pd.DataFrame({ 'nama': ['Budi', 'Ani', 'Citra'], 'umur': [25, 23, 28], 'kota': ['Jakarta', 'Bandung', 'Surabaya'] })

Basic operations

print(df.head()) # First 5 rows print(df.info()) # Data types print(df.describe()) # Statistics

Selection

print(df['nama']) # Single column print(df[['nama', 'umur']]) # Multiple columns print(df[df['umur'] > 24]) # Filter

Read/Write data

df = pd.read_csv('data.csv') df.to_csv('output.csv', index=False)

Groupby

df.groupby('kota')['umur'].mean()

Handling missing values

df.fillna(0) df.dropna()

Data Visualization

Matplotlib

import matplotlib.pyplot as plt

Line plot

x = [1, 2, 3, 4, 5] y = [2, 4, 6, 8, 10] plt.plot(x, y) plt.xlabel('X axis') plt.ylabel('Y axis') plt.title('Line Plot') plt.show()

Scatter plot

plt.scatter(x, y) plt.show()

Bar plot

categories = ['A', 'B', 'C'] values = [10, 20, 15] plt.bar(categories, values) plt.show()

Histogram

data = np.random.randn(1000) plt.hist(data, bins=30) plt.show()

Multiple subplots

fig, axes = plt.subplots(2, 2, figsize=(10, 8)) axes[0, 0].plot(x, y) axes[0, 1].scatter(x, y) axes[1, 0].bar(categories, values) axes[1, 1].hist(data) plt.tight_layout() plt.show()

Seaborn

import seaborn as sns

Load sample dataset

tips = sns.load_dataset('tips')

Distribution plot

sns.histplot(tips['total_bill']) plt.show()

Box plot

sns.boxplot(x='day', y='total_bill', data=tips) plt.show()

Scatter with hue

sns.scatterplot(x='total_bill', y='tip', hue='sex', data=tips) plt.show()

Heatmap

correlation = tips.corr() sns.heatmap(correlation, annot=True, cmap='coolwarm') plt.show()

Pair plot

sns.pairplot(tips, hue='sex') plt.show()

Machine Learning dengan Scikit-learn

Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Prepare data

X = df[['feature1', 'feature2']] y = df['target']

Split data

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

Train model

model = LinearRegression() model.fit(X_train, y_train)

Predict

y_pred = model.predict(X_test)

Evaluate

print(f"MSE: {mean_squared_error(y_test, y_pred)}") print(f"R2 Score: {r2_score(y_test, y pred)}") print(f"Coefficients: {model.coef}")

Classification

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Train model

clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train)

Predict

y_pred = clf.predict(X_test)

Evaluate

print(f"Accuracy: {accuracy_score(y_test, y_pred)}") print("\nClassification Report:") print(classification_report(y_test, y_pred)) print("\nConfusion Matrix:") print(confusion_matrix(y_test, y_pred))

Cross-Validation

from sklearn.model_selection import cross_val_score

5-fold cross-validation

scores = cross_val_score(clf, X, y, cv=5) print(f"CV Scores: {scores}") print(f"Mean: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")

Feature Importance

# Get feature importance
importance = pd.DataFrame({
    'feature': X.columns,
    'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)

print(importance)

Visualize

plt.barh(importance['feature'], importance['importance']) plt.xlabel('Importance') plt.title('Feature Importance') plt.show()

Exploratory Data Analysis (EDA)

EDA Checklist

# 1. Load data
df = pd.read_csv('data.csv')

2. Basic info

print(df.shape) print(df.info()) print(df.describe())

3. Check missing values

print(df.isnull().sum()) print(df.isnull().sum() / len(df) * 100)

4. Check duplicates

print(df.duplicated().sum())

5. Data types

print(df.dtypes)

6. Unique values

for col in df.columns: print(f"{col}: {df[col].nunique()} unique values")

7. Distributions

for col in df.select_dtypes(include=[np.number]).columns: plt.figure() sns.histplot(df[col]) plt.title(col) plt.show()

8. Correlations

plt.figure(figsize=(12, 8)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm') plt.show()

9. Outliers

for col in df.select_dtypes(include=[np.number]).columns: plt.figure() sns.boxplot(x=df[col]) plt.title(col) plt.show()

Data Preprocessing

Handling Missing Values

# Check missing
print(df.isnull().sum())

Drop missing

df_clean = df.dropna()

Fill with mean/median

df['column'] = df['column'].fillna(df['column'].mean())

Fill with mode (categorical)

df['category'] = df['category'].fillna(df['category'].mode()[0])

Forward/backward fill

df['column'] = df['column'].fillna(method='ffill')

Feature Encoding

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

Label encoding

le = LabelEncoder() df['category_encoded'] = le.fit_transform(df['category'])

One-hot encoding

df_encoded = pd.get_dummies(df, columns=['category'])

Scikit-learn OneHotEncoder

ohe = OneHotEncoder(sparse=False) encoded = ohe.fit_transform(df[['category']])

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler

Standardization (mean=0, std=1)

scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Normalization (0-1)

minmax = MinMaxScaler() X_normalized = minmax.fit_transform(X)

Project Ideas untuk Portfolio

Beginner Projects

1. Titanic Survival Prediction
   - Kaggle classic
   - Classification problem
  1. House Price Prediction
    • Regression problem
    • Feature engineering
  1. Customer Segmentation
    • Clustering
    • RFM analysis

Intermediate Projects

1. Sentiment Analysis
   - NLP
   - Twitter/review data
  1. Stock Price Prediction
    • Time series
    • LSTM
  1. Recommendation System
    • Collaborative filtering
    • Content-based

Resources

Learning Platforms

Free:
- Kaggle Learn
- Google ML Crash Course
- Fast.ai
- DataCamp (beberapa gratis)

Paid:

  • Coursera (Andrew Ng courses)
  • DataCamp
  • Udemy

Practice

- Kaggle Competitions
- DrivenData
- Analytics Vidhya
- HackerRank

Kesimpulan

Data Science adalah journey yang panjang. Mulai dengan Python dan statistics, lalu gradually build ke machine learning. Practice dengan real projects adalah kunci.

Ditulis oleh

Hendra Wijaya

Tinggalkan Komentar

Email tidak akan ditampilkan.