Lewati ke konten
Kembali ke Blog

Cara Belajar Machine Learning untuk Pemula

· · 8 menit baca

Machine Learning adalah cabang AI yang memungkinkan komputer belajar dari data. Mari pelajari fundamental-nya.

Apa itu Machine Learning?

Definisi

Machine Learning adalah:
- Komputer belajar dari data
- Tanpa explicitly programmed
- Improve performance dengan experience
- Pattern recognition & prediction

Jenis Machine Learning

1. Supervised Learning
   - Ada label/target
   - Learn dari example
   - Classification & Regression
  1. Unsupervised Learning
    • Tidak ada label
    • Find patterns
    • Clustering & Dimensionality reduction
  1. Reinforcement Learning
    • Learn dari reward/punishment
    • Trial and error
    • Game, robotics

Prerequisites

Mathematics

Linear Algebra:
- Vectors dan matrices
- Matrix operations
- Eigenvalues/eigenvectors

Statistics:

  • Probability distributions
  • Bayes' theorem
  • Hypothesis testing
  • Mean, variance, std

Calculus:

  • Derivatives
  • Gradient descent
  • Chain rule
  • Partial derivatives

Programming

Python basics:
- Variables, data types
- Control flow
- Functions
- OOP

Libraries:

  • NumPy (arrays)
  • Pandas (data manipulation)
  • Matplotlib/Seaborn (visualization)
  • Scikit-learn (ML algorithms)

Supervised Learning

Classification

# Binary Classification Example
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

data = load_breast_cancer() X, y = data.data, data.target

Split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

Scale features

scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

Train

model = LogisticRegression() model.fit(X_train_scaled, y_train)

Predict

y_pred = model.predict(X_test_scaled)

Evaluate

print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}") print(classification_report(y_test, y_pred))

Regression

# Linear Regression Example
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

Generate sample data

np.random.seed(42) X = np.random.rand(100, 1) 10 y = 2 X.squeeze() + 3 + np.random.randn(100)

Split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

Train

model = LinearRegression() model.fit(X_train, y_train)

Predict

y_pred = model.predict(X_test)

Evaluate

print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}") print(f"R2 Score: {r2_score(y_test, y pred):.2f}") print(f"Coefficients: {model.coef}") print(f"Intercept: {model.intercept_}")

Common Algorithms

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

Decision Tree

dt = DecisionTreeClassifier(max_depth=5) dt.fit(X_train, y_train)

Random Forest

rf = RandomForestClassifier(n_estimators=100) rf.fit(X_train, y_train)

SVM

svm = SVC(kernel='rbf', C=1.0) svm.fit(X_train_scaled, y_train)

KNN

knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train_scaled, y_train)

Naive Bayes

nb = GaussianNB() nb.fit(X_train, y_train)

Gradient Boosting

gb = GradientBoostingClassifier(n_estimators=100) gb.fit(X_train, y_train)

Unsupervised Learning

Clustering

from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

Generate sample data

from sklearn.datasets import make blobs X, = make_blobs(n_samples=300, centers=4, random_state=42)

K-Means

kmeans = KMeans(n_clusters=4, random_state=42) kmeans_labels = kmeans.fit_predict(X)

Find optimal K (Elbow method)

inertias = [] K_range = range(1, 10) for k in K_range: km = KMeans(n_clusters=k, random state=42) km.fit(X) inertias.append(km.inertia)

plt.plot(K_range, inertias, 'bx-') plt.xlabel('k') plt.ylabel('Inertia') plt.title('Elbow Method') plt.show()

DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5) dbscan_labels = dbscan.fit_predict(X)

Gaussian Mixture

gmm = GaussianMixture(n_components=4, random_state=42) gmm_labels = gmm.fit_predict(X)

Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

PCA

pca = PCA(n_components=2) X_pca = pca.fit_transform(X)

Explained variance

print(f"Explained variance ratio: {pca.explained_variance ratio}") print(f"Total: {sum(pca.explained_variance ratio):.2f}")

t-SNE (for visualization)

tsne = TSNE(n_components=2, random_state=42, perplexity=30) X_tsne = tsne.fit_transform(X)

Visualize

fig, axes = plt.subplots(1, 2, figsize=(12, 5)) axes[0].scatter(X_pca[:, 0], X_pca[:, 1]) axes[0].set_title('PCA') axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1]) axes[1].set_title('t-SNE') plt.show()

Model Evaluation

Classification Metrics

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, roc_auc_score, roc_curve
)

Basic metrics

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}") print(f"Precision: {precision_score(y_test, y_pred):.3f}") print(f"Recall: {recall_score(y_test, y_pred):.3f}") print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")

Confusion Matrix

cm = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(cm)

ROC Curve

y_prob = model.predict_proba(X_test)[:, 1] fpr, tpr, thresholds = roc_curve(y_test, y_prob) auc = roc_auc_score(y_test, y_prob)

plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}') plt.plot([0, 1], [0, 1], 'k--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.legend() plt.show()

Regression Metrics

from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score
)

print(f"MSE: {mean_squared_error(y_test, y_pred):.3f}") print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}") print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}") print(f"R² Score: {r2_score(y_test, y_pred):.3f}")

Cross-Validation

from sklearn.model_selection import cross_val_score, KFold

K-Fold Cross Validation

kfold = KFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print(f"CV Scores: {scores}") print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Feature Engineering

Handling Missing Values

import pandas as pd
from sklearn.impute import SimpleImputer

Check missing

print(df.isnull().sum())

Drop

df_clean = df.dropna()

Impute with mean/median/mode

imputer = SimpleImputer(strategy='mean') # 'median', 'most_frequent' df_imputed = pd.DataFrame( imputer.fit_transform(df), columns=df.columns )

Encoding

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

Label Encoding

le = LabelEncoder() df['category_encoded'] = le.fit_transform(df['category'])

One-Hot Encoding

df_encoded = pd.get_dummies(df, columns=['category'])

Or using sklearn

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore') encoded = ohe.fit_transform(df[['category']])

Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

StandardScaler (z-score)

scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

MinMaxScaler (0-1)

minmax = MinMaxScaler() X_normalized = minmax.fit_transform(X)

RobustScaler (resistant to outliers)

robust = RobustScaler() X_robust = robust.fit_transform(X)

Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif, RFE

Select K Best

selector = SelectKBest(f_classif, k=10) X_selected = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()].tolist()

Recursive Feature Elimination

from sklearn.ensemble import RandomForestClassifier rfe = RFE(RandomForestClassifier(), n_features_to_select=10) X_rfe = rfe.fit_transform(X, y)

Feature Importance

rf = RandomForestClassifier() rf.fit(X, y) importance = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature importances }).sort_values('importance', ascending=False)

Hyperparameter Tuning

Grid Search

from sklearn.model_selection import GridSearchCV

Define parameter grid

param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] }

Grid Search

grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1 )

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best params}") print(f"Best score: {grid_search.best score:.3f}")

Use best model

best_model = grid_search.best estimator

Random Search

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

Define distributions

param_dist = { 'n_estimators': randint(50, 500), 'max_depth': randint(5, 50), 'min_samples_split': randint(2, 20), 'min_samples_leaf': randint(1, 10) }

Random Search

random_search = RandomizedSearchCV( RandomForestClassifier(random_state=42), param_dist, n_iter=100, cv=5, scoring='accuracy', n_jobs=-1, random_state=42 )

random_search.fit(X_train, y_train) print(f"Best parameters: {random_search.best params}")

ML Pipeline

Complete Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

Define preprocessing

numeric_features = ['age', 'income'] categorical_features = ['gender', 'city']

preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features) ] )

Create pipeline

pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier()) ])

Fit and predict

pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test)

Save model

import joblib joblib.dump(pipeline, 'model.pkl')

Load model

loaded_model = joblib.load('model.pkl')

Resources

Learning Path

1. Mathematics fundamentals (2-4 weeks)
2. Python dan libraries (2-4 weeks)
3. Supervised learning (4-6 weeks)
4. Unsupervised learning (2-4 weeks)
5. Deep learning basics (4-6 weeks)
6. Projects dan practice (ongoing)

Courses

Free:
- Andrew Ng's ML Course (Coursera)
- Fast.ai
- Google ML Crash Course
- Kaggle Learn

Paid:

  • DataCamp
  • Udemy
  • Coursera Specializations

Practice

- Kaggle Competitions
- UCI ML Repository
- Scikit-learn toy datasets
- Real-world projects

Kesimpulan

Machine Learning adalah field yang luas. Start dengan fundamentals, practice dengan datasets, dan gradually move ke advanced topics seperti deep learning.

Ditulis oleh

Hendra Wijaya

Tinggalkan Komentar

Email tidak akan ditampilkan.