一種元啟發式(Metaheuristic)遺傳算法(Genetic Algorithm)特徵選擇器
介紹
兩種搜尋策略概念
啟發算法（Heuristic）
元啟發算法（Metaheuristic）
特徵選擇（Feature Selection）
包裝式方法（Wrapper Feature Selection）
遺傳算法包裝式特徵選擇實作項目
環境與資料集
包裝式特徵選擇(Wrapper Feature Selection )
遺傳算法（Genetic Algorithm, GA）
初始化
個體評分
交叉(Crossover) 與變異(Mutation)
分析結果
參考資料

資料科學開發

2025-10-25

一種元啟發式(Metaheuristic)遺傳算法(Genetic Algorithm)特徵選擇器

介紹

使用遺傳算法（Genetic Algorithm, GA）這種元啟發式方法，來自動挑選資料集中特徵子集，透過 Wrapper Feature Selection 評估每個子集對模型表現（Accuracy）的好壞，找出「最有效的特徵組合」。

從Genetic Algorithms(GAs)演算法的理論來看，資料集的每個資料行視為個體，而每個列的特徵視為染色體或解方(chromosome/solutions)，每個世代(Generation)的個體為同一個種群（Population），直到種群隨著演算法進行下一個世代(Generation)的迭代，產生新一代種群

在開始介紹遺傳算法（Genetic Algorithm, GA）之前，這裡有兩種搜尋策略概念，分別對應到全局最優解和局部最優解兩種情境，這也是人工智能領域重要的概念之一。

兩種搜尋策略概念

啟發算法（Heuristic）

以簡化規則或經驗法則快速產生可行解，不保證全域最優，但計算速度快，適合特定問題的特徵子集搜尋或初步篩選。

目前已有的演算法

前向逐步選擇（SFS, Sequential Forward Selection）
後向逐步剔除（SBS, Sequential Backward Selection）
貪婪搜索（Greedy Search）
最近鄰搜索
基於重要性排序

元啟發算法（Metaheuristic）

高階啟發式策略，用於指導搜尋過程，結合探索（exploration）與利用（exploitation），具有全局搜尋能力，可避免陷入局部最優。

此外，探索（Exploration）與利用（Exploitation）是一個非常通用的概念，不僅限於 GA 或特徵選擇

探索（Exploration）：嘗試未知的解或新的區域，增加找到全局最優解的機會。
利用（Exploitation）：集中在已知的好解或優勢區域，快速改進並收斂到最佳解。

目前存在的元啟發式方法

遺傳算法（GA, Genetic Algorithm）
粒子群優化（PSO, Particle Swarm Optimization）
蟻群算法（ACO, Ant Colony Optimization）
模擬退火（SA, Simulated Annealing）
差分進化（DE, Differential Evolution）

特徵選擇（Feature Selection）

找出對模型最有貢獻的特徵子集，減少維度、提升效能與可解釋性。

主要類型：

篩選法（Filter）
包裝法（Wrapper）
內嵌法（Embedded）

包裝式方法（Wrapper Feature Selection）

包裝式方法（Wrapper Feature Selection）是一種以模型性能為基準的特徵選擇策略，將每個候選特徵子集當作輸入，訓練模型並計算表現分數（Accuracy、F1、R² 等），判斷哪些特徵組合最有效。

包裝式方法在評估依賴模型表現，直接反映特徵對模型的貢獻，每個特徵子集都需訓練模型，因此計算成本較高

包裝式方法適合與元啟發式方法（Metaheuristic，如 GA、PSO、ACO）結合，用於搜尋最優特徵組合。

範例應用：

遺傳算法包裝式特徵選擇（GA Wrapper Feature Selection）
粒子群優化包裝式特徵選擇（PSO Wrapper Feature Selection）
模擬退火包裝式特徵選擇（SA Wrapper Feature Selection）

遺傳算法包裝式特徵選擇實作項目

環境與資料集

初始載入環境與資料集，並觀察資料集資訊

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import pandas as pd

# 1. 載入數據
data = load_iris()
X = data.data      # 特徵矩陣
y = data.target    # 標籤

print(data.feature_names)
print(X.shape)
print(y)
n_features = X.shape[1] #特徵數量

包裝式特徵選擇(Wrapper Feature Selection )

包裝式特徵選擇負責評估個體的分數表現，使用隨機森林模型進行訓練，並進行五折交叉驗證 5-fold Cross Validation，最終取平均值代表個體分數

feature_mask：個體向量，例如 [0 1 1 0] 用隨機森林模型 (RandomForestClassifier) 進行 5 折交叉驗證回傳平均準確率當作該個體的「適應度」

def fitness_function(feature_mask):
    # feature_mask: 長度 = n_features，元素為0或1
    if np.sum(feature_mask) == 0:
        return 0  # 避免空子集
    selected_X = X[:, feature_mask == 1]
    model = RandomForestClassifier()
    score = cross_val_score(model, selected_X, y, cv=5).mean()
    return score

遺傳算法（Genetic Algorithm, GA）

模仿自然演化的隨機化最佳化演算法，透過個體選擇、交叉與變異，使解空間中的候選解不斷演化，逐步逼近全域最優解。

主要步驟：

初始化種群（Population Initialization）
計算適應度（Fitness Evaluation）
選擇父母（Selection）
基因交叉（Crossover）
基因變異（Mutation）
生成新一代（New Population）
檢查終止條件（Stopping Criteria）

初始化

10個族群
進行5代演化
有 20% 突變機率
population 物件產生族群的隨機特徵

population_size = 10
generations = 5
mutation_rate = 0.2

population = np.random.randint(0, 2, size=(population_size, n_features))

個體評分

迭代每個個體的演化分數表現，然後挑選出最高分個體，評分方式則是 Wrapper Feature Selection 的範疇

for gen in range(generations):
    fitness_scores = np.array([fitness_function(ind) for ind in population])
    
    best_idx = np.argmax(fitness_scores)
    best_individual = population[best_idx]
    print(f"Gen {gen} best score: {fitness_scores[best_idx]} | selected features: {best_individual}")

交叉(Crossover) 與變異(Mutation)

GA演算法的遺傳與變異概念，透過交叉(Crossover) 與變異(Mutation) 來進行演化下一代，其中交叉的為兩個「父母個體」的部分基因，重新組合成兩個「子代個體」，

而變異(Mutation) 則在子代基因中以低機率隨機改變某些基因位（例如將 0 變為 1 或反之），
以增加族群的多樣性、防止演化陷入區域最佳解（local optimum）

具體的交叉方法取決於演算法的設計，此部分則為父母個體1+父母個體2，兩個個體各半特徵來組合

父1: [0, 1 | 0, 1] -> [0, 1]
父2: [1, 0 | 1, 0] -> [1, 0]

這段程式碼主要將族群個體總數各取一半，然後將parents隨機進行配對，取中間位子的索引值，最後將各半的特徵值傳遞給child ，產生兩個 child ，並對 child 進行變異處理。

new_population = []
for _ in range(population_size // 2):
    parents = population[np.random.choice(population_size, 2, replace=False)]
    point = np.random.randint(1, n_features)
    child1 = np.concatenate([parents[0][:point], parents[1][point:]])
    child2 = np.concatenate([parents[1][:point], parents[0][point:]])
    for child in [child1, child2]:
        for i in range(n_features):
            if np.random.rand() < mutation_rate:
                child[i] = 1 - child[i]
        new_population.append(child)
population = np.array(new_population)

分析結果

GA演算法執行結果表明最終迭代到 Gen 4 特徵選擇為[0 0 1 1]，對應資料集則是 petal length 、petal width (cm) ，鳶尾花（Iris）的特徵關聯很好理解，萼片與花瓣是不同的部位，因此關聯特徵需要是相同的財具有關聯性，如需要更其他驗證方式可進行探索性資料分析 (Exploratory Data Analysis)。

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Gen 0 best score: 0.9666666666666668 | selected features: [0 1 1 1]
Gen 1 best score: 0.96 | selected features: [1 1 1 1]
Gen 2 best score: 0.9666666666666668 | selected features: [1 1 1 1]
Gen 3 best score: 0.9666666666666668 | selected features: [0 0 1 1]
Gen 4 best score: 0.9666666666666668 | selected features: [0 0 1 1]

同時，再進行兩次的實驗，依然得出 Gen 4 特徵選擇為[0 0 1 1]，分數表現落在 0.96 ...

Gen 0 best score: 0.96 | selected features: [1 0 1 1]
Gen 1 best score: 0.9666666666666668 | selected features: [0 0 1 1]
Gen 2 best score: 0.9666666666666668 | selected features: [0 0 1 1]
Gen 3 best score: 0.96 | selected features: [0 0 1 1]
Gen 4 best score: 0.9666666666666668 | selected features: [0 0 1 1]

Gen 0 best score: 0.9666666666666668 | selected features: [1 1 1 1]
Gen 1 best score: 0.9533333333333334 | selected features: [0 1 0 1]
Gen 2 best score: 0.96 | selected features: [1 0 1 1]
Gen 3 best score: 0.9533333333333334 | selected features: [1 0 1 1]
Gen 4 best score: 0.9666666666666668 | selected features: [0 0 1 1]

參考資料

Genetic Algorithms 遺傳演算法 https://www.geeksforgeeks.org/dsa/genetic-algorithms/

包裝方法 - 特徵選擇 https://www.geeksforgeeks.org/machine-learning/wrapper-methods-feature-selection/

Random forests and other randomized tree ensembles https://scikit-learn.org/stable/modules/ensemble.html#random-forests-and-other-randomized-tree-ensembles

其他文章

奧卡姆剃刀的哲學理念 (Occam’s Razor)

2025/9/20

趨勢觀點

Stable Diffusion XL 本地部署AI圖像生成流程和技巧

2025/7/21

開發

Vibe Coding 技術債陷阱

2025/7/10

趨勢觀點