Discriminator 在繼續訓練時發生錯誤，因為 binary.weight 和 category.weight 的維度與之前儲存的 checkpoint 不匹配

by max-stackoverflow | Posted on 2025-03-162025-03-16

原本訓練資料是 256×256, 改用 384×384 接續訓練, 在載入 Discriminator 時, 發生錯誤:

Error loading experiments/checkpoint\6_net_D.pth: Error(s) in loading state_dict for Discriminator:
size mismatch for binary.weight: copying a param with shape torch.Size([1, 1024]) from checkpoint, the shape in current model is torch.Size([1, 2304]).
size mismatch for category.weight: copying a param with shape torch.Size([40, 1024]) from checkpoint, the shape in current model is torch.Size([40, 2304]).

問題分析

你的 Discriminator (netD) 在 繼續訓練時發生錯誤，主要是因為 binary.weight 和 category.weight 的維度與之前儲存的 checkpoint 不匹配。

錯誤訊息

Error(s) in loading state_dict for Discriminator:
    size mismatch for binary.weight: copying a param with shape torch.Size([1, 1024]) from checkpoint, the shape in current model is torch.Size([1, 2304]).
    size mismatch for category.weight: copying a param with shape torch.Size([40, 1024]) from checkpoint, the shape in current model is torch.Size([40, 2304]).

這代表 Discriminator 結構改變，導致權重形狀不匹配：

舊模型 (256x256) 的 binary.weight 大小為 [1, 1024]。
新模型 (384x384) 的 binary.weight 大小變為 [1, 2304]。

這個變化主要來自於：

final_features = final_channels * image_size * image_size
當 image_size 從 256 改成 384，final_features 也會變大，導致 self.binary 和 self.category 權重數量增加。

解決方法

方法 1：允許部分權重載入

如果你想 繼續使用舊的 netD 權重，可以忽略形狀不匹配的層，並 初始化新權重：

checkpoint = torch.load(target_filepath_D, map_location=self.device)
model_dict = self.netD.state_dict()

# 過濾掉不匹配的權重
pretrained_dict = {k: v for k, v in checkpoint.items() if k in model_dict and model_dict[k].shape == v.shape}

# 更新 model_dict，並載入權重
model_dict.update(pretrained_dict)
self.netD.load_state_dict(model_dict)

print(f"Loaded {len(pretrained_dict)}/{len(model_dict)} layers from checkpoint.")

這樣 binary.weight 和 category.weight 會使用新的隨機初始化值，其餘權重則載入舊的權重。

方法 2：手動調整 `final_features`

如果 binary.weight 和 category.weight 形狀改變，你可以 讓新模型與舊模型保持一致。
修改 Discriminator，在 __init__() 手動指定 final_features 固定為 1024：

final_features = 1024  # 固定輸出大小，避免因 image_size 變更影響
self.binary = nn.Linear(final_features, 1)
self.category = nn.Linear(final_features, embedding_num)

這樣 binary.weight 和 category.weight 形狀不會變化，可以繼續載入舊的 checkpoint。

方法 3：重新訓練 `Discriminator`

如果 新輸入解析度對 netD 有重大影響，你可能需要 重新訓練 Discriminator：

重新初始化 netD（不要載入舊的 netD 權重）。
只載入 netG 權重，然後讓 netD 重新學習。
訓練時 先凍結 netG，讓 netD 學習新解析度的特徵。

修改 load_networks()，讓 netD 重新初始化：

if os.path.exists(target_filepath_G):
    self.netG.load_state_dict(torch.load(target_filepath_G, map_location=self.device), strict=False)
    print(f"Loaded Generator from {target_filepath_G}")

if os.path.exists(target_filepath_D):
    print(f"Skipping Discriminator loading: model structure changed")
else:
    print("Training new Discriminator from scratch.")

建議做法

如果想沿用舊模型，建議使用 方法 1（允許部分權重載入）。
如果你希望 Discriminator 結構不變，請用 方法 2 固定 final_features。
如果 Discriminator 需要學習新解析度的特徵，請用 方法 3 重新訓練 netD。

在 256×256 情況下, 固定 final_features = 147456, 出現錯誤:

return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (16x1024 and 147456x1)

這個錯誤發生的原因是 self.binary = nn.Linear(final_features, 1) 這一層的 final_features 形狀不匹配輸入的 features 張量。

錯誤的根源

features = self.model(input) 經過 CNN 模型後的輸出 features 形狀不匹配 nn.Linear(final_features, 1)，具體來說：

當 image_size=256 時，你的 features 經過 CNN 壓縮後，最後的形狀應該是 (batch_size, final_channels, 32, 32)（因為 256 / 8 = 32）。
當 image_size=384 時，最後 features 應該是 (batch_size, final_channels, 48, 48)（因為 384 / 8 = 48）。
但是 nn.Linear 需要一個 (batch_size, final_features) 的 2D 張量，所以 features.view(input.shape[0], -1) 會變成 (batch_size, final_channels * height * width)，但這裡 final_features 可能沒對應到正確的大小。

修正方式

1. 動態計算 `final_features`

將 final_features 設定為可變的：

image_size = math.ceil(image_size / 8)  # 8 是 CNN 縮小的比例
final_features = final_channels * image_size * image_size

這樣 final_features 會隨著 image_size 變動。

2. 檢查 `features.view()`

在 forward() 方法中，加入 print(features.shape) 來確保形狀正確：

def forward(self, input):
    features = self.model(input)
    print(f"Feature shape before view: {features.shape}")  # 確保這裡形狀正確
    if self.blur:
        features = self.gaussian_blur(features)
    features = features.view(input.shape[0], -1)
    print(f"Feature shape after view: {features.shape}")  # 確保這裡符合 `final_features`
    
    binary_logits = self.binary(features)
    category_logits = self.category(features)
    return binary_logits, category_logits

如果 features.shape[1] * features.shape[2] * features.shape[3] 不等於 final_features，就會出現 shape mismatch 錯誤。

3. 確認 Discriminator 建構時的 `image_size` 參數

當你初始化 Discriminator 時：

self.netD = Discriminator(
    input_nc=2 * self.input_nc,
    embedding_num=self.embedding_num,
    ndf=self.ndf,
    final_channels=self.final_channels,
    image_size=self.image_size,  # 確保這裡是 384 而不是 256
    blur=self.d_blur,
    norm_layer=nn.BatchNorm2d
)

請確認 image_size=self.image_size 是否正確地設定為 384，而不是 256。

總結

你的問題來自於 final_features 與 features.view() 的不匹配，解法是：

動態計算 final_features 來適應不同解析度 (256x256 → 32x32、384x384 → 48x48)。
在 forward() 加 print(features.shape) 來確認張量形狀正確。
確保初始化 Discriminator 時 image_size 參數正確。

這樣應該能解決你的問題！試試看這些步驟，看看錯誤是否還存在。

有沒有那一個 final_features , 可以同時用在 256×256 與 384×384?

要讓 final_features 能同時適用於 256×256 和 384×384，我們需要讓它在這兩種情況下都匹配 features.view(batch_size, -1) 的輸出維度。

1. 分析 CNN 壓縮後的特徵維度

一般來說，卷積神經網路 (CNN) 會隨著解析度變化影響 features 的形狀。例如：

輸入 256×256，經過 CNN 後，特徵圖尺寸可能為 (batch_size, C, 32, 32)（假設經過 3 次 stride=2 的下採樣）。
輸入 384×384，經過 CNN 後，特徵圖尺寸可能為 (batch_size, C, 48, 48)。

所以 features.view(batch_size, -1) 的結果是：

256×256 → batch_size × (C × 32 × 32)
384×384 → batch_size × (C × 48 × 48)

2. 讓 `final_features` 保持一致

我們希望 final_features 是固定值，例如 N，使得：

features.view(batch_size, -1).shape[1] == final_features

這表示 256×256 和 384×384 經過 CNN 後的特徵維度應該相等，但這在標準 CNN 架構中通常不成立，因為 CNN 會根據輸入大小改變特徵圖的空間維度 (H, W)。

3. 可能的解法

方法 1：全局平均池化 (GAP)

讓 所有解析度輸出的 features 維度都固定為 C：

features = F.adaptive_avg_pool2d(features, (1, 1))  # 讓輸出變成 (batch_size, C, 1, 1)
features = features.view(features.shape[0], -1)  # 變成 (batch_size, C)
final_features = C  # 固定 C 為一個數值

這樣 final_features 就不會隨輸入解析度改變，因為 (H, W) 永遠是 (1, 1)。

方法 2：線性變換統一輸出維度

如果 features 的空間尺寸不同，可以使用 nn.Linear() 強制轉換：

features = features.view(features.shape[0], -1)  # (batch_size, C × H × W)
features = self.fc(features)  # 轉換成固定維度

self.fc = nn.Linear(C × 32 × 32, fixed_final_features)

當輸入是 384×384，則：

self.fc_384 = nn.Linear(C × 48 × 48, fixed_final_features)

這樣即使輸入解析度不同，最後輸出的 final_features 仍然是一樣的。

4. 最佳方案

使用 GAP (方法 1) 是最佳解，因為它：

不依賴輸入大小（適用 256×256、384×384 或其他尺寸）。
減少計算量（特徵圖縮小為 1×1，只剩 C 個值）。
避免 nn.Linear() 因輸入變化而改變維度。

最終建議的 final_features 設定：

final_features = C  # C 為 CNN 的最後一層通道數，例如 512

如果 CNN 最後一層的輸出通道數 C=512，則 final_features=512，無論輸入解析度是多少，這樣 nn.Linear(final_features, 1) 都能正常運行。

結論

✔ 推薦方法：使用 F.adaptive_avg_pool2d(features, (1, 1)) 來統一 final_features。
✔ 設定 final_features = C，這樣可以適用不同解析度。
✔ 如果需要額外轉換，則可以用 nn.Linear() 來映射到固定大小。

最佳方案的意思, 使用 512 的倍數, 例如: 1024 也可以正常運行?

是的，如果使用 Global Average Pooling (GAP) 的方法，那麼 final_features 可以設為 512 或 1024（512 的倍數），並且能夠適用於 256×256 和 384×384 的輸入解析度。

詳細解釋

1. 為什麼 512 或 1024 都可以？

當我們使用 GAP (Global Average Pooling)：

features = F.adaptive_avg_pool2d(features, (1, 1))  # 輸出形狀變成 (batch_size, C, 1, 1)
features = features.view(features.shape[0], -1)    # 變成 (batch_size, C)

這裡的 C（最後一層的通道數）可以是 512、1024、2048 等任何值。只要 final_features = C，全連接層 (nn.Linear) 就能正確運作，例如：

self.fc = nn.Linear(512, 1)  # 若 C = 512
# 或
self.fc = nn.Linear(1024, 1) # 若 C = 1024

這樣，無論輸入是 256×256 還是 384×384，final_features 都會保持一致。

2. 如果 `final_features = 147456`，為什麼會錯？

錯誤的 final_features = 147456 來自於 features.view(batch_size, -1)，當 CNN 沒有用 GAP 時，輸入解析度改變會導致錯誤：

256×256 輸入 → (batch_size, C × 32 × 32)
384×384 輸入 → (batch_size, C × 48 × 48)

147456 是 (C × 32 × 32)，當輸入變成 384×384，它會變成 (C × 48 × 48)，這樣維度不匹配，就會報錯：

RuntimeError: mat1 and mat2 shapes cannot be multiplied (16x1024 and 147456x1)

3. 讓 512、1024 都適用的方法

如果希望 final_features = 512 或 1024，就需要 使用 GAP：

features = F.adaptive_avg_pool2d(features, (1, 1))  # (batch_size, C, 1, 1)
features = features.view(features.shape[0], -1)    # (batch_size, C)
self.fc = nn.Linear(C, 1)  # 讓 C = 512 或 1024

這樣 無論 256×256 還是 384×384，features 的形狀都一樣，所以 final_features 可以是 512、1024，甚至 2048，不會影響計算。

4. 總結

✅ 如果 final_features = 512，可以適用於 256×256 & 384×384
✅ 如果 final_features = 1024，也可以適用於 256×256 & 384×384
✅ 使用 GAP (F.adaptive_avg_pool2d(features, (1, 1))) 來確保固定 final_features
🚫 不要讓 final_features 依賴輸入解析度，例如 147456 這種值

所以，你可以選擇：

final_features = 512
final_features = 1024
final_features = 2048

這些數值都能正常運行！

針對字型風格轉換（Noto Sans → Zen Maru Gothic）的任務, final_features 建議值? 原作者是寫 final_features = final_channels * image_size * image_size, 所以 384×384 時, 會變成 147456.

在字型風格轉換（Noto Sans → Zen Maru Gothic）這類型的任務中，建議不要使用 final_features = final_channels * image_size * image_size，而是改用固定 final_features，例如 512 或 1024。

為什麼不要讓 `final_features` 依賴影像大小？

如果 final_features = final_channels * image_size * image_size：

256×256 時： final_features = C × 32 × 32
384×384 時： final_features = C × 48 × 48

這樣會導致輸入解析度不同時，全連接層 (nn.Linear) 的輸入維度改變，模型無法共享權重，也會發生 RuntimeError: mat1 and mat2 shapes cannot be multiplied 的錯誤。

建議的 `final_features` 設定

1. 使用 Global Average Pooling (GAP)

建議將 final_features 設定為固定值，例如 512 或 1024，這樣不論影像大小，維度都保持一致：

features = F.adaptive_avg_pool2d(features, (1, 1))  # 變成 (batch_size, C, 1, 1)
features = features.view(features.shape[0], -1)    # 變成 (batch_size, C)
self.fc = nn.Linear(C, 1)  # 讓 C = 512 或 1024

這樣，即使輸入是 256×256 或 384×384，features 的形狀都會保持一致，不會有維度錯誤。

2. `final_features` 選擇

考慮到字型風格轉換是風格學習，不需要太高維度的 final_features，否則容易過擬合：

`final_features`	計算量	適用情境
512	低	快速訓練
1024	中等	平衡準確率與效能
2048	高	需要更多特徵表達

對於 Noto Sans → Zen Maru Gothic，建議： ✅ final_features = 1024 （最佳選擇，能捕捉足夠風格資訊）
✅ final_features = 512 （如果想要更快的訓練速度）
🚫 final_features = 147456（會受影像大小影響，容易出錯）

結論

在 字型風格轉換（Noto Sans → Zen Maru Gothic） 任務中，最佳 final_features 值為：

1024（推薦）：適合大多數情境，能夠學習足夠風格特徵
512（加速訓練）：如果想減少計算量，可以選這個
使用 Global Average Pooling (GAP)，確保 final_features 不依賴影像大小

這樣，你的模型能夠同時適用於 256×256 和 384×384，且不會出錯！

這裡是修改後的 Discriminator，已經修正 final_features 的計算方式，讓它適用於 不同解析度 (256×256、384×384 等)，避免 RuntimeError。

修改點

使用 AdaptiveAvgPool2d((1,1))
- 確保 features 最終變成 (batch_size, C, 1, 1)，然後 view 成 (batch_size, C)，這樣 無論輸入解析度是多少，final_features 都保持一致。
動態設定 final_features
- 直接使用 features.shape[1] 來設定 final_features，避免硬編碼 (2048) 造成錯誤。
nn.Linear 層的 in_features 一律使用 final_features
- 確保 self.binary 和 self.category 都使用相同的 features.shape[1] 維度。

修改後的 `Discriminator`

import torch
import torch.nn as nn
import torchvision.transforms as T

class Discriminator(nn.Module):
    def __init__(self, input_nc, embedding_num, ndf=64, norm_layer=nn.BatchNorm2d, 
                 final_channels=1, blur=False):
        super(Discriminator, self).__init__()

        use_bias = norm_layer != nn.BatchNorm2d
        kw = 5
        padw = 2

        sequence = [
            nn.utils.spectral_norm(nn.Conv2d(input_nc, ndf, kernel_size=kw, stride=2, padding=padw)),
            nn.LeakyReLU(0.2, True)
        ]

        nf_mult = 1
        for n in range(1, 3):
            nf_mult_prev = nf_mult
            nf_mult = min(2 ** n, 8)
            sequence += [
                nn.utils.spectral_norm(nn.Conv2d(ndf * nf_mult_prev, ndf * nf_mult, kernel_size=kw, stride=2, padding=padw, bias=use_bias)),
                norm_layer(ndf * nf_mult),
                nn.LeakyReLU(0.2, True)
            ]

        sequence += [
            nn.utils.spectral_norm(nn.Conv2d(ndf * nf_mult, final_channels, kernel_size=kw, stride=1, padding=padw, bias=use_bias)),
            norm_layer(final_channels),
            nn.LeakyReLU(0.2, True)
        ]

        self.model = nn.Sequential(*sequence)
        self.global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))  # 確保最終輸出是 (batch_size, C, 1, 1)
        self.blur = blur
        self.gaussian_blur = T.GaussianBlur(kernel_size=1, sigma=1.0)  # 設定模糊程度

    def forward(self, input):
        features = self.model(input)
        if self.blur:
            features = self.gaussian_blur(features)

        features = self.global_avg_pool(features)  # 變成 (batch_size, C, 1, 1)
        features = features.view(features.shape[0], -1)  # 變成 (batch_size, C)

        final_features = features.shape[1]  # 動態取得 feature 維度
        if not hasattr(self, "binary") or not hasattr(self, "category"):
            self.binary = nn.Linear(final_features, 1).to(features.device)
            self.category = nn.Linear(final_features, embedding_num).to(features.device)

        binary_logits = self.binary(features)
        category_logits = self.category(features)
        return binary_logits, category_logits

優勢

✅ 支援不同解析度（256×256、384×384、512×512）
✅ 不會發生 mat1 and mat2 shapes cannot be multiplied 錯誤
✅ 動態決定 final_features，不需要硬編碼 (2048)
✅ 確保 self.binary 和 self.category 使用正確的 features 維度

這樣 Discriminator 在不同輸入影像尺寸下都能正常運行，試試看吧！ 🚀

修改下列 script, 繼續使用舊的 netD 權重，忽略形狀不匹配的層，並初始化新權重：

Max的程式語言筆記