# Data HyperCleaning This example demonstrates how to use the BOAT library to perform bi-level optimization with data hyper-cleaning. --- ## Step 1: Data Preparation ```python import sys import os import json sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))) import boat_torch as boat import torch from util_file import data_splitting, initialize from torchvision.datasets import MNIST base_folder = os.path.dirname(os.path.abspath(__file__)) parent_folder = os.path.dirname(base_folder) dataset = MNIST(root=os.path.join(parent_folder, "data/"), train=True, download=True) tr, val, test = data_splitting(dataset, 5000, 5000, 10000) tr.data_polluting(0.5) tr.data_flatten() val.data_flatten() test.data_flatten() ``` ### Explanation: - The `MNIST` dataset is loaded from the specified directory. - The `data_splitting` function splits the dataset into 5000 training, 5000 validation, and 10000 test samples. - The `data_polluting` function introduces noise into the training data by randomly changing 50% of the values. - The `data_flatten` function flattens the data to make it suitable for feeding into the model. --- ## Step 2: Model Definition ```python class Net_x(torch.nn.Module): def __init__(self, tr): super(Net_x, self).__init__() self.x = torch.nn.Parameter( torch.zeros(tr.data.shape[0]).to(device).requires_grad_(True) ) def forward(self, y): y = torch.sigmoid(self.x) * y y = y.mean() return y x = Net_x(tr) y = torch.nn.Sequential(torch.nn.Linear(28**2, 10)).to(device) ``` ### Explanation: - **`Net_x`**: A custom PyTorch model with a learnable parameter `x`. This parameter will be optimized as part of the lower-level optimization process. - **`y` model**: A simple neural network with a single linear layer. --- ## Step 3: Optimizer and Initialization ```python x_opt = torch.optim.Adam(x.parameters(), lr=0.01) y_opt = torch.optim.SGD(y.parameters(), lr=0.01) initialize(x) initialize(y) ``` ### Explanation: - **Optimizers**: Adam optimizer is used for the lower-level model (`x`), and SGD is used for the upper-level model (`y`). - **Initialization**: The `initialize` function resets the model parameters before training. --- ## Step 4: Configuration Loading ```python with open(os.path.join(parent_folder, "data_hyper_cleaning/configs/boat_config_dhl.json"), "r") as f: boat_config = json.load(f) with open(os.path.join(parent_folder, "data_hyper_cleaning/configs/loss_config_dhl.json"), "r") as f: loss_config = json.load(f) ``` ### Explanation: - Configuration files for BOAT are loaded, including: - **`boat_config`**: Contains configuration for the optimization process. - **`loss_config`**: Defines the loss functions used for training. --- ## Step 5: Main Function ```python def main(): import argparse parser = argparse.ArgumentParser(description="Data HyperCleaner") parser.add_argument( "--gm_op", type=str, default="NGD", help="omniglot or miniimagenet or tieredImagenet", ) parser.add_argument( "--na_op", type=str, default="RAD", help="convnet for 4 convs or resnet for Residual blocks", ) parser.add_argument( "--fo_op", type=str, default=None, help="convnet for 4 convs or resnet for Residual blocks", ) args = parser.parse_args() gm_op = args.gm_op.split(",") if args.gm_op else None na_op = args.na_op.split(",") if args.na_op else None boat_config["gm_op"] = gm_op boat_config["na_op"] = na_op boat_config["fo_op"] = args.fo_op boat_config["lower_level_model"] = y boat_config["upper_level_model"] = x boat_config["lower_level_opt"] = y_opt boat_config["upper_level_opt"] = x_opt boat_config["lower_level_var"] = list(y.parameters()) boat_config["upper_level_var"] = list(x.parameters()) b_optimizer = boat.Problem(boat_config, loss_config) b_optimizer.build_ll_solver() b_optimizer.build_ul_solver() ul_feed_dict = {"data": val.data.to(device), "target": val.clean_target.to(device)} ll_feed_dict = {"data": tr.data.to(device), "target": tr.dirty_target.to(device)} iterations = 3 for x_itr in range(iterations): b_optimizer.run_iter(ll_feed_dict, ul_feed_dict, current_iter=x_itr) ``` ### Explanation: 1. **Argument Parsing**: - `gm_op`: Specifies the list of the gradient mapping operations, e.g., ["NGD","GDA"]. - `na_op`: Specifies the list of numerical approximation operations, e.g., ["RAD","RGT"]. - `fo_op`: Optionally specifies a first-order gradient method, e.g., “MESO”. 2. **BOAT Configuration**: - Updates the `boat_config` with the parsed arguments and model components. - Initializes the BOAT `Problem` class for optimization. 3. **Iterative Optimization**: - Runs the optimization process for a specified number of iterations (`iterations`). - Computes and prints loss and runtime for each iteration.