wandb

Naver boostcamp -ai tech/week 02

wandb

끵뀐꿩긘 2022. 10. 1. 17:35

https://pebpung.github.io/wandb/2021/10/06/WandB-1.html

1. WandB란? - 강력한 MLOps Tool · ML감자

WandB를 활용해 머신러닝 실험 관리를 더 편하게 하는 방법에 대해 작성한 글입니다.

pebpung.github.io

WandB 란?

WandB(Weights & Biases)란 더 나은 모델을 빨리 만들 수 있도록 도와주는 머신러닝 Experiment tracking tool이다

- 주요기능

W&B Platform

Experiments
- 머신러닝 모델 실험을 추적하기 위한 Dashboard 제공.
- Experiments 기능은 모델을 학습할 때, 모델 학습 log를 추적하여 Dashboard를 통해 시각화를 해주고 이를 통해서 학습이 잘 되고 있는지 빠르게 파악할 수 있다
Artifacts
- Dataset version 관리와 Model version 관리.
Tables
- Data를 loging하여 W&B로 시각화하고 query하는 데 사용.
Sweeps
- Hyper-parameter를 자동으로 tuning하여 최적화 함.
Reports
- 실험을 document로 정리하여 collaborators와 공유.

WandB를 통해 여러 사람과 협업하고, 효율적인 프로젝트 관리를 할 수 있습니다.

또한 여러 Framework와 결합이 가능해 확장성이 뛰어나다는 장점을 가지고 있다.

CIFAR 10 분류하기

Hyperparameter 초기화

wandb.init() – 새 W&B 실행을 초기화합니다. 각 실행은 교육 스크립트의 단일 실행입니다.
wandb.config – 모든 하이퍼파라미터를 구성 개체에 저장합니다. 이를 통해 앱을 사용하여 하이퍼파라미터 값으로 실행을 정렬하고 비교할 수 있습니다.

결과 추적

wandb.watch() – 모든 레이어 치수, 그라디언트, 모델 매개변수를 가져와 대시보드에 자동으로 기록합니다.
wandb.save() – 모델 체크포인트를 저장합니다.
wandb.log() – 예측 및 실제 레이블과 함께 메트릭(정확도, 손실 및 에포크) 및 이미지의 예를 기록합니다. 이를 통해 시간 경과에 따른 신경망의 성능을 시각화할 수 있습니다.

wandb 설치 & 초기설정

# WandB – Install the W&B library
!pip install wandb -q

from __future__ import print_function
import argparse
import random # to set the python random seed
import numpy # to set the numpy random seed
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
# Ignore excessive warnings
import logging
logging.propagate = False 
logging.getLogger().setLevel(logging.ERROR)

# WandB – Import the wandb library
import wandb

wandb 로그인

# WandB – Login to your wandb account so you can log all your metrics
!wandb login

'''
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
'''

wandb에 로그인하여 api key를 받아오면 된다

모델 설정

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # In our constructor, we define our neural network architecture that we'll use in the forward pass.
        # Conv2d() adds a convolution layer that generates 2 dimensional feature maps to learn different aspects of our image
        self.conv1 = nn.Conv2d(3, 6, kernel_size=5)
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
        
        # Linear(x,y) creates dense, fully connected layers with x inputs and y outputs
        # Linear layers simply output the dot product of our inputs and weights.
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Here we feed the feature maps from the convolutional layers into a max_pool2d layer.
        # The max_pool2d layer reduces the size of the image representation our convolutional layers learnt,
        # and in doing so it reduces the number of parameters and computations the network needs to perform.
        # Finally we apply the relu activation function which gives us max(0, max_pool2d_output)
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        
        # Reshapes x into size (-1, 16 * 5 * 5) so we can feed the convolution layer outputs into our fully connected layer
        x = x.view(-1, 16 * 5 * 5)
        
        # We apply the relu activation function and dropout to the output of our fully connected layers
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        
        # Finally we apply the softmax function to squash the probabilities of each class (0-9) and ensure they add to 1.
        return F.log_softmax(x, dim=1)

Training loop

def train(args, model, device, train_loader, optimizer, epoch):
    # Switch model to training mode. This is necessary for layers like dropout, batchnorm etc which behave differently in training and evaluation mode
    model.train()
    
    # We loop over the data iterator, and feed the inputs to the network and adjust the weights.
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx > 20:
          break
        # Load the input features and labels from the training dataset
        data, target = data.to(device), target.to(device)
        
        # Reset the gradients to 0 for all learnable weight parameters
        optimizer.zero_grad()
        
        # Forward pass: Pass image data from training dataset, make predictions about class image belongs to (0-9 in this case)
        output = model(data)
        
        # Define our loss function, and compute the loss
        loss = F.nll_loss(output, target)
        
        # Backward pass: compute the gradients of the loss w.r.t. the model's parameters
        loss.backward()
        
        # Update the neural network weights
        optimizer.step()

Evalutation loop

def test(args, model, device, test_loader, classes):
    # Switch model to evaluation mode. This is necessary for layers like dropout, batchnorm etc which behave differently in training and evaluation mode
    model.eval()
    test_loss = 0
    correct = 0

    example_images = []
    with torch.no_grad():
        for data, target in test_loader:
            # Load the input features and labels from the test dataset
            data, target = data.to(device), target.to(device)
            
            # Make predictions: Pass image data from test dataset, make predictions about class image belongs to (0-9 in this case)
            output = model(data)
            
            # Compute the loss sum up batch loss
            test_loss += F.nll_loss(output, target, reduction='sum').item()
            
            # Get the index of the max log-probability
            pred = output.max(1, keepdim=True)[1]
            correct += pred.eq(target.view_as(pred)).sum().item()
            
            # WandB – Log images in your test dataset automatically, along with predicted and true labels by passing pytorch tensors with image data into wandb.Image
            example_images.append(wandb.Image(
                data[0], caption="Pred: {} Truth: {}".format(classes[pred[0].item()], classes[target[0]]))) # 데이터 이미지 + 예측값과 실제값
    
    # WandB – wandb.log(a_dict) logs the keys and values of the dictionary passed in and associates the values with a step.
    # You can log anything by passing it to wandb.log, including histograms, custom matplotlib objects, images, video, text, tables, html, pointclouds and other 3D objects.
    # Here we use it to log test accuracy, loss and some test images (along with their true and predicted labels).
    # wandb.log를 통해서 example_images, Test Accuracy, Test Loss기록
    wandb.log({
        "Examples": example_images,
        "Test Accuracy": 100. * correct / len(test_loader.dataset),
        "Test Loss": test_loss})

Train, Edit, and Retrain

# WandB – Initialize a new run
wandb.init(entity="freshmanbo", project="pytorch-intro") # 새로운 wandb run 공간을 만든다
wandb.watch_called = False # Re-run the model without restarting the runtime, unnecessary after our next release

# 하이퍼 파라미터를 config dict로 관리
# WandB – Config is a variable that holds and saves hyperparameters and inputs
config = wandb.config          # Initialize config
config.batch_size = 4          # input batch size for training (default: 64)
config.test_batch_size = 10    # input batch size for testing (default: 1000)
config.epochs = 50             # number of epochs to train (default: 10)
config.lr = 0.1               # learning rate (default: 0.01)
config.momentum = 0.1          # SGD momentum (default: 0.5) 
config.no_cuda = False         # disables CUDA training
config.seed = 42               # random seed (default: 42)
config.log_interval = 10     # how many batches to wait before logging training status

def main():
    use_cuda = not config.no_cuda and torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    
    # Set random seeds and deterministic pytorch for reproducibility
    # random.seed(config.seed)       # python random seed
    torch.manual_seed(config.seed) # pytorch random seed
    # numpy.random.seed(config.seed) # numpy random seed
    torch.backends.cudnn.deterministic = True

    # Load the dataset: We're training our CNN on CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html)
    # First we define the tranformations to apply to our images
    transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    
    # Now we load our training and test datasets and apply the transformations defined above
    train_loader = torch.utils.data.DataLoader(datasets.CIFAR10(root='./data', train=True,
                                              download=True, transform=transform), batch_size=config.batch_size,
                                              shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(datasets.CIFAR10(root='./data', train=False,
                                             download=True, transform=transform), batch_size=config.test_batch_size,
                                             shuffle=False, **kwargs)

    classes = ('plane', 'car', 'bird', 'cat',
               'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

    # Initialize our model, recursively go over all modules and convert their parameters and buffers to CUDA tensors (if device is set to cuda)
    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=config.lr,
                          momentum=config.momentum)
    
    # WandB – wandb.watch() automatically fetches all layer dimensions, gradients, model parameters and logs them automatically to your dashboard.
    # Using log="all" log histograms of parameter values in addition to gradients
    wandb.watch(model, log="all")# 모델의 그레디언트,차원, 파라미터 등 모든걸 추적

    for epoch in range(1, config.epochs + 1):
        train(config, model, device, train_loader, optimizer, epoch)
        test(config, model, device, test_loader, classes)
        
    # WandB – Save the model checkpoint. This automatically saves a file to the cloud and associates it with the current run.
    torch.save(model.state_dict(), "model.h5")
    wandb.save('model.h5') # 체크포인트 생성

if __name__ == '__main__':
    main()

정해준 repo에 log로 추척한 loss와 accuracy 그리고 prediction,label과 같이 있는 그림들이 나타난다

또한, watch로 추적했던 gradient같은 parameter의 분포 그래프도 생긴다