끵뀐꿩긘
2022. 10. 1. 17:35
https://pebpung.github.io/wandb/2021/10/06/WandB-1.html
1. WandB란? - 강력한 MLOps Tool · ML감자
WandB를 활용해 머신러닝 실험 관리를 더 편하게 하는 방법에 대해 작성한 글입니다.
pebpung.github.io
WandB 란?
WandB(Weights & Biases)란 더 나은 모델을 빨리 만들 수 있도록 도와주는 머신러닝 Experiment tracking tool이다
- 주요기능
W&B Platform
- Experiments
- 머신러닝 모델 실험을 추적하기 위한 Dashboard 제공.
- Experiments 기능은 모델을 학습할 때, 모델 학습 log를 추적하여 Dashboard를 통해 시각화를 해주고 이를 통해서 학습이 잘 되고 있는지 빠르게 파악할 수 있다
- Artifacts
- Dataset version 관리와 Model version 관리.
- Tables
- Data를 loging하여 W&B로 시각화하고 query하는 데 사용.
- Sweeps
- Hyper-parameter를 자동으로 tuning하여 최적화 함.
- Reports
- 실험을 document로 정리하여 collaborators와 공유.
WandB를 통해 여러 사람과 협업하고, 효율적인 프로젝트 관리를 할 수 있습니다.
또한 여러 Framework와 결합이 가능해 확장성이 뛰어나다는 장점을 가지고 있다.
CIFAR 10 분류하기
Hyperparameter 초기화
- wandb.init() – 새 W&B 실행을 초기화합니다. 각 실행은 교육 스크립트의 단일 실행입니다.
- wandb.config – 모든 하이퍼파라미터를 구성 개체에 저장합니다. 이를 통해 앱을 사용하여 하이퍼파라미터 값으로 실행을 정렬하고 비교할 수 있습니다.
결과 추적
- wandb.watch() – 모든 레이어 치수, 그라디언트, 모델 매개변수를 가져와 대시보드에 자동으로 기록합니다.
- wandb.save() – 모델 체크포인트를 저장합니다.
- wandb.log() – 예측 및 실제 레이블과 함께 메트릭(정확도, 손실 및 에포크) 및 이미지의 예를 기록합니다. 이를 통해 시간 경과에 따른 신경망의 성능을 시각화할 수 있습니다.
wandb 설치 & 초기설정
# WandB – Install the W&B library
!pip install wandb -q
from __future__ import print_function
import argparse
import random # to set the python random seed
import numpy # to set the numpy random seed
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
# Ignore excessive warnings
import logging
logging.propagate = False
logging.getLogger().setLevel(logging.ERROR)
# WandB – Import the wandb library
import wandb
wandb 로그인
# WandB – Login to your wandb account so you can log all your metrics
!wandb login
'''
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
'''
wandb에 로그인하여 api key를 받아오면 된다
모델 설정
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# In our constructor, we define our neural network architecture that we'll use in the forward pass.
# Conv2d() adds a convolution layer that generates 2 dimensional feature maps to learn different aspects of our image
self.conv1 = nn.Conv2d(3, 6, kernel_size=5)
self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
# Linear(x,y) creates dense, fully connected layers with x inputs and y outputs
# Linear layers simply output the dot product of our inputs and weights.
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# Here we feed the feature maps from the convolutional layers into a max_pool2d layer.
# The max_pool2d layer reduces the size of the image representation our convolutional layers learnt,
# and in doing so it reduces the number of parameters and computations the network needs to perform.
# Finally we apply the relu activation function which gives us max(0, max_pool2d_output)
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2(x), 2))
# Reshapes x into size (-1, 16 * 5 * 5) so we can feed the convolution layer outputs into our fully connected layer
x = x.view(-1, 16 * 5 * 5)
# We apply the relu activation function and dropout to the output of our fully connected layers
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
# Finally we apply the softmax function to squash the probabilities of each class (0-9) and ensure they add to 1.
return F.log_softmax(x, dim=1)
Training loop
def train(args, model, device, train_loader, optimizer, epoch):
# Switch model to training mode. This is necessary for layers like dropout, batchnorm etc which behave differently in training and evaluation mode
model.train()
# We loop over the data iterator, and feed the inputs to the network and adjust the weights.
for batch_idx, (data, target) in enumerate(train_loader):
if batch_idx > 20:
break
# Load the input features and labels from the training dataset
data, target = data.to(device), target.to(device)
# Reset the gradients to 0 for all learnable weight parameters
optimizer.zero_grad()
# Forward pass: Pass image data from training dataset, make predictions about class image belongs to (0-9 in this case)
output = model(data)
# Define our loss function, and compute the loss
loss = F.nll_loss(output, target)
# Backward pass: compute the gradients of the loss w.r.t. the model's parameters
loss.backward()
# Update the neural network weights
optimizer.step()
Evalutation loop
def test(args, model, device, test_loader, classes):
# Switch model to evaluation mode. This is necessary for layers like dropout, batchnorm etc which behave differently in training and evaluation mode
model.eval()
test_loss = 0
correct = 0
example_images = []
with torch.no_grad():
for data, target in test_loader:
# Load the input features and labels from the test dataset
data, target = data.to(device), target.to(device)
# Make predictions: Pass image data from test dataset, make predictions about class image belongs to (0-9 in this case)
output = model(data)
# Compute the loss sum up batch loss
test_loss += F.nll_loss(output, target, reduction='sum').item()
# Get the index of the max log-probability
pred = output.max(1, keepdim=True)[1]
correct += pred.eq(target.view_as(pred)).sum().item()
# WandB – Log images in your test dataset automatically, along with predicted and true labels by passing pytorch tensors with image data into wandb.Image
example_images.append(wandb.Image(
data[0], caption="Pred: {} Truth: {}".format(classes[pred[0].item()], classes[target[0]]))) # 데이터 이미지 + 예측값과 실제값
# WandB – wandb.log(a_dict) logs the keys and values of the dictionary passed in and associates the values with a step.
# You can log anything by passing it to wandb.log, including histograms, custom matplotlib objects, images, video, text, tables, html, pointclouds and other 3D objects.
# Here we use it to log test accuracy, loss and some test images (along with their true and predicted labels).
# wandb.log를 통해서 example_images, Test Accuracy, Test Loss기록
wandb.log({
"Examples": example_images,
"Test Accuracy": 100. * correct / len(test_loader.dataset),
"Test Loss": test_loss})
Train, Edit, and Retrain
# WandB – Initialize a new run
wandb.init(entity="freshmanbo", project="pytorch-intro") # 새로운 wandb run 공간을 만든다
wandb.watch_called = False # Re-run the model without restarting the runtime, unnecessary after our next release
# 하이퍼 파라미터를 config dict로 관리
# WandB – Config is a variable that holds and saves hyperparameters and inputs
config = wandb.config # Initialize config
config.batch_size = 4 # input batch size for training (default: 64)
config.test_batch_size = 10 # input batch size for testing (default: 1000)
config.epochs = 50 # number of epochs to train (default: 10)
config.lr = 0.1 # learning rate (default: 0.01)
config.momentum = 0.1 # SGD momentum (default: 0.5)
config.no_cuda = False # disables CUDA training
config.seed = 42 # random seed (default: 42)
config.log_interval = 10 # how many batches to wait before logging training status
def main():
use_cuda = not config.no_cuda and torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
# Set random seeds and deterministic pytorch for reproducibility
# random.seed(config.seed) # python random seed
torch.manual_seed(config.seed) # pytorch random seed
# numpy.random.seed(config.seed) # numpy random seed
torch.backends.cudnn.deterministic = True
# Load the dataset: We're training our CNN on CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html)
# First we define the tranformations to apply to our images
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
# Now we load our training and test datasets and apply the transformations defined above
train_loader = torch.utils.data.DataLoader(datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform), batch_size=config.batch_size,
shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform), batch_size=config.test_batch_size,
shuffle=False, **kwargs)
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
# Initialize our model, recursively go over all modules and convert their parameters and buffers to CUDA tensors (if device is set to cuda)
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), lr=config.lr,
momentum=config.momentum)
# WandB – wandb.watch() automatically fetches all layer dimensions, gradients, model parameters and logs them automatically to your dashboard.
# Using log="all" log histograms of parameter values in addition to gradients
wandb.watch(model, log="all")# 모델의 그레디언트,차원, 파라미터 등 모든걸 추적
for epoch in range(1, config.epochs + 1):
train(config, model, device, train_loader, optimizer, epoch)
test(config, model, device, test_loader, classes)
# WandB – Save the model checkpoint. This automatically saves a file to the cloud and associates it with the current run.
torch.save(model.state_dict(), "model.h5")
wandb.save('model.h5') # 체크포인트 생성
if __name__ == '__main__':
main()
정해준 repo에 log로 추척한 loss와 accuracy 그리고 prediction,label과 같이 있는 그림들이 나타난다
또한, watch로 추적했던 gradient같은 parameter의 분포 그래프도 생긴다