Creating new neural network architectures can be quite timeconsuming, especially in realworld workflows where numerous models are trained during the experimentation and design phase. In addition to being wasteful, the traditional method of training every new model from scratch slows down the entire design process. In a normal workflow, several models train, with each attempting to improve on the advantages of the one before it. However, determining whether each change results in improvement is delayed because the iterative design approach requires a whole cycle of training and evaluation for every model.
The Net2Net procedure offers a solution to this problem. It’s a cool and simple method that helps address the challenges of the iterative design process to some extent.
This article was published as a part of the Data Science Blogathon.
The Net2Net strategy involves the teacher network and the student network. We initialize the student network (new model) to represent the same function as the teacher network (previous model). In this process, we perform knowledge transfer by using the previous model as the base model and applying the Net2Net methods (Net2WiderNet and Net2DeeperNet). We adopt the knowledge from the teacher network to the student network. Before training the student network, its output matches that of the teacher network, even though the architectures of these two networks may vary.
Suppose we have a teacher network, represented by the function y=f(x;θ), where:
Now, we want to initialize a student network, represented by the function g(x;θ′), where:
The goal is to choose a new set of parameters ′θ′ for the student network in such a way that, for every input x, the output of the student network matches the output of the teacher network:
∀x, f(x;θ) = g(x;θ′)
There are two ways of using Net2Net: Increase the width or the depth of the network.
In Net2WiderNet, the width of the neural network increases. The method involves replacing the layer with a wider layer so the number of units or channels is increased. In convolutional architecture, this means having more channels.
Specifically, if layer i and layer i+1 are both fully connected layers and layer i uses an elementwise nonlinearity, Net2WiderNet allows you to replace layer i with a layer that has more units (wider layer).
The teacher network weights can be represented as W^(i), where i is the layer index. To create a consistent random mapping g^(i) for every layer, use forward inference.
Replicate the current weights for each layer I using the random mapping. For the wider layer, introduce a new weight matrix U^(i).
Make sure that the broader layer has been initialized. If so, move on to the following actions. If not, carry out the initialization step again.
Let us examine a particular scenario in which layers i and i+1 are fully connected layers. Both W^(i) ∈ R^m×n and W^(i+1) ∈ R^n×p are the original weights. Expanding the layer i to provide q outputs, where q>n, is the aim.
Give rise to a random mapping function g^(i): {1,2,…,q} → {1,2,…,n}, which fulfils the following:
For every j≤n, g(j)=j
Given a j>n, g(j) is a random sample taken from {1,2,…,n}.
For the broader layer(wider layer), new weight matrices U(i) and U(i+1) are introduced. The purpose is to use the random mapping function to copy the weights from the original layer to the broader layer.
The replication factor determines how many times a certain weight is reproduced in the larger layer.
Input

Teacher Network (Original Size)

Layer 1: (W, U)

Layer 2: (W, U)

Layer 3: (W, U)

…

Layer n: (W, U)

Wider Layer: (U, New Connections)

Output
Input: The network’s first input.
Teacher N/W: The original neural network, represented by W(i), with weights for each layer i.
Layers 1–n: The teacher network’s existing layers, each with weights W and extra broader weights U.
Wider Layer: The layer broadened by the Net2WiderNet method includes new connections and weights.
Output: The network’s ultimate output.
# Importing Libraries
import tensorflow as tf
from tensorflow. Keras import layers, models
def net2wider_net(teacher_model, scale_factor):
# Clone the teacher model to create the student model
student_model = models.clone_model(teacher_model)
# Iterate through layers in the student model
for i, layer in enumerate(student_model.layers):
# Check if the layer is a Dense layer
if isinstance(layer, tf.keras.layers.Dense):
# Get input and output dimensions of the layer
input_dim = layer.input_shape[1]
output_dim = layer.output_shape[1]
# Calculate the new width of the layer based on the scale factor
widened_dim = int(output_dim * scale_factor)
# Create a new weight matrix with increased width
new_weights = tf.Variable(layer.get_weights()[0][:, :output_dim],
shape=(input_dim, widened_dim),
trainable=True)
# Create a new Dense layer with the increased width and the same activation function
new_layer = layers.Dense(widened_dim, activation=layer.activation,
use_bias=layer.use_bias)
# Set the weights of the new layer
new_layer.set_weights([new_weights.numpy(), layer.get_weights()[1]])
# Replace the original layer in the student model with the new wider layer
student_model.layers.pop(i)
student_model.layers.insert(i, new_layer)
return student_model
# Example usage:
teacher_model = tf. keras.Sequential([
layers.Dense(32, activation='relu', input_shape=(10,)),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Apply Net2WiderNet with a scale factor of 1.5
scale_factor = 1.5
wider_student_model = net2wider_net(teacher_model, scale_factor)
In this experiment, the researchers started with a smaller neural network (teacher network) by reducing the number of convolution channels in each layer. This made the model simpler with fewer parameters. They trained this smaller network and then used it to speed up the training of a regularsized network (student network) through a method called Net2WiderNet.
The results showed that the Net2WiderNet approach led to faster convergence (the model learning quickly) compared to other methods. Importantly, despite the faster training, the final accuracy of the model using Net2WiderNet was the same as a model trained from scratch. This means that using Net2WiderNet allows researchers to reach the same level of accuracy more quickly, saving time in running experiments without sacrificing the final performance of the model.
In the Net2DeeperNet method, they increase the depth of the neural network by converting the existing network into a deeper one. the basic concept is to replace the layer h(i) = ϕ(h^(i1)^TW(i) with twolayers.
The main constraint is that we are increasing the depth of the network while keeping the structure of the network in a similar manner. The reason for increasing the depth of the network is that deeper architectures have the ability to gain more information and capture complex patterns in the data.
The Net2DeeperNet method divides a layer L^(i) into two layers: the identity mapping layer I and the updated layer L^(i). This factorization enables a smooth shift to deeper topologies, hence unleashing the potential for greater network performance.
Original Layer: h^(i) = phi(h^(i1)T * W^(i))
Net2DeeperNet Transformation:
New Layer 1: h^(i) = phi(U^(i)T * phi(W^(i)T * h^(i1)))
New Layer 2: h^(i+1) = phi(I * h^(i))
Note: I is the identity mapping layer.
This transformation replaces a single layer h^(i) with two layers, creating a deeper structure while retaining the original network’s general function. The type of the layers involved and the activation function phi determine the precise shape of the transformation.
#Importing Libraries
import tensorflow as tf
from tensorflow.keras import layers, models
def net2deeper_net(teacher_model):
# Clone the teacher model to create the student model
student_model = models.clone_model(teacher_model)
# Iterate through layers in the student model
for i, layer in enumerate(student_model.layers):
# Check if the layer is a Dense layer
if isinstance(layer, tf.keras.layers.Dense):
output_dim = layer.output_shape[1]
# Factorize the Dense layer into an identity layer and a new Dense
# layer with ReLU activation
identity_layer = layers.Activation('linear', input_shape=(output_dim,))
new_layer = layers.Dense(output_dim, activation='relu', use_bias=True,
kernel_initializer=tf.keras.initializers.Identity(),
bias_initializer='zeros')
# Replace the original Dense layer in the student model with factorized layers
student_model.layers.pop(i)
student_model.layers.insert(i, identity_layer)
student_model.layers.insert(i + 1, new_layer)
return student_model
# Example usage:
teacher_model = tf.keras.Sequential([
layers.Dense(32, activation='relu', input_shape=(10,)),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Apply Net2DeeperNet
deeper_student_model = net2deeper_net(teacher_model)
In these experiments, the researchers used the Net2DeeperNet method to make the model deeper, focusing on the convolutional layer. They used a term like “Inception” to refer to a deeper model. They employed rectangular kernels to gain information, arranging them in pairs. One layer used a vertical kernel, and the following layer used a horizontal kernel.
The results indicated that using Net2DeeperNet led to significantly faster improvement in accuracy compared to training from random initialization, both in terms of training and validation accuracy. In simpler terms, they made the Inception model deeper, and it learned more quickly while achieving good accuracy.
Fig: Training Accuracy of Different Methods
Fig: Validation Accuracy of Different Methods
We are developing code for the MNIST dataset. Initially, we create the teacher model and then transfer all the weights to expand the depth of the architecture. Subsequently, we build both the student and deeper student architectures. Finally, we observe the output.
!pip install keras numpy
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
from keras.datasets import mnist
#from keras.utils import to_categorical
from tensorflow.keras.utils import to_categorical
import numpy as np
np.random.seed(1337)
input_shape = (28, 28, 1) # Image shape
# Load and preprocess data
(train_x, train_y), (validation_x, validation_y) = mnist.load_data()
# Preprocess input data: reshape and normalize
preprocess_input = lambda x: x.reshape((1, 28, 28, 1)) / 255.
preprocess_output = lambda y: to_categorical(y)
train_x, validation_x = map(preprocess_input, [train_x, validation_x])
train_y, validation_y = map(preprocess_output, [train_y, validation_y])
# Display data shapes
print("Loading MNIST data...")
print("train_x shape:", train_x.shape, "train_y shape:", train_y.shape)
print("validation_x shape:", validation_x.shape, "validation_y shape", validation_y.shape, "\n")
def wider2net_fc(teacher_w1, teacher_b1, teacher_w2, new_width, init):
"""Get initial weights for a wider, fully connected (dense) layer with a bigger nut,
by 'randompadding' or 'net2wider'.
# Arguments
teacher_w1: `weight` of fc layer to become wider, of shape (nin1, nout1)
teacher_b1: `bias` of fc layer to become wider, of shape (nout1, )
teacher_w2: `weight` of next connected fc layer, of shape (nin2, nout2)
new_width: new `nout` for the wider fc layer
init: initialization algorithm for new weights, either 'randompad' or 'net2wider'
"""
assert teacher_w1.shape[1] == teacher_w2.shape[0] # nout1 == nin2 for connected layers
assert teacher_w1.shape[1] == teacher_b1.shape[0]
assert new_width > teacher_w1.shape[1]
n = new_width  teacher_w1.shape[1]
if init == 'randompad':
new_w1 = np.random.normal(0, 0.1, size=(teacher_w1.shape[0], n))
new_b1 = np.ones(n) * 0.1
new_w2 = np.random.normal(0, 0.1, size=(n, teacher_w2.shape[1]))
elif init == 'net2wider':
index = np.random.randint(teacher_w1.shape[1], size=n)
factors = np.bincount(index)[index] + 1.
new_w1 = teacher_w1[:, index]
new_b1 = teacher_b1[index]
new_w2 = teacher_w2[index, :] / factors[:, np.newaxis]
else:
raise ValueError("Unsupported weight initializer: %s" % init)
student_w1 = np.concatenate((teacher_w1, new_w1), axis=1)
student_w2 = np.concatenate((teacher_w2, new_w2), axis=0)
if init == 'net2wider':
student_w2[index, :] = new_w2
student_b1 = np.concatenate((teacher_b1, new_b1), axis=0)
return student_w1, student_b1, student_w2
def deeper2net_conv2d(teacher_w):
"""Get initial weights for a deeper conv2d layer by net2deeper'.
# Arguments
teacher_w: `weight` of previous conv2d layer, of shape (nb_filter, nb_channel, h, w)
"""
nb_filter, nb_channel, w, h = teacher_w.shape
student_w = np.zeros((nb_filter, nb_filter, w, h))
for i in xrange(nb_filter):
student_w[i, i, (h  1) // 2, (w  1) // 2] = 1.
student_b = np.zeros(nb_filter)
return student_w, student_b
def copy_weights(teacher_model, student_model, layer_names):
"""Copy weights from teacher_model to student_model,
for layers listed in layer_names, ensuring compatible shapes."""
for name in layer_names:
teacher_layer = teacher_model.get_layer(name)
student_layer = student_model.get_layer(name)
if teacher_layer.get_weights()[0].shape == student_layer.get_weights()[0].shape:
student_layer.set_weights(teacher_layer.get_weights())
print(f"Weights successfully copied to layer: {name}")
else:
print(f"Skipping layer {name} due to incompatible shapes.")
def make_teacher_model(train_data, validation_data):
"""Train a simple CNN as a teacher model."""
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=input_shape, padding="same", name="conv1"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), name="pool1"))
model.add(Conv2D(128, (3, 3), padding="same", name="conv2"))
model.add(MaxPooling2D(name="pool2"))
model.add(Flatten(name="flatten"))
model.add(Dense(128, activation="relu", name="fc1"))
model.add(Dense(10, activation="softmax", name="fc2"))
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
train_x, train_y = train_data
history = model.fit(train_x, train_y, epochs=1, validation_data=validation_data)
# Print layer shapes for verification
print("Shapes after training:")
for layer in model.layers:
print(layer.name, layer.output_shape)
return model, history
def make_deeper_student_model(teacher_model, train_data, validation_data, init):
"""Train a deeper student model based on teacher_model, with either 'randominit' (baseline)
or 'net2deeper'
"""
model = Sequential()
model.add(Conv2D(64, 3, 3, input_shape=input_shape, padding="same", name="conv1"))
model.add(MaxPooling2D(name="pool1"))
model.add(Conv2D(128, 3, 3, padding="same", name="conv2"))
# Check the dimensions after the second convolutional layer
model.add(MaxPooling2D(name="pool2"))
print("Dimensions after pool2:", model.output_shape)
model.add(Flatten(name="flatten"))
model.add(Dense(128, activation="relu", name="fc1"))
# Add another fc layer to make original fc1 deeper
if init == "net2deeper":
# Net2deeper for fc layer with relu is just an identity initializer
model.add(Dense(128, kernel_initializer="identity", activation="relu", name="fc1deeper"))
elif init == "randominit":
model.add(Dense(128, activation="relu", name="fc1deeper"))
else:
raise ValueError("Unsupported weight initializer: %s" % init)
model.add(Dense(10, activation="softmax", name="fc2"))
# Copy weights for other layers
copy_weights(teacher_model, model, layer_names=["conv1", "conv2", "fc1", "fc2"])
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
train_x, train_y = train_data
history = model.fit(train_x, train_y, epochs=3, validation_data=validation_data)
return model, history
def net2deeper_experiment():
train_data = (train_x, train_y)
validation_data = (validation_x, validation_y)
print("Experiment of Net2DeeperNet ...")
# Build teacher model
teacher_model, teacher_history = make_teacher_model(train_data, validation_data)
# Build deeper student model with random initialization
random_student_model, random_student_history = make_deeper_student_model(
teacher_model, train_data, validation_data, "randominit")
# Build deeper student model with net2deeper initialization
net2deeper_student_model, net2deeper_student_history = make_deeper_student_model(
teacher_model, train_data, validation_data, "net2deeper")
# Run the experiment
net2deeper_experiment()
Because of the functionpreserving strategy adopted, the new larger network (student network) performs exactly as well as the old network (teacher network), rather than experiencing a time of low performance.
Additionally, compared to randomly initialized networks, Net2Nettrained networks converge to the same accuracy more quickly. Remember that the final accuracy solely depends on the size of the network and is not affected by the training method.
The authors of the paper illustrate the benefits of training with Net2Net when developing new designs and conducting testing through graphs showing the results of tests.
In conclusion, the Net2Net method proves to be valuable for designing neural networks and facilitating effective knowledge transfer during training. The results indicate an increased training speed and a reduction in the time complexity of model construction compared to building from scratch. The researchers experimented with two types of Net2Net: Net2WiderNet, which maximizes the width of the neural network, and Net2DeeperNet, which increases the depth while maintaining the initial model’s structure. Both methods improved the performance of the model. However, future improvements are necessary for Net2Net to enable more efficient neural network designs, especially as deep learning continues to advance.
A. The Net2Net procedure accelerates training by efficiently transferring knowledge from a smaller network (teacher) to a larger one (student), reducing the need for training the larger network from scratch.
A. Net2Net enables quick exploration of the design space by transforming existing stateoftheart architectures, allowing for faster experimentation and improved results in deep learning.
A. Net2WiderNet accelerates convergence to the same accuracy as random initialization, while Net2DeeperNet achieves good accuracy much faster than training from random initialization.
A. Net2Net demonstrates the possibility of transferring knowledge rapidly between neural networks, providing a technique for exploring model families more rapidly and reducing the time required for typical machine learning workflows.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.