top of page

Learn through our Blogs, Get Expert Help, Mentorship & Freelance Support!

Welcome to Colabcodes, where innovation drives technology forward. Explore the latest trends, practical programming tutorials, and in-depth insights across software development, AI, ML, NLP and more. Connect with our experienced freelancers and mentors for personalised guidance and support tailored to your needs.

blog cover_edited.jpg

Demystifying Neural Networks: A Deep Dive into the Fundamentals

  • Writer: Samuel Black
    Samuel Black
  • 7 minutes ago
  • 20 min read

This blog post aims to unravel the inner workings of neural networks, those powerful machine learning models that have revolutionized fields like image recognition, natural language processing, and many more. We'll embark on a journey through the core concepts, using the classic MNIST dataset of handwritten digits as our guide.

Neural Network - Colabcodes

A First Look at a Neural Network

Neural networks have revolutionized how we approach image recognition, natural language processing, and even autonomous driving. But what exactly is a neural network, and how does it work?

At its core, a neural network is a computational model inspired by the human brain. It consists of layers of interconnected neurons that process and transform data. When trained on labeled data, a neural network learns patterns and makes predictions.

For example, in the case of handwritten digit recognition, a neural network takes an image of a digit and outputs a probability distribution over the 10 possible classes (digits 0-9). It achieves this by passing the image through several layers of mathematical operations, adjusting internal parameters (weights and biases) to minimize classification errors.


Loading the MNIST Dataset with TensorFlow

The MNIST dataset is a benchmark dataset in the field of machine learning, widely used for handwritten digit classification. TensorFlow provides a convenient way to load it using the Keras API.


Importing and Loading the Data

Let's start by loading the dataset:

from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

This will automatically download the dataset (if not already available) and return two tuples:


  • Training set: train_images (60,000 images), train_labels (corresponding digit labels)

  • Test set: test_images (10,000 images), test_labels (corresponding labels)


Each image is a 28×28 grayscale matrix, where pixel values range from 0 to 255.


train_images.shape #(60000, 28, 28) 
len(train_labels)  #60000
train_labels       #array([5, 0, 4, ..., 5, 6, 8], dtype = uint8)
test_images.shape  #(10000, 28, 28)
len(test_labels)   #10000
test_labels        #array([7, 2, 1, ..., 4, 5, 6], dtype=uint8) 

  • train_images.shape → (60000, 28, 28) 60,000 training images, each of size 28×28 pixels (grayscale).

  • len(train_labels) → 60,000 corresponding labels (each label represents a digit from 0-9).

  • train_labels → array([...]) The actual labels for the training images (e.g., [5, 0, 4, ..., 5, 6, 8]).

  • test_images.shape → (10000, 28, 28) 10,000 test images, each of size 28×28 pixels.

  • len(test_labels) → 10,000 corresponding labels for the test set.

  • test_labels → array([...]) The actual labels for the test images (e.g., [7, 2, 1, ..., 4, 5, 6]).


The network architecture

Now that we have preprocessed the MNIST dataset, let's define a simple neural network using TensorFlow’s Keras API. This model consists of fully connected (Dense) layers that learn to classify handwritten digits.

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(512, activation="relu"),
    layers.Dense(10, activation="softmax")
])

Understanding the Architecture


  1. Input Layer: The model expects an input of 28×28 pixels, which needs to be flattened into a 1D array of 784 values before being passed to the first layer.


  2. Hidden Layer: The hidden layer consists of a Dense (fully connected) layer with 512 neurons. It uses the ReLU (Rectified Linear Unit) activation function, which introduces non-linearity and helps the network learn complex patterns.


  3. Output Layer: The output layer is a Dense layer with 10 neurons, each corresponding to one of the digits (0-9). It uses the softmax activation function, which converts the outputs into probability scores that sum up to 1, helping the model determine the most likely digit.


Compiling the Model

Before training, we need to compile the model by specifying three key components: the optimizer, the loss function, and the evaluation metric.

model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

The optimizer used here is RMSprop (Root Mean Square Propagation). It adapts the learning rate dynamically for each parameter, helping the model converge efficiently.

The loss function, sparse categorical cross-entropy, is suitable when dealing with multi-class classification problems where the labels are provided as integers (0-9). If the labels were one-hot encoded, we would use categorical_crossentropy instead.

The evaluation metric is accuracy, which measures how often the predicted labels match the true labels. This helps us track the model’s performance during training and validation.


Preparing the Data for Training

Before feeding the images into the neural network, we need to reshape and normalize them for optimal learning.

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255

Each 28×28 grayscale image is reshaped into a 1D array of 784 values (since a Dense layer expects a 1D input rather than a 2D matrix).

The pixel values, originally ranging from 0 to 255, are converted to floating-point values between 0 and 1. This normalization improves training stability and ensures faster convergence by preventing large gradient updates.


Training the Model

With our data preprocessed and the model compiled, we can now train the neural network using the fit() method.

model.fit(train_images, train_labels, epochs=5, batch_size=128)

Output:
Epoch 1/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 16s 22ms/step - accuracy: 0.8690 - loss: 0.4443
Epoch 2/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 17s 14ms/step - accuracy: 0.9658 - loss: 0.1152
Epoch 3/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 10s 13ms/step - accuracy: 0.9780 - loss: 0.0727
Epoch 4/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 5s 11ms/step - accuracy: 0.9849 - loss: 0.0518
Epoch 5/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 5s 12ms/step - accuracy: 0.9886 - loss: 0.0383

The epochs parameter is set to 5, meaning the model will iterate over the entire dataset five times. Each epoch consists of multiple updates to the network's parameters to minimize the loss.

The batch_size is 128, meaning the model processes 128 images at a time before updating the weights. Using batch training improves efficiency while stabilizing the gradient updates.


Interpreting the Results


  • In the first epoch, the model achieves 86.9% accuracy, showing that it is already learning meaningful patterns.


  • By the second epoch, accuracy jumps to 96.6%, and the loss drops significantly.


  • After five epochs, the model reaches 98.86% accuracy, demonstrating strong learning performance.


This rapid improvement indicates that our neural network is effectively recognizing handwritten digits.


Making Predictions

Now that the model is trained, let's use it to make predictions on some test images.

test_digits = test_images[0:10]
predictions = model.predict(test_digits)
predictions[0]

Output:
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 103ms/step
array([2.4787747e-08, 6.0755856e-10, 2.3643456e-06, 1.5603322e-04,
       7.3208113e-11, 1.9166256e-08, 3.6490741e-12, 9.9984002e-01,
       5.3984682e-08, 1.5870924e-06], dtype=float32)

Each value in the output array represents the predicted probability for a digit (0-9). The model assigns a probability to each digit, indicating how confident it is that the given image belongs to that class.

predictions[0].argmax() #np.int64(7)

For this specific example, the seventh element (index 7) has the highest probability, 9.9984 × 10⁻¹, meaning the model predicts that the digit in the image is 7.


Evaluating the Model

After training, it's important to evaluate the model's performance on unseen data. We do this by testing it on the test set using the evaluate() method.

test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"test_acc: {test_acc}")

Output:
313/313 ━━━━━━━━━━━━━━━━━━━━ 2s 6ms/step - accuracy: 0.9758 - loss: 0.0781
test_acc: 0.9793000221252441

The accuracy on the test set is 97.93%, which means our model correctly classifies 97.93% of the handwritten digits in the test dataset. The loss value of 0.0781 indicates the average error in predictions.

This high accuracy confirms that the model generalizes well to new data, making it a reliable digit classifier.


Data Representations for Neural Networks

Neural networks process data in the form of tensors. Let's explore how different types of data are represented in NumPy and TensorFlow.


Creating a Scalar (0D Tensor)

The value x is a 0D tensor (also called a scalar), meaning it contains just a single number and has no axes.

import numpy as np
x = np.array(12)  # A scalar value
print(x)          # Output: 12
print(x.ndim)     # Output: 0
  • x.ndim returns 0, confirming that the tensor has zero dimensions (it's a single value, not an array).


Vectors (Rank-1 Tensors)

A vector is a rank-1 tensor, meaning it has one axis (one dimension) and consists of an ordered sequence of numbers. In NumPy, we can create a vector using a 1D array.

import numpy as np
x = np.array([3, 7, 2, 9])  # A 1D array (vector)
print(x)         # Output: [3 7 2 9]
print(x.ndim)    # Output: 1

Understanding the Output

  • x is a 1D tensor (a vector) with four elements.

  • x.ndim returns 1, confirming that this tensor has one axis (one dimension).

Vectors are commonly used in machine learning to represent feature sets, weights, and word embeddings.


Matrices (Rank-2 Tensors)

A matrix is a rank-2 tensor, meaning it has two axes (dimensions): rows and columns. Matrices are commonly used in machine learning to represent datasets, images, and weight matrices in neural networks.

import numpy as np
x = np.array([[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9]])  # A 2D array (matrix)
print(x)
print(x.ndim)  # Output: 2
  • x is a 2D tensor (matrix) with 3 rows and 3 columns.

  • x.ndim returns 2, confirming that this tensor has two axes (rows and columns).


Matrices are widely used in deep learning for storing datasets, representing images (grayscale), and performing linear transformations.


Rank-3 and Higher-Rank Tensors

A rank-3 tensor has three axes (dimensions). These are often used in deep learning to represent:

  • Color images (height × width × color channels)

  • Time series data (samples × timesteps × features)

  • Video data (frames × height × width × channels)

import numpy as np
x = np.array([[[1, 2, 3], [4, 5, 6]],
              [[7, 8, 9], [10, 11, 12]],
              [[13, 14, 15], [16, 17, 18]]])  # A 3D tensor
print(x.ndim)   # Output: 3
print(x.shape)  # Output: (3, 2, 3)

Understanding the Output

  • x.ndim returns 3, indicating a rank-3 tensor.

  • x.shape is (3, 2, 3), meaning:

    3 matrices (first dimension)

    2 rows in each matrix (second dimension)

    3 columns in each row (third dimension)

Higher-rank tensors (4D, 5D, etc.) are used in convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for complex data like video sequences and volumetric images.


The Gears of Neural Networks: Tensor Operations

Neural networks rely on tensors—the fundamental data structures that store and manipulate numerical data. Tensor operations are the real driving force behind deep learning models. These operations enable the transformation, combination, and optimization of data as it flows through the network.

From simple element-wise computations to complex matrix multiplications, tensor operations form the foundation of everything from activations and weight updates to gradient calculations during training.


Element-Wise Operations – ReLU (Rectified Linear Unit)

Element-wise operations allow computations to be applied independently to each element in a tensor. A common example is the ReLU (Rectified Linear Unit) activation function, which replaces negative values with zero while keeping positive values unchanged.

def naive_relu(x):
    assert len(x.shape) == 2  # Ensuring x is a 2D tensor (matrix)
    x = x.copy()  # Avoid modifying the original tensor
    for i in range(x.shape[0]):  # Iterate over rows
        for j in range(x.shape[1]):  # Iterate over columns
            x[i, j] = max(x[i, j], 0)  # Apply ReLU function
    return x

Element-Wise Addition in Neural Networks

Element-wise operations play a crucial role in deep learning computations. One of the simplest yet fundamental operations is element-wise addition, where corresponding elements of two tensors are added together.

def naive_add(x, y):
    assert len(x.shape) == 2  # Ensure both inputs are 2D tensors (matrices)
    assert x.shape == y.shape  # Ensure both tensors have the same shape
    x = x.copy()  # Create a copy to avoid modifying the original tensor
    for i in range(x.shape[0]):  # Iterate over rows
        for j in range(x.shape[1]):  # Iterate over columns
            x[i, j] += y[i, j]  # Perform element-wise addition
    return x

In deep learning, efficiency matters. While naive implementations of tensor operations using explicit loops help us understand the fundamentals, they are significantly slower compared to vectorized operations using NumPy.

The following code demonstrates a vectorized approach to performing element-wise addition and applying the ReLU activation function, measuring how long it takes to run 1,000 iterations.

import time

x = np.random.random((20, 100))
y = np.random.random((20, 100))

t0 = time.time()
for _ in range(1000):
    z = x + y
    z = np.maximum(z, 0.)

print("Took: {0:.2f} s".format(time.time() - t0))

Output:
Took: 0.01 s

Comparing Naive and Vectorized Implementations

Now, let's compare the performance of the naive (loop-based) approach with the vectorized NumPy approach. The following code measures execution time when using the manually implemented naive_add and naive_relu functions:

t0 = time.time()
for _ in range(1000):
    z = naive_add(x, y)
    z = naive_relu(z)

print("Took: {0:.2f} s".format(time.time() - t0))

Output:
Took: 1.89 s

Broadcasting in Tensor Operations

Broadcasting is a powerful feature in NumPy that allows operations between arrays of different shapes without needing explicit loops or manual reshaping. This is particularly useful in deep learning when dealing with tensor operations efficiently.


Broadcasting automatically expands the dimensions of smaller arrays so that they match the shape of larger ones during operations. Instead of repeating values explicitly, NumPy optimizes memory usage and computation speed.


Example 1: Adding a Scalar to a Matrix

A scalar (single number) can be added to a matrix element-wise:

import numpy as np

x = np.array([[1, 2, 3], [4, 5, 6]])
y = 2  # Scalar

z = x + y  # Broadcasting
print(z)

Output:
[[3 4 5]
 [6 7 8]]

Example 2: Adding a Vector to a Matrix

A row vector can be added to a matrix:

x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([10, 20, 30])  # Shape (3,)

z = x + y  # Broadcasting
print(z)

Output:
[[11 22 33]
 [14 25 36]]

Optimizing Vector Operations in Neural Networks

In neural networks, efficient computation of mathematical operations is crucial for performance. One such fundamental operation is the dot product of two vectors, which is widely used in various layers, including dense (fully connected) and convolutional layers.


Naive Implementation of Vector Dot Product

A straightforward way to compute the dot product is by using a loop that iterates over the elements of both vectors, multiplying them element-wise and summing the results:

def naive_vector_dot(x, y):
    assert len(x.shape) == 1
    assert len(y.shape) == 1
    assert x.shape[0] == y.shape[0]
    z = 0.
    for i in range(x.shape[0]):
        z += x[i] * y[i]
    return z

While this implementation is simple and easy to understand, it is not optimal for large-scale computations.


Optimized Approach Using NumPy

Instead of manually iterating over the elements, NumPy provides a highly efficient way to compute the dot product:

import numpy as np

def optimized_vector_dot(x, y):
    return np.dot(x, y)

NumPy's np.dot(x, y) is significantly faster than the naive implementation because it leverages highly optimized, low-level BLAS (Basic Linear Algebra Subprograms) routines.


Performance Comparison

To compare the efficiency of both approaches, consider the following benchmark:

import time

# Generate random vectors
x = np.random.random(10000)
y = np.random.random(10000)

# Naive implementation
t0 = time.time()
z_naive = naive_vector_dot(x, y)
print("Naive method took: {:.6f} seconds".format(time.time() - t0))

# NumPy optimized implementation
t0 = time.time()
z_numpy = np.dot(x, y)
print("NumPy method took: {:.6f} seconds".format(time.time() - t0))

# Check if both implementations return the same result
print("Results match:", np.allclose(z_naive, z_numpy))

Output:
Naive method took: 0.014139 seconds
NumPy method took: 0.000409 seconds
Results match: True

Extending the Dot Product to Matrices

In deep learning, matrix-vector multiplications are fundamental operations that occur frequently in fully connected layers, convolutional layers, and attention mechanisms. These operations allow models to process and transform data efficiently.


Naive Matrix-Vector Dot Product

Extending the dot product from vectors to matrices involves computing the dot product of each row in the matrix with the given vector. A simple implementation of this in Python is:

def naive_matrix_vector_dot(x, y):
    z = np.zeros(x.shape[0])
    for i in range(x.shape[0]):
        z[i] = naive_vector_dot(x[i, :], y)
    return z

This approach breaks the problem into multiple vector dot products, each computed independently for every row in the matrix. However, this implementation can be slow for large matrices due to explicit loops.


A much faster and efficient way to perform matrix-vector multiplication is to use NumPy’s np.dot() function like discussed before.


Performance Comparison

To evaluate the performance of both methods, consider the following benchmark:

import time

# Generate random vectors
x = np.random.random(10000)
y = np.random.random(10000)

# Naive implementation
t0 = time.time()
z_naive = naive_matrix_vector_dot(x, y)
print("Naive method took: {:.6f} seconds".format(time.time() - t0))

# NumPy optimized implementation
t0 = time.time()
z_numpy = np.dot(x, y)
print("NumPy method took: {:.6f} seconds".format(time.time() - t0))

# Check if both implementations return the same result
print("Results match:", np.allclose(z_naive, z_numpy))

Output:
Naive method took: 0.541131 seconds
NumPy method took: 0.004768 seconds
Results match: True

Understanding Tensor Reshaping in Deep Learning

Reshaping tensors is a crucial operation in deep learning. It allows models to adapt data into the required shape for different computations, such as feeding inputs into a neural network or preparing images for processing.


Why Reshape Tensors?

  • Converts data into a compatible format for machine learning models.

  • Prepares input data for batch processing.

  • Optimizes memory usage by reducing unnecessary dimensions.

  • Enables operations like convolution, matrix multiplication, or vectorization.


Reshaping a Tensor in NumPy

NumPy provides a simple way to reshape tensors using the .reshape() method.

x = np.array([[0, 1, 2], [3, 4, 5]])  
print("Original shape:", x.shape)  

reshaped_x = x.reshape((3, 2))  
print("Reshaped tensor:\n", reshaped_x.shape)

Output:
Original shape: (2, 3)
Reshaped tensor:
 (3, 2)

Reshaping tensors is a fundamental operation in deep learning. Whether you're processing images, handling sequential data, or optimizing neural networks, efficient reshaping ensures that data flows smoothly through various computations. Using NumPy and TensorFlow, developers can reshape tensors efficiently to enhance model performance and usability.


Reshaping in Deep Learning (TensorFlow/Keras Example)

In deep learning frameworks like TensorFlow and Keras, reshaping is frequently used to convert input images into a suitable format for neural networks.

import tensorflow as tf  

# Create a random tensor (batch of 2 images, 28x28 pixels, 1 channel)
x = tf.random.normal((2, 28, 28, 1))  
print("Original shape:", x.shape)  

# Flatten each image into a 1D vector of 784 values
flattened_x = tf.reshape(x, (2, 784))  
print("Reshaped shape:", flattened_x.shape)

Output:
Original shape: (2, 28, 28, 1)
Reshaped shape: (2, 784)

The Gradient Tape in TensorFlow

tf.GradientTape automates differentiation, crucial for training neural networks. It records operations in the forward pass and computes gradients in the backward pass, making it essential for custom training loops and optimization.


Example: Optimizing a Variable Using Gradients

Let’s say we have a variable www that we want to optimize to minimize the function square of (w-2). The gradient of this function with respect to w is 2(w−2) which we can use to update w.

import tensorflow as tf

# Define a variable
w = tf.Variable(4.0)

# Use GradientTape to compute gradients
with tf.GradientTape() as tape:
    loss = (w - 2) ** 2  # Function to minimize

# Compute the gradient of loss with respect to w
grad = tape.gradient(loss, w)

# Apply gradient descent update step
learning_rate = 0.1
w.assign_sub(learning_rate * grad)  # w = w - learning_rate * grad

print(f"Updated w: {w.numpy()}")  # Moves w closer to 2

Output:
Updated w: 3.5999999046325684

This example demonstrates how tf.GradientTape helps compute gradients and update variables, forming the foundation of backpropagation in deep learning.


Reimplementing Our First Neural Network from Scratch in TensorFlow

To truly understand how a neural network operates, we can build it from scratch using TensorFlow. Below, we implement a simple sequential model and a custom dense layer without relying on Keras' high-level API.


Building a Custom Dense Layer in TensorFlow

To further break down the mechanics of a neural network, we implement a basic dense (fully connected) layer from scratch using TensorFlow. The NaiveDense class replicates the behavior of keras.layers.Dense, handling weight initialization, forward propagation, and activation.

import tensorflow as tf

class NaiveDense:
    def __init__(self, input_size, output_size, activation):
        self.activation = activation  # Store the activation function

        # Initialize weights with small random values
        w_shape = (input_size, output_size)
        w_initial_value = tf.random.uniform(w_shape, minval=0, maxval=1e-1)
        self.W = tf.Variable(w_initial_value)

        # Initialize biases as zeros
        b_shape = (output_size,)
        b_initial_value = tf.zeros(b_shape)
        self.b = tf.Variable(b_initial_value)

    def __call__(self, inputs):
        """Perform forward pass: output = activation(W * inputs + b)"""
        return self.activation(tf.matmul(inputs, self.W) + self.b)

    @property
    def weights(self):
        """Return the trainable parameters (weights and biases)"""
        return [self.W, self.b]

How It Works:


  • Weight Initialization: The weights W are initialized with small random values to prevent large initial gradients.

  • Bias Initialization: The bias b is initialized as a zero vector.

  • Forward Pass: It computes output = activation(W * inputs + b), where W and b are trainable parameters.

  • Activation Function: Allows for non-linearity, enabling the model to learn complex patterns.


This custom layer can be combined with the NaiveSequential model to construct a fully functional neural network without using high-level Keras APIs.


Creating a Custom Sequential Model

The NaiveSequential class is a simple implementation of a sequential neural network model. It takes a list of layers and applies them in order to process input data. This structure mimics TensorFlow's Sequential API, making it easy to stack layers and perform forward propagation. Additionally, the weights property allows access to all the weights of the model's layers.

class NaiveSequential:
    def __init__(self, layers):
        """
        Initialize a simple sequential model with a list of layers.

        Args:
            layers (list): A list of layer objects that will be applied in sequence.
        """
        self.layers = layers

    def __call__(self, inputs):
        """
        Perform a forward pass through the sequential model.

        Args:
            inputs (tensor): The input data to be passed through the layers.

        Returns:
            tensor: The final output after passing through all layers.
        """
        x = inputs
        for layer in self.layers:
            x = layer(x)  # Apply each layer sequentially
        return x

    @property
    def weights(self):
        """
        Retrieve the weights of all layers in the model.

        Returns:
            list: A list containing all weight tensors from each layer.
        """
        weights = []
        for layer in self.layers:
            weights += layer.weights  # Collect weights from all layers
        return weights

The NaiveSequential class is a simple implementation of a sequential neural network model. It takes a list of layers and applies them in order to process input data. This structure mimics TensorFlow's Sequential API, making it easy to stack layers and perform forward propagation. Additionally, the weights property allows access to all the weights of the model's layers.


Now that we have our custom layers and sequential model class, let's put them to work by constructing a simple neural network. This model will classify images from the MNIST dataset, recognizing handwritten digits from 0 to 9.


We define our model using the NaiveSequential class and stack two dense layers:

model = NaiveSequential([
   NaiveDense(input_size=28 * 28, output_size=512, activation = tf.nn.relu),
    NaiveDense(input_size=512, output_size=10, activation = tf.nn.softmax)
])

The first layer has 512 neurons and uses the ReLU activation function, introducing non-linearity to the model. The second layer has 10 neurons and applies the softmax activation, turning the outputs into probabilities for each digit class.


Implementing a Batch Generator for Efficient Training

When training a neural network, handling large datasets efficiently is crucial. Instead of feeding the entire dataset at once, we process it in smaller batches. This helps with memory efficiency and allows for smoother gradient updates.


The BatchGenerator Class

Our BatchGenerator class takes care of dividing the dataset into manageable chunks:

import math

class BatchGenerator:
    def __init__(self, images, labels, batch_size=128):
        assert len(images) == len(labels)  # Ensure images and labels are aligned
        self.index = 0
        self.images = images
        self.labels = labels
        self.batch_size = batch_size
        self.num_batches = math.ceil(len(images) / batch_size)  # Total number of batches

    def next(self):
        """Returns the next batch of images and labels."""
        images = self.images[self.index : self.index + self.batch_size]
        labels = self.labels[self.index : self.index + self.batch_size]
        self.index += self.batch_size
        return images, labels

Implementing a Single Training Step

To train a neural network, we perform multiple iterations where we:


  1. Make predictions

  2. Compute the loss

  3. Calculate gradients

  4. Update weights


The one_training_step function encapsulates these steps in TensorFlow:

def one_training_step(model, images_batch, labels_batch):
    with tf.GradientTape() as tape:
        # Forward pass: compute predictions
        predictions = model(images_batch)
        
        # Compute loss for each sample
        per_sample_losses = tf.keras.losses.sparse_categorical_crossentropy(
            labels_batch, predictions)
        
        # Compute the average loss across the batch
        average_loss = tf.reduce_mean(per_sample_losses)
    
    # Compute gradients of the loss with respect to model weights
    gradients = tape.gradient(average_loss, model.weights)
    
    # Update model weights using the computed gradients
    update_weights(gradients, model.weights)
    
    return average_loss

How It Works

Gradient Tape plays a crucial role in automatic differentiation. It records all operations performed on tensors, allowing us to compute gradients efficiently.

Prediction & Loss Calculation begins with the model generating predictions using model(images_batch). The function then computes the per-sample losses using tf.keras.losses.sparse_categorical_crossentropy, which measures how far the predictions are from the actual labels. To ensure smooth optimization, tf.reduce_mean is used to calculate the average loss across the batch.

Gradient Calculation & Weight Update starts by computing gradients of the loss with respect to the model’s weights using tape.gradient. These gradients indicate the direction in which the weights should be adjusted to minimize the loss. Finally, update_weights applies these computed gradients to update the model’s parameters, ensuring that it learns from the given batch of data.


Updating Model Weights

Once we compute the gradients, the next crucial step in training a neural network is updating the weights. This is where the optimization process happens, allowing the model to learn from the data over multiple iterations.

learning_rate = 1e-3  # Defines how much the weights should change per update

def update_weights(gradients, weights):
    for g, w in zip(gradients, weights):
        # Update each weight using gradient descent
        w.assign_sub(g * learning_rate)

How It Works

By repeatedly applying update_weights after computing gradients, the model gradually refines its parameters, improving its performance over time


  1. Learning Rate: The learning_rate controls the step size in weight updates. A small value ensures stable learning, while a large value can cause erratic updates.


  2. Gradient Descent: The function iterates through the computed gradients and corresponding model weights.


  3. Weight Update: Each weight is updated using w.assign_sub(g * learning_rate), which subtracts the gradient multiplied by the learning rate. This moves the weights in the direction that minimizes the loss.


Optimizing Model Weights with Stochastic Gradient Descent

Instead of manually updating weights using basic gradient descent, we can leverage TensorFlow's built-in optimizers for more efficient training.

from tensorflow.keras import optimizers

# Initialize the optimizer with a learning rate
optimizer = optimizers.SGD(learning_rate=1e-3)

def update_weights(gradients, weights):
    # Apply gradients to update model weights
    optimizer.apply_gradients(zip(gradients, weights))

By using TensorFlow's optimizer, we gain access to advanced optimization features like momentum and adaptive learning rates, making training more effective!


Training the Model with Mini-Batch Gradient Descent

The fit function is responsible for training the model over multiple epochs using mini-batch gradient descent. It iterates through the dataset in small batches, updating the model weights step by step.

def fit(model, images, labels, epochs, batch_size=128):
    for epoch_counter in range(epochs):
        print(f"Epoch {epoch_counter}")
        batch_generator = BatchGenerator(images, labels)
        for batch_counter in range(batch_generator.num_batches):
            images_batch, labels_batch = batch_generator.next()
            loss = one_training_step(model, images_batch, labels_batch)
            if batch_counter % 100 == 0:
                print(f"loss at batch {batch_counter}: {loss:.2f}")

This approach ensures the model learns efficiently while keeping training manageable for large datasets!


Training the Model on the MNIST Dataset

Now, let's put everything together and train our neural network on the MNIST dataset! The MNIST dataset consists of 60,000 training images and 10,000 test images of handwritten digits (0-9). We preprocess the data by reshaping it into a flat vector and normalizing pixel values to the range [0,1].

from tensorflow.keras.datasets import mnist

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Preprocess the training and test images
train_images = train_images.reshape((60000, 28 * 28)).astype("float32") / 255
test_images = test_images.reshape((10000, 28 * 28)).astype("float32") / 255

# Train the model for 10 epochs with a batch size of 128
fit(model, train_images, train_labels, epochs=10, batch_size=128)

Output:
Epoch 0
loss at batch 0: 3.06
loss at batch 100: 2.21
loss at batch 200: 2.18
loss at batch 300: 2.07
loss at batch 400: 2.19

Epoch 1
loss at batch 0: 1.88
loss at batch 100: 1.85
loss at batch 200: 1.81
loss at batch 300: 1.69
loss at batch 400: 1.82

Epoch 2
loss at batch 0: 1.56
loss at batch 100: 1.56
loss at batch 200: 1.49
loss at batch 300: 1.41
loss at batch 400: 1.51

Epoch 3
loss at batch 0: 1.31
loss at batch 100: 1.32
loss at batch 200: 1.23
loss at batch 300: 1.20
loss at batch 400: 1.28

Epoch 4
loss at batch 0: 1.12
loss at batch 100: 1.14
loss at batch 200: 1.04
loss at batch 300: 1.04
loss at batch 400: 1.11

Epoch 5
loss at batch 0: 0.98
loss at batch 100: 1.00
loss at batch 200: 0.90
loss at batch 300: 0.93
loss at batch 400: 0.99

Epoch 6
loss at batch 0: 0.87
loss at batch 100: 0.90
loss at batch 200: 0.80
loss at batch 300: 0.84
loss at batch 400: 0.90

Epoch 7
loss at batch 0: 0.79
loss at batch 100: 0.82
loss at batch 200: 0.72
loss at batch 300: 0.77
loss at batch 400: 0.84

Epoch 8
loss at batch 0: 0.73
loss at batch 100: 0.75
loss at batch 200: 0.66
loss at batch 300: 0.72
loss at batch 400: 0.79

Epoch 9
loss at batch 0: 0.68
loss at batch 100: 0.70
loss at batch 200: 0.61
loss at batch 300: 0.67
loss at batch 400: 0.74

First, we load the MNIST dataset using mnist.load_data(), which gives us training and test sets. Since the images are 28×28 pixel grids, we reshape them into simple 784-dimensional vectors to make them easier for the model to process. To help the model train better, we also scale the pixel values from 0-255 down to a range of 0 to 1. Finally, we call the fit function to train the model for 10 epochs using mini-batches of 128 images at a time. This step-by-step process helps our neural network recognize patterns in handwritten digits and make accurate predictions!


Evaluating the Model

Once training is done, it’s time to see how well our model performs! We pass the test images through the model to get predictions. Since the model outputs probabilities for each digit (0-9), we use np.argmax to find the most likely label. Comparing these predictions to the actual test labels, we calculate the accuracy—how many predictions were correct. In this case, we achieve 81% accuracy, which is a solid start for a simple neural network!

# Generate predictions for the test images using the trained model
predictions = model(test_images)

# Convert TensorFlow tensor to a NumPy array for easier processing
predictions = predictions.numpy()

# Determine the predicted labels by selecting the index of the highest probability
predicted_labels = np.argmax(predictions, axis=1)

# Compare predicted labels with actual test labels to find correct matches
matches = predicted_labels == test_labels

# Calculate and print the accuracy as the mean of correct predictions
print(f"accuracy: {matches.mean():.2f}")

Conclusion

Building a neural network from scratch in TensorFlow has been an insightful journey. We started by understanding the fundamental building blocks, such as tensors and element-wise operations, then moved on to implementing essential neural network components like dense layers, activation functions, and weight updates. We also explored how TensorFlow’s Gradient Tape efficiently computes gradients for backpropagation, making the training process seamless.

Through this hands-on approach, we saw how a model learns patterns in handwritten digits using the MNIST dataset. The training process involved data preprocessing, mini-batch gradient descent, and iterative weight updates, culminating in a model that achieves around 81% accuracy on test data. While this accuracy isn’t state-of-the-art, it demonstrates the core mechanics of deep learning, and there’s room for improvement by using deeper architectures, better optimizers, and regularization techniques.

This project highlights the importance of understanding neural networks beyond high-level frameworks. By reimplementing key components manually, we gain deeper insights into how machine learning models truly work under the hood. Whether you're a beginner or an experienced practitioner, this foundational knowledge will empower you to design and optimize more sophisticated AI models in the future.



 

🚀Let’s Bring Your Deep Learning Project to Life!

If you need freelance support or mentorship in machine learning, deep learning, computer vision, NLP, data science, ColabCodes is here to help! Whether you're working on a project, need guidance, or want to sharpen your skills, we offer expert assistance tailored to your needs.


👉  Connect with our machine learning freelancers Today – Only at ColabCodes!!

📩 Contact us at : contact@colabcodes.com or visit this link for a specified plan.


Let's build something great together!

Comments


Get in touch for customized mentorship, research and freelance solutions tailored to your needs.

bottom of page