# Overview 

In machine learning, we use **gradient descent** as an optimization technique to find the optimal model parameters that would result in the minimal cost value. 

## Types of gradient descent 
Depending on the implementation, there are three types of gradient descent algorithms. 

1. Batch gradient descent 

2. Stochastic gradient descent 

3. Mini-Batch gradient descent 

In this shot, we'll be focusing on *Stochastic gradient descent*. 

## Stochastic gradient descent 
We use the entire dataset to calculate the gradient for every iteration in a standard gradient approach. The downside of this approach is recognized when the dataset size increases considerably. For each iteration, all the dataset samples will be used until a minimum is found, making the algorithm inefficient and resource-intensive. 

Therefore, a better approach is to use the **Stochastic Gradient Descent (SGD)**, in which a few dataset items are sampled and used for each iteration. This sample is collected randomly after shuffling the dataset.

Due to the randomization involved in SGD, it takes more iterations to reach the minima, and the path/s taken to get that minima are noisier. 

This is illustrated in the image below. 



## Code 
Let's look at an example pseudocode implementation of SGD in Python. 

def SGD(theta0, learning_rate, no_iterations):
    i = 0
    theta = theta0
    for i in range(i+1, no_iterations+1):
        cost, gradient = predict(theta)
        theta = theta - (learning_rate * gradient)

## Explanation 
In the code above, `theta0` is the initial point from where the SGD is started, `learning_rate` is the learning rate of the algorithm, and `no_iterations` represents the total number of iterations for which the SGD process will be run. 

We initialize a function that takes in three parameters. We're assuming a `predict` function has been implemented that returns the cost and the gradient we'll optimize. Once the iterations are exhausted, we produce the output as `theta.`



What is Stochastic gradient descent? 

SGD optimizes model parameters iteratively using random data samples, increasing efficiency with large datasets but producing noisier paths.