Univariate Linear Regression

Problem statement #

Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet.

Can you use the data to help you identify which cities may potentially give your business higher profits?

Note:

Population of a city (x 10,000) as \(x\) Profit of a restaurent (x $10,000) as \( f_{w,b}(x^{(i)}) \)or \(x\)
6.1101 17.592
5.5277 9.1302
8.5186 13.662
7.0032 11.854
5.8598 6.8233

Number of training example (size (1000 sqft) as x) \(m\)

In this case \(m = 5\)

Model Function #

The model function for linear regression (which is a function that maps from \(x\) to \(y\) is represented as

$$f_{w,b}(x^{(i)}) = w*x^{(i)} + b \tag{1}$$


float *compute_model_output(float x[], float w, float b, int m){

  /*
  Computes the prediction of a linear model
  Args:
  X[m] (ndarray (m,1)): Data, m examples 
  w,b (scalar)     : model parameters 
  m (scalar)       : number of examples, X
  Returns
  Y[m] (ndarray (m,1)): target values
  */

  float *f_x = (float *)malloc(m * sizeof(float));

  for(int i = 0; i < m; i++){
    f_x[i] = x[i] * w + b;
  }
  return f_x;
}

Compute Cost #

Cost is the measure of how well our model will predict the target output well, in this case target output is Profit of a restaurent.

Gradient descent involve repeated steps to adjust the values of w and b to get smaller and smaller Cost, \(J(w,b)\).

The equation for Cost with one variable

$$J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{2}$$

where

$$f_{w,b}(x^{(i)}) = w*x^{(i)} + b \tag{1}$$

Note,


float compute_cost(float x[], float y[], float w, float b, int m){

  /*
  Computes the cost function for linear regression.

  Args:
  X (ndarray (m,1)): Data, m examples 
  Y (ndarray (m,1)): target values
  w,b (scalar)     : model parameters  
  m (scalar)       : number of examples, X

  Returns
    total_cost (float): The cost of using w,b as the parameters for linear regression
          to fit the data points in X and Y
  */

  float J_wb = 0;
  float f_x;

  for (int i = 0; i < m; i++)
  {
    f_x = x[i] * w + b; 
    J_wb += pow((f_x - y[i]), 2);
  }
  J_wb /= (2 * m);

  return J_wb;
}

Cost versus iterations of gradient descent #

A plot of cost versus iterations is a useful measure of progress in gradient descent. Cost should always decrease in successful runs. The change in cost is so rapid initially, it is useful to plot the initial decent on a different scale than the final descent. In the plots below, note the scale of cost on the axes and the iteration step.

Plot of cost J(w,b) vs w,b with path of gradient descent #

Plot shows the $cost(w,b)$ over a range of $w$ and $b$. Cost levels are represented by the rings. Overlayed, using red arrows, is the path of gradient descent. Here are some things to note:

Convex Cost surface #

The fact that the cost function squares the loss, \( \sum_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\) ensures that the 'error surface' is convex like a soup bowl. It will always have a minimum that can be reached by following the gradient in all dimensions.

Gradient Descent #

In linear regression, we utilize input training data to fit the parameters \(w\),\(b\) by minimizing a measure of the error between our predictions \(f_{w,b}(x^{(i)})\) and the actual data \(y^{(i)}\). The measure is called the $cost$, \(J(w,b)\). In training you measure the cost over all of our training samples \(x^{(i)},y^{(i)}\)

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{3} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace \end{align*}$$

where, parameters \(w\), \(b\) are updated simultaneously.

The gradient is defined as:

$$ \begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})*x^{(i)} \tag{4}\\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5}\\ \end{align} $$

Here *simultaniously* means that you calculate the partial derivatives for all the parameters before updating any of the parameters.


float *compute_gradients(float x[], float y[], float w, float b, int m){

  /*
  Computes the gradients w,b for linear regression 
  Args:
  x (ndarray (m,1)): Data, m examples 
  y (ndarray (m,1)): target values
  w,b (scalar)    : model parameters  
  m (scalar)       : number of examples, X
  Returns
  dj_dw (scalar): The gradient of the cost w.r.t. the parameters w  
  dj_db (scalar): The gradient of the cost w.r.t. the parameters b
  grads (ndarray (2,1)): gradients [dj_dw, dj_db]  
  */

  float f_x;
  float dj_dw = 0;
  float dj_db = 0;
  float *grads = (float *)malloc(2 * sizeof(float));

  for (int i = 0; i < m; i++)
  {
    f_x = x[i] * w + b;
    dj_dw += (f_x - y[i]) * x[i];
    dj_db += (f_x - y[i]);
  }
  dj_dw /= m;
  dj_db /= m;

  grads[0] = dj_dw;   //index 0: dj_dw
  grads[1] = dj_db;   //index 1: dj_db

  return grads;
}

Cost vs w, with gradients, b set to 100 #

The plot shows \(\frac{\partial J(w,b)}{\partial w}\) or the slope of the cost curve relative to \(w\) at three points. The derivative is negative. Due to the 'bowl shape', the derivatives will always lead gradient descent toward the bottom where the gradient is zero.

Evaluating our model #

To evaluate the estimation model, we use coefficient of determination which is given by the following formula:

$$ R^2 = 1 - \frac{Residual\ Square\ Sum}{Total\ Square\ Sum} $$ $$ R^2 = 1 - \frac{\sum\limits_{i=0}^{(m-1)}(f_{w,b}(x^{(i)}) - y^{i})^2}{\sum\limits_{i=0}^{(m-1)}(f_{w,b}(x^{(i)}) - f_{w,b}(x^{(i)})_{mean})^2} $$

float *compute_weight_bias(float w, float b, float dj_dw, float dj_db, float alpha){

  /*
  Performs gradient descent to fit w,b. Updates w,b by taking 
  gradient steps with learning rate alpha

  Args:
  w,b (scalar): initial values of model parameters  
  alpha (float):     Learning rate
  dj_dw (scalar): The gradient of the cost w.r.t. the parameters w  
  dj_db (scalar): The gradient of the cost w.r.t. the parameters b

  Returns:
  w (scalar): Updated value of parameter after running gradient descent
  b (scalar): Updated value of parameter after running gradient descent
  p_hist (ndarray (2,1)): History of parameters [w,b] 
  */

  float *p_hist = (float *)malloc(2 * sizeof(float));

  w = w - (alpha * dj_dw);
  b = b - (alpha * dj_db);

  p_hist[0] = w; //weight
  p_hist[1] = b; //bias

  return p_hist;
}

Learning parameters using batch gradient descent#

You will now find the optimal parameters of a linear regression model by using batch gradient descent. Recall batch refers to running all the examples in one iteration.

Expected Output#

Optimal w, b found by gradient descent

w b
|-3.216610 |-3.216610

We will now use our final parameters w, b to find our prediction for single example.

recall:

$$ f_{w,b}(x^{(i)}) = w * x^{i} + b $$

Let's predict what profit will be for the ares of 35,000 and 70,000 people

The main function#


#include <stdio.h>
#include <stdlib.h>
#include <math.h>

//Initializing functions
float *compute_model_output(float x[], float w, float b, int m);
float compute_cost(float x[], float y[], float w, float b, int m);
float *compute_gradients(float x[], float y[], float w, float b, int m);
float *compute_weight_bias(float w, float b, float dj_dw, float dj_db, float alpha);
float evaluate_model(float x[], float y[], float w, float b, int m);

int main(){

  //initialize w and b
  float w_in = 0;
  float b_in = 0;

  //initialize input, x
  float x[]={6.1101, 5.5277, 8.5186, 7.0032, 5.8598, 
            8.3829, 7.4764, 8.5781, 6.4862, 5.0546, 
            5.7107, 14.164, 5.734, 8.4084, 5.6407, 
            5.3794, 6.3654, 5.1301,6.4296, 7.0708, 6.1891};
  float y[]={17.592, 9.1302, 13.662, 11.854, 6.8233,
            11.886, 4.3483, 12, 6.5987, 3.8166, 3.2522,
            15.505, 3.1551, 7.2258, 0.71618, 3.5129, 
            5.3048, 0.56077, 3.6518, 5.3893, 3.1386};

  //number of example, m
  int m = sizeof(x) / sizeof(x[0]);
  float alpha = 1e-2;
  int num_iters = 10000;

  float w = 0;
  float b = 0;
  float J_wb = 0;
  float dj_dw = 0;
  float dj_db = 0;

  for (int i = 0; i <= num_iters; i++)
  {
    // Save cost J at each iteration
    J_wb = compute_cost(x, y, w, b, m);
    
    // Calculate the gradient and update the parameters using gradient_function
    float *grads = compute_gradients(x, y, w, b, m);

    dj_dw = grads[0];
    dj_db = grads[1];

    // Update Parameters using equation (3) above
    float *p_hist = compute_weight_bias(w, b, dj_dw, dj_db, alpha);

    //R square
    float R_square = evaluate_model(x, y, w, b, m);

    w = p_hist[0];
    b = p_hist[1];
    
    // Print cost every at intervals 10 times or as many iterations if < 10
    if (ceil(i % 1000) == 0)
    {
      printf("Iteration: %5d\tCost: %10f\t R_Score: %10f\t dj_dw: %10f\t dj_db:%10f\t w: %10f\tb:%8f\n", i, J_wb, R_square, dj_dw, dj_db, w, b);
    }
  }

  //Predicting the Profit of restaurant, with a given Population, popul for 1 example, eg=1
  int eg = 2;
  float popul[] = {3.5, 7.5};

  float weight = w;
  float bias = b;
  float *pred = compute_model_output(popul, weight, bias, eg);
  for (int i = 0; i < eg; i++)
  {
    printf("\nFor population: %f,\t we predict the profit of:$ %f\n", popul[i] * 10000, pred[i] * 10000);
  }

  return 0;
}

Now read this

Multivariate Linear Regression

C programming language,... Continue →