May 3, 2016

Multiple Variable Linear Regression

Problem statement #

We will use the motivating example of housing price prediction. The training dataset contains four features (size, bedrooms, floors and, age) shown in the table below.

Size (sqft)	Number of Bedrooms	Number of floors	Age of home (years)	Price (1,000s dollars)
2104	5	1	45	460
1416	3	2	40	232
852	2	1	35	178

Let's build a linear regression model using these values so you can then predict the price for other houses. For example, a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old.

Matrix $X$ containing our example #

Each row of datasets represent one example, with $n$ number of features and $m$ training examples, now $\mathbf{x}$ is an input matrix with dimensions $(m, n)$ where $m$ is row and $n$ is column.

$$\mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\ \vdots & \vdots & \vdots & \vdots \\ x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} \end{pmatrix} $$

Notations

$\mathbf{x}^{(i)}$ is vector containing example i. $\mathbf{x}^{(i)} = (x^{(i)}_0, x^{(i)}_1, \cdots,x^{(i)}_{n-1})$
$x^{(i)}_j$ is element $j$ in example $i$. The superscript in parenthesis indicates the example number while the subscript represents an element.

Parameter vector $\mathbf{w}_{j}$ , $b$ #

$\mathbf{w}$ is a vector with $n$ elements.
- Each element contains the parameter associated with one feature.
- in our dataset, n is 4.
- notionally, we draw this as a column vector
$b$ is a scalar parameter.

Load datasets #

As shown in a problem statement, our dataset contains five features (size, bedrooms, floors and, age, price of the house), $n = 5$ and ahundred examples, $m = 100$.

In this case our dataset will have the size of $(m , n) = (100 , 5)$

$$\mathbf{data} = \begin{pmatrix} x^{(0)}_{0} & x^{(0)}_{1} & x^{(0)}_{2} & \cdots & y^{(0)}_{n-1} \\ x^{(1)}_{0} & x^{(1)}_{1} & x^{(1)}_{2} & \cdots & y^{(1)}_{n-1} \\ \vdots & \vdots & \vdots & \vdots & \vdots\\ x^{(m-1)}_{0} & x^{(m-1)}_{1} & x^{(m-1)}_{2} & \cdots & y^{(m-1)}_{n-1} \end{pmatrix} = \begin{pmatrix} x^{(0)}_{0} & x^{(0)}_{1} & x^{(0)}_{2} & x^{(0)}_{3} & y^{(0)}_{4} \\ x^{(1)}_{0} & x^{(1)}_{1} & x^{(1)}_{2} & x^{(1)}_{3} & y^{(1)}_{4} \\ \vdots & \vdots & \vdots & \vdots & \vdots\\ x^{(99)}_{0} & x^{(99)}_{1} & x^{(99)}_{2} & x^{(99)}_{3} & y^{(99)}_{4} \end{pmatrix} $$

$$data \in \mathbb{R}_{m \times n} = \mathbb{R}_{100 \times 5}$$

Load input datasets, $\mathbf{x}^{(i)}_{j}$ #

Input datasets will comprise of five features (size, bedrooms, floors and, age), $n = 4$ and hundred exaples, $m = 100$.

$$\mathbf{X} = \begin{pmatrix} x^{(0)}_{0} & x^{(0)}_{1} & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_{0} & x^{(1)}_{1} & \cdots & x^{(1)}_{n-1} \\ \vdots & \vdots & \vdots & \vdots \\ x^{(m-1)}_{0} & x^{(m-1)}_{1} & \cdots & x^{(m-1)}_{n-1} \end{pmatrix} = \begin{pmatrix} x^{(0)}_{0} & x^{(0)}_{1} & x^{(0)}_{2} & x^{(0)}_{3} \\ x^{(1)}_{0} & x^{(1)}_{1} & x^{(1)}_{2} & x^{(1)}_{3} \\ \vdots & \vdots & \vdots & \vdots \\ x^{(99)}_{0} & x^{(99)}_{1} & x^{(99)}_{2} & x^{(99)}_{3} \end{pmatrix} $$

$$\mathbf{X} \in \mathbb{R}_{m \times n} = \mathbb{R}_{100 \times 4}$$

Load output datasets, $\mathbf{y}^{(i)}_{j}$ #

Input datasets will comprise of only one features (price), $n = 1$ and hundred exaples, $m = 100$.

$$\mathbf{Y} = \begin{pmatrix} y^{(0)}_{0} \\ y^{(1)}_{0} \\ \vdots \\ y^{(m-1)}_{0} \end{pmatrix} = \begin{pmatrix} y^{(0)}_{0} \\ y^{(1)}_{0} \\ \vdots \\ y^{(99)}_{0} \end{pmatrix} $$

$$Y \in \mathbb{R}_{m \times n} = \mathbb{R}_{100 \times 1}$$

Initialize parameters $\mathbf{w}_{j}$ , b#

$\mathbf{w}_{j}$ is a vector with $n = 4$ elements.
- Each element contains the parameter associated with one feature.
$b$ is a scalar parameter.

Model Prediction With Multiple Variables #

The model's prediction with multiple variables is given by the linear model:

$$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = w_0x_0 + w_1x_1 +... + w_{n-1}x_{n-1} + b \tag{1}$$

or in vector Notations

$$ f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{X} + \mathbf{b} \tag{2} $$

$$f_{\mathbf{w},b}(\mathbf{x}) \in \mathbb{R}_{m \times 1} = \mathbf{w} \in \mathbb{R}_{n \times 1} \cdot \mathbf{X} \in \mathbb{R}_{m \times n} + \mathbf{b} \in \mathbb{R}_{1}$$

$$f_{\mathbf{w},b}(\mathbf{x})_{m \times 1} = \mathbf{w}_{n \times 1} \cdot \mathbf{X}_{m \times n} + \mathbf{b}_{1}$$

$$f_{\mathbf{w},b}(\mathbf{x}^{(i)}) \in \mathbb{R}_{m \times 1}$$

$$f_{\mathbf{w},b}(\mathbf{x}) = \begin{pmatrix} f_{\mathbf{w},b}(\mathbf{x})^{(0)}_{0} \\ f_{\mathbf{w},b}(\mathbf{x})^{(1)}_{0} \\ \vdots \\ f_{\mathbf{w},b}(\mathbf{x})^{(m-1)}_{0} \end{pmatrix} = \begin{pmatrix} f_{\mathbf{w},b}(\mathbf{x})^{(0)}_{0} \\ f_{\mathbf{w},b}(\mathbf{x})^{(1)}_{0} \\ \vdots \\ f_{\mathbf{w},b}(\mathbf{x})^{(99)}_{0} \end{pmatrix} $$

where $\cdot$ is a vector `dot product`

To demonstrate the dot product, we will implement prediction using (1) and (2).


double *predict(float **x, double w[][1], double b){

  /*
  single predict using linear regression
  
  Args:
    x (ndarray): Shape (m, n) example with multiple features
    w (ndarray): Shape (n, 1) model parameters    
    b (scalar):  model parameter     
    
  Returns:
    p (scalar):  prediction
  
  */

  double p;
  double *f_wb=(double *)malloc(m*sizeof(double));
  if (f_wb==NULL)
  {
    perror("Error allocating memory");
    free(f_wb);
  }
  

  for (int i = 0; i < m; i++)
  {
    for (int j = 0; j < 1; j++)
    {   
      p=0;
      for (int k = 0; k < n; k++)
      {
          p += (double)x[i][k]*w[k][j];
      }
      p += b;
      f_wb[i]=p;
        
    } 
  }
  
  return f_wb;
}

Compute Cost With Multiple Variables #

The equation for the cost function with multiple variables $J(\mathbf{w},b)$ is:

$$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 \tag{3}$$ where: $$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{X} + b \tag{4} $$

$$f_{\mathbf{w},b}(\mathbf{x}) \in \mathbb{R}_{m \times 1}$$

In contrast to previous labs, $\mathbf{w}$ and $\mathbf{x}^{(i)}$ are vectors rather than scalars supporting multiple features.


double compute_cost(float **x, float y[m], double w[][1], double b){
  
  /*
    compute cost
  Args:
    X (ndarray (m,n)): Data, m examples with n features
    y (ndarray (m,1)) : target values
    w (ndarray (n,1)) : model parameters  
    b (scalar)       : model parameter
    
  Returns:
    cost (scalar): cost
  
  */

  double cost=0;
  double p;

  for (int i = 0; i < m; i++)
  {
    for (int j = 0; j < 1; j++)
    {
      p=0;
      for (int k = 0; k < n; k++)
      {
        p += (double)x[i][k]*w[k][j];
      }
      p +=b;
    }
    cost += pow((p - (double) y[i]), 2);
  }
  cost /= (2*m);

  return cost;
}

Gradient Descent With Multiple Variables#

Gradient descent for multiple variables:

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{5} \; & \text{for j = 0..n-1}\newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*}$$

where, n is the number of features, parameters $w_j$, $b$, are updated simultaneously and where

$$ \begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{6} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{7} \end{align} $$

m is the number of training examples in the data set
$f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target value

Note

$$\frac{\partial J(\mathbf{w},b)}{\partial w_j} \in \mathbb{R}_{1 \times n} = \mathbb{R}_{1 \times 4}$$

$$\frac{\partial J(\mathbf{w},b)}{\partial w_j} = \begin{pmatrix} \frac{\partial J(\mathbf{w},b)}{\partial w_0} , \frac{\partial J(\mathbf{w},b)}{\partial w_1} , \cdots , \frac{\partial J(\mathbf{w},b)}{\partial w_{n-1}} \end{pmatrix} = \begin{pmatrix} \frac{\partial J(\mathbf{w},b)}{\partial w_0} , \frac{\partial J(\mathbf{w},b)}{\partial w_1} , \frac{\partial J(\mathbf{w},b)}{\partial w_2} , \frac{\partial J(\mathbf{w},b)}{\partial w_3} \end{pmatrix} $$
$$\frac{\partial J(\mathbf{w},b)}{\partial b} \in \mathbb{R}_{1}$$


double *compute_gradient(float **x, float y[m], double w[][1], double b){
  /*

  Computes the gradient for linear regression 
  Args:
    X (ndarray (m,n)): Data, m examples with n features
    y (ndarray (m,1)) : target values
    w (ndarray (n,1)) : model parameters  
    b (scalar)       : model parameter
    
  Returns:
    dj_dw (ndarray (n,1)): The gradient of the cost w.r.t. the parameters w. 
    dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 

    grads (ndarray ((n+1), 1)): dj_dw and dj_db. 
  */

  double *grads=(double *)malloc((n + 1)*sizeof(double));
  if (grads ==NULL)
  {
      perror("Error in allocating memory");
      free(grads);
  }

  double *f_wb=(double *)malloc(m*sizeof(double));
  if (f_wb ==NULL)
  {
      perror("Error in allocating memory");
      free(f_wb);
  }
  
  double p;
  double dj_dw;
  double dj_db=0;
  
  for (int i = 0; i < m; i++)
  {
    for (int j = 0; j < 1; j++)
    {
      p=0;
      for (int k = 0; k < n; k++)
      {
        p += x[i][k]*w[k][j];
      }
      p += b;
      f_wb[i]=p;
    }

    dj_db += f_wb[i] - y[i];
  }
  dj_db /= m;


  for (int i = 0; i < n; i++)
  {
    dj_dw=0;
    for (int j = 0; j < m; j++)
    {
      dj_dw += (f_wb[j] - y[j])*x[j][i];
    }
    dj_dw /= m;

    grads[i]=dj_dw;       
  }

  grads[n]=dj_db;

  free(f_wb);

  return grads;
}

Evaluating our model by compute R-squared#

R-squared is the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model. It ranges from 0 to 1, where 0 means no relationship and 1 means a perfect fit

R-squared(coefficient of Determination) tells us How well our model is performing or how well our model's predictions match the real results. A higher R-squared means our model is doing a better job predicting.

$$ R^2 = 1 - \frac{Residual\ Sum\ of\ Squares\ (\mathbf{ss}_{res})}{Total\ Sum\ of\ Squares\ (\mathbf{ss}_{total})} $$

$$ R^2 = 1 - \frac{\mathbf{ss}_{res}}{\mathbf{ss}_{total}}\\ $$

$$ R^2 = 1 - \frac{\sum\limits_{i=0}^{(m-1)}(f_{w,b}(x^{(i)}) - y^{i})^2}{\sum\limits_{i=0}^{(m-1)}(y_{mean}^{(i)} - y^{(i)})^2} $$


double r_squared(float **x, float y[m], double w[][1], double b){

  /*

  Computes the gradient for linear regression 
  Args:
    X (ndarray (m,n)): Data, m examples with n features
    y (ndarray (m,1)) : target values
    w (ndarray (n,1)) : model parameters  
    b (scalar)       : model parameter
    
  Returns:
    r_square (scalar): r square score.
    r_square = (1 - rss/tss) 
    rss : measure of sum diff btn actual(y) value and predicted(f_wb) value.(y - f_wb)
    tss : measure of total variance in actual(y) value. (y - f_mean)
  
  */

  double r_square;
  double f_wb;
  double y_mean;
  double y_sum;
  double rss;
  double tss;

  for (int i = 0; i < m; i++)
  {
    for (int j = 0; j < 1; j++)
    {
      f_wb=0;
      for (int k = 0; k < n; k++)
      {
        f_wb += x[i][k]*w[k][j];
      }
      f_wb +=b;
    }

    rss += pow((y[i] - f_wb), 2);
    y_sum += y[i];
      
  }
  y_mean = y_sum/m;

  for (int i = 0; i < m; i++)
  {
    tss += pow((y[i] - y_mean), 2);
  }
  
  r_square = 1 - (rss/tss);


  return r_square;
}

Multiple Variable Linear Regression

Problem statement #

Matrix \(X\) containing our example #

Parameter vector \(\mathbf{w}_{j}\) , \(b\) #

Load datasets #

Load input datasets, \(\mathbf{x}^{(i)}_{j}\) #

Load output datasets, \(\mathbf{y}^{(i)}_{j}\) #

Initialize parameters \(\mathbf{w}_{j}\) , b#

Model Prediction With Multiple Variables #

Compute Cost With Multiple Variables #

Gradient Descent With Multiple Variables#

Evaluating our model by compute R-squared#

Now read this

Matrix Multiplication