Linear regression is one of the fundamental statistical and machine learning techniques. Whether you want to do statistics, machine learning, or scientific computing, there are good chances that you’ll need it. It’s advisable to learn it first and then proceed towards more complex methods.
Regression analysis is one of the most important fields in statistics and machine learning. There are many regression methods available. Linear regression is one of them.
What Is Regression?
Regression searches for relationships among variables.
For example, you can observe several employees of some company and try to understand how their salaries depend on the features, such as experience, level of education, role, city they work in, and so on.
This is a regression problem where data related to each employee represent one observation. The presumption is that the experience, education, role, and city are the independent features, while the salary depends on them.
Similarly, you can try to establish a mathematical dependence of the prices of houses on their areas, numbers of bedrooms, distances to the city center, and so on.
Generally, in regression analysis, you usually consider some phenomenon of interest and have a number of observations. Each observation has two or more features. Following the assumption that (at least) one of the features depends on the others, you try to establish a relation among them.
In other words, you need to find a function that maps some features or variables to others sufficiently well.
The dependent features are called the dependent variables, outputs, or responses.
The independent features are called the independent variables, inputs, or predictors.
Regression problems usually have one continuous and unbounded dependent variable. The inputs, however, can be continuous, discrete, or even categorical data such as gender, nationality, brand, and so on.
It is a common practice to denote the outputs with 𝑦 and inputs with 𝑥. If there are two or more independent variables, they can be represented as the vector 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of inputs.
When Do You Need Regression?
Typically, you need regression to answer whether and how some phenomenon influences the other or how several variables are related. For example, you can use it to determine if and to what extent the experience or gender impact salaries.
Regression is also useful when you want to forecast a response using a new set of predictors. For example, you could try to predict electricity consumption of a household for the next hour given the outdoor temperature, time of day, and number of residents in that household.
Regression is used in many different fields: economy, computer science, social sciences, and so on. Its importance rises every day with the availability of large amounts of data and increased awareness of the practical value of data.
Linear regression is probably one of the most important and widely used regression techniques. It’s among the simplest regression methods. One of its main advantages is the ease of interpreting results.
When implementing linear regression of some dependent variable 𝑦 on the set of independent variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, you assume a linear relationship between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression equation. 𝛽₀, 𝛽₁, …, 𝛽ᵣ are the regression coefficients, and 𝜀 is the random error.
Linear regression calculates the estimators of the regression coefficients or simply the predicted weights, denoted with 𝑏₀, 𝑏₁, …, 𝑏ᵣ. They define the estimated regression function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ. This function should capture the dependencies between the inputs and output sufficiently well.
The estimated or predicted response, 𝑓(𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛, should be as close as possible to the corresponding actual response 𝑦ᵢ. The differences 𝑦ᵢ - 𝑓(𝐱ᵢ) for all observations 𝑖 = 1, …, 𝑛, are called the residuals. Regression is about determining the best predicted weights, that is the weights corresponding to the smallest residuals.
To get the best weights, you usually minimize the sum of squared residuals (SSR) for all observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This approach is called the method of ordinary least squares.
Implementing Linear Regression in Python
It’s time to start implementing linear regression in Python. Basically, all you should do is apply the proper packages and their functions and classes.
Python Packages for Linear Regression
The package NumPy is a fundamental Python scientific package that allows many high-performance operations on single- and multi-dimensional arrays. It also offers many mathematical routines. Of course, it’s open source.
If you’re not familiar with NumPy, you can use the official NumPy User Guide and read Look Ma, No For-Loops: Array Programming With NumPy. In addition, Pure Python vs NumPy vs TensorFlow Performance Comparison can give you a pretty good idea on the performance gains you can achieve when applying NumPy.
The package scikit-learn is a widely used Python library for machine learning, built on top of NumPy and some other packages. It provides the means for preprocessing data, reducing dimensionality, implementing regression, classification, clustering, and more. Like NumPy, scikit-learn is also open source.
You can check the page Generalized Linear Models on the scikit-learn web site to learn more about linear models and get deeper insight into how this package works.
If you want to implement linear regression and need the functionality beyond the scope of scikit-learn, you should consider statsmodels. It’s a powerful Python package for the estimation of statistical models, performing tests, and more. It’s open source as well.
Simple Linear Regression With scikit-learn
Let’s start with the simplest case, which is simple linear regression.
There are five basic steps when you’re implementing linear regression:
1. Import the packages and classes you need.
2. Provide data to work with and eventually do appropriate transformations.
3. Create a regression model and fit it with existing data.
4. Check the results of model fitting to know whether the model is satisfactory.
5. Apply the model for predictions.
These steps are more or less general for most of the regression approaches and implementations.
Step 1: Import packages and classes
The first step is to import the package numpy and the class LinearRegression from sklearn.linear_model:
import numpy as np from sklearn.linear_model import LinearRegression
Now, you have all the functionalities you need to implement linear regression.
The fundamental data type of NumPy is the array type called numpy.ndarray. The rest of this article uses the term array to refer to instances of the type numpy.ndarray.
The class sklearn.linear_model.LinearRegression will be used to perform linear and polynomial regression and make predictions accordingly.
Step 2: Provide data
The second step is defining data to work with. The inputs (regressors, 𝑥) and output (predictor, 𝑦) should be arrays (the instances of the class numpy.ndarray) or similar objects. This is the simplest way of providing data for regression:
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1)) y = np.array([5, 20, 14, 32, 22, 38])
Now, you have two arrays: the input x and output y. You should call .reshape() on x because this array is required to be two-dimensional, or to be more precise, to have one column and as many rows as necessary. That’s exactly what the argument (-1, 1) of .reshape() specifies.
This is how x and y look now:
>>> print(x) [[ 5]     ] >>> print(y) [ 5 20 14 32 22 38]
As you can see, x has two dimensions, and x.shape is (6, 1), while y has a single dimension, and y.shape is (6,).
Step 3: Create a model and fit it
The next step is to create a linear regression model and fit it using the existing data.
Let’s create an instance of the class LinearRegression, which will represent the regression model:
model = LinearRegression()
This statement creates the variable model as the instance of LinearRegression. You can provide several optional parameters to LinearRegression:
fit_intercept is a Boolean (True by default) that decides whether to calculate the intercept 𝑏₀ (True) or consider it equal to zero (False).
normalize is a Boolean (False by default) that decides whether to normalize the input variables (True) or not (False).
copy_X is a Boolean (True by default) that decides whether to copy (True) or overwrite the input variables (False).
n_jobs is an integer or None (default) and represents the number of jobs used in parallel computation. None usually means one job and -1 to use all processors.
This example uses the default values of all parameters.
It’s time to start using the model. First, you need to call .fit() on model:
With .fit(), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output (x and y) as the arguments. In other words, .fit()fits the model. It returns self, which is the variable model itself. That’s why you can replace the last two statements with this one:
model = LinearRegression().fit(x, y)
This statement does the same thing as the previous two. It’s just shorter.
Step 4: Get results
Once you have your model fitted, you can get the results to check whether the model works satisfactorily and interpret it.
You can obtain the coefficient of determination (𝑅²) with .score() called on model:
>>> r_sq = model.score(x, y) >>> print('coefficient of determination:', r_sq) coefficient of determination: 0.715875613747954
When you’re applying .score(), the arguments are also the predictor x and regressor y, and the return value is 𝑅².
The attributes of model are .intercept_, which represents the coefficient, 𝑏₀ and .coef_, which represents 𝑏₁:
>>>>>> print('intercept:', model.intercept_) intercept: 5.633333333333329 >>> print('slope:', model.coef_) slope: [0.54]
The code above illustrates how to get 𝑏₀ and 𝑏₁. You can notice that .intercept_ is a scalar, while .coef_ is an array.
The value 𝑏₀ = 5.63 (approximately) illustrates that your model predicts the response 5.63 when 𝑥 is zero. The value 𝑏₁ = 0.54 means that the predicted response rises by 0.54 when 𝑥 is increased by one.
You should notice that you can provide y as a two-dimensional array as well. In this case, you’ll get a similar result. This is how it might look:
>>>>>> new_model = LinearRegression().fit(x, y.reshape((-1, 1))) >>> print('intercept:', new_model.intercept_) intercept: [5.63333333] >>> print('slope:', new_model.coef_) slope: [[0.54]]
As you can see, this example is very similar to the previous one, but in this case, .intercept_ is a one-dimensional array with the single element 𝑏₀, and .coef_ is a two-dimensional array with the single element 𝑏₁.
Step 5: Predict response
Once there is a satisfactory model, you can use it for predictions with either existing or new data.
To obtain the predicted response, use .predict():
>>>>>> y_pred = model.predict(x) >>> print('predicted response:', y_pred, sep='\n') predicted response: [ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333]
When applying .predict(), you pass the regressor as the argument and get the corresponding predicted response.
This is a nearly identical way to predict the response:
>>>>>> y_pred = model.intercept_ + model.coef_ * x >>> print('predicted response:', y_pred, sep='\n') predicted response: [[ 8.33333333] [13.73333333] [19.13333333] [24.53333333] [29.93333333] [35.33333333]]
In this case, you multiply each element of x with model.coef_ and add model.intercept_ to the product.
The output here differs from the previous example only in dimensions. The predicted response is now a two-dimensional array, while in the previous case, it had one dimension.
If you reduce the number of dimensions of x to one, these two approaches will yield the same result. You can do this by replacing x with x.reshape(-1), x.flatten(), or x.ravel() when multiplying it with model.coef_.
In practice, regression models are often applied for forecasts. This means that you can use fitted models to calculate the outputs based on some other, new inputs:
>>>>>> x_new = np.arange(5).reshape((-1, 1)) >>> print(x_new) [    ] >>> y_new = model.predict(x_new) >>> print(y_new) [5.63333333 6.17333333 6.71333333 7.25333333 7.79333333]
Here .predict() is applied to the new regressor x_new and yields the response y_new. This example conveniently uses arange() from numpy to generate an array with the elements from 0 (inclusive) to 5 (exclusive), that is 0, 1, 2, 3, and 4.
You now know what linear regression is and how you can implement it with Python and three open-source packages: NumPy, scikit-learn, and statsmodels.
You use NumPy for handling arrays.
Linear regression is implemented with the following:
scikit-learn if you don’t need detailed results and want to use the approach consistent with other regression techniques
statsmodels if you need the advanced statistical parameters of a model
Both approaches are worth learning how to use and exploring further. The links in this article can be very useful for that.
When performing linear regression in Python, you can follow these steps:
Import the packages and classes you need
Provide data to work with and eventually do appropriate transformations
Create a regression model and fit it with existing data
Check the results of model fitting to know whether the model is satisfactory
Apply the model for predictions
If you have questions or comments, please put them in the contact us form.
Follow me on LinkedIn : https://www.linkedin.com/in/vaibhav-shah-704104212/
Your feedback is appreciated!
Did you find this Blog helpful? Any suggestions for improvement? Please let me know by filling the contact us form or ping me on LinkedIn .