Simple Linear Regression Implementation -Machine Learning Tutorial with Python and R-Part 6



hello all today I'm going to start my sixth session of the machine learning data science tutorial with Python today we are going to do the practical implementation of the simple linear regression with Python so okay let's begin in order to do stew it first of all I need to show you the data set that we have in which we are going to implement the simple linear regression model I have a data set over here with the years of experience and with the salary based on the years of experience the salary allows us changing which are actually specified in the previous example of in my previous session where we were discussing various examples for linear simple linear regression so in that we would we had discussed about an example for years of experience and salary based on the years of experience there is a change in the salary and this is based on the market trend so there's all this all years of experience and salary data set we are going to implement in a simple linear regression and the basic steps that we are going to follow us first of all let me note down first of all we are going to import some packages both some packages like numpy pandas a scale on that is basically skycat learn and math not matplotlib this is basically format lot place basically for the graphic purpose I will explain you about all these packages once we are doing the practical work second first let me do this first step importing so gives my file a simple regression dot py I've renamed it first of all we need to set up the workspace the workspace always note that the workspace and the data said that it is present it should be present in the same folder so I'll do a save as you can see I have saved as simple regression dot py mmm I'll just replace this file and you can see the salary data set is present so we are going to read the salary data set and do our basically the simple linear regression I'm just going to replace it so the first step that we are doing is we are we need to import numpy we need to import pandas and we need to import matplotlib dot PI plot SP LT so this is the syntax how we need to employ the libraries into our Python you can see that Elias name is given over here this is basically four we don't need to write numpy in each and every code we can use this allies instead in order to call those functionalities inside those packages so currently I'm not importing SK loan because SK loan will be imported later when we are actually creating the classifiers to which we are developing the model okay just let me execute this for executing in Spyder unit 2 press control enter so this gets executed you have all the packages that has got installed okay now let's go to the second step in the second step we are gonna do one more thing we need to read the data set so this data set we need to read it and we need to form as a data frames basically we will be reading this data set into data frames so that once we take into the data frames we will be implementing our basically we'll be implementing our simple linear regression algorithms okay in this injury after reading the data set we need to after reading the data set we need to split the data based on independent and dependent variables dependent variables are usually given a small Y and independent variables are usually given as this basically means that Y is your output and since Y is your output will always be dependent on this independent variables or the future variables so based on this variables this value will be created okay now let me just show you this salary over here will be our Y that is the dependent variable and the years of experience will be our feature independent variables cool so let let us implement this two method two two statements that I have written over here so first of all is reading the data set to in order to read the data set just let me create a dataset variable for reading the data set we will be using pandas because pandas back libraries usually have those reading functionalities if you it wants to read it from the CSV or excel file or HTML file or JSON so let me just do PD dot you can see okay if I from this you can see that all the methods that you can see or how you can read it there is a read underscore HTML read underscore JSON read on let's go CSV file so let's let me type should read on a CSV file because my data set is a non CSV file okay now we need to understand what all parameters that read underscore CSV files X in order to see how which all parameters the region underscore CSV parameter takes just click over and press control I after pressing ctrl R you'll be seeing all the informations about the readin Disko CSV file and water parameters you will be requiring okay so for for current purpose like I'll be only using the first part that is first parameter that is file path or buffer so I'll be providing the file part which is presently currently in my workspace all the other parameters will be seeing all the other parameters in different kind of examples where we will be using PDF files or any other files or the apart from excel file not see exhibit so let me go here let me rename my file name basically a salary dot CSV so this is my file path you can see over here I've basically my and this is my file name that is present in to my workspace so let me execute this so I it says that file be salaried underscore or dot CSV does not exist okay let me do one thing let me just open the file what's the name of the file its and really oh sorry first of all the problem is basically it is not being able to find a salary dot CSV so the one thing that we have missed is we have not set our workspace always when you program in your spider programming language you need to set up your workspace set up your workplace means basically now you see that your file is set up over here simple regression dot py in this machine learning example folders okay now in order to set this directory as your workspace you need to do just one thing that's just again let me see you need to just do one thing you need to just press f5 so once you see do the f5 over here you see that your current directory will be set up in this machine learning example folder you can see over here this simple regression why folders and you can also see the path away because this is the working directory that has been set so now once you read it you have got the data frame over here so this was basically I did not do it in the initial stage because if we need to know about the working directory – we need always before programming before starting up you need to set up the working directory once you set up the working directory then you will be able to read that particular file which is present inside it so if I show you again here is the file that is present and here is my programming file which I am programming it into so it is now considering it as a working directory and it is being able to use the salary dot CSV file now I have read the salary dot CSV file now let me show you all the data is have come over here we have basically got the data of a years of experience in salary om ok fine so on the left hand side and the salary now the second step is basically after reading the data set we need to split the data set based on independent x and dependent variables Y now in order to explain you how this works I have my data set over here so the years of experience will be become our independent variable whereas salary will become our dependent variable so let's see how we'll do that now so it is better that you write comments in your code comments to write the comments you need to just use this hash symbol and you can write the comments so I'll just write as divide the data set into x and y so this will basically give you an information that it is getting divided into an independent variable and a dependent variable okay so let me just take X is equal to I'm going to use this data set dot basically now this data frame I'm going to read the first column and the second column so in the first column I'm going to actually use it as an independent variable and the second column will be my basically it might dependent variables which will be in the y-axis so in order to read it you just have to use I look now always remember whenever you are putting we have to use this high location braces if you are familiar with the pandas there is a there is a functionality called as I look which will be actually helpful for checking out which columns they have to use it as an x-axis and as an y-axis so inside it I'll give two parameter status colon colon basically means and the first parameter : basically means all the rules all the rules over here so all the rules that are present over here we are considering all the rules in this particular x-axis then we need to specify : -1 so first first parameter is regarding the rows and second parameter obviously it'll be regarding the columns so no over here you see that if I put this : first : right this first colon specifies all the columns in the second parameter the first column specifies all the columns from that you have to do minus 1 minus 1 means this salary column we are actually dropping it out so if we drop out the only column that is remaining Ezio's of experience please do have a note of this this is very important to understand in the first totally in high location we have two parameters one is for the rows and other is for the column in the first we have specified : which means we need to take all the rows and in the second parameter you see that we have also specified : : basically means for the first instance we need to specify we are taking both the : now both the column from the column we are doing a minus 1 basically we are taking dropping up the salary column because strawberry will be going into the y-axis good so let me just then I'll just use a stored value so that whatever values we are actually exploring it will be taking automatically in which format it is basically enough load so let me just execute it perfect now you can see over here this has got executed successfully from here you can see that your X column is basically having all the experiences and your x-axis is also having to see so this basically is the independent variable that you have set in the x-axis now for the y-axis from the data frame you know that y-axis is already salary column so we will be taking this salary part in to the y-axis so just let me do it Y is equal to data set dot I look again I have take the whole columns over here whole rows over here sorry old rows over here and I have to specify the column name column column number is basically 1 0 1 so 1 first column will just take it as over here which will be specifying our salary columns and dot values let me just execute it and see what I get was by perfect I've got the salary columns oh yeah and this is what my dependent variable is that it will be in the y-axis now this is what you have over here hmm so we have completed our third step basically we need to divide the data set into X and wives is the dependent and independent variable so I have my X Y over here you can see all there is that is good now let me just go to our next step and the next step we will be dividing the complete data set into training training and state all set perfect so oh why we are doing this why we are completely dividing the data set into training set and it has a test data set is because through the training data set will be defining our classifier we'll be creating our class will be implementing a particular regression model into it and we'll be creating a classifier through which that model will be created so classifier and models classifier is not it's same as a model we'll be implementing or linear aggravation on based on the training data and on the test data will be actually ready predicting what are the values all for that particular test data so this fourth step is also simple let me show you how we can do it so let me just give a comment that we are splitting splitting the data based on training and test perfect now in order to implement this as you know I think you have heard about this a skycat loan library so from the sky could learn library I'll write like that as a scale on dot in the SK learn we have something called as cross validate crossed underscore validation in this cross underscore validation we are going to import train underscore test underscore split so what this basically does this is a package which will actually divide of data frame and do some amount of training data and some amount of test data based on the parameters that we specify good so after executing this step let me do one thing we are implementing a simple regression so your Alf is specifying the train set data split test on this go split and over here if I just press control I you can see the functionality basically the description of this in the help help section so here and the parameters if we see we need the arrays and we need the test size or the tens train size on the random state and the random straight oh let's remember that you can either specify the test size or the train size because if you specify the train size suppose if you are having a hundred percentage or 100 records and you want to specify that from the 100 records you need around 70 record to be into the trend train data set so you just need to specify 0.7 percent over here and the remaining will be actually the test size so any one of the parameters you can specify in this case either trade dashed underscore size or trained under sighs okay the first parameter is arrays so just put it as X comma Y so these are the two where is that you actually requiring because x is the independent and Y is the dependent variable now over here for this purpose we are going to take it as – underscore size now since we have 30 records that I've shown you in the initially from that 30 records we will try to do something like we will be trying to take 20 values into training set and 10 values around 10 values in the test set so just let me write as 1 by 3 so our test size will basically be 1 by 3 that is 1 by 3 into 30 then 30 – 10 values will be going into the test size basically now the fourth parameter that we are seeing is random underscore state random on Disko state I will define I'll explain you what it is basically in the later videos but currently you need to understand that random state is a seed value you can you can specify it in the seed value way it is written it is the seed value used by a random number generator so currently to get the same output if I am implementing it to get the correct output as you want to implement it for the pathway so we just specify it as 0 perfect or let me just execute it okay oops I made a mistake over here this actually will specify you around three pound of four parameters that is external scoretext X Y underscore train and Y in this code test so sorry my mind was not in the right way what I was thinking sometimes this mistakes happens now we'll be executing this inside the strength training on disk rotational via specified X parameter Y parameter touched on this goes rise parameter where we are specified test size is 1 by 3 and random state is 0 so lettin execute this and see what we get the output finder so we have got some warnings but it is ok some of my models is called spin this duplicated basically so let me see in the variable Explorer now you see so there are four variables that I've created that is X train extreme ecstasy white rain and white s so if you see from the X train they are specifically around 20 values since I'd given in my test size as 1 by 3 so remaining 20 will come into our train size so here is the years of experience that you have and it is around 20 similarly if I go to the extent in the X tester I have the 10 values which are specified as 1 by 3 perfect I've got my train and test let's see why train my test and the white rain I can see the salaries and they are around 20 values over here so this is perfect I have my 20 values here now if I go to the white test these are the another 10 values that we specify from we took it from the salary column cool so we have implemented this testing and training set basically divided the data set into those two forms cool oh let me do one thing let me just take the next step implement or classifiers implement our classifiers based on simple linear regression so this four steps usually for each and every problem you have to do it you have to read the data set need to split the independent split the data set based on independent and dependent variables and divide the complete dataset into training and test data set so let's let me give us done this is done this is done this is done finally this is awesome okay cool this is where we are doing and all the fifth step implement a classifier based on simple linear equation now in order to do it I just write a comment over implement our classifier based on simple linear regression of it poet we need to again take a scale on dot we have something called as pre-processing and put no not this toy I just got mistaken again I have called as linear model and I'll import the linear regression package basically so from the SK alone not linear disco model we'll be importing the linear regression package so if you want to see this then the next step basically is defining my simple linear regression classifier simple linear regression classifier now in this linear regression why we are actually adding this this function or this linear regression is doing this package is basically calculating the beta0 thatöyou we had the formula y is equal to MX plus C which I later converted that and to shown in a different representation format in my previous video like Y is equal to beta naught much let me tie it down over here so we had something as Y is equal to beta 0 plus beta 0 plus beta 1 into y so here we know that beta 0 is the intercept point beta 1 is the slope and beta 1 formula we had discussed in all previous class which is summation of X whole square I just don't remember the formula let me just specify this this is how we have calculated beta 1 where summation of XY into summation of X minus summation of X into summation of Y divided by summation of X square minus summation of whole square so this is what the method it is falling actually to calculate the beta 1 so that all is happening through this package you don't have to put down your head and write a different functionality for that where you're calculating the mean and trying to find out all these things so instead you can directly import this package and use it so I'll just do this linear regression and I'll see if there is any parameters that is ready okay for currently um this are the default optional parameters that are required currently we are not doing anything we are just trying to implement a simple version of simple linear regression so we will not specify any parameters perfect now on the next line what we will do is we will simple linear regression we will try to fit fit basically the we will befitting the extreme underscore if you just go to the fit functionality over here we can see that there is an x-ray array that is provided over there and then Y array which is the dependent variables so what I'm going to do I'm going to provide whatever parameters that we are requiring basically to create V since we are creating the some classifier on the training data set I'll specify all the stuffs and we'll be implementing those classifiers so the parameter over here basically is the x-ray and the wire basically the data set that you want to be using for creating your model so currently I'm using my train data set for creating my order so let me just execute this it has got executed successfully you can see over the underscore X is equal to true fit intercept true and JavaOne normalize is called false but this all the default values and the parameters that are taken for that if you don't specify anything perfect now if I go over here so let me just take one example where I will actually my classifier is ready now I'll try to predict the values o based on we are going to basically predict the values so yeah simple linear regression I'm going to call a predict functionality in the predict I need to specify on which test set I need to specify it I need to predict the values so the test set is or X underscore test just let me enter this perfect I have got the predict value I'll show you the comparison how would this will be done so for the X test we definitely had the value initially as why test now this is my predict now you can compare away and can just determine like how good is your classifier and whether it is being able to predict your values or not so the first value let us compare it is something here around 3771 oh here it is 48 35 more near that's okay you have got some amount of percentage correct or net oh this is pretty cool because it is very very near to this particular value that we have predicted similarly for all the other value so we have small small differences and this one also is pretty good the seventh row no it is not that good but the sixth row is pretty good again mmm similar there the third one is the best I guess because they're just some values had have got changed and her so our model looks to be a very good one and it has predicted properly perfect now if you want to predict some other values which are not already present over in the data set basically so I'll do one thing why under scope predict underscore some value so I want to know from the data set that what will be the salary of if a person is experience of around 11 years suppose you know the OSH value is ten point five I'll just calculate with eleven years so I will do just eleven and I'll try to predict it so what I see eleven years is around two l96 to two so this is the value for the eleven years of experience for guys of eleven years of experience this much money he'll be getting perfect we have able to predict it properly now the main the next thing is how do we see the linear regression simple linear regression in the graphs now we need to see like in my previous cloud I showed you like there are some points that are scattered and scattered in the graph and we'll be plotting a straight line which is the best plane which is the best fit line which passes through this point so let us pray implement it now the six step over here is we need to implement the graph simple linear regression perfect let's do it we are going to implement okay first of all we have imported the package of matplotlib the pipe lot which is actually being used for the graphic purpose so next thing is that we do is pipe lot or + scatter we have a functionality called a scatter where will be scattering our train set and the train set which will be present through the external disk or train and yn disco train perfect now after scattering the extreme and a wise grandeur that are the points that will be visible or just let me write one more thing or yes I will just specify color and it will be as a red now after scattering those data we will be the next thing that we'll be doing will be plotting our best fit best fit line through across that particular point so if I do like plot dot I have a points called ass plotted let me just check oh sorry plot so there'll be something called as plot and in this plot we are actually going to put our X train based on the X train and the predicted value of the exchange so always remember the best fit line will be in this case will be actually created from the model that we are grading in the model we have specified X train and why train so we will try to predict the extreme values and we will try to plot the best fit line good so I just copy this so here we are actually predicting but we have to predict based on the extreme puppet so this is my plotting the best fit line over there which passes through the scattered points of X train and white win now the next thing that I have to do is I have to show the plot simple now let me see whether the plot the graph is well created or not I press control enter I see something yeah whoa this graph looks awsom awsom awsom this is our the scattered points which we have actually scattered it off the extra and the white trend said basically and on the extreme values we have actually predicated a best fit line which looks pretty good so this is basically a simple linear regression still if you want to practice some more from the Kaggle you can find a lot of lot of data sets a lot a lot of scenarios where you do a lot of stuffs where you find different kind of in dependent variables and independent variables and based on that you predict the best fit line so this looks pretty good no I think this looks yeah it this looks awesome because you have predict a very good thing now if I calculate for the 11 years till the point will be coming somewhere here and this is how the simple linear regression will be done so yeah that's it for the simple linear regression and the next class will be doing the multi multiple simple linear equation so till then have fun enjoy the class practice some more and do well thank you

24 thoughts on “Simple Linear Regression Implementation -Machine Learning Tutorial with Python and R-Part 6”

  1. Hi krish,
    your tutorials are good.
    please provide the tutorial for handling missing values if i have more nan values in my dataset how to handle.

  2. No module named 'sklearn.cross_validation' issue. Just change sklearn.cross_validation by sklearn.model_selection

  3. While implementing the classifier, getting the error
    , Found input variables with inconsistent numbers of samples: What is the issue ?

  4. Hey Krish, it's a great video!!! I have a little doubt. Actually, I'm new to machine learning and have seen many tutorials on this topic. In some, the linear regression problem is solved using sklearn, pandas libraries while in some, only numpy library with basic python programming is used to solve. Which one method is better to use??

  5. I have multiple columns in my csv file. for selecting x variable what will the syntax of code. Please solve it. how do i select only one column for x variable?

  6. Hi Krish, great video so far

    Just a simple question.

    I currently have 2 CSVs of the same size, one training data set with output values provided, test set with no output values.

    How do I run simple linear regression dataset splitting based on your approach? Do i combine the 2 CSVs together?
    If so, how do I split the combined dataset?
    (X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/2, random_state = 0))

  7. i have one question for you, suppose so many variables is there in given data sets, then i select particular 2 or 3 variables how to code in python [x=dataset.iloc[:,:-1].values]

  8. Hi sir i have a one problem in import , from sklearn.cress_validations import train_test_split is not imported

  9. Thanks for the tutorial
    I try to run the code on python 27 but am having this import error: No module named sklearn_cross_validation

  10. Nice video sir,you've really explained the concepts very well.But now I would like to know that if we want to change the parameters of linear regression from default values,so that it shows less error.Then,how could we do this and what values should we've to give to the parameters.

Leave a Reply

Your email address will not be published. Required fields are marked *