Machine Learning Tutorial Python – 2: Linear Regression Single Variable

today we are going to write Python code to predict home prices using a machine learning technique called simple linear regression in this table I have the prices of home based on the area of that home in my neighborhood in Monroe Township New Jersey using this data we'll build a machine learning model that can tell me the prices of the homes whose area is 3300 square feet and five thousand square feet you can plot available prices and areas in form of a scatter plot like this where this red marker shows the available data points now we can draw this blue line which best fits these data points once I have this line I can tell the price of any home basically I can say ok 3,300 square foot home is gonna be this price once I have the linear equation now you might ask how did you come up with this blue line because this line is not going through all these data points and there are number of ways you can draw different lines like for example this red and orange okay so why did I choose this blue line what we do is we calculate this Delta which is an error between the actual data point and the data point which is predicted by your linear equation v-square individual errors and we sum them up and we try to minimize those so we do this procedure for all these lines so for orange red and blue line I repeated that procedure and what I found was this blue line was giving me the minimum error hence I chose that line now if you remember from your algebra class during your school days then you have probably learned the linear equations which looks like Y is equal to MX plus B where m is a slope or gradient and B is an intercept in our case the slope is M here but the y is price and an X variable is area area is called an independent variable whereas price is called a dependent variable because we are calculating price based on this area now we are going to write Python code for doing home price predictions here I have launched my Jupiter notebook and I have imported some useful libraries the most important library here is from SK learn import linear model so SK learn is the library it is also called psychic lon so if you google it you will find that this is the library we are using and it counts with anaconda installation so once you have installed anaconda you should have this library available for import I have the prices available in form of this CSV file so the first thing I am going to do is load this prices in pandas dataframe PD dot read CSV name of the file is comprises dot CSV and I have the data frame now if you don't know about pandas and data frame I recommend you watching my tutorials on pandas because pandas is going to be extremely useful in your machine learning journey once I have a data frame next thing that I am going to do is plot a scatter plot just to get an idea on the distribution of my data point ok and you all know if you have used Jupiter notebook before is that you have to use a mat plot lit in line magic in order to draw the plots and first I'm going to plot dot scatter so I'm going to plot area versus price okay now I'll make some modification and I will set color to be red marker to be plus I'm just making my chart a little fancy and then also setting X&Y label because you can see that x and y labels are not available right now so the X label would be area in square fit and y label is the price in US dollar okay so this is square feet area and US dollar price alright so once I look at this plot I get an idea that the distribution is suitable for a linear regression model and hence I will now go ahead and use the linear regression okay so first you need to create a linear regression object so you can see that from SK learn Python module I have already imported linear model and here linear regression so I will create an object for linear regression and then I will fit my data so fitting the data means you are training the linear regression model using the available data points okay now the first argument has to be like a 2d arrays so you can supply your our data frame here so I am going to supply a data frame which only contains area okay and the second argument would be your y axis on your plot which is your price so when I execute this it worked okay without any error which means now this linear regression model is ready to predict the prices so let's do our fraction so what we wanted to predict was the price of home whose area is 3300 square feet and you can see that it predicted this price now you might be wondering why it came up with this price so let's look at some of the internal details so when I train my regression linear regression object using this available data what happened is it calculated the coefficient and intercept so if you go back to our mathematical equation you know that for any linear equation there is a slope and intercept this is also called a coefficient so my model calculated the value for M and B so let's see what are the values okay so when you do drag to F it will show the value of coefficient which is this M and when you do intercept this is showing you the value for this B ok so now the equation for price is M into area plus B okay so now we have M area and B so let's see what value it gives so Y is equal to M into X plus B okay and here M is this number so let me just copy it into X X is the area that you want to predict so area you want to predict is 3,300 and your intercept is this okay so when you execute this you get this value so now you know how it was able to predict this value right here okay similarly if you change this to the second value that we wanted to predict was the 5,000 square feet so for 5,000 you can just do this and you get this value right here so this is pretty amazing now you have a model which you can use to predict your home prices you might have a CSV file like this where you have list of prices a list of areas available and you want to predict the prices for these homes right until now what we did is we were individually giving the know the area and we're predicting the prices so what I want to do now is generate another CSV file where I have these list of areas and the corresponding prediction on on prices so for this I will create a data frame first using read CSV so here first I created a data frame I had a list of areas available now I will use regression model to actually predict the prices okay so I'll just supply here and that will return me the prices so you can store the prices in variable P and then what you can do is in your original data frame you can create a new column so when you do like this it will create a new column in your data frame and you can assign P here okay and now when you print your data frame you can see the prices are available and then you can just use to CSV method to export the value to prediction dot CSV so if I open my prediction dot CSV you will now find that I have area and prices it imported it exported actually the index as well and if you don't want to do that then index is if you do that and execute this again and now if you can prediction dot CSV you won't find the index it will be just area and prices as you can see here okay so once you have this model building you can apply this model on a huge CSV file and come up with a list of predictions now going back to our original example so let me go back again here and read my original prices and do a fitting on that what I want to show you is how does my linear equation line look okay and for that I'm again plotting a scatter plot and along with the scatter plot so this much line of code will just applaud the scatter plot and I heard it one more line where I'm using data frames area and I am predicting the prices and plotting them on the y chart so let's see what happened here so DF dot area is not defined so that has to be DF Torreya here so let's go step by step so here I have my scatterplot and what I'm doing is PLT dot plot on my x-axis I want the after area and on my y-axis I've want to I want to predict area like this okay so it shows the visual representation of my linear equation here alright so that's all I had for this tutorial I have an exercise for you guys what you want to do is given a Canada's adjusted net national income per capita you want to predict the net income in year 2020 I have provided a CSV file in the exercise folder so if you go to github and download my notebook and I have by the way the link of Jupiter not book available in the video description below so download the notebook study it first and then download the exercise folder in the exercise for you will find this CSV file which has a Canada's per capita income for the year 1970 to 2016 and your job is to find out the predicted income in year 2020 I highly recommend that you do the exercise because just by watching the video you are not probably going to learn that much I mean you'll learn something but it's not very effective so it's it's better that you do some practice as well as we go through this tutorials and I'll make sure I provide a simple exercises at the end of every tutorial ok so again just to summarize this tutorial was all about building a simple linear regression model using one variable and in the future we are going to cover a little more complex linear regression models thank you bye

48 thoughts on “Machine Learning Tutorial Python – 2: Linear Regression Single Variable”

  1. It is giving a error in reg_predict —
    ValueError: Expected 2D array, got scalar array instead:
    Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

  2. what if we use print(reg.predict(arr)) where arr = np.array(2600) then output should be 550000 but answer is coming 533664.38356164 ? can you tell me how ?

  3. plt.xlabel('Year')
    plt.ylabel('Income per Capital in US $')
    plt.scatter(df.year,df.income,color = 'red', marker = '+')

    reg = linear_model.LinearRegression()[['year']],df.income)

    df output

    year income
    0 1970 3399.299037
    1 1971 3768.297935
    2 1972 4251.175484
    3 1973 4804.463248
    4 1974 5576.514583
    5 1975 5998.144346
    6 1976 7062.131392
    7 1977 7100.126170
    8 1978 7247.967035

    I am getting below error,
    Expected 2D array, got scalar array instead:

  4. Hi I have uploaded my query on your git hub repository as shailendrarg. the predict[[2100]] is not working for me. please help me.
    i am learning through your videos and its really a solid content for learning.

  5. Thanks for the tutorial!! just a small feedback — You may want to discuss functions and its parameters — the fit(X, y) and predict(X)

  6. Thanks for all the informative videos..
    Very interesting & awesome Lectures for beginners!
    Machine Learning made simple and exciting

  7. Cna you do this with train_test_split method? And also how do we calculate R square, mean variance and score. Please advise. Thanks.

  8. Thank you very much for opening the door to what looks like a daunting task of hands-on machine learning. You made it so easy and refined. Kudos!

  9. Saw various video but definitely you have explained in the best possible way ,
    Can you please provide how to calculate R square or something like that

  10. Why Area is taken as 2D array in fit(df[['area']],df.price)
    Please reply

    Love your Tutorials
    They are amazing…..

  11. import pandas as pd
    from sklearn import linear_model


  12. lesson is great it would more fun if in the excercise you give them to make simple linear regression model with just numpy

  13. first of all i want to congratulate you on your good tutorials. However I have a problem with scikit-learn-0.20.2. When I reach the reg.predict(3300) I get ValueError: Expected 2D array, got scalar array instead:

  14. I don't want to take names but most of the youtube channels for Machine Learning (beginners) are not clear and specific. I fell in love with ''CODE BASICS''. Sir, I'm ready to take your machine learning course will you provide it ?.

  15. reg.predict(3000) did not worked, checked comments and got the answer, Special Thanks to Tutor and all others

  16. Dear sir ,
    when we enter : d["prices"] = p
    why doesn't the prices column get added to "area.csv" file as "d" represents the same flie (and also contains prices column in the output shown) ?
    Can you kindly explain this ?

  17. I have read and watch a lot of article & video on Google and YouTube, every one explained what is machine learning,ML Types,
    their algorithms and blah blah.. but actually no one explained how to do it. this is a great channel I found very very useful where everything is explained in very simple way. Thank you very much

  18. Man. i've Hope Every one can explain like you!
    I would happy to do my homework if my teacher can Explain this crystal clear
    2020 net income =41288.69409442 anyone?
    what you get?

  19. what does mean fit()? when you ignore the fit() method and use predict() method give error in video you said means training but what does mean what actually does?

Leave a Reply

Your email address will not be published. Required fields are marked *