Machine Learning Tutorial Python – 3: Linear Regression Multiple Variables



we are going to look into linear regression with multiple variables also known as multivariate regression using this we are going to predict the home prices in Monroe Township New Jersey here in this table I have various metrics available such as area bedroom age and these are the factors that the ultimate price depends on in the previous tutorial we looked at simple linear regression with one variable where price was dependent only on area but now we are making our problem little more complex by adding bedrooms and age because as we know in normal life the home price depends on multiple factors not just the square foot area after we build a model we are going to predict prices of these two homes okay now before we tackle any machine learning problem the first thing that needs to be done is we need to carefully analyze our data set or our training data when I look at this data set the first thing I notice is there is a data point missing here so we have to do something to handle this missing data point another thing that I noticed is there is a linear relationship between each of these factors and our target variable which is price for example as the home gets older the price tends to go down here I have 3,200 square feet home with 18 years of age and the price is more than 600,000 whereas I have a little bigger home here 3,600 square feet but since the age is more the price is less than the one whose square foot area is less okay similarly as the area and number of bedroom goes up the price also tend to go up so overall by analysis we can say that for this data set we can use linear regression safely okay if we go into a math a little bit then our linear equation will look something like this where price is dependent on three factors area bedroom age and these three factors are called independent variables or features features is a term that you will hear often while going through machine learning material it is nothing but an independent variable and the price is a dependent variable m1 m2 and m3 are called coefficients and B is an intercept it can be generalized into this equation where you can have n number of independent variables or I I just have three here but you can have more than three now the topics that we are going to cover during this tutorial is first we look into how to handle that missing data point and then I will build a model in a Python code all right now let's go back to our Jupiter notebook so Jupiter notebook is something I'm using to write my code you can use py charm or any other IDE of your choice I have a CSV file here where I have all my home prices along with these factors so the first thing I have done here is imported necessary modules and then I'm going to load my data into pandas dataframe home price is dot CSV if you don't know about pandas dataframe then I have a separate pandas tutorial pandas is extremely useful while doing machine learning so I highly recommend that if you don't know about it already okay so now I have my data frame ready and I can see this null data point and I need to handle it so the way I'm thinking I will handle this is I will take a median of this entire column and put it here since number of bedrooms are missing taking medians seems like a safe assumption all right so first thing I will do here is calculate the median and how do you calculate the median of bedroom so DF dot bedrooms will give you pandas series and when you do this it gives you a median now you have three bedrooms appearing two times that's why it's giving the average of it which is three point five I just want to keep it integer the whole number and for that I will just import a math module and I will say median bed rooms is equal to Matt dot store and the medium bedrooms would be three okay and the way I fill this columns any values is using fill any function so fill any function is available on pandas series so DF dot bedrooms will give you one column which is nothing but a pandas series on that I call fill any function and I want to fill all any values with this median number so you can see that I got a new series where this NN value is now replaced with a median number I need to assign this back to original series so that my data frame gets updated and when I print my data frame I can see my data frame looks much better so our data pre-processing step is over so again to summarize before applying any machine learning model you need to pre-process your data you need to clean your data because data is always messy there are problems with it so you need to fix the errors kind of prepare your data and then apply your actual machine learning model using that data you train the model okay so now my dear FM looks good I am all set to train my models so the first thing I am going to do is create a linear regression object so I have import a linear model here and using this I will create linear regression class object the object got created fine now I will call fit method so you will use this fit method often and this method is used to train your model using your training set here my training set is a data frame so if you want to create a data frame using your existing data frame you can use this syntax of using two two brackets okay and my independent variables are area bedrooms and age and my target variable is price when I execute this my model is now ready once it is ready it's a good idea to take a look at the coefficients so the coefficients I got was this just to summarize what these coefficients are in this equation M 1 M 2 and M 3 are coefficients so this is M 1 this is M 2 and this is M 3 now you will wonder where is my intercept intercept is stored in another variable called intercept so this number is this B so once we have M 1 M 2 M 3 you have area bedrooms H which are your independent variable now you can calculate the price so here I can say predict ok so I can predict a home price of a home whose dimensions are this Oh three hundred three thousand square-foot three bedroom and forty-year-old okay so three thousand three and forty that's what I am going to supply here three thousand three and forty and if okay so three thousand square foot three-bedroom forty year old and I found find the price to be this much all right so this home is you can see that the price the price of the three thousand square foot home here is five hundred and sixty five thousand but here got it you got it much less and the reason is I think mainly age so here it is like 40-year old versus this home is fifteen year old okay so if I make it 15 I will get around it is little bit skewed but you get the ballpark number alright so the price is this and now I want to know how it calculated this price so it is coefficient coefficient multiplied by your square foot plus another coefficient multiplied by your bedrooms the third coefficient multiplied by the age here it is missing an intercept so we need to add intercept also so plus intercept and when you do that you get that price right here because of rounding factor is not showing the full value but it is essentially the same price similarly you can predict the price of the home where the values of those factors is twenty-five hundred four and five 2500 four bedroom only five-year-old this home is very new so I expect the price to be little higher than average so here you can see that 58 $580,000 is the price for this home which was twenty six hundred the prize was 550 but this Holmes prize is more and it is because this home was twenty-year-old whereas this home is only five year old and it has one more bedroom see this guy had only three bedrooms so you can see how different factors play a significant role okay that's all I had now we are going to do an exercise so the exercise is for you guys to practice what you have learned in this tutorial i have this csv file which contains a hiring data for a form here are based on the experience the written test score and personal interview score candidates salaries decided and I have some past data using this data you have to build a model for your HR department where they can feed in experience and various scores and they can get some idea on what kind of salary they need to offer to a candidate now in this data set you will find few interesting things first of all in experience these two cells don't have values so you can just assume them to be zero here also that is one data point missing here you can't you shouldn't assume it to be zero maybe take a median here alright and also usually linear regression models work on numbers whereas here I have a string so the experience again it's a number it's just that our data set contains word for that particular number for this you can use pythons word to number module to convert that string into a number all right so just do people stall and use the model alright and using this what you're going to do is figure out the salary for two candidates who has these statistics so one candidate has two years experience scored nine and six into test second candidate an excellent guy experience court an out of ten in both of it and you have to find out recommended salary for these two candidates if you go to my github page let me go to my github page I have all the tutorials available on my github so you should go and download it so here this notebook is whatever I just went through and in the exercise folder I have a CSV file which you should download and try to practice on your own and then you can compare your answer or to this notebook where I have provided the entire solution but don't get tempted too much and don't go and just start looking at the solution first you should try to solve on your own alright so that's all I had for this tutorial thank you very much

40 thoughts on “Machine Learning Tutorial Python – 3: Linear Regression Multiple Variables”

  1. Super explaination in simplified way. Please make a video on regularisation i.e gridsearchcv utilization in linear and logistic regression.

  2. I'm getting this error " Found input variables with inconsistent numbers of samples: [3, 353] " the model I'm trying to build has 3 variables and 353 values in each variable. That shouldn't be a problem though. Any ideas?

  3. ! pip install word2number

    from word2number import w2n

    df.experience = df['experience'].apply(w2n.word_to_num)

    df

  4. 10:28
    reg.predict(3000, 3, 40)
    Here, area = 3000, bedrooms = 3, age = 40. The output is an array of a value:
    array([444400.]).
    Notice, how this can be manually calculated.
    137.25*3000 + -26025*3 + -6825*40 + 383724.99999999983
    444399.9999999998 which is approximately equal to
    444400.
    Brilliant, codebasics
    !! Comprehensive. This simplified version of yours is highly commendable.

  5. 9:00
    Now, we can find the predicted price of a house using the values we obtained. We have the equation,
    price = m1*area + m2*bedrooms + m3*age + b
    Now that we know the coefficients, m1, m2, m3, and the intercept, b, we can find the price.
    The <predict(area, bedrooms, age) method> given, the values of the three features will give us the predicted price, <price> will be saved in our object variable, <price>

  6. 8:06
    reg.intercept_
    This is the code to find the value of the intercept. And the following is the output: This is the value of b,
    in the equation, (price = m1*area + m2*bedrooms + m3*age + b)
    383724.99999999983

  7. 7:50re
    price = m1 * area + m2 * bedrooms + m3 * age + b <== CODE (sparsed)
    dep. var = (coef_1 * AREA) + (coef_2 * BEDROOMS) + (coef_3 * AGE) + INTERCEPT <== PSEUDOCODE
    <<<<<<<<<<<<<<<<<<<<<<<<<<
    price = dependent variable (target)
    m1 = coefficient_1 (= feature_1)
    m2 = coefficient_2 (= feature_2)
    m3 = coefficient_3 (= feature_3)
    >>>>>>>>>>>>>>>>>
    area = independent_variable_1 (= feature_1)
    bedrooms = independent_variable_2 (= feature_2)
    age = independent_variable_3 (= feature_3)
    >>>>>>>>>>>>>>>>>>>>>>>>>
    price = m1*area + m2*bedrooms + m3*age + b

  8. 5:50

    from sklearn import linear_model

    reg = linear_model.LinearRegression()

    reg.fit(df[[ 'area', 'bedrooms', 'age' ]], df.price)

  9. instead of using external library, we can also use pandas replace method with dict as parameter
    word2number = {'zero':0,'one':1,'two':2,'three':3,'four':4,'five':5,'six':6,'seven':7,'eight':8,'nine':9,'ten':10,'eleven':11}
    d.replace({"experience": word2number},inplace=True)

  10. I could not install word2number function on anaconda. I instead used the following function which I found on stack overflow
    https://stackoverflow.com/questions/493174/is-there-a-way-to-convert-number-words-to-integers
    after that I used the following
    df.experience=df.experience.fillna('zero')
    df.experience = df.experience.apply(num2int)
    the other procedure is similar to the one explained in video
    However I get a different answer
    53290.89225 and
    92268.07223
    Is this normal? How do I check for the prediction error?

  11. SOLUTION:
    AttributeError: module 'word2number' has no attribute 'word_to_num'
    from word2number import w2n \this way
    df.experience = df.experience.apply(w2n.word_to_num)

    df.experience

  12. getting this error in pycharm
    AttributeError: module 'word2number' has no attribute 'word_to_num'

  13. I got (498408.25158031) value for [3000, 3, 40] which is very different from yours. Why this happens ?

  14. What if one of my input data which reduce the price as increasing value for example distance of metro to home?

  15. C:UsersMMAnaconda3Scripts>pip install word2number

    pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.

    Collecting word2number

    Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/word2number/

    Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/word2number/

    Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/word2number/

    Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/word2number/

    Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/word2number/

    Could not fetch URL https://pypi.org/simple/word2number/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/word2number/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) – skipping

    Could not find a version that satisfies the requirement word2number (from versions: )

    No matching distribution found for word2number

    pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.

    Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) – skipping

    pls help here..

    thanks in advance

  16. Thanks a lot for your video series, very helpful! I've got one question if you don't mind.
    07:00 Why are you using double brackets here?

  17. import pandas as pd
    import numpy as np
    from sklearn import linear_model
    from word2number import w2n

    dt=pd.read_csv('hiring.csv')
    dt
    import math
    median_test=math.floor(dt.test_score.median())
    median_test
    dt.test_score=dt.test_score.fillna(median_test)
    dt
    dt.experience=dt.experience.fillna('zero')
    dt
    dt.experience=dt.experience.apply(w2n.word_to_num)
    dt
    reg=linear_model.LinearRegression()
    reg.fit(dt[['experience','test_score','interview_score']],dt.salary)
    reg.predict([[2,9,6]])
    reg.predict([[12,10,10])

  18. Hi, sir Great video & it is much useful.
    i am facing an issue.
    'from word2number import w2n' it is giving an error as
    No module named 'word2number'

    Plz help.
    Thanks In advance

  19. Hey man. thanks for the video. I do not know why, but I am getting different reg.coef_ s than your? can you guess why?

  20. Hey sir
    I was downloaded ML file which you uploaded in github, when I used fillna method in experience columns then I have face some error my compiler show that DataFRame have no object experience … How to resolve this error please help me…
    I am using a.experiance=a.experiance.fillna("zero")….

  21. After going through many youtubers, tutors, I was not able to start with the ML concepts,coding. Your content is awesome, very easy to understand. Thanks for these videos & your time. Appreciated…

  22. You are really awesome…. Please make more videos….. I can easily understand your tutorial….

Leave a Reply

Your email address will not be published. Required fields are marked *