Regression Intro – Practical Machine Learning Tutorial with Python p.2


Alright, so now we are at least going to get started with setting up a simple linear regression example. The first thing that we need to make sure we have is scikit learn, Pandas and Quandl. So open up terminal, command prompt, whatever. And pip install sklearn. pip install quandl. And pip install pandas. Once you have all those, you are good to go. So to install those, go ahead and pause the video and pick back up once you have them. Ok, so once you have those, let’s go ahead and get started with a simple example. So we are starting with regression, and the idea of regression is to take continuous data, and figure out a best fit line to that data. and basically with that just boils down to we are trying to like “model” your data and the way we do that with regression at least with simple linear regression is just with a straight line so the equation of the line as we will talk more about down the line but as you might remember from school, y=mx+b, so if you have x, you can figure out what y is, also if you have m and b. So basically the whole point of regression is to find out what m an b is. So for example, a lot of people use regression with stock prices so that’s what we are gonna do at least in this one. And, so the idea is, this is continuous data and you’ve got months and months of stock prices and and each price is in its own kind of unique day. But all the data is kind of one dataset together as opposed to with like classification, where each group of data has its own unique label. So with machine learning, basically everything boils down, at least with supervised machine learning, everything boils down to features and labels. Features are like your attributes, or in this case, the continuous data. So, let’s go ahead and get started and we’ll talk a little bit more about features. So first of all, let’s go ahead and import pandas as pd. And then we are gonna import Quandl with a capital Q. And then what we are gonna say is df for dataframe equals Quandl.get and we’ll put in the ticker. You can get this from quandl. So if you just go to quandl.com You can use a little search and find stuff like if I say google stock we can probably find it. Let’s see I am trying to find, we are using the wiki dataset Let’s just do free. Anyway, when you find it you can find all kinds of different datasets here but we are looking just simply for the wiki one. Here it is. You will pick up a dataset and you can come over here You can either just download here or more importantly here is the quandl code and then you can click on like python and this is the exact statement to get it. If you have an account, you can make basically unlimited request free data. If you don’t use an account, like we are not gonna use an account here, like we are not gonna use an auth token. If you don’t have an account, I think, it’s limited like 50 calls a day. We are actually only use quandl fairly short term here and then maybe later on. So you really don’t need to create an account, but if you like quandl, you might as well make an account at some point. So anyways, quandl.get and then wiki/google was the ticker there so then we can just simply print let’s print the df.head just so we can see what it is we are working with. We’ll see that basically each column here is a feature. So the open, high, low, close, these are features. So in machine learning you can have all the features you want but you want to have meaningful features features that actually have something to do with your data So some people are pretty avid believers in the ideas like pattern recognition with stock prices and that might be you but do you need every single one of these open high low close columns to do pattern recognition? No. Also, you would know we’ve got open high low close volume and then adjusted and adjusted is adjusted after a thing like stock splits so a stock split maybe your company has 10 stocks and each stock is $1000 a share and you decide I want people to be able to buy shares of my company for less than $1000 So you might say, ok, BAM, every share is now two shares so we have 20 total shares and the share price is $500 so you have adjusted prices to account for that so it doesn’t like like the stock price went from $1000 to $500 so that’s what adjusted is so we are gonna be using those but again, each one of these is really related to the other one like the correlation of these two columns is super high so would you use each one of these columns Does that the next one really brings that much meaningful data? No but one thing to always think about when you have features and labels is maybe like what about the relationship between those columns so when we get into something like deep learning and then some of the other algorithms you can start to discover relationships between attributes but with regression, just simply no. what you wanna do you wanna like simplify your data as much as possible. You want as many meaningful features as you can get but useless features as we’ll show kinda through this series can really cause a lot of trouble for your machine learning classifiers especially the more simple ones in supervised learning and so on anyways let’s close out of this and let’s go ahead and grab some features what we’re gonna say first we are gonna pair this down, we are gonna say dataframe equals the df and then we are going to create a long list basically all of the columns that we wanna have so we are gonna take adjusted, open, and then I’m just gonna go ahead and copy this copy, ok so that’s adjusted, open, and then we are gonna take oepn, so high, low, close, and volume ok so now we have just these columns so we kinda recreated our dataframe to just be the open high low close and volume of the adjusted ones. so then, like I was saying, some of these columns are relatively worthless but they do have some relationships so for example, like what is interesting about high and low is the margin of high and low tells us a little bit about volatility for the day Also, the open price that’s the starting price for the day and it’s relationship to the close price tells us did the price go up if so, by how much and did it go down? If so, by how much and so on. so the relationship there is very valuable. But a simple linear regression is not gonna seek out that relationship. It’s just gonna work with whatever features you feed through it so what we need to do is define those special relationships and then use those as our features rather than redundant almost prices that not gonna really give us anything else very useful first let’s do the high minus the low percent so this is like the percent volatility almost so we are gonna define a new column we are gonna call it HL_percent and then that is going to be I’m having a hard time here that’s gonna be equal to so percent change is in this case it would be the high minus low divided by the low times 100 so for us it would be df Adj high minus the df Adj close and what’s happening here is just on a per row basis which is just this column minus this column so that column divided by df Adj close and then times 100 you can either times by 100 or not the classifier really is not gonna care about that we are just doing that for ourselves So that’s the high minus low percent and then we actually want just the daily percent change like the daily move so I’m just gonna copy that whole line, paste and then we are gonna call this one percent_change and that is equal to pretty much the same thing only we need to change some stuff so normally percent change is new minus the old divided by the old times 100 so that would be adjusted close minus adjusted open so new minus the old divided by the old times 100 oh, I am sorry, ha, we did it the wrong way. divided by the old, so this would be open times 100 so that’s percent change. actually, you can pass close here again the classifier doesn’t really care as long as everything kinda normalized but yea so either way would been fine this is the actual way you should do it anyways once we have that data we are gonna define a new dataframe and we are instead gonna say so it’s gonna be df equals df[] and then now we define the only columns that we really acutally care about and so in our case the columns we care about are gonna be adjusted close, the high low percent, the percent change and then volume is also somewhat useful to have so volume is just how many trades were occurred basically that day so volume is also kinda related to volatility so you can also make more features with some sort of relationship there but we’ll try to keep this pretty simple so for now we’ll just print df.head and we wait just to make sure everything worked out and sure enough it did so we have all the numbers we are kinda interested in so we got our features and eventually this will actually wound up being, possibly, our label but we’ll get to…, I guess think about between now and the next tutorial features are the kinda of the attributes that make up the label and the label is, hopefully, some sort of prediction into the future so will the adjusted close, will this column, actually be a feature? or will it be a label as it stands right now. so think about that and the next tutorial we’ll pick it up and start getting closer actually making real predictions with this data so if you have any questions, comments, whatever, leave them below otherwise, as always, thanks for watching, thanks for all the supports, subscriptions and until next time

100 thoughts on “Regression Intro – Practical Machine Learning Tutorial with Python p.2”

  1. If anyone's getting a "No module named 'Quandl'" error I fixed mine by changing all the 'Quandl's to lowercase.

  2. hey man i facing ssl certificate error plz tell me how to resolve it i have used tons of ways . but failed

  3. Can you please redo this video a lot has changed apparently and I get soooo many errors from the beginning of the tutorial

  4. Hey I came from your python introduction/basics playlist and man ur amazing!!Thanks a lot!
    I wanted to know what are things I should know (like installing modules and all) to start this series…

  5. for folks having trouble, you'll need to create an account on quandl to use the data. Once you create an account you get apikey. Run this command before importing data quandl.ApiConfig.api_key = "YOURAPIKEY"

  6. When i try to install pandas in cmd i get a message that says "Requirement already satisfied" then i try to run the code in python and it throws me that is "no module named pandas" ¿What can i do?

  7. import pandas as pandas

    import quandl

    df = quandl.get("WIKI/GOOGL")

    print(df.head())

    #quandl should be lowercase
    3 April 2019

  8. !pip install sklearn

    !python -m pip install –upgrade pip

    !pip install –upgrade pandas

    !pip install Quandl

    import pandas as pd

    import quandl

    df = quandl.get('WIKI/GOOGL')

    df.head(15)

    df.tail()

    df.columns

    use_columns = ['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']

    df = df[use_columns]

    df.head()

    rename_cols = ['Adj_Open', 'Adj_High', 'Adj_Low', 'Adj_Close', 'Adj_Volume']

    df.columns = rename_cols

    df.columns

    df['HL_PCT'] = (df.Adj_High – df.Adj_Low)/df.Adj_Low*100.0

    df.head()

    df['PCT_change'] = (df.Adj_Close – df.Adj_Open)/df.Adj_Open*100.0

    df

  9. For some reason I just cannot import quandl. I've tried lower case and upper case, and I'm still getting ModuleNotFoundError
    the pip install process went smoothly, I have no idea why I can't import it now. Has anyone had a similar problem or know how to fix this? Please help

  10. Please tell me any one, which browser he is using to type, before python, and result is going to python editor.

  11. "So the equation of a line… well, we'll talk about more down the line." Heh, nice pun man. Made me chuckle

  12. these equations work fine but when I return my statements, all the columns for the PC variables are 0. Any suggestions?

  13. guys , I have LimitExceededError, could you please send a download file, so I won't get that error, thanks

  14. I feel like im really lost here when i watch these, should i watch your pandas/ other python library tutorials to get a better understanding of these functions you are implementing?

  15. Quandl's a piece of shit. I ran it once then it told me I had exceeded my 50 a day. So I signed up for a free account but there's just more and more layers of nonsense once you have the free account. Not a good learning tool.

  16. Hi @sendex, can you share the 'WIKI/GOOGL' dataframe somewhere in Google Drive maybe? I try to find it in Quandl and fail to do so. I am a newbie and really interested in learning python machine learning. I think the best way to learn is to follow tutorial one by one.

  17. Hi Guys,

    I need your help!!

    WIKI databse is no longer updating stock prices. I dont know how Quandl API works. But i found a webpage which provided codes to import data from quandl

    import quandl

    quandl.ApiConfig.api_key = 'INSERT YOU API KEY HERE'

    data = quandl.get_table('WIKI/PRICES', ticker = ['AAPL', 'MSFT', 'WMT'],

    qopts = { 'columns': ['ticker', 'date', 'adj_close'] },

    date = { 'gte': '2015-12-31', 'lte': '2016-12-31' },

    paginate=True)

    data.head()

    I want to know which quandl datset is free so i can import data in python. I also want to know how python already knows my API key because even i dont know what my API is!! But as per the codeing above python is able to import data without me entering my API!!

  18. Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-l3KAiT/sklearn/

  19. i cant insstall quand1.."could not find a version that satisfies the requirement quand1" whats this error .. and how can i fix this??

  20. if pip install not working go to this pc then properties on left corner advance system settings then environmental variable in advanced create new path "C:UsersyashcAppDataLocalProgramsPythonPython37-32Scripts" (script location)

  21. OK, I dont know, but I am unable to get the full dataset, after 'open, high and low' there is 3 dots followed by "Adj. Low Adj. Close Adj. Volume" and there is not Adj. High, which causes KeyError or something.. If anybody by any mistake getting typeerror, then they can do this , "df = pd.DataFrame([['Adj.Open','Adj.High','Adj.Low','Adj.Close','Adj.Volume']])" which has solved the problem for me but not the KeyError error errrr… -_- otherwise good tutorial.. ty (pd is import pandas as pd)

  22. i was somehow exceeding the 50 limit with out actually using it 50 times i looked at it and it is very hit and miss to get around this i saved "df" to a pickle file:
     
    import pickle

    try:
    # to save it
    df = quandl.get('WIKI/GOOGL')

    pickle_out = open("df", "wb")
    pickle.dump(df, pickle_out)
    pickle_out.close()
    except:
    # to load it
    pickle_in = open("df", "rb")
    df = pickle.load(pickle_in)

    so now you can bypass quandl.get and test as many times as you want i hope this helps

  23. 7:32 : In creating features, shouldn't High-Low percentage be 'High – Low' instead of 'High – Close'. Just an observation.

  24. is there any way to to quandl not found error , or any method to use dataset similar to this the other way ? because i am completely beginner to python and quandl is not found. I tried almost every way from internet. Please help @sentdex

  25. File "<ipython-input-2-ae39b1471d40>", line 5

    df['label'] = df(forecast_col.shift(-forecasts_out))

    ^

    SyntaxError: invalid syntax

    anyone please help me on this???

  26. this thing is not working, i tried lower case for quandl still giving me error, and this error taking about quandle.get…someone pls help me out

  27. So is the point of having PCT_Change to add another pattern/feature for the model to take in as an input? Or am I missing something crucial…

  28. I'm happy that people like you exist . In Indian engineering colleges we dont have courses on ML in the computer science stream . Thanks a lot !

  29. df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume',]]

    df['HL_PCT']=(df['Adj. High']-df['Adj. Close'])/df['Adj. Close']*100

    df['PCT_CHANGES']=(df['Adj. Close']-df['Adj. Open'])/df['Adj. Open']*100

    df = df[['Adj. Close','HL_PCT','PCT_CHANGES','Adj. Volume']]

    print(df.head())

    —————————————————————————

    TypeError Traceback (most recent call last)

    <ipython-input-21-979b6d156e02> in <module>

    —-> 1 df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume',]]

    2

    3 df['HL_PCT']=(df['Adj. High']-df['Adj. Close'])/df['Adj. Close']*100

    4 df['PCT_CHANGES']=(df['Adj. Close']-df['Adj. Open'])/df['Adj. Open']*100

    5

    TypeError: list indices must be integers or slices, not list

    anyone can help on this???

  30. Can't fetch data from 'WIKI/GOOGL'

    Following error shown:
    ————————————
    Exception has occurred: QuandlError

    (Status 403) Something went wrong. Please try again. If you continue to have problems, please contact us at [email protected]

  31. To avoid "limit exceeded" messages when using quandl, I broke down and signed up for a free account to get my own api key. You can call this out in your code with the directive:
    “`
    import quandl
    quandl.ApiConfig,api_key = "xxxxxxxxx"
    “`
    If you store it as an os environment variable (say, QUANDL_API_KEY), you won't have to put your key in every script. Instead, just add the lines:
    “`
    import os
    import quandl
    quandl.ApiConfig.api_key = os.environ['QUANDL_API_KEY']
    “`
    On Linux or Mac, you can add the following line to your .profile (or .bash_profile, depending on your distro):
    “`
    export QUANDL_API_KEY="xxxxxxxxx"
    “`
    On Windows you'll need to open "Advanced System Settings" and click "Environment Variables" to add a new User variable for your Windows user.

  32. >>> import pandas as pd

    >>> import Quandl

    Traceback (most recent call last):

    File "<pyshell#2>", line 1, in <module>

    import Quandl

    ModuleNotFoundError: No module named 'Quandl'

    >>>
    why this error occur !

  33. if any of you guys having this error (list indices must be integers or slices, not str) make sure you didnt type this like me .

    The Wrong ONE !!

    df = [['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]

    The True ONE!
    df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]

  34. You are writing all these commands in the default python IDE but can I also execute them in PyCharm? I installed all the pip modules after a lot of effort due to errors.

  35. I am getting error in this can anybody please help including Author?

    KeyError Traceback (most recent call last)

    <ipython-input-9-f1c7b3306cde> in <module>

    3 df=quandl.get('WIKI/GOOGL')

    4 print(df.head())

    —-> 5 df= df[['Adj.Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]

    6 df['HL_PCT']=(df['Adj. High']-df['Adj. Close'])/(df['Adj. Close'])*100

    7 df['PCT_Change']=(df['Adj. Close']-df['Adj. Open'])/(df['Adj. Open'])*100

    ~AppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py in __getitem__(self, key)

    2984 if is_iterator(key):

    2985 key = list(key)

    -> 2986 indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True)

    2987

    2988 # take() does not accept boolean indexers

    ~AppDataLocalContinuumanaconda3libsite-packagespandascoreindexing.py in _convert_to_indexer(self, obj, axis, is_setter, raise_missing)

    1283 # When setting, missing keys are not allowed, even with .loc:

    1284 kwargs = {"raise_missing": True if is_setter else raise_missing}

    -> 1285 return self._get_listlike_indexer(obj, axis, **kwargs)[1]

    1286 else:

    1287 try:

    ~AppDataLocalContinuumanaconda3libsite-packagespandascoreindexing.py in _get_listlike_indexer(self, key, axis, raise_missing)

    1090

    1091 self._validate_read_indexer(

    -> 1092 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing

    1093 )

    1094 return keyarr, indexer

    ~AppDataLocalContinuumanaconda3libsite-packagespandascoreindexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)

    1183 if not (self.name == "loc" and not raise_missing):

    1184 not_found = list(set(key) – set(ax))

    -> 1185 raise KeyError("{} not in index".format(not_found))

    1186

    1187 # we skip the warning on Categorical/Interval

    KeyError: "['Adj.Open'] not in index"

  36. When I use df['Adj. Open'] etc it gives me the error: "list indices must be integers or slices, not str"
    Halp, am I doing something stupid or has there been a change that Im not aware of? I wrote everything exactly like mr sent, python 3.7

  37. Also, the line df = [['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume',]] really just redefines df into a list of those strings…

Leave a Reply

Your email address will not be published. Required fields are marked *