Hi Braden - I've been wrestling with how to speed up some conditional changes to a large 1m x 100 row data frame - so the entire df is transformed with True/False according to the condition. The condition is a function of a quantile for the column. I had to use iteritems so I could go column wise. Did not need df.index but used np.where. It took 2.89 secs for the 1m x 100 data frame. It seemed more like 30 secs but still, wow! Thank you for your video - it's the best.
Thank you a lot, I have two files with huge data to confront with one another and has taken me over 2 hours to loop and do all the checks, tomorrow I'll try one of your techniques
How to select only first value of dataframe for eg : if condition satisfied to second row but I need to ignore that. In your example convert 1 if the value greater than 0.5 if condition satisfies to coming rows then also it needs to ignore.selct only first row of condition satisfied
Very good tutorial.if we want to access previous row inside apply function, how can we do that. ? If I have index defined, can I do row.index-1 or if I dont have index defined , how do we know current location of current row?
Finally a clear and concise explanation. Many thanks!! By the way, If I need to make a calculation say between two sequential rows like A[i] + B[i] / C[i+1] * D[i+1], how this could be done with the fast methods? Thanks again
Thanks! This is a great question. You should be able to accomplish this with the pandas shift() method. I'll work on an example I can share with you tomorrow! 😀
@@damon1236 Thanks for the reminder! Alright here is what you can do for a vectorized solution. df['new_col'] = (df['A'] + df['B']) / (df['C'].shift(-1) * df['D'].shift(-1)) The first part: (df['A'] + df['B']) is straight forward it adds the two columns together. The second part (df['C'].shift(-1) * df['D'].shift(-1) uses the shift method to move all the values in the column up one (I know it is weird to use -1 to shift up haha). We take the sum from the first part and divide it by the second part to get our answer :)
I was wondering what's the most "pandas" (and fast) way to do the following. Each row of my table is associated with a patient, while each column with a certain connectivity measure (200 of them). For each column n with label tag[n] I compute the average avg[n] =df[tag[n]].mean() and the standard deviation std[n] = sqrt( df[tag[n]].var()). I would like to run the following pseudocode: for row in df: for n in range(number of columns): row[n] = (row[n] - avg[n])/std[n] How'd you do that?
Thanks. Possibly, could you provide me with an example? If it is easier you could join my discord server here (so our discussion is not confined to small comment boxes). discord.gg/EEajBNpeK9
Im trying to flatten one json using pandas...im trying to create a generic script which can continueosly iterable and flatten the dicts and lists...but still im struggling in that...if you can help in this case can u give a idea how i can achieve this...
Great stuff! What approach would you recommend when not comparing to a static value, but another dataFrame where I just want to test is df1['col'] == df2['col'] Thanks again for sharing the knowledge....
I want to know write a large dataframe 35K rows 1200 columns to csv. Data is string and numeric. It takes 3 min to write. Is there any way to make it faster?
This is a good question. I'll need to look into this in more depth... Here are a couple options that should speed up the output process: 1) Ensure your pandas is the most up to date 2) Try using numpy's savetxt numpy.org/doc/stable/reference/generated/numpy.savetxt.html 3) Consider using pickle instead of csv (this might not be possible depending on your situation). it's the same as a csv just replace csv with pickle pd.to_pickle('file_name.pkl') and pd.read_pickle('file_name.pkl')
Excellent bro..can you please help me in my situation i have to first split column with , seperator then i have to check string like if string is A then it should be place in A column else string is B then it should set on column B
Thanks ALI. Could you give me some examples of what the first 5 rows look like from the column you need to split and filter. For example: row0: 'a, b' row1: ' b,a' row2: 'b,b' row3: 'a,b' row4: 'a,b,c'
@@ChartExplorers thanks for reply below is example colA and col B auto generated based on col1 Col1. ColA. Col B Row1. 29A,45B. 29A. 45 B Row2. 1,33A,44B. 33A. 44 B Row3. 12A,35A,2B. 12A,35A. 2B . .
@@malikhan0123 Try this out and please let me know how it works. If you have a very large df and want to optimize this code further then jump onto this discord server for my channel and we can discuss things further. Or if you want further clarification on what the code is doing. discord.gg/mCRwew6n import pandas as pd import re # create function to get the A or B values # if there are more than 2 A's or B's a list will # be returned. Get the cell_value as a string rather # than a list. def get_a_or_b(a_or_b,x): r = re.compile(f".*{a_or_b}", re.I) # re.I means it is case insensitive list_ab = list(filter(r.match, x)) cell_value = ','.join([str(elem) for elem in list_ab]) return cell_value # create practice DF df = pd.DataFrame({'need_to_split':['29A,45B','1,33A,44B','12A,35A,2B']}) # split the column on , df['split'] = df['need_to_split'].str.split(',') # Create col_a by passing A to the function we created df['col_a'] = df['split'].apply(lambda x : get_a_or_b('A', x)) # Create col_b by passing B to the function we created df['col_b'] = df['split'].apply(lambda x : get_a_or_b('B', x))
@@ChartExplorers thanks alot brayn it's working in my data set i just need to clean data before feeding this as this throwing error if cell is empty...i haven't used re librery in past can you make a video on that and explain this code... thank you once again
@@malikhan0123 I have written it down to create a video on regular expressions using the re library which I should be able to get out in the next 10 days or so. I'll either make a video or do a write up that explains what the code is doing in the example I gave you. Happy Analyzing!
I want a new column, say B, in which I want the values of next row of A. Eg. If The column A has 1,2,3,4,5, I will get 2,3,4,5 in column B. Can you help me with that?
you are making these time comparasions on different columns. Although the time may be similar, I feel showing calcuation comparasions on one particular column would have been a better way.
Thank you! I've been struggling with filling in a recommendation column from check on three other columns for weeks! Your iloc_example was the piece that I was missing.
How can we update rows dynamically on all columns rather then mentioning the name of the column. I am having same scenario but the number of columns are 4000+
Hi, I have a .csv file with a text column and multiple rows. I wish to execute a function on each row of the text and have a resultant column as well. Can I use the iterations methods shown in the video to accomplish this?
@@ChartExplorers thanks for the quick response. I have a column with rows consisting of text (say 200-500 chars on average). I'd like to manipulate data - like executing a summarize function, append text, remove text etc. and create a new column to compare changes. Is this straightforward? Please let me know if you'd need to take this conversation via a private communication thread.
Hey Brayden, super helpful video! I noticed though, after running some tests, that using "np.where" does not like to work when using int values in the dataframe. I found that initializing a numpy array beforehand and then using that instead of directly calling df['column_name'].values in the "np.where" method works, but I'm not exactly sure why. If you have a reasoning behind this I'd love to hear it!
Hi! Thanks for such nice explanation. Can you please make a video on how to iterate over each rows and find out the max/min value for each rows across multiple columns. Suppose, I have a dataframe showing Jan - Dec (in rows) data for 5 scenarios (in columns). Now I would like to know which scenario is having max/min values for each rows. Thanks in advance
Here is a video I created on how to get the min & max values of each row. It also discusses how to get the index location of those values as well. ruclips.net/video/n2ff3rcLgnE/видео.html
I'm glad you brought this up. This is something that I should have discussed in the video. If possible use pandas built in string methods .str. When this is not possible try to us np.where() or np.select (these methods still work with strings). Most other things you should be able to cover with .apply and a lambda function. And if none of those work, use .iloc but you should never have to use df.iterrows even with strings.
@@ChartExplorers Thanks for answering so promptly. This has been driving me crazy. Turns out there's been a "standoff" between Numpy vs other camps. One of the posts at this (warning: lengthy) discussion at StackOverflow explains: stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur "There is a disagreement between Numpy and native python on what should happen when you compare a strings to numpy's numeric types. Notice the left operand is python's turf, a primitive string, and the middle operation is python's turf, but the right operand is numpy's turf. Should you return a Python style Scalar or a Numpy style ndarray of Boolean? Numpy says ndarray of bool, Pythonic developers disagree. Classic standoff. Should it be elementwise comparison or Scalar if item exists in the array? If your code or library is using the in or == operators to compare python string to numpy ndarrays, they aren't compatible, so when if you try it, it returns a scalar, but only for now. The Warning indicates that in the future this behavior might change so your code pukes all over the carpet if python/numpy decide to do adopt Numpy style." Submitted Bug reports: Numpy and Python are in a standoff, for now the operation returns a scalar, but in the future it may change. github.com/numpy/numpy/issues/6784 github.com/pandas-dev/pandas/issues/7830"
Thank you for this, very helpful. I’m trying to apply this in a particular way. I have two data frames: A and B. Data frame A contains 37 rows and 1 column, called indicator. Data frame b is has many rows and 2 columns: ID and Indicator. What i want to do is iterate over data frame a’s indicator column, and do a count the number of IDs in data frame b that match that indicator. I was successful in iterating and doing the nested group by, however, the group by value does not reset and only returns the very first iteration for all 37 indicators. What’s missing?
I'm just wondering what the drawbacks are? Why is it mentioned on SO that "iterrows should only be used in very, very specific cases"?(stackoverflow.com/a/24871316/11901732) Is it appropriate to be used in sequential search such as the case of a for-loop? Thanks
Hi Harper, good question. The drawback to iterrows is that it is SLOW... like really slow. It's not that iterrows is bad (even though the way some people talk about it you would think it was the worst thing you could ever use). Code that runs slow is not really a big deal if you have a small data set. In fact, there have been times where I have spent more times coming up with a vectorized solution than the time it would have taken to write and run an iterrows solution. You have probably heard about code being Pythonic (or writing code in the Python style and using Python methods to solve problems). In the same way some code is not "Pythonic" using iterrows is not very "data scientistic." I think the moral of the story is that using iterrows is fine if you only work with small datasets, are not going to reuse the code, and don't do anything computationally expensive. Some people would disagree with me on this point, but there is no sense spending 5 minutes optimizing code that you are only going to use once to get a 30 second speed increase. On the other hand, if you want to keep improving as a programmer / analyst / data scientist I think we need to constantly be learning and implementing faster better solutions - even if there is a bit of a learning curve.
Thanks for this great video! Would you be able to use one of these faster methods to loop over an array or dataframe in a specific # of rows, then create an xml file, and start process again? What I am doing is building XML files from a large dataframe (can be array or series or whatever). The data is just one column of. Each row is being put into a the xml file as a value. I want to do 50K at a time, print file and then continue it again with the last one not having to have exactly 50K but whatever is left. Thanks,
Great Video! While Lambda & Numpy methods are time efficient but how would we use conditions based on different columns in these methods. For example: I'm trying to create a sequence number based on other variables & using "k" as an incrementor when a condition is met. I'm not able to incorporate this in Lamda/Numpy but I can easily do it in Index/iterrows(). My Code: c5=df_temp['DESCRIPTION'].str.match(pat=r'ISIN|IS INf') cumm=0 df_temp['Seq']=0 for i,row in df_temp.iterrows(): if c5[i]==True: df_temp['Seq'][i]=cumm cumm=cumm+1 else: df_temp['Seq'][i]=cumm
Thanks! I like this question. There are some situations where you need to iterate through a DataFrame (this might be one of them). I did some digging and I couldn't find anything that uses an incrementor with lambda or np.where, np.select. I'm going to keep looking into this, but for now this is the best solution I came up with. Note: When you are testing if something is true and you have a Boolean (True/False) variable you don't need to use "== True". Also, there is a short hand for cumm = cumm += 1 you can instead use cumm += 1 (you probably knew this already, but just in case). If you use df.index instead of df.iterrows() you should get a slight performance boost. c5 = df_temp['DESCRIPTION'].str.match(pat=r'ISIN|IS INf') cumm=0 df_temp['Seq']=0 for i in df_temp.index: if c5[i]: # Removed == True df_temp['Seq'].iat[i] = cumm cumm += 1 #cumm+1 else: df_temp['Seq'].iat[i] = cumm
@@ChartExplorers Exactly! About 90% of the time when I have to loop, it's always related to iterating over a dataframe & lambda/np just doesn't do the job. Thanks for looking it up, let me know if you come up with something. Thanks a lot for your note. I'm aware of both of them but I'm intentionally using them as I have to deploy the code & it's easier to explain that way to the client.
@@ChartExplorers To add to this topic : let's assume you have one operation that requires 'iteration', and other operations that can use 'apply'. Is it better to piggyback all operations to one loop, or is it better to use 'apply' or 'where' whenever possible? BTW: I like the presentation a lot, nicely donw!
Hi, thank you so much for the video. I was working on a python simulation where I iterate through rows like in your first example and that takes me over 3 hours to run. Can't wait to try with your methods. How would you go about dealing with a cell which value depends on a cell at the previous iteration. Like A2 = 1 if B1 > 5 ?
I'm so curious how this will turn out for you. Please let me know. I am interpreting "a cell at the previous iteration" to mean the cell above it? Is that correct? If you are trying to compare to the previous column you can use the pandas shift methods pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html # Practice dataframe df = pd.DataFrame({"col1": [6, 2, 5, 13, 6], "col2": [3, 2, 8, 6, 4], "col3": [7, 2, 6, 9, 5]}) # Shift values by one previous_values = df['col1'].shift(1, fill_value=0) # Compare previous values to current value if the # previous value was greater than 5 set it equal # to 1 else set it equal to the value that it currently is df['col1'] = np.where(previous_values > 5, 1, df['col1']) # Example for entire dataframe not just a column previous_values = df.shift(1) new_values = np.where(previous_values > 5, 1, df) df = pd.DataFrame(new_values, columns=df.columns) There are a couple things to note. When we use shift there will be a row of missing. If we shift the rows up by one the bottom row will be missing. If we shift the rows down by one the top row will be missing. We can fill this in with what ever value is most appropriate or leave it as NaN if you're analysis allows NaN. we are using np.where which takes the condition we are evaluating (is the previous value great than 5). Then we pass in the value we want the cell to be if the condition is true (in this case 1). Note, we also need to pass in the value we want the cell to be if the condition is not true. In this example I pass in the original cell value (do not change the value). In your example you said A2 = 1 if B1 > 5. If you wanted to compare the the current cell (A2) to the previous_value of another column you could just change the column we use when we create the previous_values variable. # create previous values for col2 instead of col1 previous_values = df['col2'].shift(1, fill_value=0) Let me know if I explained things well enough (these topics are more advanced). And please let me know if you get a boost in performance!
@@ChartExplorers Thank you so much for the detailed answer !! I love that channel, definitely subscribing !! I just tried it with shift. Tell me if I am wrong, but I think in my example below the interdependency makes it impossible without a loop. There are some dependencies on values on rows above, some on values on the same row. And I feel like you can't fill a whole column without completing the rows one by one. I added a more detailed example below. Please let me know if you see something wrong in my implementation of .shift. I am also gonna try using NumPy arrays and also with Numba for just in time compiling. The table looks like this: Columns: date, prices, qty, collateral, LTV, margin call, margin release, margin termination Date, price, and qty are inputs. Margin call, release, termination are booleans. Cash collateral is 0 at first then its value derives from the previous line eg: if margin call at T-1 is 1: =price(t-1)*qty(t-1)-cash collateral(t-1) LTV is 45% at first then some percentage of price and cash collateral (based on T, not T-1) Margin calls, release, and termination are 0 or 1 based on some functions of cash collateral, price, LTV. So before I had something like (works but slow - I am running that for a thousand rows and redo that for 200 stocks on 3000 dates. So like 600M iterations - so far 4 hours): df.loc[0, 'cash']=0 df.loc[0, 'LTV'] = 0.45 df.loc[0, 'margin call'] = 1 if df.loc[0, 'LTV'] > 0.5 else 0 ... for i in range(1, len(df)): if df.loc[i-1,' margincall'] == 1: df.loc[i, 'cash'] = df.loc[i-1, 'LTV'] * ... elif ... ... df.loc[i, 'LTV'] = 100 - df.loc[i, 'cash'] / ... Now I have something like this: if df['margincall'].shift(1, fill_value=0) == 1: df['cash'] = df['LTV'].shift(1, fill_value=0.45) * ... elif ... ... df 'LTV'] = 100 - df 'cash'] / ... Thanks again for your answer !! And your videos !!
HI @V1ct0r 6l0r1eux, this may be a case where you need to loop. I have to head out to work and then I have a few other things I need to get done today. After that I will try see if there is anything I can figure out. Would you be able to send me the first 50 rows or so of you data? If you are not allowed for whatever reason could you make a fake df with the columns you described above with some fake data in it and send it my way. I want to make sure I am using the best data to practice on as possible. If you can do that my email is bradonvalgardson@gmail.com.
@@ChartExplorers Thanks, looks like it might be that one mandatory looping case indeed. I made it work by switching to numpy arrays. About 10 times faster. Still dealing with a couple issues to use numba which should cut the load by another 3/5 times. It's fast enough for what I need it to do now so don't spend time on this but for sake of curiosity I can send you the code tomorrow at your email. Thanks again for replying so fast and making those awesome videos.
Hi Subhankar. Thanks for the question. Are your columns numeric? If so you can just compare the columns directly df['binary'] = df['col1'] > df['col2'] (you can replace the > with any comparison operator). Will this work in your situation?
dude... I thought you were going to actually show how to make changes to the data frame with loop, not to show which type of change is faster..... too bad.
manufacturer = (Ford, Ford, Ford, Chevrolet, Chevrolet, Chevrolet, Hyundai, Kia, nan, nan, nan, nan, nan, nan, nan) model = (Focus, Fiesta, F150, Captiva, Bolt, Silverado1500, Tucson, Sorento, F150, Fiesta, F150, Silverado1500, Captiva, F150, Focus, Tucson) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ here is a sample data that my the dataset I'm working with, the code I posted does a good job but it's very slow. model Focus has a manufacturer Ford at first, but afterward it has nan so the code is scanning and picking a name from previously declared observations.
Hi Braden - I've been wrestling with how to speed up some conditional changes to a large 1m x 100 row data frame - so the entire df is transformed with True/False according to the condition. The condition is a function of a quantile for the column. I had to use iteritems so I could go column wise. Did not need df.index but used np.where. It took 2.89 secs for the 1m x 100 data frame. It seemed more like 30 secs but still, wow! Thank you for your video - it's the best.
Thank you a lot, I have two files with huge data to confront with one another and has taken me over 2 hours to loop and do all the checks, tomorrow I'll try one of your techniques
What was the result?
Savage! Best loop explanation on youtube!
not just improvements with regards to large data sets but also for operations that need to happen many times per second.
May I clarify. At 1:57 the returned row data is a pandas series or a dictionary? Thank you. Fantastic guide.
Good question. row is a pandas Series.
How to select only first value of dataframe for eg : if condition satisfied to second row but I need to ignore that. In your example convert 1 if the value greater than 0.5 if condition satisfies to coming rows then also it needs to ignore.selct only first row of condition satisfied
Very good tutorial.if we want to access previous row inside apply function, how can we do that. ? If I have index defined, can I do row.index-1 or if I dont have index defined , how do we know current location of current row?
Finally a clear and concise explanation. Many thanks!!
By the way, If I need to make a calculation say between two sequential rows like A[i] + B[i] / C[i+1] * D[i+1], how this could be done with the fast methods?
Thanks again
Thanks! This is a great question. You should be able to accomplish this with the pandas shift() method. I'll work on an example I can share with you tomorrow! 😀
@@ChartExplorers Eagerly waiting for it.
@@damon1236 Thanks for the reminder! Alright here is what you can do for a vectorized solution.
df['new_col'] = (df['A'] + df['B']) / (df['C'].shift(-1) * df['D'].shift(-1))
The first part:
(df['A'] + df['B'])
is straight forward it adds the two columns together.
The second part
(df['C'].shift(-1) * df['D'].shift(-1)
uses the shift method to move all the values in the column up one (I know it is weird to use -1 to shift up haha).
We take the sum from the first part and divide it by the second part to get our answer :)
Wow... thanks!!! I'll stay tunned here for many other little treasures... think about json manipulation with very nested structures next time please 😉
@@oswaldocastro9600 Good idea! thanks for your support :)
I was wondering what's the most "pandas" (and fast) way to do the following.
Each row of my table is associated with a patient, while each column with a certain connectivity measure (200 of them). For each column n with label tag[n] I compute the average avg[n] =df[tag[n]].mean() and the standard deviation std[n] = sqrt( df[tag[n]].var()).
I would like to run the following pseudocode:
for row in df:
for n in range(number of columns):
row[n] = (row[n] - avg[n])/std[n]
How'd you do that?
Great video. Helped a bunch.
Any idea how to apply these methods for more columns with changing output - and applying these into the same col?
Thanks. Possibly, could you provide me with an example? If it is easier you could join my discord server here (so our discussion is not confined to small comment boxes). discord.gg/EEajBNpeK9
Finally I get a good reason to master numpy, thank you so much!
It's taken me a long time to realize how amazing numpy is. Now if I know how to do something in numpy I will!
Im trying to flatten one json using pandas...im trying to create a generic script which can continueosly iterable and flatten the dicts and lists...but still im struggling in that...if you can help in this case can u give a idea how i can achieve this...
How to change pandas & numpy code into pyspark OR how to boost performance of pandas and numpy probability or matrix calculation?
Great stuff! What approach would you recommend when not comparing to a static value, but another dataFrame where I just want to test is df1['col'] == df2['col']
Thanks again for sharing the knowledge....
I want to know write a large dataframe 35K rows 1200 columns to csv. Data is string and numeric. It takes 3 min to write. Is there any way to make it faster?
This is a good question. I'll need to look into this in more depth... Here are a couple options that should speed up the output process:
1) Ensure your pandas is the most up to date
2) Try using numpy's savetxt numpy.org/doc/stable/reference/generated/numpy.savetxt.html
3) Consider using pickle instead of csv (this might not be possible depending on your situation). it's the same as a csv just replace csv with pickle pd.to_pickle('file_name.pkl')
and
pd.read_pickle('file_name.pkl')
I noticed that the focus was on specific columns. What if I want results from all the columns?
Excellent bro..can you please help me in my situation i have to first split column with , seperator then i have to check string like if string is A then it should be place in A column else string is B then it should set on column B
Thanks ALI. Could you give me some examples of what the first 5 rows look like from the column you need to split and filter.
For example:
row0: 'a, b'
row1: ' b,a'
row2: 'b,b'
row3: 'a,b'
row4: 'a,b,c'
@@ChartExplorers thanks for reply below is example colA and col B auto generated based on col1
Col1. ColA. Col B
Row1. 29A,45B. 29A. 45 B
Row2. 1,33A,44B. 33A. 44 B
Row3. 12A,35A,2B. 12A,35A. 2B
.
.
@@malikhan0123 Try this out and please let me know how it works. If you have a very large df and want to optimize this code further then jump onto this discord server for my channel and we can discuss things further. Or if you want further clarification on what the code is doing.
discord.gg/mCRwew6n
import pandas as pd
import re
# create function to get the A or B values
# if there are more than 2 A's or B's a list will
# be returned. Get the cell_value as a string rather
# than a list.
def get_a_or_b(a_or_b,x):
r = re.compile(f".*{a_or_b}", re.I) # re.I means it is case insensitive
list_ab = list(filter(r.match, x))
cell_value = ','.join([str(elem) for elem in list_ab])
return cell_value
# create practice DF
df = pd.DataFrame({'need_to_split':['29A,45B','1,33A,44B','12A,35A,2B']})
# split the column on ,
df['split'] = df['need_to_split'].str.split(',')
# Create col_a by passing A to the function we created
df['col_a'] = df['split'].apply(lambda x : get_a_or_b('A', x))
# Create col_b by passing B to the function we created
df['col_b'] = df['split'].apply(lambda x : get_a_or_b('B', x))
@@ChartExplorers thanks alot brayn it's working in my data set i just need to clean data before feeding this as this throwing error if cell is empty...i haven't used re librery in past can you make a video on that and explain this code... thank you once again
@@malikhan0123 I have written it down to create a video on regular expressions using the re library which I should be able to get out in the next 10 days or so. I'll either make a video or do a write up that explains what the code is doing in the example I gave you. Happy Analyzing!
Very cool showing the performance differences, waiting for your Lambda video
Super comparison.
Thanks!
How did you find out/learn which code returns faster? I hope my question make sense.
I want a new column, say B, in which I want the values of next row of A.
Eg. If The column A has 1,2,3,4,5, I will get 2,3,4,5 in column B. Can you help me with that?
How can we filter first non null value in every rows of a dataframe? Pls reply
can you exclude the 'else' from the list comprehension? I can't get it to work without the else part.
You can, check out example 1 on this website. If that doesn't work you can send me your list comprehension and I'll see if I can figure it out.
@@ChartExplorers thanks, which website ?
@@DJBONEZDIEZ haha, well that's awkward I didn't paste in the website!
stackoverflow.com/questions/12709062/python-lambda-with-if-but-without-else
you are making these time comparasions on different columns. Although the time may be similar, I feel showing calcuation comparasions on one particular column would have been a better way.
Can you do multiple ifs in the last one
Thank you!
I've been struggling with filling in a recommendation column from check on three other columns for weeks! Your iloc_example was the piece that I was missing.
Thanks very much, just saved me a huge amount of time analysing :D
You're welcom!
I need help in python can we connect?
How can we update rows dynamically on all columns rather then mentioning the name of the column. I am having same scenario but the number of columns are 4000+
What if instead of a particular column (A or B, ...), I want to loop over each row to find out the largest of each row.
In your example we could use a built in method (max). I think you will find this video particularly helpful ruclips.net/video/n2ff3rcLgnE/видео.html
Hi, I have a .csv file with a text column and multiple rows. I wish to execute a function on each row of the text and have a resultant column as well. Can I use the iterations methods shown in the video to accomplish this?
Hi Blessen Koshy, it depends on what method you are using and what function you are applying to your text column. Can you provide more details?
@@ChartExplorers thanks for the quick response. I have a column with rows consisting of text (say 200-500 chars on average). I'd like to manipulate data - like executing a summarize function, append text, remove text etc. and create a new column to compare changes. Is this straightforward? Please let me know if you'd need to take this conversation via a private communication thread.
this was a great video. Thanks for the clear demonstration of all these different options and the pros/cons
Thank you! I'm glad you found it helpful!
@@ChartExplorers ur stuff is acc quality content. Props to you bro
@@ultimategamer7859 Thanks!!
Great presentation. Exactly what I was after.
Hey Brayden, super helpful video! I noticed though, after running some tests, that using "np.where" does not like to work when using int values in the dataframe. I found that initializing a numpy array beforehand and then using that instead of directly calling df['column_name'].values in the "np.where" method works, but I'm not exactly sure why. If you have a reasoning behind this I'd love to hear it!
Can you give an example?
I will be sure to use np.where more often from now on
Thank you
Hi! Thanks for such nice explanation. Can you please make a video on how to iterate over each rows and find out the max/min value for each rows across multiple columns. Suppose, I have a dataframe showing Jan - Dec (in rows) data for 5 scenarios (in columns). Now I would like to know which scenario is having max/min values for each rows. Thanks in advance
Hi Niladri, Thanks for the suggestion! I'll start working on that video right away. I'll post it here in the comments when I'm finished.
Here is a video I created on how to get the min & max values of each row. It also discusses how to get the index location of those values as well. ruclips.net/video/n2ff3rcLgnE/видео.html
@@ChartExplorers Awesome! And THANK YOU for considering my request. Wish you all the very best 👍
how can i iterrate through a colunm
How about a string? Which method will be fastest?
I'm glad you brought this up. This is something that I should have discussed in the video. If possible use pandas built in string methods .str. When this is not possible try to us np.where() or np.select (these methods still work with strings). Most other things you should be able to cover with .apply and a lambda function. And if none of those work, use .iloc but you should never have to use df.iterrows even with strings.
@@ChartExplorers Thanks for answering so promptly. This has been driving me crazy. Turns out there's been a "standoff" between Numpy vs other camps. One of the posts at this (warning: lengthy) discussion at StackOverflow explains:
stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur
"There is a disagreement between Numpy and native python on what should happen when you compare a strings to numpy's numeric types. Notice the left operand is python's turf, a primitive string, and the middle operation is python's turf, but the right operand is numpy's turf. Should you return a Python style Scalar or a Numpy style ndarray of Boolean? Numpy says ndarray of bool, Pythonic developers disagree. Classic standoff.
Should it be elementwise comparison or Scalar if item exists in the array?
If your code or library is using the in or == operators to compare python string to numpy ndarrays, they aren't compatible, so when if you try it, it returns a scalar, but only for now. The Warning indicates that in the future this behavior might change so your code pukes all over the carpet if python/numpy decide to do adopt Numpy style."
Submitted Bug reports:
Numpy and Python are in a standoff, for now the operation returns a scalar, but in the future it may change.
github.com/numpy/numpy/issues/6784
github.com/pandas-dev/pandas/issues/7830"
EXTREMELY HELPFUL!
Glad it helped!
This changed my life. Thanks!
Mine Too!! 😛
Thank you for this, very helpful. I’m trying to apply this in a particular way. I have two data frames: A and B. Data frame A contains 37 rows and 1 column, called indicator. Data frame b is has many rows and 2 columns: ID and Indicator. What i want to do is iterate over data frame a’s indicator column, and do a count the number of IDs in data frame b that match that indicator. I was successful in iterating and doing the nested group by, however, the group by value does not reset and only returns the very first iteration for all 37 indicators. What’s missing?
Thanks
Thanks. Can you tell once again for beginners pls.
This was enlightening. Many thanks!
Very cool, thanks for taking the time to make this.
You are awesome! Such useful content!
Thanks, I'm glad you liked it!!
Thanks very useful, please can you talk about geopandas and spatial data.
I'm glad you found this useful. Geospatial data is a little out of my wheel house, but I'll see what I can do 😀
@@ChartExplorers Thanks a lot for your
Interaction and good continuation.
Great video, thanks!
That's amazing thanks!
I'm just wondering what the drawbacks are? Why is it mentioned on SO that "iterrows should only be used in very, very specific cases"?(stackoverflow.com/a/24871316/11901732) Is it appropriate to be used in sequential search such as the case of a for-loop? Thanks
Hi Harper, good question. The drawback to iterrows is that it is SLOW... like really slow. It's not that iterrows is bad (even though the way some people talk about it you would think it was the worst thing you could ever use).
Code that runs slow is not really a big deal if you have a small data set. In fact, there have been times where I have spent more times coming up with a vectorized solution than the time it would have taken to write and run an iterrows solution.
You have probably heard about code being Pythonic (or writing code in the Python style and using Python methods to solve problems). In the same way some code is not "Pythonic" using iterrows is not very "data scientistic."
I think the moral of the story is that using iterrows is fine if you only work with small datasets, are not going to reuse the code, and don't do anything computationally expensive. Some people would disagree with me on this point, but there is no sense spending 5 minutes optimizing code that you are only going to use once to get a 30 second speed increase. On the other hand, if you want to keep improving as a programmer / analyst / data scientist I think we need to constantly be learning and implementing faster better solutions - even if there is a bit of a learning curve.
Useful content nicely explained! Thank you for sharing.
Thanks!
Thanks for this great video! Would you be able to use one of these faster methods to loop over an array or dataframe in a specific # of rows, then create an xml file, and start process again? What I am doing is building XML files from a large dataframe (can be array or series or whatever). The data is just one column of. Each row is being put into a the xml file as a value. I want to do 50K at a time, print file and then continue it again with the last one not having to have exactly 50K but whatever is left. Thanks,
Great video!
Great Video! While Lambda & Numpy methods are time efficient but how would we use conditions based on different columns in these methods. For example: I'm trying to create a sequence number based on other variables & using "k" as an incrementor when a condition is met. I'm not able to incorporate this in Lamda/Numpy but I can easily do it in Index/iterrows().
My Code:
c5=df_temp['DESCRIPTION'].str.match(pat=r'ISIN|IS INf')
cumm=0
df_temp['Seq']=0
for i,row in df_temp.iterrows():
if c5[i]==True:
df_temp['Seq'][i]=cumm
cumm=cumm+1
else:
df_temp['Seq'][i]=cumm
Thanks! I like this question. There are some situations where you need to iterate through a DataFrame (this might be one of them). I did some digging and I couldn't find anything that uses an incrementor with lambda or np.where, np.select. I'm going to keep looking into this, but for now this is the best solution I came up with.
Note: When you are testing if something is true and you have a Boolean (True/False) variable you don't need to use "== True".
Also, there is a short hand for cumm = cumm += 1 you can instead use cumm += 1 (you probably knew this already, but just in case).
If you use df.index instead of df.iterrows() you should get a slight performance boost.
c5 = df_temp['DESCRIPTION'].str.match(pat=r'ISIN|IS INf')
cumm=0
df_temp['Seq']=0
for i in df_temp.index:
if c5[i]: # Removed == True
df_temp['Seq'].iat[i] = cumm
cumm += 1 #cumm+1
else:
df_temp['Seq'].iat[i] = cumm
@@ChartExplorers Exactly! About 90% of the time when I have to loop, it's always related to iterating over a dataframe & lambda/np just doesn't do the job. Thanks for looking it up, let me know if you come up with something.
Thanks a lot for your note. I'm aware of both of them but I'm intentionally using them as I have to deploy the code & it's easier to explain that way to the client.
@@ChartExplorers To add to this topic : let's assume you have one operation that requires 'iteration', and other operations that can use 'apply'. Is it better to piggyback all operations to one loop, or is it better to use 'apply' or 'where' whenever possible? BTW: I like the presentation a lot, nicely donw!
Hi, thank you so much for the video. I was working on a python simulation where I iterate through rows like in your first example and that takes me over 3 hours to run. Can't wait to try with your methods.
How would you go about dealing with a cell which value depends on a cell at the previous iteration. Like A2 = 1 if B1 > 5 ?
I'm so curious how this will turn out for you. Please let me know.
I am interpreting "a cell at the previous iteration" to mean the cell above it? Is that correct?
If you are trying to compare to the previous column you can use the pandas shift methods pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html
# Practice dataframe
df = pd.DataFrame({"col1": [6, 2, 5, 13, 6],
"col2": [3, 2, 8, 6, 4],
"col3": [7, 2, 6, 9, 5]})
# Shift values by one
previous_values = df['col1'].shift(1, fill_value=0)
# Compare previous values to current value if the
# previous value was greater than 5 set it equal
# to 1 else set it equal to the value that it currently is
df['col1'] = np.where(previous_values > 5, 1, df['col1'])
# Example for entire dataframe not just a column
previous_values = df.shift(1)
new_values = np.where(previous_values > 5, 1, df)
df = pd.DataFrame(new_values, columns=df.columns)
There are a couple things to note. When we use shift there will be a row of missing. If we shift the rows up by one the bottom row will be missing. If we shift the rows down by one the top row will be missing. We can fill this in with what ever value is most appropriate or leave it as NaN if you're analysis allows NaN.
we are using np.where which takes the condition we are evaluating (is the previous value great than 5). Then we pass in the value we want the cell to be if the condition is true (in this case 1). Note, we also need to pass in the value we want the cell to be if the condition is not true. In this example I pass in the original cell value (do not change the value).
In your example you said A2 = 1 if B1 > 5. If you wanted to compare the the current cell (A2) to the previous_value of another column you could just change the column we use when we create the previous_values variable.
# create previous values for col2 instead of col1
previous_values = df['col2'].shift(1, fill_value=0)
Let me know if I explained things well enough (these topics are more advanced). And please let me know if you get a boost in performance!
@@ChartExplorers Thank you so much for the detailed answer !! I love that channel, definitely subscribing !!
I just tried it with shift. Tell me if I am wrong, but I think in my example below the interdependency makes it impossible without a loop. There are some dependencies on values on rows above, some on values on the same row. And I feel like you can't fill a whole column without completing the rows one by one. I added a more detailed example below.
Please let me know if you see something wrong in my implementation of .shift. I am also gonna try using NumPy arrays and also with Numba for just in time compiling.
The table looks like this:
Columns: date, prices, qty, collateral, LTV, margin call, margin release, margin termination
Date, price, and qty are inputs.
Margin call, release, termination are booleans.
Cash collateral is 0 at first then its value derives from the previous line eg: if margin call at T-1 is 1: =price(t-1)*qty(t-1)-cash collateral(t-1)
LTV is 45% at first then some percentage of price and cash collateral (based on T, not T-1)
Margin calls, release, and termination are 0 or 1 based on some functions of cash collateral, price, LTV.
So before I had something like (works but slow - I am running that for a thousand rows and redo that for 200 stocks on 3000 dates. So like 600M iterations - so far 4 hours):
df.loc[0, 'cash']=0
df.loc[0, 'LTV'] = 0.45
df.loc[0, 'margin call'] = 1 if df.loc[0, 'LTV'] > 0.5 else 0
...
for i in range(1, len(df)):
if df.loc[i-1,' margincall'] == 1:
df.loc[i, 'cash'] = df.loc[i-1, 'LTV'] * ...
elif ...
...
df.loc[i, 'LTV'] = 100 - df.loc[i, 'cash'] / ...
Now I have something like this:
if df['margincall'].shift(1, fill_value=0) == 1:
df['cash'] = df['LTV'].shift(1, fill_value=0.45) * ...
elif ...
...
df 'LTV'] = 100 - df 'cash'] / ...
Thanks again for your answer !! And your videos !!
HI @V1ct0r 6l0r1eux, this may be a case where you need to loop. I have to head out to work and then I have a few other things I need to get done today. After that I will try see if there is anything I can figure out. Would you be able to send me the first 50 rows or so of you data? If you are not allowed for whatever reason could you make a fake df with the columns you described above with some fake data in it and send it my way. I want to make sure I am using the best data to practice on as possible. If you can do that my email is bradonvalgardson@gmail.com.
@@ChartExplorers Thanks, looks like it might be that one mandatory looping case indeed. I made it work by switching to numpy arrays. About 10 times faster. Still dealing with a couple issues to use numba which should cut the load by another 3/5 times. It's fast enough for what I need it to do now so don't spend time on this but for sake of curiosity I can send you the code tomorrow at your email. Thanks again for replying so fast and making those awesome videos.
Thanks for the video bro..
comparing 2 different column, I need to creat a binary column. how to do it?
Hi Subhankar. Thanks for the question. Are your columns numeric? If so you can just compare the columns directly df['binary'] = df['col1'] > df['col2'] (you can replace the > with any comparison operator). Will this work in your situation?
Hi Subhankar Roy, I made this video addressing your question ruclips.net/video/2xwto0MK_9U/видео.html
Excellent thanks a lot mate
You're very welcome!
very very useful..thanks
I'm glad you found it helpful!
Great explanation, thanks!
Thank you so much!!
Thanks! :)
You're Welcome :)
where's the lambda video!!??
Thank you so much
thank you!!!
Cool bro ❤
Awesome!
You just missed the itertuples...
dude... I thought you were going to actually show how to make changes to the data frame with loop, not to show which type of change is faster..... too bad.
Is that not what I do? Each example shows how to change the dataframe.
If you were thinking something more helpful I would love to get your feedback so I can create better content.
Magic
its just so painful
What's painful?
@@ChartExplorers pandas data frame, been struggling with an assignment which requires me to use pandas dfs, been struggling with it for over a week.
@@iskandarrasimoon734 Oh man! that stinks. Let me know if you have any specific questions. If I can I'd be happy to try and help you.
@@ChartExplorers Is there any way I could contact you?
@@iskandarrasimoon734 if you have discord discord.gg/EEajBNpeK9 or my email is bradonvalgardson@gmail.com
manufacturer
= (Ford, Ford, Ford, Chevrolet, Chevrolet, Chevrolet, Hyundai, Kia, nan, nan, nan, nan, nan, nan, nan)
model = (Focus, Fiesta, F150, Captiva, Bolt, Silverado1500, Tucson, Sorento, F150, Fiesta, F150, Silverado1500, Captiva, F150, Focus, Tucson)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
here is a sample data that my the dataset I'm working with, the code I posted does a good job but it's very slow. model Focus has a manufacturer Ford at first, but afterward it has nan so the code is scanning and picking a name from previously declared observations.