Loop / Iterate over pandas DataFrame (2020)

Поделиться
HTML-код
  • Опубликовано: 26 дек 2024

Комментарии • 136

  • @arnoldrosielle5290
    @arnoldrosielle5290 3 года назад +7

    Hi Braden - I've been wrestling with how to speed up some conditional changes to a large 1m x 100 row data frame - so the entire df is transformed with True/False according to the condition. The condition is a function of a quantile for the column. I had to use iteritems so I could go column wise. Did not need df.index but used np.where. It took 2.89 secs for the 1m x 100 data frame. It seemed more like 30 secs but still, wow! Thank you for your video - it's the best.

  • @soreanustefanalexandru8063
    @soreanustefanalexandru8063 9 месяцев назад

    Thank you a lot, I have two files with huge data to confront with one another and has taken me over 2 hours to loop and do all the checks, tomorrow I'll try one of your techniques

    • @rshah645
      @rshah645 9 месяцев назад

      What was the result?

  • @marcellosoto
    @marcellosoto Год назад +1

    Savage! Best loop explanation on youtube!

  • @chapmansbg
    @chapmansbg 2 года назад

    not just improvements with regards to large data sets but also for operations that need to happen many times per second.

  • @ap5672
    @ap5672 3 года назад +1

    May I clarify. At 1:57 the returned row data is a pandas series or a dictionary? Thank you. Fantastic guide.

  • @farheensulthana7324
    @farheensulthana7324 2 года назад

    How to select only first value of dataframe for eg : if condition satisfied to second row but I need to ignore that. In your example convert 1 if the value greater than 0.5 if condition satisfies to coming rows then also it needs to ignore.selct only first row of condition satisfied

  • @ItsCloudHub
    @ItsCloudHub 11 месяцев назад

    Very good tutorial.if we want to access previous row inside apply function, how can we do that. ? If I have index defined, can I do row.index-1 or if I dont have index defined , how do we know current location of current row?

  • @oswaldocastro9600
    @oswaldocastro9600 3 года назад +2

    Finally a clear and concise explanation. Many thanks!!
    By the way, If I need to make a calculation say between two sequential rows like A[i] + B[i] / C[i+1] * D[i+1], how this could be done with the fast methods?
    Thanks again

    • @ChartExplorers
      @ChartExplorers  3 года назад +4

      Thanks! This is a great question. You should be able to accomplish this with the pandas shift() method. I'll work on an example I can share with you tomorrow! 😀

    • @damon1236
      @damon1236 3 года назад

      @@ChartExplorers Eagerly waiting for it.

    • @ChartExplorers
      @ChartExplorers  3 года назад +3

      @@damon1236 Thanks for the reminder! Alright here is what you can do for a vectorized solution.
      df['new_col'] = (df['A'] + df['B']) / (df['C'].shift(-1) * df['D'].shift(-1))
      The first part:
      (df['A'] + df['B'])
      is straight forward it adds the two columns together.
      The second part
      (df['C'].shift(-1) * df['D'].shift(-1)
      uses the shift method to move all the values in the column up one (I know it is weird to use -1 to shift up haha).
      We take the sum from the first part and divide it by the second part to get our answer :)

    • @oswaldocastro9600
      @oswaldocastro9600 3 года назад +2

      Wow... thanks!!! I'll stay tunned here for many other little treasures... think about json manipulation with very nested structures next time please 😉

    • @ChartExplorers
      @ChartExplorers  3 года назад +1

      @@oswaldocastro9600 Good idea! thanks for your support :)

  • @andreamarino95
    @andreamarino95 9 месяцев назад

    I was wondering what's the most "pandas" (and fast) way to do the following.
    Each row of my table is associated with a patient, while each column with a certain connectivity measure (200 of them). For each column n with label tag[n] I compute the average avg[n] =df[tag[n]].mean() and the standard deviation std[n] = sqrt( df[tag[n]].var()).
    I would like to run the following pseudocode:
    for row in df:
    for n in range(number of columns):
    row[n] = (row[n] - avg[n])/std[n]
    How'd you do that?

  • @Borzacchinni
    @Borzacchinni 3 года назад +1

    Great video. Helped a bunch.
    Any idea how to apply these methods for more columns with changing output - and applying these into the same col?

    • @ChartExplorers
      @ChartExplorers  3 года назад

      Thanks. Possibly, could you provide me with an example? If it is easier you could join my discord server here (so our discussion is not confined to small comment boxes). discord.gg/EEajBNpeK9

  • @juanbetancourt5106
    @juanbetancourt5106 3 года назад

    Finally I get a good reason to master numpy, thank you so much!

    • @ChartExplorers
      @ChartExplorers  3 года назад +1

      It's taken me a long time to realize how amazing numpy is. Now if I know how to do something in numpy I will!

  • @syedronaldo4758
    @syedronaldo4758 10 месяцев назад

    Im trying to flatten one json using pandas...im trying to create a generic script which can continueosly iterable and flatten the dicts and lists...but still im struggling in that...if you can help in this case can u give a idea how i can achieve this...

  • @shubhamaggarwal3676
    @shubhamaggarwal3676 3 года назад

    How to change pandas & numpy code into pyspark OR how to boost performance of pandas and numpy probability or matrix calculation?

  • @jhillyt
    @jhillyt 2 года назад +1

    Great stuff! What approach would you recommend when not comparing to a static value, but another dataFrame where I just want to test is df1['col'] == df2['col']
    Thanks again for sharing the knowledge....

  • @prashanthardikar9895
    @prashanthardikar9895 3 года назад +1

    I want to know write a large dataframe 35K rows 1200 columns to csv. Data is string and numeric. It takes 3 min to write. Is there any way to make it faster?

    • @ChartExplorers
      @ChartExplorers  3 года назад

      This is a good question. I'll need to look into this in more depth... Here are a couple options that should speed up the output process:
      1) Ensure your pandas is the most up to date
      2) Try using numpy's savetxt numpy.org/doc/stable/reference/generated/numpy.savetxt.html
      3) Consider using pickle instead of csv (this might not be possible depending on your situation). it's the same as a csv just replace csv with pickle pd.to_pickle('file_name.pkl')
      and
      pd.read_pickle('file_name.pkl')

  • @neemiahluis6449
    @neemiahluis6449 2 года назад

    I noticed that the focus was on specific columns. What if I want results from all the columns?

  • @malikhan0123
    @malikhan0123 4 года назад +1

    Excellent bro..can you please help me in my situation i have to first split column with , seperator then i have to check string like if string is A then it should be place in A column else string is B then it should set on column B

    • @ChartExplorers
      @ChartExplorers  4 года назад +1

      Thanks ALI. Could you give me some examples of what the first 5 rows look like from the column you need to split and filter.
      For example:
      row0: 'a, b'
      row1: ' b,a'
      row2: 'b,b'
      row3: 'a,b'
      row4: 'a,b,c'

    • @malikhan0123
      @malikhan0123 4 года назад

      @@ChartExplorers thanks for reply below is example colA and col B auto generated based on col1
      Col1. ColA. Col B
      Row1. 29A,45B. 29A. 45 B
      Row2. 1,33A,44B. 33A. 44 B
      Row3. 12A,35A,2B. 12A,35A. 2B
      .
      .

    • @ChartExplorers
      @ChartExplorers  4 года назад +1

      @@malikhan0123 Try this out and please let me know how it works. If you have a very large df and want to optimize this code further then jump onto this discord server for my channel and we can discuss things further. Or if you want further clarification on what the code is doing.
      discord.gg/mCRwew6n
      import pandas as pd
      import re
      # create function to get the A or B values
      # if there are more than 2 A's or B's a list will
      # be returned. Get the cell_value as a string rather
      # than a list.
      def get_a_or_b(a_or_b,x):
      r = re.compile(f".*{a_or_b}", re.I) # re.I means it is case insensitive
      list_ab = list(filter(r.match, x))
      cell_value = ','.join([str(elem) for elem in list_ab])
      return cell_value
      # create practice DF
      df = pd.DataFrame({'need_to_split':['29A,45B','1,33A,44B','12A,35A,2B']})
      # split the column on ,
      df['split'] = df['need_to_split'].str.split(',')
      # Create col_a by passing A to the function we created
      df['col_a'] = df['split'].apply(lambda x : get_a_or_b('A', x))
      # Create col_b by passing B to the function we created
      df['col_b'] = df['split'].apply(lambda x : get_a_or_b('B', x))

    • @malikhan0123
      @malikhan0123 4 года назад +1

      @@ChartExplorers thanks alot brayn it's working in my data set i just need to clean data before feeding this as this throwing error if cell is empty...i haven't used re librery in past can you make a video on that and explain this code... thank you once again

    • @ChartExplorers
      @ChartExplorers  4 года назад +1

      @@malikhan0123 I have written it down to create a video on regular expressions using the re library which I should be able to get out in the next 10 days or so. I'll either make a video or do a write up that explains what the code is doing in the example I gave you. Happy Analyzing!

  • @mikeyg8631
    @mikeyg8631 3 года назад

    Very cool showing the performance differences, waiting for your Lambda video

  • @danielflorea3001
    @danielflorea3001 2 года назад +1

    Super comparison.

  • @qasimzaheer1140
    @qasimzaheer1140 2 года назад

    How did you find out/learn which code returns faster? I hope my question make sense.

  • @khatri2046
    @khatri2046 3 года назад

    I want a new column, say B, in which I want the values of next row of A.
    Eg. If The column A has 1,2,3,4,5, I will get 2,3,4,5 in column B. Can you help me with that?

  • @AshisRaj
    @AshisRaj 3 года назад

    How can we filter first non null value in every rows of a dataframe? Pls reply

  • @DJBONEZDIEZ
    @DJBONEZDIEZ 3 года назад +1

    can you exclude the 'else' from the list comprehension? I can't get it to work without the else part.

    • @ChartExplorers
      @ChartExplorers  3 года назад

      You can, check out example 1 on this website. If that doesn't work you can send me your list comprehension and I'll see if I can figure it out.

    • @DJBONEZDIEZ
      @DJBONEZDIEZ 3 года назад

      @@ChartExplorers thanks, which website ?

    • @ChartExplorers
      @ChartExplorers  3 года назад +1

      @@DJBONEZDIEZ haha, well that's awkward I didn't paste in the website!
      stackoverflow.com/questions/12709062/python-lambda-with-if-but-without-else

  • @tahiliani22
    @tahiliani22 Год назад

    you are making these time comparasions on different columns. Although the time may be similar, I feel showing calcuation comparasions on one particular column would have been a better way.

  • @MADgamer9212
    @MADgamer9212 3 года назад

    Can you do multiple ifs in the last one

  • @chuckt8246
    @chuckt8246 3 года назад

    Thank you!
    I've been struggling with filling in a recommendation column from check on three other columns for weeks! Your iloc_example was the piece that I was missing.

  • @kolinsin23
    @kolinsin23 3 года назад +2

    Thanks very much, just saved me a huge amount of time analysing :D

  • @triumphmusic7495
    @triumphmusic7495 2 года назад

    I need help in python can we connect?

  • @mustqueembutt9001
    @mustqueembutt9001 2 года назад

    How can we update rows dynamically on all columns rather then mentioning the name of the column. I am having same scenario but the number of columns are 4000+

  • @SuperVarunsharma
    @SuperVarunsharma 3 года назад +1

    What if instead of a particular column (A or B, ...), I want to loop over each row to find out the largest of each row.

    • @ChartExplorers
      @ChartExplorers  3 года назад

      In your example we could use a built in method (max). I think you will find this video particularly helpful ruclips.net/video/n2ff3rcLgnE/видео.html

  • @Alpacastan21m
    @Alpacastan21m 3 года назад +1

    Hi, I have a .csv file with a text column and multiple rows. I wish to execute a function on each row of the text and have a resultant column as well. Can I use the iterations methods shown in the video to accomplish this?

    • @ChartExplorers
      @ChartExplorers  3 года назад

      Hi Blessen Koshy, it depends on what method you are using and what function you are applying to your text column. Can you provide more details?

    • @Alpacastan21m
      @Alpacastan21m 3 года назад

      @@ChartExplorers thanks for the quick response. I have a column with rows consisting of text (say 200-500 chars on average). I'd like to manipulate data - like executing a summarize function, append text, remove text etc. and create a new column to compare changes. Is this straightforward? Please let me know if you'd need to take this conversation via a private communication thread.

  • @xingtinggong344
    @xingtinggong344 3 года назад

    this was a great video. Thanks for the clear demonstration of all these different options and the pros/cons

    • @ChartExplorers
      @ChartExplorers  3 года назад

      Thank you! I'm glad you found it helpful!

    • @ultimategamer7859
      @ultimategamer7859 3 года назад +1

      @@ChartExplorers ur stuff is acc quality content. Props to you bro

    • @ChartExplorers
      @ChartExplorers  3 года назад

      @@ultimategamer7859 Thanks!!

  • @Zenoandturtle
    @Zenoandturtle Год назад

    Great presentation. Exactly what I was after.

  • @TiR_official
    @TiR_official 2 года назад +1

    Hey Brayden, super helpful video! I noticed though, after running some tests, that using "np.where" does not like to work when using int values in the dataframe. I found that initializing a numpy array beforehand and then using that instead of directly calling df['column_name'].values in the "np.where" method works, but I'm not exactly sure why. If you have a reasoning behind this I'd love to hear it!

  • @immanuelsuleiman7550
    @immanuelsuleiman7550 2 года назад

    I will be sure to use np.where more often from now on
    Thank you

  • @iamniladri
    @iamniladri 4 года назад +1

    Hi! Thanks for such nice explanation. Can you please make a video on how to iterate over each rows and find out the max/min value for each rows across multiple columns. Suppose, I have a dataframe showing Jan - Dec (in rows) data for 5 scenarios (in columns). Now I would like to know which scenario is having max/min values for each rows. Thanks in advance

    • @ChartExplorers
      @ChartExplorers  4 года назад +1

      Hi Niladri, Thanks for the suggestion! I'll start working on that video right away. I'll post it here in the comments when I'm finished.

    • @ChartExplorers
      @ChartExplorers  4 года назад +1

      Here is a video I created on how to get the min & max values of each row. It also discusses how to get the index location of those values as well. ruclips.net/video/n2ff3rcLgnE/видео.html

    • @iamniladri
      @iamniladri 4 года назад

      @@ChartExplorers Awesome! And THANK YOU for considering my request. Wish you all the very best 👍

  • @derekhan8396
    @derekhan8396 2 года назад

    how can i iterrate through a colunm

  • @tomparatube6506
    @tomparatube6506 4 года назад +1

    How about a string? Which method will be fastest?

    • @ChartExplorers
      @ChartExplorers  4 года назад +1

      I'm glad you brought this up. This is something that I should have discussed in the video. If possible use pandas built in string methods .str. When this is not possible try to us np.where() or np.select (these methods still work with strings). Most other things you should be able to cover with .apply and a lambda function. And if none of those work, use .iloc but you should never have to use df.iterrows even with strings.

    • @tomparatube6506
      @tomparatube6506 4 года назад

      ​@@ChartExplorers Thanks for answering so promptly. This has been driving me crazy. Turns out there's been a "standoff" between Numpy vs other camps. One of the posts at this (warning: lengthy) discussion at StackOverflow explains:
      stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur
      "There is a disagreement between Numpy and native python on what should happen when you compare a strings to numpy's numeric types. Notice the left operand is python's turf, a primitive string, and the middle operation is python's turf, but the right operand is numpy's turf. Should you return a Python style Scalar or a Numpy style ndarray of Boolean? Numpy says ndarray of bool, Pythonic developers disagree. Classic standoff.
      Should it be elementwise comparison or Scalar if item exists in the array?
      If your code or library is using the in or == operators to compare python string to numpy ndarrays, they aren't compatible, so when if you try it, it returns a scalar, but only for now. The Warning indicates that in the future this behavior might change so your code pukes all over the carpet if python/numpy decide to do adopt Numpy style."
      Submitted Bug reports:
      Numpy and Python are in a standoff, for now the operation returns a scalar, but in the future it may change.
      github.com/numpy/numpy/issues/6784
      github.com/pandas-dev/pandas/issues/7830"

  • @dabombguitars6897
    @dabombguitars6897 3 года назад +1

    EXTREMELY HELPFUL!

  • @wheatonrecurrence9525
    @wheatonrecurrence9525 3 года назад

    This changed my life. Thanks!

  • @ronaldsanchez2506
    @ronaldsanchez2506 2 года назад

    Thank you for this, very helpful. I’m trying to apply this in a particular way. I have two data frames: A and B. Data frame A contains 37 rows and 1 column, called indicator. Data frame b is has many rows and 2 columns: ID and Indicator. What i want to do is iterate over data frame a’s indicator column, and do a count the number of IDs in data frame b that match that indicator. I was successful in iterating and doing the nested group by, however, the group by value does not reset and only returns the very first iteration for all 37 indicators. What’s missing?

  • @stanTrX
    @stanTrX 2 месяца назад

    Thanks. Can you tell once again for beginners pls.

  • @KarlosEPM
    @KarlosEPM 3 года назад

    This was enlightening. Many thanks!

  • @fraggotmode
    @fraggotmode 3 года назад

    Very cool, thanks for taking the time to make this.

  • @cairun
    @cairun 3 года назад +1

    You are awesome! Such useful content!

  • @farouksouissi3421
    @farouksouissi3421 4 года назад +1

    Thanks very useful, please can you talk about geopandas and spatial data.

    • @ChartExplorers
      @ChartExplorers  4 года назад

      I'm glad you found this useful. Geospatial data is a little out of my wheel house, but I'll see what I can do 😀

    • @farouksouissi3421
      @farouksouissi3421 4 года назад +1

      @@ChartExplorers Thanks a lot for your
      Interaction and good continuation.

  • @stephaniakossman2923
    @stephaniakossman2923 3 года назад +1

    Great video, thanks!

  • @imad_uddin
    @imad_uddin 3 года назад +1

    That's amazing thanks!

  • @harperguo373
    @harperguo373 3 года назад

    I'm just wondering what the drawbacks are? Why is it mentioned on SO that "iterrows should only be used in very, very specific cases"?(stackoverflow.com/a/24871316/11901732) Is it appropriate to be used in sequential search such as the case of a for-loop? Thanks

    • @ChartExplorers
      @ChartExplorers  3 года назад +1

      Hi Harper, good question. The drawback to iterrows is that it is SLOW... like really slow. It's not that iterrows is bad (even though the way some people talk about it you would think it was the worst thing you could ever use).
      Code that runs slow is not really a big deal if you have a small data set. In fact, there have been times where I have spent more times coming up with a vectorized solution than the time it would have taken to write and run an iterrows solution.
      You have probably heard about code being Pythonic (or writing code in the Python style and using Python methods to solve problems). In the same way some code is not "Pythonic" using iterrows is not very "data scientistic."
      I think the moral of the story is that using iterrows is fine if you only work with small datasets, are not going to reuse the code, and don't do anything computationally expensive. Some people would disagree with me on this point, but there is no sense spending 5 minutes optimizing code that you are only going to use once to get a 30 second speed increase. On the other hand, if you want to keep improving as a programmer / analyst / data scientist I think we need to constantly be learning and implementing faster better solutions - even if there is a bit of a learning curve.

  • @user-wc7em8kf9d
    @user-wc7em8kf9d 4 года назад +1

    Useful content nicely explained! Thank you for sharing.

  • @TheCDLSchool
    @TheCDLSchool 3 года назад

    Thanks for this great video! Would you be able to use one of these faster methods to loop over an array or dataframe in a specific # of rows, then create an xml file, and start process again? What I am doing is building XML files from a large dataframe (can be array or series or whatever). The data is just one column of. Each row is being put into a the xml file as a value. I want to do 50K at a time, print file and then continue it again with the last one not having to have exactly 50K but whatever is left. Thanks,

  • @kennethstephani692
    @kennethstephani692 Год назад

    Great video!

  • @akshitmiglani5419
    @akshitmiglani5419 4 года назад +1

    Great Video! While Lambda & Numpy methods are time efficient but how would we use conditions based on different columns in these methods. For example: I'm trying to create a sequence number based on other variables & using "k" as an incrementor when a condition is met. I'm not able to incorporate this in Lamda/Numpy but I can easily do it in Index/iterrows().
    My Code:
    c5=df_temp['DESCRIPTION'].str.match(pat=r'ISIN|IS INf')
    cumm=0
    df_temp['Seq']=0
    for i,row in df_temp.iterrows():
    if c5[i]==True:
    df_temp['Seq'][i]=cumm
    cumm=cumm+1
    else:
    df_temp['Seq'][i]=cumm

    • @ChartExplorers
      @ChartExplorers  4 года назад +1

      Thanks! I like this question. There are some situations where you need to iterate through a DataFrame (this might be one of them). I did some digging and I couldn't find anything that uses an incrementor with lambda or np.where, np.select. I'm going to keep looking into this, but for now this is the best solution I came up with.
      Note: When you are testing if something is true and you have a Boolean (True/False) variable you don't need to use "== True".
      Also, there is a short hand for cumm = cumm += 1 you can instead use cumm += 1 (you probably knew this already, but just in case).
      If you use df.index instead of df.iterrows() you should get a slight performance boost.
      c5 = df_temp['DESCRIPTION'].str.match(pat=r'ISIN|IS INf')
      cumm=0
      df_temp['Seq']=0
      for i in df_temp.index:
      if c5[i]: # Removed == True
      df_temp['Seq'].iat[i] = cumm
      cumm += 1 #cumm+1
      else:
      df_temp['Seq'].iat[i] = cumm

    • @akshitmiglani5419
      @akshitmiglani5419 4 года назад +2

      ​@@ChartExplorers Exactly! About 90% of the time when I have to loop, it's always related to iterating over a dataframe & lambda/np just doesn't do the job. Thanks for looking it up, let me know if you come up with something.
      Thanks a lot for your note. I'm aware of both of them but I'm intentionally using them as I have to deploy the code & it's easier to explain that way to the client.

    • @fredericnicholson80
      @fredericnicholson80 2 года назад

      @@ChartExplorers To add to this topic : let's assume you have one operation that requires 'iteration', and other operations that can use 'apply'. Is it better to piggyback all operations to one loop, or is it better to use 'apply' or 'where' whenever possible? BTW: I like the presentation a lot, nicely donw!

  • @Vctrlreux
    @Vctrlreux 3 года назад

    Hi, thank you so much for the video. I was working on a python simulation where I iterate through rows like in your first example and that takes me over 3 hours to run. Can't wait to try with your methods.
    How would you go about dealing with a cell which value depends on a cell at the previous iteration. Like A2 = 1 if B1 > 5 ?

    • @ChartExplorers
      @ChartExplorers  3 года назад +1

      I'm so curious how this will turn out for you. Please let me know.
      I am interpreting "a cell at the previous iteration" to mean the cell above it? Is that correct?
      If you are trying to compare to the previous column you can use the pandas shift methods pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html
      # Practice dataframe
      df = pd.DataFrame({"col1": [6, 2, 5, 13, 6],
      "col2": [3, 2, 8, 6, 4],
      "col3": [7, 2, 6, 9, 5]})
      # Shift values by one
      previous_values = df['col1'].shift(1, fill_value=0)
      # Compare previous values to current value if the
      # previous value was greater than 5 set it equal
      # to 1 else set it equal to the value that it currently is
      df['col1'] = np.where(previous_values > 5, 1, df['col1'])
      # Example for entire dataframe not just a column
      previous_values = df.shift(1)
      new_values = np.where(previous_values > 5, 1, df)
      df = pd.DataFrame(new_values, columns=df.columns)
      There are a couple things to note. When we use shift there will be a row of missing. If we shift the rows up by one the bottom row will be missing. If we shift the rows down by one the top row will be missing. We can fill this in with what ever value is most appropriate or leave it as NaN if you're analysis allows NaN.
      we are using np.where which takes the condition we are evaluating (is the previous value great than 5). Then we pass in the value we want the cell to be if the condition is true (in this case 1). Note, we also need to pass in the value we want the cell to be if the condition is not true. In this example I pass in the original cell value (do not change the value).
      In your example you said A2 = 1 if B1 > 5. If you wanted to compare the the current cell (A2) to the previous_value of another column you could just change the column we use when we create the previous_values variable.
      # create previous values for col2 instead of col1
      previous_values = df['col2'].shift(1, fill_value=0)
      Let me know if I explained things well enough (these topics are more advanced). And please let me know if you get a boost in performance!

    • @Vctrlreux
      @Vctrlreux 3 года назад +2

      @@ChartExplorers Thank you so much for the detailed answer !! I love that channel, definitely subscribing !!
      I just tried it with shift. Tell me if I am wrong, but I think in my example below the interdependency makes it impossible without a loop. There are some dependencies on values on rows above, some on values on the same row. And I feel like you can't fill a whole column without completing the rows one by one. I added a more detailed example below.
      Please let me know if you see something wrong in my implementation of .shift. I am also gonna try using NumPy arrays and also with Numba for just in time compiling.
      The table looks like this:
      Columns: date, prices, qty, collateral, LTV, margin call, margin release, margin termination
      Date, price, and qty are inputs.
      Margin call, release, termination are booleans.
      Cash collateral is 0 at first then its value derives from the previous line eg: if margin call at T-1 is 1: =price(t-1)*qty(t-1)-cash collateral(t-1)
      LTV is 45% at first then some percentage of price and cash collateral (based on T, not T-1)
      Margin calls, release, and termination are 0 or 1 based on some functions of cash collateral, price, LTV.
      So before I had something like (works but slow - I am running that for a thousand rows and redo that for 200 stocks on 3000 dates. So like 600M iterations - so far 4 hours):
      df.loc[0, 'cash']=0
      df.loc[0, 'LTV'] = 0.45
      df.loc[0, 'margin call'] = 1 if df.loc[0, 'LTV'] > 0.5 else 0
      ...
      for i in range(1, len(df)):
      if df.loc[i-1,' margincall'] == 1:
      df.loc[i, 'cash'] = df.loc[i-1, 'LTV'] * ...
      elif ...
      ...
      df.loc[i, 'LTV'] = 100 - df.loc[i, 'cash'] / ...
      Now I have something like this:
      if df['margincall'].shift(1, fill_value=0) == 1:
      df['cash'] = df['LTV'].shift(1, fill_value=0.45) * ...
      elif ...
      ...
      df 'LTV'] = 100 - df 'cash'] / ...
      Thanks again for your answer !! And your videos !!

    • @ChartExplorers
      @ChartExplorers  3 года назад +1

      HI @V1ct0r 6l0r1eux, this may be a case where you need to loop. I have to head out to work and then I have a few other things I need to get done today. After that I will try see if there is anything I can figure out. Would you be able to send me the first 50 rows or so of you data? If you are not allowed for whatever reason could you make a fake df with the columns you described above with some fake data in it and send it my way. I want to make sure I am using the best data to practice on as possible. If you can do that my email is bradonvalgardson@gmail.com.

    • @Vctrlreux
      @Vctrlreux 3 года назад

      @@ChartExplorers Thanks, looks like it might be that one mandatory looping case indeed. I made it work by switching to numpy arrays. About 10 times faster. Still dealing with a couple issues to use numba which should cut the load by another 3/5 times. It's fast enough for what I need it to do now so don't spend time on this but for sake of curiosity I can send you the code tomorrow at your email. Thanks again for replying so fast and making those awesome videos.

  • @dhirajp4677
    @dhirajp4677 4 года назад +1

    Thanks for the video bro..

  • @movie_emotion6010
    @movie_emotion6010 4 года назад +1

    comparing 2 different column, I need to creat a binary column. how to do it?

    • @ChartExplorers
      @ChartExplorers  4 года назад

      Hi Subhankar. Thanks for the question. Are your columns numeric? If so you can just compare the columns directly df['binary'] = df['col1'] > df['col2'] (you can replace the > with any comparison operator). Will this work in your situation?

    • @ChartExplorers
      @ChartExplorers  4 года назад

      Hi Subhankar Roy, I made this video addressing your question ruclips.net/video/2xwto0MK_9U/видео.html

  • @marco_6145
    @marco_6145 4 года назад +1

    Excellent thanks a lot mate

  • @mrmuranga
    @mrmuranga 4 года назад +1

    very very useful..thanks

  • @joegalbraith1585
    @joegalbraith1585 4 года назад

    Great explanation, thanks!

  • @jongcheulkim7284
    @jongcheulkim7284 3 года назад

    Thank you so much!!

  • @מוגוגוגו
    @מוגוגוגו 3 года назад +2

    Thanks! :)

  • @aaditt8120
    @aaditt8120 Год назад

    where's the lambda video!!??

  • @emanalshoala6004
    @emanalshoala6004 Год назад

    Thank you so much

  • @thuy7671
    @thuy7671 3 года назад

    thank you!!!

  • @benravenhill484
    @benravenhill484 3 месяца назад

    Cool bro ❤

  • @rshah645
    @rshah645 9 месяцев назад

    Awesome!

  • @alipiofernandes1
    @alipiofernandes1 Год назад

    You just missed the itertuples...

  • @luisweber4399
    @luisweber4399 3 года назад +1

    dude... I thought you were going to actually show how to make changes to the data frame with loop, not to show which type of change is faster..... too bad.

    • @ChartExplorers
      @ChartExplorers  3 года назад

      Is that not what I do? Each example shows how to change the dataframe.

    • @ChartExplorers
      @ChartExplorers  3 года назад

      If you were thinking something more helpful I would love to get your feedback so I can create better content.

  • @ramishnishah3411
    @ramishnishah3411 3 года назад

    Magic

  • @iskandarrasimoon734
    @iskandarrasimoon734 3 года назад

    its just so painful

    • @ChartExplorers
      @ChartExplorers  3 года назад

      What's painful?

    • @iskandarrasimoon734
      @iskandarrasimoon734 3 года назад

      @@ChartExplorers pandas data frame, been struggling with an assignment which requires me to use pandas dfs, been struggling with it for over a week.

    • @ChartExplorers
      @ChartExplorers  3 года назад

      @@iskandarrasimoon734 Oh man! that stinks. Let me know if you have any specific questions. If I can I'd be happy to try and help you.

    • @iskandarrasimoon734
      @iskandarrasimoon734 3 года назад

      @@ChartExplorers Is there any way I could contact you?

    • @ChartExplorers
      @ChartExplorers  3 года назад

      @@iskandarrasimoon734 if you have discord discord.gg/EEajBNpeK9 or my email is bradonvalgardson@gmail.com

  • @andersolsson4919
    @andersolsson4919 3 года назад

    manufacturer
    = (Ford, Ford, Ford, Chevrolet, Chevrolet, Chevrolet, Hyundai, Kia, nan, nan, nan, nan, nan, nan, nan)
    model = (Focus, Fiesta, F150, Captiva, Bolt, Silverado1500, Tucson, Sorento, F150, Fiesta, F150, Silverado1500, Captiva, F150, Focus, Tucson)
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    here is a sample data that my the dataset I'm working with, the code I posted does a good job but it's very slow. model Focus has a manufacturer Ford at first, but afterward it has nan so the code is scanning and picking a name from previously declared observations.