Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
3.6k views
in Technique[技术] by (71.8m points)

python - Efficient conditional selection with masks in very large dataframe

I have a dataframe with some 2 million rows like this:

                    dt   num
0  2019-05-12 10:17:00   135
1  2018-01-16 21:32:00     5
2  2017-11-30 22:29:00   135
3  2017-10-05 16:59:00    19
4  2017-08-07 05:26:00     5
5  2017-06-12 17:47:00    18

For each and all of the different values in column 'num' I need to find the corresponding minimum value of column 'dt'.

I am doing it with a list comprehension with a mask followed by an operator:

[(num_i, df[df.num == num_i].dt.min()) for num_i in set(df.num)]

It works, but it is taking really a lot ot time. Any other way to solve it that is less time consuming?


Ooops ... thanks to all! (@It_is_Chris, @papke, @paul-brennan). I was thinking in making a time comparison, but the solution provided (groupby) solves it in seconds against close to one hour...


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

@It_is_Chris was exactly right, and if you have more cores available, parallel the job with the groupby apply trick.

from multiprocessing import Pool, cpu_count

def applyParallel(dfGrouped, func):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pandas.concat(ret_list)

so pass in the df.groupby(df['num']) as dfGrouped and then have the function defined as you would like it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...