apache spark - Is there a way to slice dataframe based on index in pyspark?

Question

Welcome To Ask or Share your Answers For Others

apache spark - Is there a way to slice dataframe based on index in pyspark?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index.

For example, in pandas:

df.iloc[5:10,:]

Is there a similar way in pyspark to slice data based on location of rows?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:39:47+0000

Short Answer

If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:

from pyspark.sql.functions import col
df.where(col("id").between(5, 10))

If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).

Full Explanation

No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.

Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.

Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)

Related/Futher Reading

Categories

apache spark - Is there a way to slice dataframe based on index in pyspark?

apache spark - Is there a way to slice dataframe based on index in pyspark?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags