Welcome To Ask or Share your Answers For Others

apache spark - first_value windowing function in pyspark

Welcome To Ask or Share your Answers For Others

1 Answer

answered Oct 24, 2021 by 深蓝 (71.8m points)

Spark >= 2.0:

first takes an optional ignorenulls argument which can mimic the behavior of first_value:

df.select(col("k"), first("v", True).over(w).alias("fv"))

Spark < 2.0:

Available function is called first and can be used as follows:

df = sc.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3)
]).toDF(["k", "v"])

w = Window().partitionBy("k").orderBy("v")

df.select(col("k"), first("v").over(w).alias("fv"))

but if you want to ignore nulls you'll have to use Hive UDFs directly:

df.registerTempTable("df")

sqlContext.sql("""
    SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
    FROM df""")

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

...