python - How to use aggregated data in a pyspark udf - OStack Q&A-Knowledge Sharing Community

I am trying to calculate the speed of a vehicle using GPS records. I have original GPS points of and matched route for each trip in two separate data frames. The speed of the vehicle is calculated for each original GPS point taking the distance between two original GPS points along the matched route and dividing it by the time difference.

I have a UDF for calculating the speed, which has lat, long, timestamp of two consecutive GPS points and the matched route as the parameters. My question is that how I can pass the route to this UDF?

The two options I have tried out are joining two dataframes and broadcasting route dataframe by converting it into a map. But as there are billions of GPS points both of these are not viable options as the joined dataframe requires a vast amount of memory (the route of a trip is repeated for each GPS point of that trip) and collecting all the routes to one node is simply impossible.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - How to use aggregated data in a pyspark udf

python - How to use aggregated data in a pyspark udf

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags