Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
140 views
in Technique[技术] by (71.8m points)

python - How to use aggregated data in a pyspark udf

I am trying to calculate the speed of a vehicle using GPS records. I have original GPS points of and matched route for each trip in two separate data frames. The speed of the vehicle is calculated for each original GPS point taking the distance between two original GPS points along the matched route and dividing it by the time difference.

I have a UDF for calculating the speed, which has lat, long, timestamp of two consecutive GPS points and the matched route as the parameters. My question is that how I can pass the route to this UDF?

The two options I have tried out are joining two dataframes and broadcasting route dataframe by converting it into a map. But as there are billions of GPS points both of these are not viable options as the joined dataframe requires a vast amount of memory (the route of a trip is repeated for each GPS point of that trip) and collecting all the routes to one node is simply impossible.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...