Welcome To Ask or Share your Answers For Others

Recent questions tagged pyspark

0 votes

943 views

1 answer

pyspark - Avoid performance impact of a single partition mode in Spark window functions

My question is triggered by the use case of calculating the differences between consecutive rows in a spark ... this can cause serious performance degradation. Question&Answers:os...

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

1.3k views

1 answer

pyspark - Split Spark Dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more ... want these new columns to be named as well. Question&Answers:os...

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

855 views

1 answer

pyspark - java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) with Java 10

I Started getting the following error anytime I try to collect my rdd's. It happened after I installed Java 10.1 So of ... 'new' is not defined >>> sc.stop() Question&Answers:os...

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

821 views

1 answer

pyspark - Find maximum row per group in Spark DataFrame

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to ... and I should just go back to using RDDs. Question&Answers:os...

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

1.1k views

1 answer

pyspark - Using a column value as a parameter to a spark DataFrame function

Consider the following DataFrame: #+------+---+ #|letter|rpt| #+------+---+ #| X| 3| ... a way to replicate this behavior using the spark DataFrame functions? Question&Answers:os...

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

1.3k views

1 answer

pyspark - How to melt Spark DataFrame?

Is there an equivalent of Pandas Melt Function in Apache Spark in PySpark or at least in Scala? I was ... Spark for the entire dataset. Thanks in advance. Question&Answers:os...

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

810 views

1 answer

pyspark - Oozie - Unable to run Spark-Submit on remote server though shell action

When I login to my edge node and run the below command, my application is submitted successfully and completes ... -to-run-spark-submit-on-remote-server-though-shell-action...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

963 views

1 answer

pyspark - Writing large spark data frame as parquet to s3 bucket

My Scenario I have a spark data frame in a AWS glue job with 4 million records I need to write it as a ... questions/65832736/writing-large-spark-data-frame-as-parquet-to-s3-bucket...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

1.5k views

1 answer

pyspark - The value of spark.network.timeout must be no less than the value of spark.executor.heartbeatInterval

I am trying to increase the heartbeat interval parameter in pyspark configuration but keep getting this error. Is there any ... -must-be-no-less-than-the-value-of-spark-execu...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

956 views

1 answer

pyspark - DataFrame show string representation fails with showString(Integer, Boolean, Boolean) does not exist

I'm trying to capture the string representation generated by the show() function as suggested here ... dataframe-show-string-representation-fails-with-showstringinteger-boolean-boo...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

944 views

1 answer

pyspark - how to pair rows with the same id?

Closed. This question needs details or clarity. It is not currently accepting answers. question from:https://stackoverflow.com/questions/65841356/how-to-pair-rows-with-the-same-id...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

866 views

1 answer

pyspark - spark worker just cannot connect

I am running a spark standalone cluster. My os is centos7 on master as well as on worker. Have set ... https://stackoverflow.com/questions/65842650/spark-worker-just-cannot-connect...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

885 views

1 answer

pyspark - Spark parquet compression and encoding schemes

I need to encode parquet files which are produced by my pyspark script, so that the encoding is ... .com/questions/65844890/spark-parquet-compression-and-encoding-schemes...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

970 views

1 answer

pyspark - Placeholders in a FROM Clause in SQL query. like in Python string Formating

I am new to coding and would like to know where "0" holding the database name in {0} is supposed to be in ... -in-a-from-clause-in-sql-query-like-in-python-string-formating...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

757 views

1 answer

pyspark - Optimizing Spark resources to avoid memory and space usage

I have a dataset that is around 190GB that was partitioned into 1000 partitions. my EMR cluster allows a ... /65866586/optimizing-spark-resources-to-avoid-memory-and-space-usage...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

851 views

1 answer

pyspark - Spark count records into specified ranges

I am trying to split a column of total count into different ranges of columns using pyspark. I am ... stackoverflow.com/questions/65867294/spark-count-records-into-specified-ranges...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

920 views

1 answer

pyspark - Repartitioning Skewed Dataframes in Spark

I have a bit of a question around PySpark. After aggregating, I have really skewed data (some ... //stackoverflow.com/questions/65869200/repartitioning-skewed-dataframes-in-spark...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

817 views

1 answer

pyspark - spark worker initially connecting and then disconnecting, trying to reconnect

My setup is simple, centos master, centos worker. In master spark-env.sh export STANDALONE_SPARK_MASTER_HOST= ... -initially-connecting-and-then-disconnecting-trying-to-reconnect...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

830 views

1 answer

pyspark - Why driver memory is not in my Spark context configuration?

When I run the following command: spark-submit --name "My app" --master "local[*]" --py-files main ... questions/65873182/why-driver-memory-is-not-in-my-spark-context-configuration...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

904 views

1 answer

pyspark - Spark Exit Status 134. What does it mean

I get the following failed error for some of my tasks when running my job. But the job finishes successfully on ... .com/questions/65889696/spark-exit-status-134-what-does-it-mean...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

955 views

1 answer

pyspark - From Spark to Snowflake data types

I am new to snowflake. I'm writing a spark df to snowflake, using this code. var = dict(sfUrl=" ... ://stackoverflow.com/questions/65901227/from-spark-to-snowflake-data-types...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

801 views

1 answer

pyspark - When is it appropriate to use a UDF vs using spark functionality?

Closed. This question needs to be more focused. It is not currently accepting answers. question from:https:// ... -it-appropriate-to-use-a-udf-vs-using-spark-functionality...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

1.1k views

1 answer

pyspark - How to specify file size using repartition() in spark

Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition ... /65912908/how-to-specify-file-size-using-repartition-in-spark...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

783 views

1 answer

pyspark - How to perform group by and aggregate operation on spark

I have a Dataset below like: +----------------------------------+------------ ... ://stackoverflow.com/questions/65915468/how-to-perform-group-by-and-aggregate-operation-on-spark...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

1.4k views

1 answer

pyspark - Why can't Pandas's isin() work with numpy.int64?

When trying to run the following code: val1_index = df_playlists['pid'].isin(val1_playlist[0]) I received this ... /questions/65915669/why-cant-pandass-isin-work-with-numpy-int64...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

917 views

1 answer

pyspark - spark execution - a single way to access file contents in both the driver and executors

According to this question - --files option in pyspark not working the sc.addFiles option should work for accessing files ... way-to-access-file-contents-in-both-the-driver-and-ex...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

833 views

1 answer

pyspark - Make single DataFrame from list of Dataframes

I have a list of data frames, on each location of a list, I have one dataframe I need to ... stackoverflow.com/questions/65923884/make-single-dataframe-from-list-of-dataframes...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

0 votes

965 views

1 answer

pyspark - Spark Shell command failing on local

I am trying to run spark-shell command locally and I am getting below error java.net.BindException: ... stackoverflow.com/questions/65928852/spark-shell-command-failing-on-local...

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

To see more, click for the full list of questions or popular tags.

...