我有一个名为d2的数据框,有两列(dest\u country\u name,count)
我创建了一个新的数据框,如下所示:
df3 = df2.groupBy("DEST_COUNTRY_NAME").sum('count')
我打算将“sum(count)”列的名称改为“destination\u total”:
df5 = df3.selectExpr("cast(DEST_COUNTRY_NAME as string) DEST_COUNTRY_NAME", "cast(sum(count) as int) destination_total")
但是,出现以下错误:
Py4JJavaError Traceback (most recent call last)
/content/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a,**kw)
62 try:
---> 63 return f(*a,**kw)
64 except py4j.protocol.Py4JJavaError as e:
4 frames
Py4JJavaError: An error occurred while calling o1195.selectExpr.
: org.apache.spark.sql.AnalysisException: cannot resolve '`count`' given input columns: [DEST_COUNTRY_NAME, sum(count)]; line 1 pos 5;
'Project [cast(DEST_COUNTRY_NAME#1110 as string) AS DEST_COUNTRY_NAME#1154, cast('count as int) AS count#1155]
+- AnalysisBarrier
+- Aggregate [DEST_COUNTRY_NAME#1110], [DEST_COUNTRY_NAME#1110, sum(cast(count#1112 as bigint)) AS sum(count)#1120L]
+- Project [cast(DEST_COUNTRY_NAME#1090 as string) AS DEST_COUNTRY_NAME#1110, cast(ORIGIN_COUNTRY_NAME#1091 as string) AS ORIGIN_COUNTRY_NAME#1111, cast(count#1092 as int) AS count#1112]
+- Relation[DEST_COUNTRY_NAME#1090,ORIGIN_COUNTRY_NAME#1091,count#1092] csv
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:122)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:127)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3296)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
at org.apache.spark.sql.Dataset.selectExpr(Dataset.scala:1342)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
During handling of the above exception, another exception occurred:
AnalysisException Traceback (most recent call last)
/content/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a,**kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: "cannot resolve '`count`' given input columns: [DEST_COUNTRY_NAME, sum(count)]; line 1 pos 5;\n'Project [cast(DEST_COUNTRY_NAME#1110 as string) AS DEST_COUNTRY_NAME#1154, cast('count as int) AS count#1155]\n+- AnalysisBarrier\n +- Aggregate [DEST_COUNTRY_NAME#1110], [DEST_COUNTRY_NAME#1110, sum(cast(count#1112 as bigint)) AS sum(count)#1120L]\n +- Project [cast(DEST_COUNTRY_NAME#1090 as string) AS DEST_COUNTRY_NAME#1110, cast(ORIGIN_COUNTRY_NAME#1091 as string) AS ORIGIN_COUNTRY_NAME#1111, cast(count#1092 as int) AS count#1112]\n +- Relation[DEST_COUNTRY_NAME#1090,ORIGIN_COUNTRY_NAME#1091,count#1092] csv\n"
我打算将“sum(count)”列重命名为“destination\u total”。我怎样才能解决这个问题?我不是和Pandas一起工作,而是和spark一起工作。
4条答案
按热度按时间iezvtpos1#
假设您的Dataframe中只有两列,下面是两种重命名方法。
也可以在调用alias函数时对其进行重命名,如下所示:
p、 s:别忘了导入col。
woobm2wo2#
bfrts1fy3#
也可以进行聚合而不是直接求和。
k5ifujac4#