我有一个示例spark dataframe,它是我从pandas dataframe创建的-
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
from pyspark.sql.types import *
import pandas as pd
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# create sample spark dataframe first and then create pandas dataframe from it
import pandas as pd
pdf = pd.DataFrame([[1,"hello world. lets shine and spread happiness"],[2,"not so sure"],[2,"cool i like it"],[2,"cool i like it"],[2,"cool i like it"]]
, columns = ['input1','input2'])
df = spark.createDataFrame(pdf) # this is spark df
现在,我有如下数据类型
df.printSchema()
root
|-- input1: long (nullable = true)
|-- input2: string (nullable = true)
如果我用-
pandas_df = df.toPandas()
如果我试图打印数据类型,我会得到第二列的对象类型,而不是字符串类型。
pandas_df.dtypes
input1 int64
input2 object
dtype: object
如何将spark中的字符串类型正确转换为pandas中的字符串类型?
1条答案
按热度按时间8iwquhpp1#
要转换为字符串,可以使用
StringDtype
: