如何将sparkDataframe中的字符串型列转换为pandasDataframe中的字符串型列

jhdbpxl9  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(556)

我有一个示例spark dataframe,它是我从pandas dataframe创建的-

from pyspark.sql import SparkSession

import pyspark.sql.functions as F
from pyspark.sql.types import StringType
from pyspark.sql.types import *

import pandas as pd

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# create sample spark dataframe first and then create pandas dataframe from it

import pandas as pd
pdf = pd.DataFrame([[1,"hello world. lets shine and spread happiness"],[2,"not so sure"],[2,"cool i like it"],[2,"cool i like it"],[2,"cool i like it"]]
                   , columns = ['input1','input2'])
df = spark.createDataFrame(pdf) # this is spark df

现在,我有如下数据类型

df.printSchema()

root
 |-- input1: long (nullable = true)
 |-- input2: string (nullable = true)

如果我用-

pandas_df = df.toPandas()

如果我试图打印数据类型,我会得到第二列的对象类型,而不是字符串类型。

pandas_df.dtypes
input1     int64
input2    object
dtype: object

如何将spark中的字符串类型正确转换为pandas中的字符串类型?

8iwquhpp

8iwquhpp1#

要转换为字符串,可以使用 StringDtype :

pandas_df["input_2"] = pandas_df["input_2"].astype(pd.StringDtype())

相关问题