在Pyspark中以正确的数据类型读取CSV

vsdwdz23 于 2023-07-31 发布在 Spark

关注(0)|答案(2)|浏览(137)

当我试图用spark导入一个本地CSV时，默认情况下，每一列都是作为字符串读入的。但是，我的列只包含整数和时间戳类型。更具体地说，CSV如下所示：

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000

字符串
我找到了应该在this question中工作的代码，但是当我执行它时，所有的条目都作为NULL返回。
我使用以下代码创建一个自定义模式：

from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType

customSchema = StructType(Array(
        StructField("Customer", IntegerType, true),
        StructField("TransDate", TimestampType, true),
        StructField("Quantity", IntegerType, true),
        StructField("Cost", IntegerType, true),
        StructField("TransKey", IntegerType, true)))

型
然后在CSV中读取：

myData = spark.read.load('myData.csv', format="csv", header="true", sep=',', schema=customSchema)

型
其中返回：

+--------+---------+--------+----+--------+
|Customer|TransDate|Quantity|Cost|Transkey|
+--------+---------+--------+----+--------+
|    null|     null|    null|null|    null|
+--------+---------+--------+----+--------+

型
我是不是错过了关键的一步？我怀疑“日期”列是问题的根源。注意：我正在GoogleCollab中运行此操作。

csv

来源：https://stackoverflow.com/questions/53013173/read-in-csv-in-pyspark-with-correct-datatypes

2条答案

按热度按时间

brvekthn1#

给你！

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
PATH_TO_FILE="file:///u/vikrant/LocalTestDateFile"
Loading above file to dataframe:
df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "true") \
  .option("inferschema", "true") \
  .option("delimiter", ",").load(PATH_TO_FILE)

字符串
你的日期将被加载为字符串列类型，但当你将其更改为日期类型时，它会将此日期格式视为NULL。

df = (df.withColumn('TransDate',col('TransDate').cast('date'))

+--------+---------+--------+-----------+----+---------+--------+
|Customer|TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+---------+--------+-----------+----+---------+--------+
|  149332|     null|       1|     199.95| 107|127998739|  100000|
+--------+---------+--------+-----------+----+---------+--------+

型
因此我们需要将日期格式从dd.mm.yy更改为yy-mm-dd。

from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
from pyspark.sql.functions import col

型
Python函数更改日期格式：

change_dateformat_func =  udf (lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%Y-%m-%d'))

型
现在为你的dataframe列调用这个函数：

newdf = df.withColumn('TransDate', change_dateformat_func(col('TransDate')).cast(DateType()))

+--------+----------+--------+-----------+----+---------+--------+
|Customer| TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+----------+--------+-----------+----+---------+--------+
|  149332|2005-11-15|       1|     199.95| 107|127998739|  100000|
+--------+----------+--------+-----------+----+---------+--------+

型
下面是Schema：

|-- Customer: integer (nullable = true)
 |-- TransDate: date (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- PurchAmount: double (nullable = true)
 |-- Cost: integer (nullable = true)
 |-- TransID: integer (nullable = true)
 |-- TransKey: integer (nullable = true)

型
告诉我你是否适合。

赞(0）回复(0）举报 2023-07-31

raogr8fs2#

您可以为DataFrameReader指定一个选项('dateFormat','d.M.y')来解析特定格式的日期。

df = spark.read.format("csv").option("header","true").option("dateFormat","M.d.y").schema(my_schema).load("path_to_csv")

字符串

参考资料

赞(0）回复(0）举报 2023-07-31

我来回答

在Pyspark中以正确的数据类型读取CSV

2条答案

相关问题

热门标签

最新问答