PySpark:regexp_extract

5cg8jx4n  于 2023-04-13  发布在  Spark
关注(0)|答案(2)|浏览(237)

在我的数据集中有一个名为“description”的列,其中包含了很多信息,如下所示:
| 描述|
| --------------|
| questionA:在段落之间放置回车符questionB:代码缩进4个空格questionC:对于换行符,在末尾添加2个空格|
| questionA:添加语言标识符questionB:创建代码围栏questionC:突出显示代码|
我想使用regexp_extract函数提取questionB响应,以获得以下内容:
| 描述|正则表达式|
| --------------|--------------|
| questionA:在段落之间放置回车符questionB:代码缩进4个空格questionC:对于换行符,在末尾添加2个空格|代码缩进4个空格|
| questionA:添加语言标识符questionB:创建代码围栏questionC:突出显示代码|创建代码围栏|
如何使用regexp_extract函数来实现?

dkqlctbz

dkqlctbz1#

您可以尝试:

df = df.withColumn(
    'regex',
    F.regexp_extract('description', 'questionB : (.+) questionC :', 1)
)
von4xj4u

von4xj4u2#

你可以调用split函数两次,以获得想要的结果:

spark = SparkSession.builder.appName("test").getOrCreate()
data = [
    (1,
     "questionA : put returns between paragraphs questionB : indent code by 4 spaces questionC : for linebreak add 2 spaces at end"),
    (2, "questionA : add language identifier questionB : create code fences questionC : to highlight code"),
]
df = spark.createDataFrame(data, ['id', 'description'])

df.withColumn("regex", split("description", "questionB : ").getItem(1)) \
    .withColumn("regex", trim(split("regex", "question").getItem(0))).show(truncate=False)

结果:

+---+----------------------------------------------------------------------------------------------------------------------------+-----------------------+
|id |description                                                                                                                 |regex                  |
+---+----------------------------------------------------------------------------------------------------------------------------+-----------------------+
|1  |questionA : put returns between paragraphs questionB : indent code by 4 spaces questionC : for linebreak add 2 spaces at end|indent code by 4 spaces|
|2  |questionA : add language identifier questionB : create code fences questionC : to highlight code                            |create code fences     |
+---+----------------------------------------------------------------------------------------------------------------------------+-----------------------+

相关问题