pyspark 保留具有特定前缀的记录并过滤所有数字

bgibtngc  于 2023-01-25  发布在  Spark
关注(0)|答案(3)|浏览(138)

我有一个pyspark dataframe看起来像下面:

serial_number
000001234
000002887
00008765
0745-218
01-7865
040/7868L
0000124
00002364
01231325246
068775H

我只想提取以前缀0开头的记录(开头是一个0),并且是而不仅仅是数字的记录。也就是说,它应该只包含字母和/或特殊字符。

serial_number
0745-218
01-7865
040/7868L
068775H

我尝试使用一些regex表达式,如^0[^0],但它也接受全数字输入。

rpppsulh

rpppsulh1#

使用下面的rlike.代码

df.where(col('serial_number').rlike('\D')&col('serial_number').rlike('^0')).show()
ajsxfq5m

ajsxfq5m2#

按照何阮的回答:

import re

COMPILED = re.compile("0\d*[^\d]+\d*")

serial_numbers = [
    "000001234",
    "000002887",
    "00008765",
    "0745-218",
    "01-7865",
    "040/7868L",
    "0000124",
    "00002364",
    "01231325246",
    "068775H"
] 

matching_numbers = [number for number in serial_numbers if COMPILED.match(number)]

print(matching_numbers)

不需要^,因为match从字符串的开头匹配。
\d实际上是0-9的语法糖

gwbalxhn

gwbalxhn3#

import re

serial_numbers = [
    "000001234",
    "000002887",
    "00008765",
    "0745-218",
    "01-7865",
    "040/7868L",
    "0000124",
    "00002364",
    "01231325246",
    "068775H"
]

pattern = "^0[^0-9]+"

matching_numbers = [number for number in serial_numbers if re.match(pattern, number)]

print(matching_numbers)

相关问题