regex 拆分街道地址，字与字之间没有任何空格[关闭]

ozxc1zmp 于 2023-08-08 发布在其他

关注(0)|答案(1)|浏览(111)

已关闭。此问题需要更多focused。它目前不接受回答。
**希望改进此问题？**更新问题，使其仅针对editing this post的一个问题。

14天前关闭
Improve this question的
我有一个1,000个街道地址的列表，单词之间没有任何间距。我想把它们分成街道，城市，州和邮政编码。唯一的识别特征是大写的变化。
这就是我所拥有的：

OneMainStreetAnytownVA20394
200SideStAnotherCityMD12345
2BigAvenueNWMetropolisDC33224-4039

字符串
这就是我想要的

One Main Street, Anytown, VA, 20394
200 Side St, Another City, MD, 12345
2 Big Avenue NW, Metropolis, DC, 33224-4039

型
我在Notepad++和一些在线工具中玩过正则表达式，但似乎没有什么能可靠地工作。有趣的是，谷歌Map通常可以获取地址，但我没有发现任何方法来自动化这个过程。有什么想法吗？

regex

来源：https://stackoverflow.com/questions/76733093/split-street-addresses-without-any-spaces-between-words

1条答案

按热度按时间

agyaoht71#

我不确定这是正则表达式的问题。这里是一个简单的解析状态机，可以拆分单词（除了一种情况）。正如我所说，它将需要一个数据库来确定在哪里从城市分割地址。
失败的一个案例是“NWMetropolis”。一个潜在的解决方案是假设每个大写字母都是独立的，这将给予你“N W大都会”。正如我所说，这是一个难题。
我还要指出，很少有地址列表是如此干净。人们对拼写和大写都不一致。

tests = [
  'OneMainStreetAnytownVA20394',
  '200SideStAnotherCityMD12345',
  '2BigAvenueNWMetropolisDC33224-4039'
]

def classify(c):
    if c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
        return 'cap'
    if c in 'abcdefghijklmnopqrstuvwxyz':
        return 'low'
    if c in '0123456789-':
        return 'sum'
    print("UNKNOWN")

def parse(s):
    words = []
    state = 'none' # 'cap', 'low', 'num'

    gather = ''
    for c in s:
        new = classify(c)
        if new == state or (new=='low' and state=='cap'):
            gather += c
        else:
            words.append(gather)
            gather = c
        state = new

    if gather:
        words.append(gather)

    return ' '.join(words)

for t in tests:
    print(t, parse(t))

字符串
输出量：

OneMainStreetAnytownVA20394  One Main Street Anytown VA 20394
200SideStAnotherCityMD12345  200 Side St Another City MD 12345
2BigAvenueNWMetropolisDC33224-4039  2 Big Avenue NWMetropolis DC 33224-4039

型
您可以通过将return语句替换为：

return ' '.join(words[:-2]) + f", {words[-2]} {words[-1]}"

型

赞(0）回复(0）举报 2023-08-08

我来回答

regex 拆分街道地址，字与字之间没有任何空格[关闭]

1条答案

相关问题

热门标签

最新问答