我尝试使用REGEX从Pandas Dataframe 的文本块中提取连接字符串。
我的REGEX在www.example.com上运行REGEX101.com(见下面的截图)。链接到我保存的测试:https://regex101.com/r/ILnpS0/1
当我尝试在Pandas Dataframe 中运行REGEX时,我没有得到任何REGEX匹配/提取(但没有错误),尽管在REGEX 101上得到了匹配。https://colab.research.google.com/drive/1WAMlGkHAOqe38Lzo_K0KHwD_ynVJyIq1?usp=sharing
因此,问题似乎是Pandas如何解释我的REGEX
有人能确定为什么我没有得到任何正则表达式匹配使用Pandas?
REGEX逻辑我的REGEX由3个组组成
(?=Source = DB2.Database)(.*?)(?=\]\))
组1:(?=Source = DB2.Database) 是一个“Lookbehind”,它查找文本“Source = DB2.Database”,即我的连接字符串的开头。
群组2:(.?)* 会寻找任何字符,并做为第一个和第三个群组之间的跨距。
组3:(?=])) 是一个查看Assert,用于标识连接字符串的结尾)
**其他测试:**当我运行一个简化版本的REGEX(DB2.Database)时,我得到了预期的匹配。这个例子也在上面链接的笔记本中。
我的代码(与链接的Colab笔记本相同)
import pandas as pd
myDF = pd.DataFrame({'conn_str':['''{'expression': 'let\n Source = Snowflake.Databases("whitehouse.australia-east.azure.snowflakecomputing.com","USER"),\n WH_DW_Database = Source{[Name="WHOUSE_DW",Kind="Database"]}[Data],\n DWH_Schema = SPARK_DW_Database{[Name="DWH",Kind="Schema"]}[Data],\n D_ACCOUNT_CURR_View = DWH_Schema{[Name="D_ACCOUNT_CURR",Kind="View"]}[Data],\n #"Filtered Rows" = Table.SelectRows(D_ACCOUNT_CURR_View, each ([PAYMENT_TYPE] = "POSTPAID") and ([ACCOUNT_SEGMENT] <> "Consumer") ),\n #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"DESCRIPTION", "ACCOUNT_NUMBER"})\nin\n #"Removed Other Columns"'}''','''{'expression': 'let\n Source = DB2.Database("69.699.69.69", "WHUDB", [HierarchicalNavigation=true, Implementation="Microsoft", Query="SELECT\n base.HEAD_PARTY_NO,\n base.HEAD_PARTY_NAME,\n usg.BILL_MONTH,\n base.CUSTOMER_NUMBER,\n base.ACCOUNT_NUMBER,\n base.CHARGE_ARRANGEMENT_NUMBER,\n usg.DATA_MB,\n usg.DATA_MB/1024 as Data_GB,\n base.PRODUCT_DESCRIPTION,\nbase.LINE_DESCRIPTION\n\nFROM PRODUCT.MOBILE_ACTIVE_BASE base\nLEFT JOIN PRODUCT.MOBILE_USAGE_SUMMARY usg\n\nON\n base.CHARGE_ARRANGEMENT_NUMBER = usg.CHARGE_ARRANGEMENT_NUMBER\n\nand \nbase.CHARGE_ARRANGEMENT_ID = usg.CHARGE_ARRANGEMENT_ID\n\nWHERE base.PRODUCT_DESCRIPTION LIKE \'%Share%\' \n--AND (base.HEAD_PARTY_NO = 71474425 or base.HEAD_PARTY_NO = 73314303)\nAND usg.BILL_MONTH BETWEEN (current_date - 5 MONTHS) and CURRENT_DATE \nOrder by base.ACCOUNT_NUMBER,Data_MB desc with ur"]),\n #"Added Custom1" = Table.AddColumn(Source, "Line Number", each Text.Middle([CHARGE_ARRANGEMENT_NUMBER],1,14)),\n #"Renamed Columns" = Table.RenameColumns(#"Added Custom1",{{"LINE_DESCRIPTION", "Line Description"}, {"BILL_MONTH", "Bill Month"}}),\n #"Filtered Rows" = Table.SelectRows(#"Renamed Columns", each ([PRODUCT_DESCRIPTION] <> "Sharer Unlimited NZ & Aus mins + Unlimited NZ & Aus texts" and [PRODUCT_DESCRIPTION] <> "Sharer with Data Stretch"))\nin\n #"Filtered Rows"'}''']})
myDF
#why isn't this working?
#this regex works on REGEX 101 : https://regex101.com/r/ILnpS0/1
regex_db =r'(?=Source = DB2.Database)(.*?)(?=\]\))'
myDF['SQLDB connection2'] = myDF['conn_str'].str.extract(regex_db ,expand=True)
myDF
#This is a simplified version of the above REGEX, and works to extracts the text "DB2.Database"
#This works fine
regex_db2 =r'(DB2.Database)'
myDF['SQLDB connection1'] = myDF['conn_str'].str.extract(regex_db2 ,expand=True)
myDF
我哪里做错了有什么建议吗?
1条答案
按热度按时间ygya80vv1#
试着在dot all模式下运行你的正则表达式,这样
.*
就可以跨换行符匹配: