Excel VBA代码使用正则表达式从HTML文件中提取节号。但是,正则表达式包含VBA正则表达式中不支持的负向后查找。"(?<!tbl"")>(\d(\.\d)+)<"
Sub GetAllSectionNumbers()
LRb = Cells(Rows.Count, "B").End(xlUp).Row
Range("B7:C" & LRb).ClearContents
Dim fileDialog As fileDialog
Set fileDialog = Application.fileDialog(msoFileDialogOpen)
fileDialog.AllowMultiSelect = True
fileDialog.Title = "Select HTML files"
fileDialog.Filters.Clear
fileDialog.Filters.Add "HTML files", "*.htm;*.html", 1
If fileDialog.Show <> -1 Then Exit Sub
Dim file As Variant
For Each file In fileDialog.SelectedItems
Dim fileContents As String
Open file For Input As #1
fileContents = Input$(LOF(1), 1)
Close #1
Dim regex As Object
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = "(?<!tbl"")>(\d(\.\d)+)<"
regex.Global = True
regex.IgnoreCase = True
regex.MultiLine = True
TRET = regex.Pattern
filePath = file
fileFolder = Left(filePath, InStrRev(filePath, "\"))
fileNameSource = Mid(filePath, InStrRev(filePath, "\") + 1, 100)
Dim match As Object
Set match = regex.Execute(fileContents)
Dim i As Long
For i = 0 To match.Count - 1
LRb = Cells(Rows.Count, "B").End(xlUp).Row + 1
Range("B" & LRb).Value = match.Item(i).SubMatches(0)
Range("C" & LRb).Value = fileNameSource
Next i
Next file
MsgBox "Done!"
End Sub
有没有其他的正则表达式解决方案来处理这个问题?
1条答案
按热度按时间b91juud31#
当你提取时,传统的方法是使用“最好的正则表达式技巧”,即匹配你不需要的,匹配 * 并捕获 * 你需要的。
在这种特定情况下的正则表达式如下所示
在代码中,它看起来像
接下来,在你的代码中,你应该检查
match.SubMatches(0)
值是否真的存在,如果是的话,接受它,因为它是你需要的。参见regex demo。