regex 如何在xslt中使用正则表达式来操作元素的文本，同时保持对子节点及其属性的处理(使用TEI样式表配置文件)？

我目前正在为TEI xslt Stylesheets（https://tei-c.org/release/doc/tei-xsl/）编写一个概要文件，以定制从MSword docx到符合TEI的XML（并进一步转换为有效的HTML）的转换。在我的例子中，我需要定制的一个特定转换是，我有一堆引用特定视频源存档的文本。在文本中，这些引用类似于[box：001 roll：01 start：00：01：00.00]。我想使用正则表达式来查找这些引用并在tei：figure元素中生成符合TEI的tei：media元素。当引用在其自己的段落中时，这很有效。但是不同的作者在他们的文本段落中有引用（元素tei：p）。这里开始了挑战，因为这些pragraph可能包含其他元素，如tei：note或tei：hi，这些元素应该保持完整并进行充分处理。不幸的是，xslt指令xsl：analyze-string创建子字符串，因此不能对它们使用xsl：apply-templates，只能使用xsl：copy-of。这适用于xsl：matching-substring，但如上所述，xsl：non-matching-substring包含一些其他应该处理的元素（具有属性）。
TEI样式表变换相当复杂，并且运行各种过程。在我想干预我的个人资料的阶段，我已经为我的段落设置了一个tei元素p。例如：

<p>This is my paragraph with a note <note place="foot">This is my note</note> and it is <hi rend="italic">important</hi> that this inline elements and their attributes are kept and further processed. This is my special reference to a video in the archive [box: 001 roll: 01 start: 00:01:10.12] that should be transformed into a valid tei:media element.</p>

我的转型至今（简化版）：

<xsl:template match="tei:p" mode="pass2">
  <xsl:choose>
   <xsl:when test=".,'\[[Bb]ox:.+?\]'">
    <xsl:analyze-string select="." regex="\[box: (\d+) roll: (\d+) start: ((\d\d):(\d\d):(\d\d).(\d\d))\]">
     <xsl:matching-substring>
      <xsl:element name="ref">
       <xsl:attribute name="target">
        <xsl:value-of select="concat('https://path-to-video-page/',regex-group(1),'-',regex-group(2),'/',regex-group(4),'-'regex-group(5),'-',regex-group(6),'-',regex-group(7))"/>
       </xsl:attribute>
       <xsl:value-of select="concat('(box: ',regex-group(1),' roll: ',regex-group(2),' @ ',regex-group(4),'h 'regex-group(5),'m ',regex-group(6),'s)')"/>
      </xsl:element>
      
      <figure place="margin">
       <xsl:element name="head">
        <xsl:value-of select="concat('Sequence from box: ',regex-group(1),' roll: ',regex-group(2))"/>
       </xsl:element>
       <xsl:element name="media">
        <xsl:attribute name="mimeType">video/mp4</xsl:attribute>
         <xsl:attribute name="url">
          <xsl:value-of select="concat('https://path-to-video/',regex-group(1),'-',regex-group(2),'.mp4')"/>
         </xsl:attribute>
         <xsl:attribute name="start">
           <xsl:value-of select="regex-group(3)"/>
         </xsl:attribute>
       </xsl:element>
      </figure>
     </xsl:matching-substring>
     <xsl:non-matching-substring>
      <xsl:copy-of select="."/>
     </xsl:non-matching-substring>
    </xsl:analyze-string>  
   <xsl:otherwise>
    <xsl:apply-templates mode="pass2"/>
   </xsl:otherwise>
  </xsl:choose>
  </p>
 </xsl:template>

结果：

<p>This is my paragraph with a note This is my note and it is important that this inline elements and their attributes are kept and further processed. This is my special reference to a video in the archive <ref target="https://path-to-video-page/001-01/00-01-10-12">(box: 001 roll: 01 @ 00h 01m 10s)</ref>
<figure rend="margin">
   <head rend="none">Sequence from box: 001 roll: 01</head>
   <media mimeType="video/mp4" url="path-to-video/001-01.mp4" start="00:01:10.12"/>
</figure> that should be transformed into a valid tei:media element.</p>

现在我被困住了是否可以用正则表达式操作p元素中文本的匹配内容，同时保留不匹配部分的“节点字符”以供进一步处理？或者我是不是走到了死胡同，应该停止使用XML来达到这个目的？我考虑的替代方案是将引用作为文本保留在XML中，并使用Python脚本对生成的XML/HTML文件进行后处理。但是如果可能的话，用XSLT来完成所有事情会更优雅。
谢谢你的任何建议奥拉夫

解决办法很简单：将模板匹配更改为

xsl:template match="tei:p//text()"

当应用于tei:p时，xsl:analyze-string将整个元素分解为可以使用正则表达式解析的字符串。只匹配文本节点tei:p//text()会保留tei:p的其余元素结构及其父/祖先/兄弟元素。然后xsl:analyze-string只对文本进行操作，并将其余内容保留给其他模板或默认的身份转换处理。
许多xsl:analyze-string的教程或示例都将其应用于整个元素，因为它们只想提取一些信息以供进一步处理，而将原始元素留在后面。如果你想使用xsl:analyze-string来改变一个元素的文本，你可以将它作为一个元素来使用，那么只将它应用到文本节点上是很重要的。
感谢@Martin Honnen在评论我的问题时提出的建议。

regex 如何在xslt中使用正则表达式来操作元素的文本，同时保持对子节点及其属性的处理(使用TEI样式表配置文件)？

1条答案

相关问题

热门标签

最新问答