正则表达式java xml pig

mzillmmw  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(320)

救命啊!你的几分钟可以节省我几个小时!!
我用Pig来获取一些信息。

<Content

<Name ><\Name> 
<Data ><\Data>
<Data ><\Data>
><\Content>

所以我用了:

abcd_ = LOAD 'parentFolder/*' USING org.apache.pig.piggybank.storage.XMLLoader('Content') AS (content: chararray);

我只需要一些具体的信息,我不知道有没有可能:

abcd_ = LOAD 'parentFolder/*' USING org.apache.pig.piggybank.storage.XMLLoader('Content','Data') AS (content: chararray,data: chararray);

但我希望避免这种情况。我已经成功地在xmloader之后使用regex提取了其他信息,除了以下内容(只是一个可能的字符组合示例)

<Data Name="Buffer">{&quot;$type&quot;System.Collections.Generic'[!#%,:()!@-;[.}<\Data>

我的正则表达式:

1. \\<Data Name=\\"Buffer\\"\\>\\{(.*)\\}\\<\Data\\> -- Unexpected character D at <\Data>
2. \\<Data Name=\\"Buffer\\"\\>\\{(.*)\\}\\<\\Data\\> -- I got nothing
3. \<Data Name=\"Buffer\"\>\{(.*)\}\<\\Data\> -- Unexpected character < at \<Data Name..
4. \\<Data Name=\\"Buffer\\"\\>\\{(.*)\\}\\<\\\Data\\> -- Unexpected character D at <\\\Data>

我打算得到:

&quot;$type&quot;System.Collections.Generic'[!#%,:()!@-;[.

编辑:
刚刚意识到一个巨大的错误/
找到了答案

<Data Name=\\"Buffer\\">\\{(.*)\\}</Data\\>
bttbmeg0

bttbmeg01#

解析此xml的更好方法是使用XPathJavaAPI。
以下打印:
XXXYYYZZ公司
111222333

import java.io.IOException;
import java.io.StringReader;
import java.util.AbstractList;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class ParseXML {

    public static void main(String... args) throws Exception {
        String input = "<Content><Data Name=\"Buffer\">XXXYYYZZZ</Data><Data Name=\"Buffer\">111222333</Data></Content>";
        String xpathExpression = "//Data[@Name='Buffer']";
        NodeList result = parseXML(input, xpathExpression);
        for (Node node : new NodeListWrapper(result)) {
            System.out.println(node.getFirstChild().getTextContent());
        }
    }

    private static NodeList parseXML(String input, String xpathExpression) throws Exception {
        StringReader reader = new StringReader(input);
        Document document = createDocument(input);
        XPathFactory xpathFactory = XPathFactory.newInstance();
        XPath xpath = xpathFactory.newXPath();
        XPathExpression expression = xpath.compile(xpathExpression);
        return (NodeList) expression.evaluate(document, XPathConstants.NODESET);
    }

    private static Document createDocument(String input) throws ParserConfigurationException, SAXException, IOException {
        DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
        return documentBuilder.parse(new InputSource(new StringReader(input)));
    }

}

class NodeListWrapper extends AbstractList<Node> {
    private final NodeList nodeList;

    public NodeListWrapper(NodeList nodeList) {
        this.nodeList = nodeList;
    }

    @Override
    public Node get(int n) {
        return nodeList.item(n);
    }

    @Override
    public int size() {
        return nodeList.getLength();
    }
}

我已经上传了我的答案源代码到github这里。

相关问题