iis 从Web服务器读取大型XML文件,而不拆分为较小的块

esyap4oy  于 2022-11-12  发布在  其他
关注(0)|答案(3)|浏览(183)

我正在从第三方服务器下载一个文件,如下所示:

Try
    req = DirectCast(HttpWebRequest.Create("https://www.example.com/my.xml"), HttpWebRequest)
    req.Timeout = 100000 '100 seconds
    Resp = DirectCast(req.GetResponse(), HttpWebResponse)
    reader = New StreamReader(Resp.GetResponseStream)
    responseString = reader.ReadToEnd()
Catch ex As Exception

End Try

文件my.xml是1.2GB,我收到错误“抛出了'System.OutOfMemoryException'类型的异常”。当我打开Windows任务管理器时,我看到内存使用率仅为总可用内存的70%,IIS工作进程的大小没有增长到使用全部系统内存。当我发现这一点时:https://learn.microsoft.com/en-us/archive/blogs/tom/chat-question-memory-limits-for-32-bit-and-64-bit-processes,因此70%的失败听起来是正确的。
所以现在我考虑将文件分割成更易于管理的小块。但是,我如何做到这一点 * 而不创建单独的文件 *?有没有一种方法,例如每次加载100MB到内存中(考虑到XML节点结尾),或者每次阅读X个XML节点?
当我在谷歌上搜索“从Web服务器读取大型XML文件而不分割成更小的块”时,我只得到了文件分割工具。

更新1

根据Lex Li的建议,我搜索并找到了这个教程:https://learn.microsoft.com/en-us/dotnet/standard/linq/perform-streaming-transform-large-xml-documents
所以我翻译了代码,它的工作原理和教程一样:

Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
    Using reader As XmlReader = XmlReader.Create(uri)
        Dim name As XElement = Nothing
        Dim item As XElement = Nothing
        reader.MoveToContent()

        While reader.Read()

            If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Customer" Then

                While reader.Read()

                    If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Name" Then
                        name = TryCast(XElement.ReadFrom(reader), XElement)
                        Exit While
                    End If
                End While

                While reader.Read()
                    If reader.NodeType = XmlNodeType.EndElement Then Exit While

                    If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Item" Then
                        item = TryCast(XElement.ReadFrom(reader), XElement)

                        If item IsNot Nothing Then
                            Dim tempRoot As XElement = New XElement("Root", New XElement(name))
                            tempRoot.Add(item)
                            Yield item
                        End If
                    End If
                End While
            End If
        End While
    End Using
End Function

Private Shared Sub Main()
    Dim srcTree As IEnumerable(Of XElement) = From el In StreamCustomerItem("https://www.example.com/source.xml") Select New XElement("Item", New XElement("Customer", CStr(el.Parent.Element("Name"))), New XElement(el.Element("Key")))
    Dim xws As XmlWriterSettings = New XmlWriterSettings()
    xws.OmitXmlDeclaration = True
    xws.Indent = True

    Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files\") + "Output.xml", xws)
        xw.WriteStartElement("Root")

        For Each el As XElement In srcTree
            el.WriteTo(xw)
        Next

        xw.WriteEndElement()
    End Using

End Sub

上面的示例转换了output.xml中的source.xml,但我只想完全按原样读取product节点(不需要转换),并以这样的方式读取单个节点,以便我可以处理大型XML文件。
我试着重写它,这样它就可以从我的XML中提取值,就像下面的结构一样。首先,我试着从我的xml文件中准备一些东西,如下所示:

Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
    Using reader As XmlReader = XmlReader.Create(uri)
        Dim name As XElement = Nothing
        Dim item As XElement = Nothing
        reader.MoveToContent()

        While reader.Read()
            If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Id" Then
                name = TryCast(XElement.ReadFrom(reader), XElement)
                item = TryCast(XElement.ReadFrom(reader), XElement)

                If item IsNot Nothing Then
                    Dim tempRoot As XElement = New XElement("Root", New XElement(name))
                    tempRoot.Add(item)
                    Yield item
                End If

                Exit While
            End If
        End While
    End Using
End Function

Private Shared Sub Main()
    Dim srcTree As IEnumerable(Of XElement)

    srcTree = From el In StreamCustomerItem("https://www.example.com/mysource.xml")
              Select New XElement("product", New XElement("product", CStr(el.Parent.Element("Id"))))

    Dim xws As XmlWriterSettings = New XmlWriterSettings()
    xws.OmitXmlDeclaration = True
    xws.Indent = True

    Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files\") + "Output.xml", xws)
        xw.WriteStartElement("Root")

        For Each el As XElement In srcTree
            el.WriteTo(xw)
        Next

        xw.WriteEndElement()
    End Using

End Sub

这只是将<Root />写入到我的output.xml中
mysource.xml

<?xml version="1.0" encoding="UTF-8" ?>
<products>
    <product>
        <Id>
            <![CDATA[122854]]>
        </Id>
        <Type>
            <![CDATA[restaurant]]>
        </Type>
        <features>
            <wifi>
                <![CDATA[included]]>
            </wifi>
        </features>         
    </product>
</products>

所以总结一下我的问题:如何从“mysource.xml”中按原样读取单个product节点,而不将整个文件加载到内存中?

更新1

Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
    Using reader As XmlReader = XmlReader.Create(uri)
        Dim name As XElement = Nothing
        Dim item As XElement = Nothing
        reader.MoveToContent()

        While Not reader.EOF
            If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "product" Then
                Dim el As XElement = TryCast(XElement.ReadFrom(reader), XElement)
                If el IsNot Nothing Then Yield el
            Else
                reader.Read()
            End If
        End While
    End Using
End Function            

Private Shared Sub Main()
    Dim element As IEnumerable(Of XmlElement) = From el In StreamCustomerItem("source.xml") Select el

    For Each str As XmlElement In grandChildData
    'here loop through `product` element
        Console.WriteLine(str)
    Next
End Sub

我的完整测试文件通过洋葱分享(使用TOR浏览器下载):
http://jkntfybog2s5cc754sn7mujvyaawdqxd4q5imss66x3hsos34rrbjrid.onion密钥:黄色建筑物

zf9nrax1

zf9nrax11#

重要的是要确保你从来没有加载整个文件,而是“流”(一般意义上,流字节、字符、xml节点等)从头到尾的一切(即:服务器到客户端)。
对于网络字节,这意味着必须使用原始Stream对象。
对于Xml节点,这意味着可以使用XmlReader(* 而不是 * 从流加载完整文档对象模型的XmlDocument)。在这种情况下,可以使用XmlTextReader,它 “表示提供对XML数据的快速、非缓存、只进访问的读取器”
下面是一段C#代码(可以很容易地转换为VB .NET),它可以实现这一点,但仍然可以使用XmlReader方法ReadInnerXml和/或ReadOuterXml为大Gb文件中的每个产品生成一个中间的小XML文档:

var req = (HttpWebRequest)WebRequest.Create("https://www.yourserver.com/spotahome_1.xml");
using (var resp = req.GetResponse())
{
    using (var stream = resp.GetResponseStream())
    {
        using (var xml = new XmlTextReader(stream))
        {
            var count = 0;
            while (xml.Read())
            {
                switch (xml.NodeType)
                {
                    case XmlNodeType.Element:
                        if (xml.Name == "product")
                        {
                            // using XmlDocument is ok here since we know
                            // a product is not too big
                            // but we could continue with the reader too
                            var product = new XmlDocument();
                            product.LoadXml(xml.ReadOuterXml());
                            Console.WriteLine(count++);
                        }
                        break;
                }
            }
        }
    }
}

PS:理想情况下,你可以使用异步/等待代码与异步对应方法ReadInnerXmlAsync/ReadOuterXmlAsync,但这是另一个故事,很容易设置。

s2j5cfk0

s2j5cfk02#

您是否已从Microsoft checkout 此文档?https://learn.microsoft.com/en-us/dotnet/standard/linq/stream-xml-fragments-xmlreader
我也遇到过类似的问题,但是阅读一个很大的json。我所做的是读一个表示产品开始的标记,然后遍历这些标记。这样你就不会在内存中加载整个文件。我相信同样的解决方案也可以在XML中实现。
希望能有所帮助。

yquaqz18

yquaqz183#

这是一种老式的方法,但我通常会跟踪XML文件中的XPATH地址,然后使用XPATH来确定如何处理该值。

Imports System.Xml

Module Program
  Sub Main(args As String())
    Dim filename = "C:\Junk\Junk.xml"    
    Using reader As XmlReader = XmlReader.Create(filename)
      Dim xpath = ""
      Dim currentProduct As Product = Nothing
      Do While reader.Read
        Select Case reader.NodeType
          Case XmlNodeType.Element
            If Not reader.IsEmptyElement Then
              xpath &= "/" & reader.Name
            End If
            If xpath = "/products/product" Then
              If currentProduct IsNot Nothing Then
                Console.WriteLine(currentProduct)
              End If
              currentProduct = New Product
            End If
          Case XmlNodeType.EndElement
            xpath = xpath.Substring(0, xpath.LastIndexOf("/"))
          Case XmlNodeType.CDATA
            Select Case xpath
              Case "/products/product/Id"
                currentProduct.Id = reader.Value
              Case "/products/product/Type"
                currentProduct.ProductType = reader.Value
              Case "/products/product/features/wifi"
                If reader.Value = "included" Then
                  currentProduct.Wifi = True
                End If
            End Select
        End Select
      Loop
      If currentProduct IsNot Nothing Then
        Console.WriteLine(currentProduct)
      End If
    End Using
    Console.WriteLine("FINISHED")
  End Sub

  Class Product
    Public Property Id As String
    Public Property ProductType As String
    Public Property Wifi As Boolean
    Public Overrides Function ToString() As String
      Return $"{Id}-{ProductType}-{Wifi}"
    End Function    
  End Class
End Module

相关问题