.net 如何使用iTextSharp从PDF中提取突出显示的文本？

4xrmg8kj 于 2022-12-24 发布在 .NET

关注(0)|答案(2)|浏览(204)

根据以下帖子：iTextSharp PDF使用C#阅读高亮文本（高亮注解）
此代码：

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

正在提取PDF注解。但是为什么下面的代码不适用于突出显示（特别是PdfName.HIGHLIGHT不适用）：

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

.net

来源：https://stackoverflow.com/questions/26652411/how-to-extract-highlighed-text-from-pdf-using-itextsharp

2条答案

按热度按时间

t9aqgxwy1#

请看ISO-32000-1中的表30（又名PDF参考）。它的标题是“页面对象中的条目”。在这些条目中，您可以找到一个名为Annots的键。它的值为：
（可选）注解字典的数组，应包含与页面相关的所有注解的间接引用（参见12.5“注解”）。
您将找不到具有Highlight这样的键的条目，因此，当您有以下行时，返回的数组为空是正常的：

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);

您需要像以前那样获取注解：

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

现在您需要循环遍历这个数组，查找Subtype等于Highlight的注解。ISO-32000-1的表169中列出了这种类型的注解，标题为“Annotation types”。
换句话说，你假设页面字典包含关键字为Highlight的条目是错误的，如果你阅读了整个规范，你还会发现另一个错误的假设。你错误地假设高亮显示的文本存储在注解的Contents条目中。这表明你对注解和页面内容的本质缺乏理解。
您要查找的文本存储在页面的内容流中。页面的内容流独立于页面的注解。因此，要获取突出显示的文本，您需要获取存储在Highlight注解（存储在QuadPoints数组中）中的坐标，并需要使用这些坐标来解析页面内容中这些坐标处的文本。

赞(0）回复(0）举报 2022-12-24

jv4diomz2#

以下是使用itextSharp提取突出显示文本的完整示例

public void GetRectAnno()
{

    string appRootDir = new DirectoryInfo(Environment.CurrentDirectory).Parent.Parent.FullName;

    string filePath = appRootDir + "/PDFs/" + "anot.pdf";

    int pageFrom = 0;
    int pageTo = 0;

    try
    {
        using (PdfReader reader = new PdfReader(filePath))
        {
            pageTo = reader.NumberOfPages;
            
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                

                PdfDictionary page = reader.GetPageN(i);
                PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
                if (annots != null)
                    foreach (PdfObject annot in annots.ArrayList)
                    {
                        
                        //Get Annotation from PDF File
                        PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annot);
                        PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
                        //check only subtype is highlight
                        if (subType.Equals(PdfName.HIGHLIGHT))
                        {
                              // Get Quadpoints and Rectangle of highlighted text
                            Console.Write("HighLight at Rectangle {0} with QuadPoints {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));

                            //Extract Text using rectangle strategy    
                            PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);
                                                      
                            Rectangle rect = new Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
                            float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat),float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));


                            RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
                            ITextExtractionStrategy strategy;
                            StringBuilder sb = new StringBuilder();

                            
                            strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
                            sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));
                            
                            //Show extract text on Console
                            Console.WriteLine(sb.ToString());
                            //Console.WriteLine("Page No" + i);

                        }


                    }


            }
        }
    }
    catch (Exception ex)
    {
    }
}

赞(0）回复(0）举报 2022-12-24

我来回答

.net 如何使用iTextSharp从PDF中提取突出显示的文本？

2条答案

相关问题

热门标签

最新问答