ruby 解析PDF以获取地址和日期之间的文本

0pizxfdo  于 2023-05-28  发布在  Ruby
关注(0)|答案(1)|浏览(113)

我想从解析pdf中获得一些信息(Rel类型)。现在的问题是文本在地址和日期之间的某个时候卡住了。我附上的PDF页面x1c 0d1x的图片。当我解析页面时,我得到了以下文本。

Court Order -  01/30/2023    400 Block HANLON WAY\n                                                                                             Own\n                                                                                             Recognizance\n\n

如何获得rel_type =法院命令-自己的担保我已经尝试了这个正则表达式,但它只给了我完整的rel类型,它没有分成很多行

rel_typesmacthes = total_page.scan(/([MF])\s+(\d{2})\s+(\d{3})\s+([A-Z]{3}|t Specif t Spee)?\s+([A-Z]{3}|t Specif| t Spee)?\s+(.+)?(\d{1,2}\/\d{1,2}\/\d{2}\s+\d{1,2}:\d{2}\s+(?:am|pm))\s+(.+)?\s+(?:(\d{1,2}\/\d{1,2}\/2023))?/)

这是页面。检查外观

Name                                       BookDate Time                DateOfBirth              Booking #            Bail Amount

                    MARK,DEONTAE                                   1/28/23 9:40 am               02/25/1998              CC23NM711        $0 Race          Gen    Height   Weight    Hair  Eyes     Job Description    Arrest Date Time   Rel type       Release Date Arrest Location

BLACK           M      72       159     BLK    BRO                     1/28/23 8:26 am                                 800 Block VAQUEROS AVE RODEO   Arrest Type

     On View

        Charge                                     Charge Description
              148(A)(1) PC                         OBSTRUCT/ETC PUB OFCR/ETC

        Charge                                     Charge Description

              273.6(A) PC                          VIO ORD:PREVNT DOMES VIOL   Arrest Type

     Parole Hold

        Charge                                     Charge Description

              3000.08 PC                           VIOLATION OF PAROLE
        Charge                                     Charge Description

              AB109                                AB109 REALIGNMENT

                        Name                                       BookDate Time                DateOfBirth              Booking #        Bail Amount
                MENDOZA-FREGOZA,LIDIO                              1/28/23 3:25 pm               10/20/1989              CC23NM719        $0

Race          Gen    Height   Weight    Hair  Eyes     Job Description Arrest Date Time   Rel type       Release Date Arrest Location HISPANIC        M      67       175     BLK    BRO                     1/28/23 2:42 pm    Court Order -  01/30/2023    400 Block HANLON WAY
                                                                                             Own
                                                                                             Recognizance

  Arrest Type
     Bench Warrant

        Charge                                     Charge Description

              10851(A) VC                          VEHICLE THEFT

        Charge                                     Charge Description
              466 PC                               POSSESS BURGLARY TOOLS

        Charge                                     Charge Description

              496D(A) PC                           POSS STOLEN VEH/VES/ETC

        Charge                                     Charge Description
              594(A) PC                            VANDALISM

        Charge                                     Charge Description

              978.5 PC                             BENCH WARRANT:FTA:FELONY
mzsu5hc0

mzsu5hc01#

从你的数据中,我得到了:

s = ["Race          Gen    Height   Weight    Hair  Eyes     Job Description Arrest Date Time   Rel type       Release Date Arrest Location HISPANIC        M      67       175     BLK    BRO                     1/28/23 2:42 pm    Court Order -  01/30/2023    400 Block HANLON WAY",
     "                                                                                               Own",
     "                                                                                             Recognizance"]

要巧妙地从pdf中读取文本并将其逐行解析为数组中的数据,如上图所示:

require 'pdf/reader'

reader = PDF::Reader.new(pdf_file_path)

reader.pages.each do |page|
  page.text.lines each do |line|
    # now you handling it as array, line by line
  end
end

逐行扫描:

(start, stop) = s[0].scan(/Rel type/).map { [Regexp.last_match.begin(0), Regexp.last_match.end(0)] }[0]

检查是否找到下一行的release type header:

if (start != nil)

拆分现有行,并从当前(拆分行)和以下行中获取数据文本:

data = s[0].scan(/Rel type.*Arrest Location(.*)/)
puts data[0][0][90 .. 105].strip
puts s[1][start .. -1].strip
puts s[2][start .. -1].strip

最初我想到使用start和stop来判断是否有多列文本。但列是固定的(在我的经验刮PDF),所以在文本中的位置应该是相同的所有行,你可以我们一个固定的偏移量。如果没有,则使用启动和停止。
执行这段代码可以得到:

Court Order -
Own
Recognizance

相关问题