如何使用ApacheDrill从amazons3查询数据?

rsl1atfo  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(425)

使用spark将astextfile保存到s3之后,再像hadoop那样格式化。文件在bucket中的格式是这样的。

格式为年/月/日期/时间戳,数据为第0000部分,文件为json格式。
我配置钻并指向我的桶名

{
  "type": "file",
  "enabled": true,
  "connection": "s3://com.giaosudau.win-bid",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json"
    },
    "avro": {
      "type": "avro"
    }
  }
}

我选择的存储是确定的。但不知道如何选择和获取日期、小时或文件的数据。

我也尝试了一些方法,比如选择一个文件,但没有成功:

此处为文件示例

{"auctionId":"xx","bidRequestString":{"id":"xx","timestamp":"2015-07-29T08:31:00.413Z","isTest":false,"url":"http://www.222.3232.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.34343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"332","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=34343\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"23848834:strategy","biddingMainAccount":"1212312","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"123232","xchg":"9849839"},"winPrice":"100USD/1M"}
{"auctionId":"343","bidRequestString":{"id":"344","timestamp":"2015-07-28T08:31:00.413Z","isTest":false,"url":"http://www.xx.xx.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"342342342","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=xxx\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"829838828928932:strategy","biddingMainAccount":"3434","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"cc","xchg":"cc"},"winPrice":"100USD/1M"}

如何从中获取数据?

xxb16uws

xxb16uws1#

下面是一些使用数据和本地文件系统而不是s3的示例。我认为您的文件没有任何扩展名,但是如果不是这样,如果您的文件有.json扩展名,您可以这样查询它们(使用表别名t来解决歧义):

use dfs.`root`;
+-------+---------------------------------------+
|  ok   |                summary                |
+-------+---------------------------------------+
| true  | Default schema changed to [dfs.root]  |
+-------+---------------------------------------+
1 row selected (0.079 seconds)
select t.auctionID from dfs.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.093 seconds)

如果您的文件没有扩展名,请修改存储插件配置,将defaultinputformat设置为“json”;否则,默认情况下,drill将假定文件为parquet格式。例如,我在根工作区中设置defaultinputformat:

{
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
. . .

查询没有扩展名的文件时,需要指定在其中定义defaultinputformat的工作区,在本例中为root:

select t.auctionID from dfs.`root`.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.095 seconds)

请参阅关于处理类型差异的文档:https://drill.apache.org/docs/json-data-model/#handling-类型差异

相关问题