我对pig的jsonloader函数的参数很难理解。json对象相当大,给我带来问题的部分是“entities”字段中的所有内容。如果我去掉这个,我可以让jsonloader()正常工作。有人能帮我解释一下这部分的模式吗?以下是一条tweet的json:
{
"contributors": null,
"truncated": false,
"text": "North Korea Says US 'Hell-Bent on Regime Change': North Korea says US 'hell-bent on regime change' and threate... http://t.co/FM4GhdQAcG",
"in_reply_to_status_id": null,
"id": 452128135731884000,
"favorite_count": 0,
"source": "<a href=\"http://twitterfeed.com\" rel=\"nofollow\">twitterfeed</a>",
"retweeted": false,
"coordinates": null,
"entities": {
"symbols": [],
"user_mentions": [],
"hashtags": [],
"urls": [
{
"url": "http://t.co/FM4GhdQAcG",
"indices": [
114,
136
],
"expanded_url": "http://abcn.ws/1jb6ANh",
"display_url": "abcn.ws/1jb6ANh"
}
]
},
"in_reply_to_screen_name": null,
"id_str": "452128135731884033",
"retweet_count": 0,
"in_reply_to_user_id": null,
"favorited": false,
"user": {
"follow_request_sent": null,
"profile_use_background_image": true,
"default_profile_image": false,
"id": 1484045802,
"profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/450180280033091584/ukwF1xQ1.jpeg",
"verified": false,
"profile_image_url_https": "https://pbs.twimg.com/profile_images/450177921198465024/5EbZX19P_normal.jpeg",
"profile_sidebar_fill_color": "DDEEF6",
"profile_text_color": "333333",
"followers_count": 178,
"profile_sidebar_border_color": "000000",
"id_str": "1484045802",
"profile_background_color": "FF3333",
"listed_count": 0,
"is_translation_enabled": false,
"utc_offset": -10800,
"statuses_count": 2900,
"description": "Unico Menor Con Flow Mi Watsshat 18297015049",
"friends_count": 103,
"location": "santo domingo",
"profile_link_color": "FF3333",
"profile_image_url": "http://pbs.twimg.com/profile_images/450177921198465024/5EbZX19P_normal.jpeg",
"following": null,
"geo_enabled": false,
"profile_banner_url": "https://pbs.twimg.com/profile_banners/1484045802/1396166038",
"profile_background_image_url": "http://pbs.twimg.com/profile_background_images/450180280033091584/ukwF1xQ1.jpeg",
"name": "Nïñø Mälø",
"lang": "es",
"profile_background_tile": true,
"favourites_count": 2,
"screen_name": "YeralMueka",
"notifications": null,
"url": "https://www.facebook.com/YeralMueka",
"created_at": "Wed Jun 05 04:41:09 +0000 2013",
"contributors_enabled": false,
"time_zone": "Santiago",
"protected": false,
"default_profile": false,
"is_translator": false
},
"geo": null,
"in_reply_to_user_id_str": null,
"possibly_sensitive": true,
"lang": "en",
"created_at": "Fri Apr 04 16:58:42 +0000 2014",
"filter_level": "medium",
"in_reply_to_status_id_str": null,
"place": null
}
2条答案
按热度按时间jum4pzuy1#
您可以通过twitter使用 elephant-bird 图书馆:https://github.com/kevinweil/elephant-bird
下面是一个使用自定义jsonloader加载json而不指定模式的示例:https://gist.github.com/neilkod/2898455
h4cxqtbf2#
我也曾使用过twitter tweets,从中我意识到了一个事实,即tweets有时在字段中有所不同(有些tweets包含的字段比其他tweets多),即tweets是非结构化的。如果您的输入是结构化的,您可以在pig中使用jsonloader…或者您不能这样做…所以要处理它,只需在pig中定义您自己的自定义项就可以了..要在pig中创建自定义项,请遵循下面的链接