我整天都在想如何完成这项任务。我有这两个文件:
user.plt:
包含用户的带时间戳的GPS轨迹。label.txt:
包含有关用于覆盖用户行程的旅行模式的信息。
第一个文件(user.plt
)是一个7字段逗号分隔的数据,如下所示:
lat,lon,constant,alt,ndays,date,time
39.921712,116.472343,0,13,39298.1462037037,2007-08-04,03:30:32
39.921705,116.472343,0,13,39298.1462152778,2007-08-04,03:30:33
39.863516,116.373796,0,115,39753.1872916667,2008-11-01,04:29:42
39.863471,116.373711,0,112,39753.1873032407,2008-11-01,04:29:43
39.991778,116.333088,0,223,39753.2128240741,2008-11-01,05:06:28
39.991776,116.333031,0,223,39753.2128472222,2008-11-01,05:06:30
39.991568,116.331501,0,95,39756.4298611111,2008-11-04,10:19:00
39.99156,116.331508,0,95,39756.4298726852,2008-11-04,10:19:01
39.975891,116.333441,0,-98,39756.4312615741,2008-11-04,10:21:01
39.915171,116.455808,0,656,39756.4601157407,2008-11-04,11:02:34
39.915369,116.455791,0,620,39756.4601273148,2008-11-04,11:02:35
39.912271,116.470686,0,95,39756.4653587963,2008-11-04,11:10:07
39.912088,116.469958,0,246,39756.4681481481,2008-11-04,11:14:08
39.912106,116.469936,0,246,39756.4681597222,2008-11-04,11:14:09
39.912189,116.465108,0,184,39756.4741666667,2008-11-04,11:22:48
39.975859,116.334063,0,279,39756.6100115741,2008-11-04,14:38:25
39.975978,116.334041,0,272,39756.6100231481,2008-11-04,14:38:26
39.991336,116.331886,0,115,39756.6112847222,2008-11-04,14:40:15
39.991581,116.33131,0,164,39756.6123148148,2008-11-04,14:41:44
第二个文件(label.txt
)是一个由3列用户行程信息组成的制表符,看起来像这样:
Start Time End Time Transportation Mode
2008/11/01 03:59:27 2008/11/01 04:30:18 train
2008/11/01 04:35:38 2008/11/01 05:06:30 taxi
2008/11/04 10:18:55 2008/11/04 10:21:11 subway
2008/11/04 11:02:34 2008/11/04 11:10:08 taxi
2008/11/04 11:14:08 2008/11/04 11:22:48 walk
我正在寻找一种方法来读取user.plt
的内容为每个时期的旅行与旅行模式注解,并写入一个CSV
文件如下:
- 读取
label.txt
的1行(即特定行程的行程模式信息)。创建两个字段trip_id
,初始化为1
,segment_id
也初始化为1
。 - 读取
user.plt
中日期和时间在label.txt
的开始时间/结束时间间隔内的每一行(即获取旅行的GPS轨迹)。 - 读取
label.txt
的下一行。 - 如果前一行的结束时间与当前行的开始时间之间的差小于30分钟(即,相同行程,新段),将
trip_id
保留为1
,将segment_id
更新为2
。 - 如果前一行的结束时间与当前行的开始时间之间的差大于30分钟(则为新行程、新段),则更新
trip_id = 2
和segment_id = 1
。 - 每次将值写入
CSV
文件,格式如下:
trip_id, segment_id, lat, lon, date, time, transportation-mode
预期效果
给定上述2个输入文件,预期的CSV文件(processed.csv
)为:
trip_id,segment_id,lat,lon,date,time,transportation-mode
1,1,39.863516,116.373796,2008-11-01,04:29:42,train
1,1,39.863471,116.373711,2008-11-01,04:29:43,train
1,2,39.991778,116.333088,2008-11-01,05:06:28,taxi
1,2,39.991776,116.333031,2008-11-01,05:06:30,taxi
2,1,39.991568,116.331501,2008-11-04,10:19:00,subway
2,1,39.99156,116.331508,2008-11-04,10:19:01,subway
2,1,39.975891,116.333441,2008-11-04,10:21:01,subway
3,1,39.915171,116.455808,2008-11-04,11:02:34,taxi
3,1,39.915369,116.455791,2008-11-04,11:02:35,taxi
3,1,39.912271,116.470686,2008-11-04,11:10:07,taxi
3,2,39.912088,116.469958,2008-11-04,11:14:08,walk
3,2,39.912106,116.469936,2008-11-04,11:14:09,walk
3,2,39.912189,116.465108,2008-11-04,11:22:48,walk
注意:并非user.plt
的所有行在label.txt
中都有相应的跳闸信息。这些行将被忽略且不需要。
编辑
下面我以字典的形式给予了评论中建议的数据。
user.plt
:
{'lat': [39.921712,39.921705,39.863516,39.863471,39.991778,39.991776,
39.991568,39.99156,39.975891,39.915171,39.915369,39.912271,39.912088,
39.912106,39.912189,39.975859,39.975978,39.991336,39.991581],
'lon': [116.472343,116.472343,116.373796,116.373711,116.333088,116.333031,
116.331501,116.331508,116.333441,116.455808,116.455791,116.470686,116.469958,
116.469936,116.465108,116.334063,116.334041,116.331886,116.33131],
'constant': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'alt': [13,13,115,112,223,223,95,95,-98,656,620,95,246,246,184,279,272,115,164],
'ndays': [39298.1462037037,39298.1462152778,39753.1872916667,39753.1873032407,
39753.2128240741,39753.2128472222,39756.4298611111,39756.4298726852,39756.4312615741,
39756.4601157407,39756.4601273148,39756.4653587963,39756.4681481481,39756.4681597222,
39756.4741666667,39756.6100115741,39756.6100231481,39756.6112847222,39756.6123148148],
'date': ['2007-08-04','2007-08-04','2008-11-01','2008-11-01','2008-11-01','2008-11-01',
'2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04',
'2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04'],
'time': ['03:30:32','03:30:33','04:29:42','04:29:43','05:06:28','05:06:30','10:19:00',
'10:19:01','10:21:01','11:02:34','11:02:35','11:10:07','11:14:08','11:14:09','11:22:48',
'14:38:25','14:38:26','14:40:15','14:41:44']}
label.txt
:
{'Start Time': ['2008/11/01 03:59:27',
'2008/11/01 04:35:38',
'2008/11/04 10:18:55',
'2008/11/04 11:02:34',
'2008/11/04 11:14:08'],
'End Time': ['2008/11/01 04:30:18',
'2008/11/01 05:06:30',
'2008/11/04 10:21:11',
'2008/11/04 11:10:08',
'2008/11/04 11:22:48'],
'Transportation Mode': ['train', 'taxi', 'subway', 'taxi', 'walk']}
2条答案
按热度按时间luaexgnf1#
如果您想将pandas与pyjanitorflavor 一起使用:
输出:
92dk7w1h2#
这里是一个纯pandas解决方案(确保dtypes也是正确的)。这是在Jupyter notebook中测试的,因此
display(user_df)
,如果使用其他IDE,您也可以执行print(user_df)
:终端结果:
话虽如此(好吧,书面),我不能100%确定这是否满足您的确切要求,或者即使它是有效的-一步一步地,手动通过问题,所以来自pandas重量级人物的某种形式的验证将是最受欢迎的。