从两个单独的文件中检索数据并写入第三个csv文件

ego6inou  于 2023-06-03  发布在  其他
关注(0)|答案(2)|浏览(434)

我整天都在想如何完成这项任务。我有这两个文件:

  1. user.plt:包含用户的带时间戳的GPS轨迹。
  2. label.txt:包含有关用于覆盖用户行程的旅行模式的信息。
    第一个文件(user.plt)是一个7字段逗号分隔的数据,如下所示:
lat,lon,constant,alt,ndays,date,time
39.921712,116.472343,0,13,39298.1462037037,2007-08-04,03:30:32
39.921705,116.472343,0,13,39298.1462152778,2007-08-04,03:30:33
39.863516,116.373796,0,115,39753.1872916667,2008-11-01,04:29:42
39.863471,116.373711,0,112,39753.1873032407,2008-11-01,04:29:43
39.991778,116.333088,0,223,39753.2128240741,2008-11-01,05:06:28
39.991776,116.333031,0,223,39753.2128472222,2008-11-01,05:06:30
39.991568,116.331501,0,95,39756.4298611111,2008-11-04,10:19:00
39.99156,116.331508,0,95,39756.4298726852,2008-11-04,10:19:01
39.975891,116.333441,0,-98,39756.4312615741,2008-11-04,10:21:01
39.915171,116.455808,0,656,39756.4601157407,2008-11-04,11:02:34
39.915369,116.455791,0,620,39756.4601273148,2008-11-04,11:02:35
39.912271,116.470686,0,95,39756.4653587963,2008-11-04,11:10:07
39.912088,116.469958,0,246,39756.4681481481,2008-11-04,11:14:08
39.912106,116.469936,0,246,39756.4681597222,2008-11-04,11:14:09
39.912189,116.465108,0,184,39756.4741666667,2008-11-04,11:22:48
39.975859,116.334063,0,279,39756.6100115741,2008-11-04,14:38:25
39.975978,116.334041,0,272,39756.6100231481,2008-11-04,14:38:26
39.991336,116.331886,0,115,39756.6112847222,2008-11-04,14:40:15
39.991581,116.33131,0,164,39756.6123148148,2008-11-04,14:41:44

第二个文件(label.txt)是一个由3列用户行程信息组成的制表符,看起来像这样:

Start Time           End Time    Transportation Mode
2008/11/01 03:59:27 2008/11/01 04:30:18 train
2008/11/01 04:35:38 2008/11/01 05:06:30 taxi
2008/11/04 10:18:55 2008/11/04 10:21:11 subway
2008/11/04 11:02:34 2008/11/04 11:10:08 taxi
2008/11/04 11:14:08 2008/11/04 11:22:48 walk

我正在寻找一种方法来读取user.plt的内容为每个时期的旅行与旅行模式注解,并写入一个CSV文件如下:

  • 读取label.txt的1行(即特定行程的行程模式信息)。创建两个字段trip_id,初始化为1segment_id也初始化为1
  • 读取user.plt中日期和时间在label.txt的开始时间/结束时间间隔内的每一行(即获取旅行的GPS轨迹)。
  • 读取label.txt的下一行。
  • 如果前一行的结束时间与当前行的开始时间之间的差小于30分钟(即,相同行程,新段),将trip_id保留为1,将segment_id更新为2
  • 如果前一行的结束时间与当前行的开始时间之间的差大于30分钟(则为新行程、新段),则更新trip_id = 2segment_id = 1
  • 每次将值写入CSV文件,格式如下:

trip_id, segment_id, lat, lon, date, time, transportation-mode

预期效果

给定上述2个输入文件,预期的CSV文件(processed.csv)为:

trip_id,segment_id,lat,lon,date,time,transportation-mode
1,1,39.863516,116.373796,2008-11-01,04:29:42,train
1,1,39.863471,116.373711,2008-11-01,04:29:43,train
1,2,39.991778,116.333088,2008-11-01,05:06:28,taxi
1,2,39.991776,116.333031,2008-11-01,05:06:30,taxi
2,1,39.991568,116.331501,2008-11-04,10:19:00,subway
2,1,39.99156,116.331508,2008-11-04,10:19:01,subway
2,1,39.975891,116.333441,2008-11-04,10:21:01,subway
3,1,39.915171,116.455808,2008-11-04,11:02:34,taxi
3,1,39.915369,116.455791,2008-11-04,11:02:35,taxi
3,1,39.912271,116.470686,2008-11-04,11:10:07,taxi
3,2,39.912088,116.469958,2008-11-04,11:14:08,walk
3,2,39.912106,116.469936,2008-11-04,11:14:09,walk
3,2,39.912189,116.465108,2008-11-04,11:22:48,walk

注意:并非user.plt的所有行在label.txt中都有相应的跳闸信息。这些行将被忽略且不需要。

编辑

下面我以字典的形式给予了评论中建议的数据。

  1. user.plt
{'lat': [39.921712,39.921705,39.863516,39.863471,39.991778,39.991776,
  39.991568,39.99156,39.975891,39.915171,39.915369,39.912271,39.912088,
  39.912106,39.912189,39.975859,39.975978,39.991336,39.991581],
 'lon': [116.472343,116.472343,116.373796,116.373711,116.333088,116.333031,
  116.331501,116.331508,116.333441,116.455808,116.455791,116.470686,116.469958,
  116.469936,116.465108,116.334063,116.334041,116.331886,116.33131],
 'constant': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'alt': [13,13,115,112,223,223,95,95,-98,656,620,95,246,246,184,279,272,115,164],
 'ndays': [39298.1462037037,39298.1462152778,39753.1872916667,39753.1873032407,
  39753.2128240741,39753.2128472222,39756.4298611111,39756.4298726852,39756.4312615741,
  39756.4601157407,39756.4601273148,39756.4653587963,39756.4681481481,39756.4681597222,
  39756.4741666667,39756.6100115741,39756.6100231481,39756.6112847222,39756.6123148148],
 'date': ['2007-08-04','2007-08-04','2008-11-01','2008-11-01','2008-11-01','2008-11-01',
  '2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04',
  '2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04'],
 'time': ['03:30:32','03:30:33','04:29:42','04:29:43','05:06:28','05:06:30','10:19:00',
  '10:19:01','10:21:01','11:02:34','11:02:35','11:10:07','11:14:08','11:14:09','11:22:48',
  '14:38:25','14:38:26','14:40:15','14:41:44']}
  1. label.txt
{'Start Time': ['2008/11/01 03:59:27',
  '2008/11/01 04:35:38',
  '2008/11/04 10:18:55',
  '2008/11/04 11:02:34',
  '2008/11/04 11:14:08'],
 'End Time': ['2008/11/01 04:30:18',
  '2008/11/01 05:06:30',
  '2008/11/04 10:21:11',
  '2008/11/04 11:10:08',
  '2008/11/04 11:22:48'],
 'Transportation Mode': ['train', 'taxi', 'subway', 'taxi', 'walk']}
luaexgnf

luaexgnf1#

如果您想将pandaspyjanitorflavor 一起使用:

import pandas as pd
import janitor

user = pd.read_csv("user.plt", parse_dates={"Trip Time": ["date", "time"]})
label = pd.read_table("label.txt", parse_dates=["Start Time", "End Time"])

df = user.conditional_join(label, 
                          ("Trip Time", "Start Time", ">="), 
                          ("Trip Time", "End Time", "<="))

df["diff (min)"] = ((df["End Time"].shift() - df["Start Time"])
                             .dt.total_seconds().div(60).abs())

df["trip_id"] = ((df["diff (min)"].shift().gt(30) & df["diff (min)"].lt(30))
                               .shift(-1, fill_value=False).cumsum().add(1))

df["segment_id"] = ((~df.duplicated(subset=["trip_id", "Transportation Mode"]))
                                              .groupby(df['trip_id']).cumsum())

use_cols = ["diff (min)", "trip_id", "segment_id",
            "lat", "lon", "Trip Time", "Transportation Mode"]

df = df[use_cols]

# df.to_csv("processed.csv", index=False) #uncomment this line to make a fresh `.csv`

输出:

print(df)

    diff (min)  trip_id  segment_id   lat    lon           Trip Time Transportation Mode
0          NaN        1           1 39.86 116.37 2008-11-01 04:29:42               train
1        30.85        2           1 39.86 116.37 2008-11-01 04:29:43               train
2         5.33        2           2 39.99 116.33 2008-11-01 05:06:28                taxi
3        30.87        2           2 39.99 116.33 2008-11-01 05:06:30                taxi
4      4632.42        3           1 39.99 116.33 2008-11-04 10:19:00              subway
5         2.27        3           1 39.99 116.33 2008-11-04 10:19:01              subway
6         2.27        3           1 39.98 116.33 2008-11-04 10:21:01              subway
7        41.38        4           1 39.92 116.46 2008-11-04 11:02:34                taxi
8         7.57        4           1 39.92 116.46 2008-11-04 11:02:35                taxi
9         7.57        4           1 39.91 116.47 2008-11-04 11:10:07                taxi
10        4.00        4           2 39.91 116.47 2008-11-04 11:14:08                walk
11        8.67        4           2 39.91 116.47 2008-11-04 11:14:09                walk
12        8.67        4           2 39.91 116.47 2008-11-04 11:22:48                walk
92dk7w1h

92dk7w1h2#

这里是一个纯pandas解决方案(确保dtypes也是正确的)。这是在Jupyter notebook中测试的,因此display(user_df),如果使用其他IDE,您也可以执行print(user_df)

import pandas as pd

user_df = pd.DataFrame({'lat': [39.921712,39.921705,39.863516,39.863471,39.991778,39.991776,
  39.991568,39.99156,39.975891,39.915171,39.915369,39.912271,39.912088,
  39.912106,39.912189,39.975859,39.975978,39.991336,39.991581],
 'lon': [116.472343,116.472343,116.373796,116.373711,116.333088,116.333031,
  116.331501,116.331508,116.333441,116.455808,116.455791,116.470686,116.469958,
  116.469936,116.465108,116.334063,116.334041,116.331886,116.33131],
 'constant': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'alt': [13,13,115,112,223,223,95,95,-98,656,620,95,246,246,184,279,272,115,164],
 'ndays': [39298.1462037037,39298.1462152778,39753.1872916667,39753.1873032407,
  39753.2128240741,39753.2128472222,39756.4298611111,39756.4298726852,39756.4312615741,
  39756.4601157407,39756.4601273148,39756.4653587963,39756.4681481481,39756.4681597222,
  39756.4741666667,39756.6100115741,39756.6100231481,39756.6112847222,39756.6123148148],
 'date': ['2007-08-04','2007-08-04','2008-11-01','2008-11-01','2008-11-01','2008-11-01',
  '2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04',
  '2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04'],
 'time': ['03:30:32','03:30:33','04:29:42','04:29:43','05:06:28','05:06:30','10:19:00',
  '10:19:01','10:21:01','11:02:34','11:02:35','11:10:07','11:14:08','11:14:09','11:22:48',
  '14:38:25','14:38:26','14:40:15','14:41:44']})

label_df = pd.DataFrame({'StartTime': ['2008/11/01 03:59:27',
  '2008/11/01 04:35:38',
  '2008/11/04 10:18:55',
  '2008/11/04 11:02:34',
  '2008/11/04 11:14:08'],
 'EndTime': ['2008/11/01 04:30:18',
  '2008/11/01 05:06:30',
  '2008/11/04 10:21:11',
  '2008/11/04 11:10:08',
  '2008/11/04 11:22:48'],
 'TransportationMode': ['train', 'taxi', 'subway', 'taxi', 'walk']})

# display(user_df)
# display(label_df)
    
user_df['timestamp'] = user_df.apply(lambda row: pd.to_datetime(str(row.date) +' ' + str(row.time), format='ISO8601'), axis=1)
label_df['start_time'] = label_df.apply(lambda row: pd.to_datetime(row['StartTime'], format='ISO8601'), axis=1)
label_df['end_time'] = label_df.apply(lambda row: pd.to_datetime(row['EndTime'], format='ISO8601'), axis=1)

label_df['trip_id'] = label_df.apply(lambda row: 1 if row.name == 0 else (1 if label_df.iloc[row.name, :]['start_time'] - label_df.iloc[row.name-1, :]['end_time'] < pd.Timedelta(minutes=30) else 2), axis=1)
label_df['segment_id'] = label_df.apply(lambda row: 1 if row.name == 0 else (2 if label_df.iloc[row.name, :]['start_time'] - label_df.iloc[row.name-1, :]['end_time'] < pd.Timedelta(minutes=30) else 1), axis=1)
label_df

user_df['transportation_mode'] = user_df.apply(lambda row: label_df[(label_df.start_time <= row.timestamp)&(label_df.end_time >= row.timestamp)]['TransportationMode'].values, axis=1)
user_df['trip_id'] = user_df.apply(lambda row: label_df[(label_df.start_time <= row.timestamp)&(label_df.end_time >= row.timestamp)]['trip_id'].values, axis=1)
user_df['segment_id'] = user_df.apply(lambda row: label_df[(label_df.start_time <= row.timestamp)&(label_df.end_time >= row.timestamp)]['segment_id'].values, axis=1)
display(user_df[user_df.transportation_mode.str.len() > 0])

终端结果:

lat     lon     constant    alt     ndays   date    time    timestamp   transportation_mode     trip_id     segment_id
2   39.863516   116.373796  0   115     39753.187292    2008-11-01  04:29:42    2008-11-01 04:29:42     [train]     [1]     [1]
3   39.863471   116.373711  0   112     39753.187303    2008-11-01  04:29:43    2008-11-01 04:29:43     [train]     [1]     [1]
4   39.991778   116.333088  0   223     39753.212824    2008-11-01  05:06:28    2008-11-01 05:06:28     [taxi]  [1]     [2]
5   39.991776   116.333031  0   223     39753.212847    2008-11-01  05:06:30    2008-11-01 05:06:30     [taxi]  [1]     [2]
6   39.991568   116.331501  0   95  39756.429861    2008-11-04  10:19:00    2008-11-04 10:19:00     [subway]    [2]     [1]
7   39.991560   116.331508  0   95  39756.429873    2008-11-04  10:19:01    2008-11-04 10:19:01     [subway]    [2]     [1]
8   39.975891   116.333441  0   -98     39756.431262    2008-11-04  10:21:01    2008-11-04 10:21:01     [subway]    [2]     [1]
9   39.915171   116.455808  0   656     39756.460116    2008-11-04  11:02:34    2008-11-04 11:02:34     [taxi]  [2]     [1]
10  39.915369   116.455791  0   620     39756.460127    2008-11-04  11:02:35    2008-11-04 11:02:35     [taxi]  [2]     [1]
11  39.912271   116.470686  0   95  39756.465359    2008-11-04  11:10:07    2008-11-04 11:10:07     [taxi]  [2]     [1]
12  39.912088   116.469958  0   246     39756.468148    2008-11-04  11:14:08    2008-11-04 11:14:08     [walk]  [1]     [2]
13  39.912106   116.469936  0   246     39756.468160    2008-11-04  11:14:09    2008-11-04 11:14:09     [walk]  [1]     [2]
14  39.912189   116.465108  0   184     39756.474167    2008-11-04  11:22:48    2008-11-04 11:22:48     [walk]  [1]     [2]

话虽如此(好吧,书面),我不能100%确定这是否满足您的确切要求,或者即使它是有效的-一步一步地,手动通过问题,所以来自pandas重量级人物的某种形式的验证将是最受欢迎的。

相关问题