spark读取文件不包含模式

c86crjj0  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(395)
df = sc.textFile("hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/*/part-*.gz")

我使用此代码读取路径中的所有gz文件

hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/

此路径中有24个00-23之间的文件。如何读取文件但排除23文件?

drwxr-xr-x   - algo algo          0 2018-08-29 23:07 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/00
drwxr-xr-x   - algo algo          0 2018-08-29 23:11 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/01
drwxr-xr-x   - algo algo          0 2018-08-29 23:17 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/02
drwxr-xr-x   - algo algo          0 2018-08-29 23:23 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/03
drwxr-xr-x   - algo algo          0 2018-08-29 23:13 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/04
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/05
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/06
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/07
drwxr-xr-x   - algo algo          0 2018-08-29 23:18 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/08
drwxr-xr-x   - algo algo          0 2018-08-29 23:21 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/09
drwxr-xr-x   - algo algo          0 2018-08-29 23:18 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/10
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/11
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/12
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/13
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/14
drwxr-xr-x   - algo algo          0 2018-08-29 23:17 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/15
drwxr-xr-x   - algo algo          0 2018-08-29 23:20 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/16
drwxr-xr-x   - algo algo          0 2018-08-29 23:18 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/17
drwxr-xr-x   - algo algo          0 2018-08-29 23:21 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/18
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/19
drwxr-xr-x   - algo algo          0 2018-08-29 23:17 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/20
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/21
drwxr-xr-x   - algo algo          0 2018-08-29 23:15 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/22
drwxr-xr-x   - algo algo          0 2018-08-29 23:21 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/23
at0kjp5o

at0kjp5o1#

有点变通,但希望对你有用。

import os
file_list = os.popen('hadoop fs -ls hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/').readlines()
file_list = [x for x in file_list if (x not in ['23'])]
rdd = sc.textFile(file_list.mkString(","))

相关问题