所以我想对存储在s3上的文本文件执行nltk,作为aws glue中etl过程的一部分[https://drive.google.com/drive/folders/1ne16uvnhd0k6790cxsimdyrrcgzeatjn?usp=sharing][1] 是我用于etl作业的外部python依赖项。
我已经更改了位于nltk包中的data.py文件,根据[https://stackoverflow.com/a/45069242/12927963][2] 但我使用了这个“/tmp/nltk\u data”作为路径变量。
这是我的示例脚本
import pyspark
import nltk
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from pyspark.ml.feature import NGram
from pyspark.sql.types import StringType
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode,regexp_replace
from nltk.corpus import stopwords
nltk.data.path = ["/tmp"]
from nltk.stem import WordNetLemmatizer
sparkcontext = SparkContext.getOrCreate()
gluecontext = GlueContext(SparkContext.getOrCreate())
sqlcontext = SQLContext(gluecontext)
glueJob = Job(gluecontext)
## Initializa Glue Job
glueJob.init("NLTKjob")
# Lemmatizen Function
def lemma(x):
lemmatizer = WordNetLemmatizer()
return lemmatizer.lemmatize(x)
## Reading Files from S3 to Spark RDD.
init_RDD = sparkcontext.textFile("<s3 file path>")
print(init_RDD)
print("Type of init_RDD is :: ",type(init_RDD))
## Perform Lemmatizen on RDD.
RDD_lem_words = init_RDD.map(lemma)
print(RDD_lem_words)
print("Type of RDD_lem_words is ::",type(RDD_lem_words))
RDD_len_list = RDD_lem_words.collect()
print(RDD_len_list)
print("Type Of RDD_len_list is ::",type(RDD_len_list))
glueJob.commit()
但是我在nltk中出错了
并显示以下日志
nltk.download('wordnet')
[0m
Attempted to load [93mcorpora/wordnet.zip/wordnet/[0m
Searched in:
- '/home/nltk_data'
- '/usr/nltk_data'
- '/usr/share/nltk_data'
is“/tmp/nltk\u data”是aws glue中nltk数据的正确路径
如果我不在python external zip中添加nltk数据,它将下载到“/home/nltk\ u data”,但spark应用程序无法访问它。
请帮我处理这个用例[1]: https://drive.google.com/drive/folders/1ne16uvnhd0k6790cxsimdyrrcgzeatjn?usp=sharing [2]: https://stackoverflow.com/a/45069242/12927963
暂无答案!
目前还没有任何答案,快来回答吧!