将以下colab python代码(请参阅下面的链接)部署到google cloud上的dataproc,并且它仅在输入列表是一个包含一个项的数组时工作,当输入列表包含两个项时,pyspark作业将在下面的get\u similarity method中的“for r in result.collect()”行中由于以下错误而终止:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)
input_list=["no error"] <---- works
input_list=["this", "throws EOF error"] <---- does not work
使用spark nlp链接到colab以获得句子相似性:https://colab.research.google.com/github/johnsnowlabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/sentence_similarity.ipynb#scrollto=6e0y5wt4个
def get_similarity(input_list):
df = spark.createDataFrame(pd.DataFrame({'text': input_list}))
result = light_pipeline.transform(df)
embeddings = []
for r in result.collect():
embeddings.append(r.sentence_embeddings[0].embeddings)
embeddings_matrix = np.array(embeddings)
return np.matmul(embeddings_matrix, embeddings_matrix.transpose())
我尝试过在hadoop集群配置中将“dfs.datanode.max.transfer.threads”更改为8192,但仍然没有成功
hadoop_config.set('dfs.datanode.max.transfer.threads', "8192")
当input\u list在数组中有多个项时,如何使此代码工作?
暂无答案!
目前还没有任何答案,快来回答吧!