pyspark 以纯文本形式读取笔记本

fumotvh3 于 2023-10-15 发布在 Spark

关注(0)|答案(1)|浏览(94)

我是新的突触，但一个长期的Python用户。我有一个包含SQL查询的Synapse Notebook。我将该笔记本中的文本字段记录在单元格中，如下所示：

SELECT
    Field1    --| Some description of field 1
    Field2    --| Some description of field 2
    FieldN    --| Some description of field n
FROM SomeTable

我希望能够将notebook作为一个纯文本字符串读入另一个notebook，使用pyspark中的re包来解析sql块，并将文档作为字段和描述的表格输出。我想不通的是：* 我如何从我的工作区中读取笔记本作为字符串到另一个notbook？***
假设我在一个名为Gold的目录中有一个名为MyNotebook的笔记本。在本地机器上的python中，我会写这样的东西：

with open('Gold/MyNotebook') as f:
    contents = f.readlines()

但在synapse中，这会导致一个错误：

File ~/cluster-env/env/lib/python3.10/site-packages/IPython/core/interactiveshell.py:282, in _modified_open(file, *args, **kwargs)
    275 if file in {0, 1, 2}:
    276     raise ValueError(
    277         f"IPython won't let you open fd={file} by default "
    278         "as it is likely to crash IPython. If you know what you are doing, "
    279         "you can use builtins' open."
    280     )
--> 282 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: 'Gold/MyNotebook'

pyspark

来源：https://stackoverflow.com/questions/77136909/read-a-notebook-in-as-plain-text

1条答案

按热度按时间

lh80um4z1#

正如@user238607所说，你不能直接在工作区内阅读笔记本的内容。他还给出了使用%notebook导出notebook并解析它。但是azure synapse不支持这个命令，所以你可以把你的笔记本导出到.html，然后上传到adls账户。

下面是导出并上传到adls帐户。

现在在synapse上读一下。
我使用fsspec来读取和解析它。
下面是代码。

fsspec_handle = fsspec.open('abfs://data/notebook/Notebook.html', account_name="Storage_acc_name", account_key="your_acc_key", mode="r")

from bs4 import BeautifulSoup

with fsspec_handle.open() as html_file:
    soup = BeautifulSoup(html_file, 'html.parser')

partial_class_name = 'jp-Cell-inputWrapper'
cell_elements = soup.find_all('div',class_='jp-InputArea-editor')

q=[]
for cell in cell_elements:
    cell_content = cell.get_text()
    q.append(cell_content.strip())

print(q)

下面是解析的查询。

现在创建表。

import re
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("Field Name", StringType(), nullable=False),
    StructField("Description", StringType(), nullable=False)
])

pattern = r'\b(\w+)\s+--\|\s+(.*?)\n'

for each in q:
    matches = re.findall(pattern, each)
    field_table = pd.DataFrame(matches, columns=["Field Name", "Description"])
    spark.createDataFrame(field_table,schema=schema).show()

输出量：

赞(0）回复(0）举报 2023-10-15

我来回答

pyspark 以纯文本形式读取笔记本

1条答案

相关问题

热门标签

最新问答