pandas 如何在Azure DataLake上从文件中查询数据子集

rlcwz9us 于 2023-11-15 发布在其他

关注(0)|答案(1)|浏览(73)

我的datalake有json文件。每个文件都是一个表示为json的pandas框架。所以一个文件可能包含以下内容：

[
   {
      "col1":"1",
      "col2":"3"
   },
   {
      "col1":"2",
      "col2":"4"
   }
]

字符串
它可以翻译成这样一个框架：|col1| col2||- -|- -|| 1 | 3 || 2 | 4 |
我使用来自azure.storage.filedatalake的DataLakeServiceClient来处理datalake文件。
我想要的是query_file在我读取它之前只提取部分字符串。我的问题是为了实现这一点，我必须提供什么查询文本？首先可以这样做吗？
下面的例子将整个文件的内容作为字符串返回给我。我使用read_json将其读入pandas框架中。我知道我可以在读取后将其作为框架使用，但我想在查询时进行选择。
示例代码：

import io
import pandas as pd

from azure.storage.filedatalake import DataLakeServiceClient, DelimitedJsonDialect
from azure.identity import ClientSecretCredential

adls_client = DataLakeServiceClient(account_url='...', credential=ClientSecretCredential(...))
filesystem_client = adls_client.get_file_system_client('...')
file_client = filesystem_client.get_file_client('test_dataframe.json')

input_format = DelimitedJsonDialect(has_header=True)
reader = file_client.query_file(
    'SELECT * from DataLakeStorage', # the problem is here
    file_format=input_format
)
json_str = reader.readall().decode('utf8')
df = pd.read_json(io.StringIO(json_str))

型
我尝试使用SELECT col1 from DataLakeStorage，但它返回{}\n。

pandas

来源：https://stackoverflow.com/questions/77416517/how-to-query-subset-of-data-from-a-file-on-azure-datalake

1条答案

按热度按时间

svgewumm1#

您可以使用selected_df = df['col1']而不是SELECT col1 from DataLakeStorage来从JSON文件中仅获取column1。要仅获取第一行，您可以使用selected_rows = df.query('col1 > 1')。
下面是完整的代码：

import io
import pandas as pd
from azure.storage.filedatalake import DataLakeServiceClient, DelimitedJsonDialect
from azure.identity import ClientSecretCredential

client_id = "<clientId>"
client_secret = "<clientSecret>"
tenant_id = "<tenantId>"
account_name = "<storageAccountName>"
file_system_name = "<containerName>"
file_path = "<directory>/<filename>.json"
credential = ClientSecretCredential(
    client_id=client_id,
    client_secret=client_secret,
    tenant_id=tenant_id
)
account_url = f'https://{account_name}.dfs.core.windows.net'
adls_client = DataLakeServiceClient(account_url=account_url, credential=credential)
filesystem_client = adls_client.get_file_system_client(file_system_name)
file_client = filesystem_client.get_file_client(file_path)
input_format = DelimitedJsonDialect(has_header=True)
reader = file_client.query_file('SELECT * from DataLakeStorage', file_format=input_format)
json_str = reader.readall().decode('utf8')
df = pd.read_json(io.StringIO(json_str))

selected_df = df['col1']
selected_rows = df.query('col1 > 1')
print("First Row:")
print(selected_rows)
print("col1:")
print(selected_df)

字符串
你可以得到如下输出：

的数据

赞(0）回复(0）举报 2023-11-15

我来回答

pandas 如何在Azure DataLake上从文件中查询数据子集

1条答案

相关问题

热门标签

最新问答