如何使用LangChain PyPDFLoader从Azure Blob存储加载PDF文件

gk7wooem  于 2023-10-22  发布在  其他
关注(0)|答案(1)|浏览(285)

我目前正在尝试实现langchain功能来处理pdf文档。我有一堆PDF文件存储在Azure Blob存储中。我正在尝试使用langchain PyPDFLoader将pdf文件加载到Azure ML Notebook。但是,我无法完成它。如果我把pdf存储在本地,这是没有问题的,但要扩展,我必须连接到blob商店。我没有在langchain网站或azure网站上找到任何文档。我想知道你们中是否有人有类似的问题。
谢谢你
下面是我正在尝试的代码示例:

from azureml.fsspec import AzureMachineLearningFileSystem
fs = AzureMachineLearningFileSystem("<path to datastore>")

from langchain.document_loaders import PyPDFLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
    loader = PyPDFLoader(document)
    data = loader.load()

Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject

另一个例子是:

from langchain.document_loaders import UnstructuredFileLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
    loader = UnstructuredFileLoader(fd)
documents = loader.load() 

Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject
gblwokeq

gblwokeq1#

如果您仍然需要答案,则必须将blob数据转换为BytesIO对象,并在处理文件之前将其保存在本地(无论是临时还是永久)。我是这样做的:

def az_load_files(storage_acc_name, container_name, filenames=None):
  container_client = get_blob_container_client(container_name, storage_acc_name)
  blob_data = []
  for filename in filenames:
      blob_client = container_client.get_blob_client(filename)
      if blob_client.exists():
          blob_data.append(io.BytesIO(blob_client.download_blob().readall()))
  return blob_data

然后为BytesIO对象创建一个临时文件夹,以便读取和“转换”成它们各自的文档类型

import temp

temp_pdfs = []
temp_dir = tempfile.mkdtemp()
for i, byteio in enumerate(ss['loaded_files']):
    file_path = os.path.join(temp_dir, ss['selected_files'][i])
    with open(file_path, 'wb') as file:
        file.write(byteio.getbuffer())
    temp_pdfs.append(file_path)

并使用DirectoryLoader加载任何类型的文档

from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.document_loaders import (
  PyPDFLoader,
  DirectoryLoader,
  CSVLoader,
  Docx2txtLoader,
  TextLoader,
  UnstructuredExcelLoader,
  UnstructuredHTMLLoader,
  UnstructuredPowerPointLoader,
  UnstructuredMarkdownLoader,
  JSONLoader
)

file_type_mappings = {
    '*.txt': TextLoader,
    '*.pdf': PyPDFLoader,
    '*.csv': CSVLoader,
    '*.docx': Docx2txtLoader,
    '*.xlss': UnstructuredExcelLoader,
    '*.xlsx': UnstructuredExcelLoader,
    '*.html': UnstructuredHTMLLoader,
    '*.pptx': UnstructuredPowerPointLoader,
    '*.ppt': UnstructuredPowerPointLoader,
    '*.md': UnstructuredMarkdownLoader,
    '*.json': JSONLoader,
}

docs = []

for glob_pattern, loader_cls in file_type_mappings.items():
    try:
        loader_kwargs = {'jq_schema': '.', 'text_content': False} if loader_cls == JSONLoader else None
        loader_dir = DirectoryLoader(
            temp_dir, glob=glob_pattern, loader_cls=loader_cls, loader_kwargs=loader_kwargs)
        documents = loader_dir.load_and_split()
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=800, chunk_overlap=200)
        # for different glob pattern it will split and add texts
        docs += text_splitter.split_documents(documents)
    except Exception as e:
        continue

相关问题