如何使用Microsoft Azure OpenAI和Azure Cogintive Search在Python中与我的数据聊天

41zrol4v  于 2023-11-21  发布在  Python
关注(0)|答案(1)|浏览(145)

我已经编写了代码,从PDF文档中提取文本,并使用Azure OpenAI的text-embeddings-ada-002模型将其转换为向量。这些向量然后存储在Microsoft Azure认知搜索索引中,可以查询。但是,我现在想使用Azure OpenAI与此数据交互并检索生成的结果。我的代码到现在为止运行良好,但是我不知道如何通过Azure OpenAI实现与Python中Azure认知搜索中的自定义数据的交互。
下面是我的代码:

OPENAI_API_BASE = "https://xxxxx.openai.azure.com"
OPENAI_API_KEY = "xxxxxx"
OPENAI_API_VERSION = "2023-05-15"

openai.api_type = "azure"
openai.api_key = OPENAI_API_KEY
openai.api_base = OPENAI_API_BASE
openai.api_version = OPENAI_API_VERSION

AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT = "https://xxxxxx.search.windows.net"
AZURE_COGNITIVE_SEARCH_API_KEY = "xxxxxxx"
AZURE_COGNITIVE_SEARCH_INDEX_NAME = "test"
AZURE_COGNITIVE_SEARCH_CREDENTIAL = AzureKeyCredential(AZURE_COGNITIVE_SEARCH_API_KEY)

llm = AzureChatOpenAI(deployment_name="gpt35", openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)
embeddings = OpenAIEmbeddings(deployment_id="ada002", chunk_size=1, openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)

acs = AzureSearch(azure_search_endpoint=AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT,
                  azure_search_key = AZURE_COGNITIVE_SEARCH_API_KEY,
                  index_name = AZURE_COGNITIVE_SEARCH_INDEX_NAME,
                  embedding_function = embeddings.embed_query)

def generate_embeddings(s):
  # wichtig! engine muss der name sein meiner bereitstellung sein!
  response = openai.Embedding.create(
      input=s,
      engine="ada002"
  )

  embeddings = response['data'][0]['embedding']

  return embeddings

def generate_tokens(s, f):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
  splits = text_splitter.split_text(s)
  i = 0

  documents = []
  for split in splits:
    metadata = {}
    metadata["index"] = i
    metadata["file_source"] = f
    i = i+1

    new_doc = Document(page_content=split, metadata=metadata)
    documents.append(new_doc)
    #documents = text_splitter.create_documents(splits)

  return documents

drive.mount('/content/drive')
folder = "/content/drive/docs/pdf/"

page_content = ''
doc_content = ''

for filename in os.listdir(folder):
    file_path = os.path.join(folder, filename)
    if os.path.isfile(file_path):
        print(f"Processing file: {file_path}")

        doc = fitz.open(file_path)
        for page in doc: # iterate the document pages
          page_content += page.get_text() # get plain text encoded as UTF-8

        doc_content += page_content

        d = generate_tokens(doc_content, file_path)
        print(d)

        acs.add_documents(documents=d)
    
        print("Done.")

query = "What are the advantages of an open-source ai model?"
search_client = SearchClient(AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT, AZURE_COGNITIVE_SEARCH_INDEX_NAME, credential=AZURE_COGNITIVE_SEARCH_CREDENTIAL)

results = search_client.search(
    search_text=None,
    vector_queries= [vector_query],
    select=["content_vector", "metadata"],
)

print(results)

for result in results:
  print(result)

字符串
Azure认知搜索中的字段是content_vector(用于向量)和content(用于纯文本内容)。我看了很多GitHub代码,也是由Microsoft发布的,并且知道它是实现的,但显然在理解这一块是如何实现的方面存在一些问题。
因此,我正在寻找一些帮助/提示如何扩展此代码,以通过Azure Open AI Chat与Azure Cognitive中的内容进行交互。

xggvc2p6

xggvc2p61#

到目前为止,您的代码所做的是在Azure认知搜索中进行相似性搜索,并找到与您的问题相关的相关数据。
下一步是将查询和此相关数据传递给LLM,以根据相关数据创建查询的答案。方法是创建一个提示并使用此信息填充它,然后将其发送给LLM以回答查询。
下面是一些代码来做同样的事情:

# "content" field contains the text content of your data. make sure that it is retrieved.
results = search_client.search(
    search_text=None,
    vector_queries= [vector_query],
    select=["content", "content_vector", "metadata"],
)

context = ""
for result in results:
  context += result.content + "\n\n"

# setup prompt template
template = """
Use the following pieces of context to answer the question at the end. Question is enclosed in <question></question>.
Do keep the following things in mind when answering the question:
- If you don't know the answer, just say that you don't know, don't try to make up an answer.
- Keep the answer as concise as possible.
- Use only the context to answer the question. Context is enclosed in <context></context>
- If the answer is not found in context, simply output "I'm sorry but I do not know the answer to your question.".

<context>{context}</context>
<question>{question}</question>
"""
prompt_template = PromptTemplate.from_template(template)

# initialize LLM
llm = AzureChatOpenAI(deployment_name="gpt35", openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION, temperature=0)
prompt = prompt_template.format(context=context, question= query)
message = HumanMessage(content=prompt)
result = llm([message])
print(result.content)

字符串
这是一个经典的Retrieval Augmented Generation (RAG)技术。我使用它创建了一个简单的应用程序来使用自然语言查询Azure文档。上面的代码基于我为该应用程序编写的代码。您可以阅读更多关于该应用程序的信息,并在这里查看源代码:https://github.com/gmantri/azure-docs-copilot

相关问题