我已经编写了代码,从PDF文档中提取文本,并使用Azure OpenAI的text-embeddings-ada-002模型将其转换为向量。这些向量然后存储在Microsoft Azure认知搜索索引中,可以查询。但是,我现在想使用Azure OpenAI与此数据交互并检索生成的结果。我的代码到现在为止运行良好,但是我不知道如何通过Azure OpenAI实现与Python中Azure认知搜索中的自定义数据的交互。
下面是我的代码:
OPENAI_API_BASE = "https://xxxxx.openai.azure.com"
OPENAI_API_KEY = "xxxxxx"
OPENAI_API_VERSION = "2023-05-15"
openai.api_type = "azure"
openai.api_key = OPENAI_API_KEY
openai.api_base = OPENAI_API_BASE
openai.api_version = OPENAI_API_VERSION
AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT = "https://xxxxxx.search.windows.net"
AZURE_COGNITIVE_SEARCH_API_KEY = "xxxxxxx"
AZURE_COGNITIVE_SEARCH_INDEX_NAME = "test"
AZURE_COGNITIVE_SEARCH_CREDENTIAL = AzureKeyCredential(AZURE_COGNITIVE_SEARCH_API_KEY)
llm = AzureChatOpenAI(deployment_name="gpt35", openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)
embeddings = OpenAIEmbeddings(deployment_id="ada002", chunk_size=1, openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)
acs = AzureSearch(azure_search_endpoint=AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT,
azure_search_key = AZURE_COGNITIVE_SEARCH_API_KEY,
index_name = AZURE_COGNITIVE_SEARCH_INDEX_NAME,
embedding_function = embeddings.embed_query)
def generate_embeddings(s):
# wichtig! engine muss der name sein meiner bereitstellung sein!
response = openai.Embedding.create(
input=s,
engine="ada002"
)
embeddings = response['data'][0]['embedding']
return embeddings
def generate_tokens(s, f):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
splits = text_splitter.split_text(s)
i = 0
documents = []
for split in splits:
metadata = {}
metadata["index"] = i
metadata["file_source"] = f
i = i+1
new_doc = Document(page_content=split, metadata=metadata)
documents.append(new_doc)
#documents = text_splitter.create_documents(splits)
return documents
drive.mount('/content/drive')
folder = "/content/drive/docs/pdf/"
page_content = ''
doc_content = ''
for filename in os.listdir(folder):
file_path = os.path.join(folder, filename)
if os.path.isfile(file_path):
print(f"Processing file: {file_path}")
doc = fitz.open(file_path)
for page in doc: # iterate the document pages
page_content += page.get_text() # get plain text encoded as UTF-8
doc_content += page_content
d = generate_tokens(doc_content, file_path)
print(d)
acs.add_documents(documents=d)
print("Done.")
query = "What are the advantages of an open-source ai model?"
search_client = SearchClient(AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT, AZURE_COGNITIVE_SEARCH_INDEX_NAME, credential=AZURE_COGNITIVE_SEARCH_CREDENTIAL)
results = search_client.search(
search_text=None,
vector_queries= [vector_query],
select=["content_vector", "metadata"],
)
print(results)
for result in results:
print(result)
字符串
Azure认知搜索中的字段是content_vector
(用于向量)和content
(用于纯文本内容)。我看了很多GitHub代码,也是由Microsoft发布的,并且知道它是实现的,但显然在理解这一块是如何实现的方面存在一些问题。
因此,我正在寻找一些帮助/提示如何扩展此代码,以通过Azure Open AI Chat与Azure Cognitive中的内容进行交互。
1条答案
按热度按时间xggvc2p61#
到目前为止,您的代码所做的是在Azure认知搜索中进行相似性搜索,并找到与您的问题相关的相关数据。
下一步是将查询和此相关数据传递给LLM,以根据相关数据创建查询的答案。方法是创建一个提示并使用此信息填充它,然后将其发送给LLM以回答查询。
下面是一些代码来做同样的事情:
字符串
这是一个经典的
Retrieval Augmented Generation (RAG)
技术。我使用它创建了一个简单的应用程序来使用自然语言查询Azure文档。上面的代码基于我为该应用程序编写的代码。您可以阅读更多关于该应用程序的信息,并在这里查看源代码:https://github.com/gmantri/azure-docs-copilot。