haystack WhisperTranscriber to add filename to document metadata

dkqlctbz 于 4个月前发布在其他

关注(0)|答案(2)|浏览(61)

如果我们能够提供将文件名添加到WhisperTranscriber创建的文档元数据中的选项，那就太好了。目前没有很好的方法来实现这个功能。这在构建RAG管道时非常有帮助，当你想要查询视频时，但又希望在响应中引用该视频。

haystack

来源：https://github.com/deepset-ai/haystack/issues/5716

2条答案

按热度按时间

liwlm1x91#

通过@anakin87进行额外学习：
似乎即使我们想通过索引管道添加meta,如下所示，元数据也会被忽略。我认为这可能是因为根节点(Whisper)忽略了元数据。
索引管道：

whisper = WhisperTranscriber(api_key=api_key)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])

videos = ["https://www.youtube.com/watch?v=h5id4erwD4s", "https://www.youtube.com/watch?v=iFUeV3aYynI"]

# for video in videos:
file_path1 = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")
file_path2 = youtube2audio("https://www.youtube.com/watch?v=iFUeV3aYynI")
doc1 = {'file_path': file_path1, "url": "https://www.youtube.com/watch?v=h5id4erwD4s"}
doc2 = {'file_path': file_path2, "url": "https://www.youtube.com/watch?v=iFUeV3aYynI"}

indexing_pipeline.run(file_paths=[doc1['file_path'], doc2['file_path']], meta=[{"url": doc['url'] for doc in [doc1, doc2]}])

赞(0）回复(0）举报 4个月前

disho6za2#

正如Tuana所说，meta被忽略了。
例如，看run方法：
haystack/haystack/nodes/audio/whisper_transcriber.py
第176行到第186行的a5b8156
| | :param meta: Ignored |
| | """ |
| | transcribed_documents: List[Document] = [] |
| | iffile_paths: |
| | forfile_pathinfile_paths: |
| | transcription=self.transcribe(file_path) |
| | d=Document.from_dict(transcription, field_map={"text": "content"}) |
| | transcribed_documents.append(d) |
| | |
| | output= {"documents": transcribed_documents} |
| | returnoutput, "output_1" |

赞(0）回复(0）举报 4个月前

我来回答

haystack WhisperTranscriber to add filename to document metadata

2条答案

相关问题

热门标签

最新问答