python 用于生物医学命名实体识别(NER)的剪裁

zlhcx6iw  于 2023-02-07  发布在  Python
关注(0)|答案(3)|浏览(331)

如何使用坐标标注实体?

当我尝试使用scispacy进行NER时,它将生物医学实体标记为Entity,但无法将其标记为基因/蛋白质等。那么如何使用scispacy?或者scispacy无法标记数据?随附图片以供参考:jupyter notebook snippet

cfh9epnr

cfh9epnr1#

模型en_core_sci_smen_core_sci_mden_core_sci_lg不命名其实体。如果需要标注实体,请使用模型

  • 工程师工艺MD
  • 工程师_jnlpba_md
  • 中文
  • 环境_内_生物体lp13cg_md

每一个都有自己的实体类型,见:
https://allenai.github.io/scispacy/
了解更多信息

cbwuti44

cbwuti442#

您可以通过“GENE_OR_GENE_PRODUCT”过滤标签以获得所有基因名称。

import spacy
import scispacy
import en_ner_bionlp13cg_md

document = "We aimed to prospectively compare the risk of early progression according to circulating ESR1 mutations, CA-15.3, and circulating cell-free DNA in MBC patients treated with a first-line aromatase inhibitor (AI)"

nlp = spacy.load("en_ner_bionlp13cg_md")
for X in nlp(document).ents:
    if X.label_=='GENE_OR_GENE_PRODUCT':
        print(X.text)
fjaof16o

fjaof16o3#

安装所需模块

!pip install spacy
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz
# !pip install scispacy

装载 Package

import scispacy, spacy
sci_nlp = spacy.load("en_ner_bionlp13cg_md")

NLP对象的组件

sci_nlp.component_names

探索实体

c= 0 
for i in sci_nlp.get_pipe('ner').labels:
    c=c+1
    print(c,"<==>",i)
# output
1 <==> AMINO_ACID
2 <==> ANATOMICAL_SYSTEM
3 <==> CANCER
4 <==> CELL
5 <==> CELLULAR_COMPONENT
6 <==> DEVELOPING_ANATOMICAL_STRUCTURE
7 <==> GENE_OR_GENE_PRODUCT
8 <==> IMMATERIAL_ANATOMICAL_ENTITY
9 <==> MULTI_TISSUE_STRUCTURE
10 <==> ORGAN
11 <==> ORGANISM
12 <==> ORGANISM_SUBDIVISION
13 <==> ORGANISM_SUBSTANCE
14 <==> PATHOLOGICAL_FORMATION
15 <==> SIMPLE_CHEMICAL
16 <==> TISSUE

x = "Med7 — An information extraction model for clinical natural language processing. More information about the model development can be found in recent pre-print: Med7: a transferable clinical natural language processing model for electronic health records."

docx = sci_nlp(x)
for ent in docx.ents:
    print(ent.text,ent.label_)
#output
Med7 GENE_OR_GENE_PRODUCT
Med7 GENE_OR_GENE_PRODUCT

可视化

from spacy import displacy
displacy.render(docx,style='ent',jupyter=True)

#output
Med7 GENE_OR_GENE_PRODUCT — An information extraction model for clinical natural language processing. More information about the model development can be found in recent pre-print: Med7 GENE_OR_GENE_PRODUCT : a transferable clinical natural language processing model for electronic health records.

相关问题