unix AWK.提取根和后缀

t1qtbnec  于 2022-11-04  发布在  Unix
关注(0)|答案(4)|浏览(197)

我有一个csv文件,用分号分隔。这个文件包含一个丹麦语词典,我需要从中提取词干和后缀。我需要用AWK来做!
档案:

adelig;adelig;adj.;1
adelig;adelige;adj.;2
adelig;adeligt;adj.;3
adelig;adeligst;adj.;5
voksen;voksen;adj.;1
voksen;voksne;adj.;2
voksen;voksent;adj.;3
voksen;voksnest;adj.;5
virkemiddel;virkemiddel;sb.;1
virkemiddel;virkemidlet;sb.;2
virkemiddel;virkemidlets;sb.;3
virkemiddel;virkemiddels;sb.;4
virkemiddel;virkemidlerne;sb.;5
virkemiddel;virkemidlernes;sb.;6
virkemiddel;virkemiddel;sb.;7
virkemiddel;virkemidler;sb.;7
virkemiddel;virkemiddels;sb.;8
virkemiddel;virkemidlers;sb.;8

预期输出:

adelig;adelig; ,e,t,*,st
voksen;voks; ,ne,ent,*,nest
virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers

第四列是形式。如果缺少某些形式,后缀将被星号代替。如adelig;adelig; ,e,t,*,st如果形式(数字)重复,后缀将被分号分隔。如virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers
我开始写这段代码,但是我没有得到处理多个可能词干的算法。

BEGIN{
FS=";"
}

{

    lemm=$1;
    form=$2;

    if(match(form, lemm) > 0)
    {
        root=lemm;
        sub(root,"",form);
        suf[$1]=suf[$1]","form;
    }
    else
    {
        split($1,a,"");
        split($2,b,"");

        s="";
        for(i in a)
        { 
            if(b[i]!=a[i])
            {
                break;
            }
            s = s "" a[i];
        }
    }
    root=s;

}
aemubtdh

aemubtdh1#

下面是一些awk代码,用于查找常见的前缀长度并确定后缀列表。


# !/usr/bin/gawk -f

BEGIN { FS = OFS = ";" }
{ words[$1] = words[$1] FS $2 }
END {
    for (word in words) {
        sub("^"FS, "", words[word])
        num_words = split(words[word], these_words)
        prefix_length = common_prefix_length(these_words, num_words)

        suffixes = ""
        sep = ""
        for (i=1; i<=num_words; i++) {
            suffixes = suffixes sep substr(these_words[i],prefix_length+1)
            sep = ","
        }
        print word, substr(these_words[1], 1, prefix_length), suffixes
    }
}

function common_prefix_length(w, n                 ,i,j,minlen, char) {
    minlen = length(w[1])
    for (i=2; i<=n; i++) 
        if (length(w[i]) < minlen)
            minlen = length(w[i])

    for (i=1; i <= minlen; i++) {
        char = substr(w[1], i, 1)
        for (j=2; j <= n; j++)
            if (substr(w[j], i, 1) != char)
                return i-1
    }
    return minlen
}

根据您的输入,输出为

voksen;voks;en,ne,ent,nest
virkemiddel;virkemid;del,let,lets,dels,lerne,lernes,del,ler,dels,lers
adelig;adelig;,e,t,st
v09wglhw

v09wglhw2#

这可能是Python中一个很好的起点,它使用os.path.commonprefix从单词列表中获取词干。

import os
import csv

file="a"
prev_word=""
words=[]
data=dict()
csv_reader = csv.DictReader(
    open(file),
    delimiter=";",
    fieldnames=['common','word','type','num']
    )

for row in csv_reader:
    word = row['common']
    if not prev_word or word == prev_word:
        words.append(row['word'])
    else:
        common=os.path.commonprefix(words)
        data[prev_word] = words
        words=[]
    prev_word = word

data[prev_word] = words
for word,values in data.iteritems():
    common = os.path.commonprefix(values)
    suffixes = [i[len(common):] for i in values]
    suffixes = [i if len(i) else '*' for i in suffixes]
    print "%s;%s;%s" %(word,common,','.join(suffixes))

它会传回:

voksen;voks;ne,ent,nest
virkemiddel;virkemid;let,lets,dels,lerne,lernes,del,ler,dels,lers
adelig;adelig;*,e,t,st
eqoofvh9

eqoofvh93#

TXR中的三种解决方案。第一种是使用提取语言构建基于结构的显式数据模型,然后处理结构:

@(do
   (defstruct inflection ()
     word type index)

   (defstruct dict-entry ()
     root variants max-index))
@(collect :vars (dict))
@  (all)
@word;@(skip)
@  (and)
@    (collect :gap 0 :vars (infl))
@word;@variant;@type;@index
@      (bind infl @(new inflection word variant type type index (toint index)))
@    (end)
@    (bind dict @(new dict-entry root word variants infl
                      max-index [find-max-key infl > .index]))
@  (end)
@(end)
@(do (each ((d dict))
       (let* ((vs (mapcar .word d.variants))
              (prefix (reduce-left (ret [@1 0..(mismatch @1 @2)]) vs))
              (plen (len prefix))
              (prefix [(first vs) 0..plen]))
         (put-string `@{d.root};@prefix; `)
         (each ((i (range 2 d.max-index)))
           (let ((vlist [keepql i d.variants .index]))
             (put-char #\,)
             (put-string
               (if (null vlist)
                 "*"
                 [cat-str (mapcar (ret [@1.word plen..:]) vlist) ";"]))))
         (put-line))))

运行时间:

$ txr stems.txr data
adelig;adelig; ,e,t,*,st
voksen;voks; ,ne,ent,*,nest
virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers

请注意细微的差异:

virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers
                    ^

此分号不包括在原始所需输出中;没有给出排除依据的理由,因此目前将其视为印刷错误。
表达式(ret [@1 0..(mismatch @1 @2)])产生了一个两参数函数,它返回一对字符串的公共前缀。要返回一个字符串列表的公共前缀,我们使用这个函数作为reduce-left的内核。
第二个版本,没有数据结构化。在data上产生相同的输出:

@(repeat)
@  (all)
@word;@(skip)
@  (and)
@    (collect :gap 0)
@word;@variant;@type;@strindex
@      (bind index @(toint strindex))
@    (end)
@    (do
       (let* ((prefix (reduce-left (ret [@1 0..(mismatch @1 @2)]) variant))
              (plen (len prefix))
              (max-index [find-max index])
              (v-i-pairs (zip variant index)))
        (put-string `@word;@prefix; `)
        (each ((i (range 2 max-index)))
          (let ((vlist [keepql i v-i-pairs second]))
            (put-char #\,)
            (put-string
              (cat-str (or (mapcar (aret [@1 plen..:]) vlist)
                           '("*"))
                       ";"))))
        (put-line)))
@  (end)
@(end)

纯TXRLisp解决方案,不使用提取语言。一个巨大的表达式读取输入行,将其拆分,将第四个字段转换为整数,按词根对条目进行分组,等等:

(flow
  (get-lines)
  (keep-matches (`@a;@b;@c;@d` @1)
    (list a b c (toint d)))
  (partition-by first)
  (mapcar transpose)
  (mapdo (tb ((word variant type index))
           (let* ((prefix (reduce-left (ret [@1 0..(mismatch @1 @2)]) variant))
                  (plen (len prefix))
                  (max-index [find-max index])
                  (v-i-pairs (zip variant index)))
             (put-string `@(first word);@prefix; `)
             (each ((i (range 2 max-index)))
               (let ((vlist [keepql i v-i-pairs second]))
                 (put-char #\,)
                 (put-string
                   (cat-str (or (mapcar (aret [@1 plen..:]) vlist)
                                '("*"))
                            ";"))))
             (put-line)))))

运行时间:

$ txr stems3.tl < data
adelig;adelig; ,e,t,*,st
voksen;voks; ,ne,ent,*,nest
virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers
xienkqul

xienkqul4#

这是我得到预期结果的代码。代码中的注解表明了对glenn代码的主要修改。

BEGIN {
FS=OFS=";"
}

{ 
    words[$1";"$3] = words[$1";"$3] FS $2;
    num[$1";"$3]=num[$1";"$3] $4 FS; #Array to store numbers in the fourth column by two ID's
}

END {
    for (item in words) {
        sub("^"FS, "", words[item]);
        words_n = split(words[item], extrac);
        split(num[item],numbers); #Extract numbers one by one, in order to compare them.
        split(item,cab,";");
        long = extract_stem(extrac, words_n);

        suffix = "";
        sep = ",";

        for (i=1; i<=words_n; i++)
        {
            suf=substr(extrac[i],long+1)
            if(suf!="") #Avoid null values from suffixes.
            {
                suffix = suffix sep suf;
            }

            if(numbers[i]==numbers[i+1]) #Compare numbers with the next number
            {
                sep=";";
            }
            else if((numbers[i+1]-numbers[i])!= 1) #Subtract numbers to its previous number
            {
                sep=",*,";
            }
            else
            {
                sep=",";
            }
        }
        print cab[1], substr(extrac[1], 1, long), " "suffix
    }
}

function extract_stem(wrd, nmr ,i,j,min, chr) { #This is the magic of glenn jackman!
    min = length(wrd[1])
    for (i=2; i<=nmr; i++)
    {
        if (length(wrd[i]) < min)
        {
            min = length(wrd[i]);
        }
    }

    for (i=1; i <= min; i++)
    {
        chr = substr(wrd[1], i, 1)
        for (j=2; j <= nmr; j++)
        {
            if (substr(wrd[j], i, 1) != chr)
            {
                return i-1;
            }
        }
    }
    return min
}

我不得不修改代码。我没有考虑过这种决疑术。当引理对动词和副词是一样的时候。

abe;abe;sb.;1
abe;aben;sb.;2
abe;abens;sb.;3
abe;abes;sb.;4
abe;aberne;sb.;5
abe;abernes;sb.;6
abe;aber;sb.;7
abe;abers;sb.;8
abe;abe;vb.;1
abe;ab;vb.;2
abe;abet;vb.;3
abe;aber;vb.;4
abe;abede;vb.;6
abe;abes;vb.;7
abe;abedes;vb.;8

相关问题