regex 在字符串中查找Emojis

4xrmg8kj 于 2023-04-22 发布在其他

关注(0)|答案(3)|浏览(242)

所以我尝试在字符串中查找和替换表情符号。这是我目前为止使用regexp的方法。

const replaceEmojis = function (string) {
    String.prototype.regexIndexOf = function (regex, startpos) {
        const indexOf = this.substring(startpos || 0).search(regex);
        return (indexOf >= 0) ? (indexOf + (startpos || 0)) : indexOf;
    }
    // generate regexp
    let regexp;
    try {
        regexp = new RegExp('\\p{Emoji}', "gu");
    } catch (e) {
        //4 firefox <3
        regexp = new RegExp(`(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])`, 'g');
    }

    // get indices of all emojis
    function getIndicesOf(searchStr, str) {
        let index, indices = [];

        function getIndex(startIndex) {
            index = str.regexIndexOf(searchStr, startIndex);
            if (index === -1) return;
            indices.push(index);
            getIndex(index + 1)
        }

        getIndex(0);

        return indices;
    }

    const emojisAt = getIndicesOf(regexp, string);

    // replace emojis with SVGs
    emojisAt.forEach(index => {
        // got nothing here yet
        // const unicode = staticHTML.charCodeAt(index); //.toString(16);
    })

这样做的问题是，我只得到一个数组，其中的索引是字符串中的表情符号。但是只有这些索引，我不能替换它们，因为我不知道它们占用了多少（UTF-16）字节。此外，为了替换它们，我需要知道我正在替换的是什么表情符号。
那么，有没有一种方法可以同时获得表情符号的长度？或者有没有一种比我更好（也许更简单）的方法来替换表情符号？

regex

来源：https://stackoverflow.com/questions/61213515/finding-emojis-in-strings

3条答案

按热度按时间

ecbunoof1#

好吧，原来我只是有点心理障碍。
为了找到表情符号，我不需要像WolverinDEV提到的那样获取索引。虽然只使用string.replace和/\p{Emoji}/gu并不起作用，因为这会将♂分解🙋🏻‍️为🙋，🏻，和♂。所以我调整了regexp来说明这一点：/[\p{Emoji}\u200d]+/gu。现在emoji完整返回，因为包含零宽度joiner。
这是我得到的（如果有人关心）：

const replaceEmojis = function (string) {
    const emojis = string.match(/[\p{Emoji}\u200d]+/gu);
    // console.log(emojis);

    // replace emojis with SVGs
    emojis.forEach(emoji => {
        // get the unicodes of the emoji
        let unicode = "";

        function getNextChar(pointer) {
            const subUnicode = emoji.codePointAt(pointer);
            if (!subUnicode) return;
            unicode += '-' + subUnicode.toString(16);
            getNextChar(++pointer);
        }

        getNextChar(0);

        unicode = unicode.substr(1); // remove the beginning dash '-'
        console.log(unicode.toUpperCase());

        // replace emoji here
        // string = string.replace(emoji, `<svg src='path/to/svg/${unicode}.svg'>`)
    })

    return string;
}

这仍然需要工作，例如，在输出的unicode中有Low Surrogates，但基本上，这是可行的。

编辑：

第一次改进：

您可能不需要这样做，但要摆脱低代理字符，请在getNextChar()中添加一个条件

if (!(subUnicode >= 56320 && subUnicode <= 57343)) unicode += '-' + subUnicode.toString(16);

这仅在字符代码不是低代理项字符时才添加该字符代码。

第二次改进：

将变量选择器16（U+FE0F）添加到regexp中，以选择更多的emoji整体：

/[\p{Emoji}\u200d\ufe0f]+/gu

赞(0）回复(0）举报 2023-04-22

tkclm6bt2#

你已经有一个工作的RegExp，所以你可以使用String.replace：

string.replace(regexp, my_emojy => { 
    return "<an emoji was here>";
});

所以你根本不需要找到任何索引。

赞(0）回复(0）举报 2023-04-22

tf7tbtn23#

首先：\p{Emoji}不是你需要的。

`\p{Emoji}`匹配哪个一长字符？

我假设我们在第一个 *unicode平面 * 内工作，其中包括我们“常用”的所有字符，这超过了65500个 * 代码点 *，所以让我们使用JavaScript来获取与\p{Emoji}匹配的项目：

console.log(...(new Array(2 ** 16)).fill(null).reduce((characters, _, i) => characters.concat(String.fromCodePoint(i)), '').match(/\p{Emoji}/gu));

幸运的是，我们可以很容易地从上面的结果中提取出我们感兴趣的字符（#*0123456789）。

如何正确搭配表情

实际上，*unicode属性 * Emoji并不打算执行以下操作：Unicode® Standard Annex #44 - UNICODE CHARACTER DATABASE - Property Definitions（Emoji Data）.是的，它确实匹配表情符号，但我们也要求它匹配组合为一个的几个表情符号。这是一个不同的正则表达式的工作，在Unicode® Technical Standard #51 - UNICODE EMOJI - EBNF and Regex中描述的。
基于它，我们可以构建这个丑陋但有效的emoji正则表达式：

const emojiRegex = /\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\uFE0F\u20E3?|[\u{E0020}-\u{E007E}]+\u{E007F})?(\u200D(\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\uFE0F\u20E3?|[\u{E0020}-\u{E007E}]+\u{E007F})?))*/gu;

回答

把这些放在一起：

const emojiBlaskList = '#*0123456789';
const emojiRegex = /\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\uFE0F\u20E3?|[\u{E0020}-\u{E007E}]+\u{E007F})?(\u200D(\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\uFE0F\u20E3?|[\u{E0020}-\u{E007E}]+\u{E007F})?))*/gu;

function replaceAllEmojis(string) {
  const emojis = (string.match(emojiRegex) || []).filter(emoji => !emojiBlaskList.includes(emoji));

  if (emojis.length === 0) {
    return string; // Nothing to do here.
  }

  let noEmojis = string;

  for (const emoji of emojis) {
    noEmojis = noEmojis.replace(emoji, '');
  }

  return noEmojis;
}

// Mixed:
console.log(replaceAllEmojis('🥰🥰🥰🥰🥰😍😘😗☺😚🏴󠁧󠁢󠁳󠁣󠁴󠁿🥲123 !@#$%^asd⚕♻⚜🔱🔰☑✔❌〽✳©®™🇦🇨🇦🇩🇦🇪🇦🇱🇦🇲🇦🇴#️⃣*️⃣0️⃣1️⃣2️⃣🤷‍♂️🤷🏼‍♂️🤷🏿‍♂️🧑🏿‍❤️‍💋‍🧑🏼🧑🏿‍❤️‍💋‍🧑🏽'));

// No-emojis only:
console.log(replaceAllEmojis('#*0123456789'));

// Emojis only:
console.log(replaceAllEmojis('🥰🏴󠁧󠁢󠁳󠁣󠁴󠁿🔱🔰✔❌🧑🏿‍❤️‍💋‍🧑🏽'));

这个实现只是一个演示，你可以根据需要进行teak/improve。

赞(0）回复(0）举报 2023-04-22

我来回答

regex 在字符串中查找Emojis

3条答案

编辑：

`\p{Emoji}`匹配哪个一长字符？

如何正确搭配表情

回答

相关问题

热门标签

最新问答

regex 在字符串中查找Emojis

3条答案

编辑：

\p{Emoji}匹配哪个一长字符？

如何正确搭配表情

回答

相关问题

热门标签

最新问答

`\p{Emoji}`匹配哪个一长字符？