在pandas / python中从字符串中获取匹配和不匹配的单词

rt4zxlrg  于 2023-11-15  发布在  Python
关注(0)|答案(2)|浏览(169)

我希望在同一个pandas列中比较字符串的相似性,以根据索引中高于或低于的值导出一个包含所有匹配单词的字符串和第二个包含所有不匹配单词的字符串。
我会选择是否匹配上面或下面的基础上,其中一个有更多的匹配的话。
下面是一个小例子:
之前:
| 产品描述|
| --|
| 红色HMS Carabiner|
| 蓝色HMS Carabiner|
| 橙子号|
| 液体粉笔-100毫升|
| 液体粉笔-100 ml(10个装)|
之后:
| 产品描述|方差|
| --|--|
| 铁钩号驱逐舰|红色|
| 铁钩号驱逐舰|蓝色|
| 铁钩号驱逐舰|橙子|
| 液体粉笔-100毫升|楠|
| 液体粉笔-100毫升|(10例)|
我有点不知道从哪里开始,所以我很抱歉没有从一个尝试的解决方案开始。
先谢谢你了。

fnatzsnv

fnatzsnv1#

1.将单词集与上一个产品和下一个产品进行比较
1.选择哪个邻居有较大的交集
1.选择没有进入交集的单词作为方差
这对您的示例是有效的,但是当产品描述中有多个重复的单词时,可能会给予奇怪的输出。
我还假设,如果描述与任何邻居都没有共同的词,它就没有组,因此没有方差。

def get_intersection(descr1, descr2):
    if pd.isna(descr1) or pd.isna(descr2):
        return set()
    return set(descr1.split()).intersection(set(descr2.split()))

def get_unique_words(descr, intersection):
    unique_words = " ".join(
        word for word in descr.split() if word not in intersection
    )
    if len(unique_words) > 0:
        return unique_words

def get_unique_description(row):
    if len(row["next_product_intersection"]) == 0 and len(row["prev_product_intersection"]) == 0:
        return row["Product Description"]
    
    if len(row["next_product_intersection"]) >= len(row["prev_product_intersection"]):
        return row["next_product_unique_words"]
    
    return row["prev_product_unique_words"]

df["next_product"] = df["Product Description"].shift(-1)
df["prev_product"] = df["Product Description"].shift(1)

df["next_product_intersection"] = df.apply(
    lambda row: get_intersection(row["Product Description"], row["next_product"]),
    axis=1
)
df["prev_product_intersection"] = df.apply(
    lambda row: get_intersection(row["Product Description"], row["prev_product"]),
    axis=1
)

df["next_product_unique_words"] = df.apply(
    lambda row: get_unique_words(row["Product Description"], row["next_product_intersection"]),
    axis=1
)
df["prev_product_unique_words"] = df.apply(
    lambda row: get_unique_words(row["Product Description"], row["prev_product_intersection"]),
    axis=1
)

df["Variance"] = df.apply(get_unique_description, axis=1)
df = df[["Product Description", "Variance"]]
print(df)

字符串
输出量:

Product Description      Variance
0                  Red HMS Carabiner           Red
1                 Blue HMS Carabiner          Blue
2               HMS Carabiner Orange        Orange
3               Liquid Chalk - 100ml          None
4  Liquid Chalk - 100ml (Case of 10)  (Case of 10)

fnvucqvd

fnvucqvd2#

这里有一个方法来解决你的问题:

import difflib
import pandas as pd

def get_products_variants(df: pd.DataFrame) -> pd.DataFrame:
    """
    Get variants for every product description.

    Parameters
    ----------
    df : pd.DataFrame
        A `pandas.DataFrame` with a column named "Product Description",
        that contains every existing product description.

    Returns
    -------
    pd.DataFrame
        A `pandas.DataFrame` with the columns:
            - "Product Description": The common name between variants of same product.
            - "Variance": The product variant.

    Examples
    --------
    >>> test_df = pd.DataFrame(
    ...     {
    ...         "Product Description": [
    ...             "Red HMS Carabiner", "Blue HMS Carabiner", "HMS Carabiner Orange",
    ...             "Liquid Chalk - 100ml", "Liquid Chalk - 100ml (Case of 10)"
    ...         ]
    ...     }
    ... )
    >>> get_products_variants(test_df)
         Product Description      Variance
    0          HMS Carabiner           Red
    1          HMS Carabiner          Blue
    2          HMS Carabiner        Orange
    9   Liquid Chalk - 100ml              
    10  Liquid Chalk - 100ml  (Case of 10)
    """
    new_data = []

    for description in df["Product Description"].values:
        # Split product descriptions into different words
        words = description.split(" ")
        # Find product descriptions that are closely related to the current
        # product description.
        matches = difflib.get_close_matches(description, df["Product Description"].values)

        # Iterate the descriptions that are similar to current product description
        # to find the words that exist in all product descriptions
        for descr in matches:
            descr_words = descr.split(" ")
            _words = []
            for word in descr_words:
                if word in words:
                    _words.append(word)
            words = _words

        # This new description is comprised of words that exist in all product descriptions
        # that have similar names.
        new_description = " ".join(words)

        # Find the words that are unique to each product description.
        # These words will be the variants
        for descr in matches:
            variance = descr.replace(new_description, "").strip()
            new_data.append([new_description, variance])

    # Create new dataframe with results
    return pd.DataFrame(new_data, columns=["Product Description", "Variance"]).drop_duplicates()

df = pd.DataFrame(
    {
        "Product Description": [
            "Red HMS Carabiner", "Blue HMS Carabiner", "HMS Carabiner Orange",
            "Liquid Chalk - 100ml", "Liquid Chalk - 100ml (Case of 10)"
        ]
    }
)
new_df = get_products_variants(df)
new_df
# Returns:
#
#      Product Description      Variance
# 0          HMS Carabiner           Red
# 1          HMS Carabiner          Blue
# 2          HMS Carabiner        Orange
# 9   Liquid Chalk - 100ml              
# 10  Liquid Chalk - 100ml  (Case of 10)

字符串

输出:

| | 方差| Variance |
| --|--|--|
| 0 |铁钩号驱逐舰|红色|
| 1 |铁钩号驱逐舰|蓝色|
| 2 |铁钩号驱逐舰|橙子|
| 9 |液体粉笔-100毫升||
| 10 |液体粉笔-100毫升|(10例)|

注意事项

要查找相似的产品描述,我们使用difflib模块(Python默认自带的模块)中的get_close_matches函数。根据产品描述的不同,get_close_matches可能无法找到某个产品的所有相似产品描述。换句话说,上述解决方案并不保证适用于所有情况。

相关问题