MYSQL查询获取信号名称,同时排除URL中的任何内容?

vwoqyblh  于 12个月前  发布在  Mysql
关注(0)|答案(1)|浏览(122)

我试图从我的论坛创建不同的技术术语列表。型号,信号编号,具体芯片组等。最困难的是,信号。信号是在这种格式-PPV_SIGNAL_NAMEPP5_VDDIO_XO-东西后面加下划线
我创建了这个mysql查询来获取我论坛上每个帖子中每条消息的每个信号,并将其合并到一个csv文件中,这对我来说很棒。然而,它包含了URL中的垃圾,因为URL通常在文本之间有下划线。:(

(SELECT DISTINCT TRIM(
    CASE
        WHEN CHAR_LENGTH(REGEXP_SUBSTR(message, '[A-Za-z0-9]{1,7}(_[A-Za-z0-9]{1,7})+')) >= 3
        THEN REGEXP_SUBSTR(message, '[A-Za-z0-9]{1,7}(_[A-Za-z0-9]{1,7})+')
    END
) FROM xf_post)
UNION
(SELECT DISTINCT TRIM(
    CASE
        WHEN CHAR_LENGTH(REGEXP_SUBSTR(title, '[A-Za-z0-9]{1,7}(_[A-Za-z0-9]{1,7})+')) >= 3
        THEN REGEXP_SUBSTR(title, '[A-Za-z0-9]{1,7}(_[A-Za-z0-9]{1,7})+')
    END
) FROM xf_thread)

INTO OUTFILE 'signal_names.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';

字符串
因此,如果有一个像https://google.com/pixel_phone_4_sale/discount.html这样的URL,它会在我的列表中包含pixel_phone_4_sale,尽管我不希望这样。
我可以通过添加WHEN message NOT LIKE '%https://%'来告诉查询忽略任何包含https的消息,但这也会错过该消息中的每个信号。假设有人说“如果你想阅读PM_PWRMGMT_SLP_R,请查看此网站:https://power_rail_education. com“,它将排除整个消息,因为它包含一个URL
我所要求的是可能的吗?或者我应该屈服并手动完成?有超过9000个条目。
编辑:我洞穴,这样做在mysql是太Pig头,固执,和愚蠢的我。我尝试再次与python。让我们看看这是否工作。将报告回来,如果它做!!

import pandas as pd
import pymysql
from bs4 import BeautifulSoup
import html
import re

# Function to remove bold & italics tags and URLs
def clean_text(text):
    text = re.sub(r'\[/?[bi]\]', '', text)  # Remove BBCode tags
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    text = re.sub(url_pattern, '', text)    # Remove URLs
    return text

# Function to remove HTML tags
def remove_html_tags(text):
    return BeautifulSoup(text, 'html.parser').get_text()

# Function to decode HTML entities
def decode_html_entities(text):
    return html.unescape(text)

# Database connection parameters
db_params = {
    'host': '127.0.0.1',
    'user': 'me',
    'password': 'lookitsme',
    'db': 'randomdb'
}

# Initialize a list to store DataFrames
dataframes = []

try:
    # Connect to the database
    connection = pymysql.connect(**db_params)

    # Fetch the list of thread_ids
    thread_ids_query = "SELECT thread_id FROM xf_thread;"
    thread_ids_df = pd.read_sql(thread_ids_query, connection)

    # Loop through each thread_id
    for thread_id in thread_ids_df['thread_id']:
        # Dynamic SQL query for each thread
        thread_query = f"""
        SELECT
            p.thread_id,
            t.title,
            p.post_id,
            p.message
        FROM
            xf_post AS p
        JOIN
            xf_thread AS t ON p.thread_id = t.thread_id
        WHERE
            p.thread_id = {thread_id};
        """
        # Fetch data for each thread
        df = pd.read_sql(thread_query, connection)

        # Data Cleaning
        df['message'] = df['message'].apply(remove_html_tags)
        df['message'] = df['message'].apply(decode_html_entities)
        df['message'] = df['message'].apply(clean_text)
        df['title'] = df['title'].apply(remove_html_tags)
        df['title'] = df['title'].apply(decode_html_entities)
        df['title'] = df['title'].apply(clean_text)

        # Append the DataFrame to the list
        dataframes.append(df)

finally:
    # Close the database connection
    connection.close()

# Function to find matches using regular expression
def find_regex_matches(text, pattern):
    if pd.isna(text):
        return []
    return re.findall(pattern, text)

# Concatenate all DataFrames in the list
all_threads_df = pd.concat(dataframes, ignore_index=True)

# Regular expression pattern
regex_pattern = r'[A-Za-z0-9]{1,10}(_[A-Za-z0-9]{1,10})+'

# Search in 'title' and 'message' columns
titles_matches = all_threads_df['title'].apply(find_regex_matches, pattern=regex_pattern)
messages_matches = all_threads_df['message'].apply(find_regex_matches, pattern=regex_pattern)

# Flatten the lists and get distinct values
all_matches = set()
for matches_list in titles_matches:
    all_matches.update(matches_list)
for matches_list in messages_matches:
    all_matches.update(matches_list)

# Convert to DataFrame
distinct_matches_df = pd.DataFrame(list(all_matches), columns=['Distinct Values'])

# Save to CSV
distinct_matches_df.to_csv('signal_names.csv', index=False)

print("Distinct matches saved to signal_names.csv")

xqk2d5yq

xqk2d5yq1#

从文本段落中删除所有URL的一种方法是使用空字符串regexp_replace()它们。然后应用regexp_substr()函数:

select regexp_substr(regexp_replace(
       'If you want to read about PM_PWRMGMT_SLP_R, check out this website: https://power_rail_education.com',
                      'https://[a-z_0-9/.]+',''),'[A-Za-z0-9]+(_[A-Za-z0-9]+)+')

字符串
看看这个小片段https://extendsclass.com/mysql/e1ea79e的演示。

相关问题