我试图从我的论坛创建不同的技术术语列表。型号,信号编号,具体芯片组等。最困难的是,信号。信号是在这种格式-PPV_SIGNAL_NAME
或PP5_VDDIO_XO
-东西后面加下划线
我创建了这个mysql查询来获取我论坛上每个帖子中每条消息的每个信号,并将其合并到一个csv文件中,这对我来说很棒。然而,它包含了URL中的垃圾,因为URL通常在文本之间有下划线。:(
(SELECT DISTINCT TRIM(
CASE
WHEN CHAR_LENGTH(REGEXP_SUBSTR(message, '[A-Za-z0-9]{1,7}(_[A-Za-z0-9]{1,7})+')) >= 3
THEN REGEXP_SUBSTR(message, '[A-Za-z0-9]{1,7}(_[A-Za-z0-9]{1,7})+')
END
) FROM xf_post)
UNION
(SELECT DISTINCT TRIM(
CASE
WHEN CHAR_LENGTH(REGEXP_SUBSTR(title, '[A-Za-z0-9]{1,7}(_[A-Za-z0-9]{1,7})+')) >= 3
THEN REGEXP_SUBSTR(title, '[A-Za-z0-9]{1,7}(_[A-Za-z0-9]{1,7})+')
END
) FROM xf_thread)
INTO OUTFILE 'signal_names.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';
字符串
因此,如果有一个像https://google.com/pixel_phone_4_sale/discount.html这样的URL,它会在我的列表中包含pixel_phone_4_sale
,尽管我不希望这样。
我可以通过添加WHEN message NOT LIKE '%https://%'
来告诉查询忽略任何包含https的消息,但这也会错过该消息中的每个信号。假设有人说“如果你想阅读PM_PWRMGMT_SLP_R,请查看此网站:https://power_rail_education. com“,它将排除整个消息,因为它包含一个URL
我所要求的是可能的吗?或者我应该屈服并手动完成?有超过9000个条目。
编辑:我洞穴,这样做在mysql是太Pig头,固执,和愚蠢的我。我尝试再次与python。让我们看看这是否工作。将报告回来,如果它做!!
import pandas as pd
import pymysql
from bs4 import BeautifulSoup
import html
import re
# Function to remove bold & italics tags and URLs
def clean_text(text):
text = re.sub(r'\[/?[bi]\]', '', text) # Remove BBCode tags
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
text = re.sub(url_pattern, '', text) # Remove URLs
return text
# Function to remove HTML tags
def remove_html_tags(text):
return BeautifulSoup(text, 'html.parser').get_text()
# Function to decode HTML entities
def decode_html_entities(text):
return html.unescape(text)
# Database connection parameters
db_params = {
'host': '127.0.0.1',
'user': 'me',
'password': 'lookitsme',
'db': 'randomdb'
}
# Initialize a list to store DataFrames
dataframes = []
try:
# Connect to the database
connection = pymysql.connect(**db_params)
# Fetch the list of thread_ids
thread_ids_query = "SELECT thread_id FROM xf_thread;"
thread_ids_df = pd.read_sql(thread_ids_query, connection)
# Loop through each thread_id
for thread_id in thread_ids_df['thread_id']:
# Dynamic SQL query for each thread
thread_query = f"""
SELECT
p.thread_id,
t.title,
p.post_id,
p.message
FROM
xf_post AS p
JOIN
xf_thread AS t ON p.thread_id = t.thread_id
WHERE
p.thread_id = {thread_id};
"""
# Fetch data for each thread
df = pd.read_sql(thread_query, connection)
# Data Cleaning
df['message'] = df['message'].apply(remove_html_tags)
df['message'] = df['message'].apply(decode_html_entities)
df['message'] = df['message'].apply(clean_text)
df['title'] = df['title'].apply(remove_html_tags)
df['title'] = df['title'].apply(decode_html_entities)
df['title'] = df['title'].apply(clean_text)
# Append the DataFrame to the list
dataframes.append(df)
finally:
# Close the database connection
connection.close()
# Function to find matches using regular expression
def find_regex_matches(text, pattern):
if pd.isna(text):
return []
return re.findall(pattern, text)
# Concatenate all DataFrames in the list
all_threads_df = pd.concat(dataframes, ignore_index=True)
# Regular expression pattern
regex_pattern = r'[A-Za-z0-9]{1,10}(_[A-Za-z0-9]{1,10})+'
# Search in 'title' and 'message' columns
titles_matches = all_threads_df['title'].apply(find_regex_matches, pattern=regex_pattern)
messages_matches = all_threads_df['message'].apply(find_regex_matches, pattern=regex_pattern)
# Flatten the lists and get distinct values
all_matches = set()
for matches_list in titles_matches:
all_matches.update(matches_list)
for matches_list in messages_matches:
all_matches.update(matches_list)
# Convert to DataFrame
distinct_matches_df = pd.DataFrame(list(all_matches), columns=['Distinct Values'])
# Save to CSV
distinct_matches_df.to_csv('signal_names.csv', index=False)
print("Distinct matches saved to signal_names.csv")
型
1条答案
按热度按时间xqk2d5yq1#
从文本段落中删除所有URL的一种方法是使用空字符串
regexp_replace()
它们。然后应用regexp_substr()
函数:字符串
看看这个小片段https://extendsclass.com/mysql/e1ea79e的演示。