regex 无法编写正确的正则表达式集

lymgl2op 于 11个月前发布在其他

关注(0)|答案(1)|浏览(98)

我正在使用：

Python 3.11.1
Windows 10 Pro
请求2.31.0
美丽的汤4.12.2
pandas 2.1.2
jupyter（用jup编写，但我会用PyCharm完成代码）

我通过html请求从学院网站上获得文本，他们在那里发布了课程表，我得到了它们，但在分散的顺序，你可以看到在图像和文本文件（下面的链接），我不能写正则表达式，使文本可读，帮助我解决这个问题

from bs4 import BeautifulSoup
import re
import requests
import pandas as pd

url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSNR-Gvp7MBcYQo0GM5nU3UC7DSIGMCwKq-eQGIY_alqORpe1pvZ00PI63wNuOyiJbZI_AP6nSeWWop/pubhtml'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

table = soup.find_all('table')[0]

world_titles = table.find_all('td')

world_table_titles = [title.text.strip() for title in world_titles]

# here I can't write reg exp to make text readable
clean_titles = [re.sub(r'[",\s]+', '', title) for title in world_table_titles]

print(clean_titles)

字符串
我会很高兴地得到关于如何使文本可读类型的指示：

ПОНЕДЕЛЬНИК

Время АСОИ-1-23
8:00
Физика (пр) Нарманбетова Г. Ж
9:30    Математика Алыкулова К.Б
11:00   Основы экономики, менеджмента и маркетинга Алтыбаева Ш.И.
12:40   Русский язык Омуркулова Г.М.

型
我知道要求很多，但我真的被困住了
text file regular_expression的数据库
我在看youtube教程，regex101，人工智能聊天机器人，但任何东西都没有帮助

regex

来源：https://stackoverflow.com/questions/77466248/%d0%a1ant-write-the-correct-set-of-regular-expressions

1条答案

按热度按时间

wmomyfyw1#

我想这是一个XY problem。
假设你想加载一个Google表格（从它的pubhtmlurl）来查询特定的数据，也许你应该考虑使用pandas的read_html，并进行一些 * 后处理 *？

# read a brut version of the Google Sheeet html
from io import StringIO; import pandas as pd
tmp = (pd.read_html(StringIO(page.text))[0].iloc[4:, 1:]
           .dropna(how="all").T.set_index(4).T)

# to de-duplicate the headers (optional ?)
s = tmp.columns.to_series()
tmp.columns = (s.str.cat(s.groupby(level=0).cumcount().add(1)
                .astype(str), sep="-").where(s.duplicated(keep=False), s))

df = tmp.set_index(["Время-1", "Время-2"]).rename_axis(columns=None)

字符串
这就形成了一个层次结构的嵌套框架，loc将给予预期的输出：

df.loc["ПОНЕДЕЛЬНИК", "АСОИ-1-23"]

Время-2
8:00                        Физика (пр) Нарманбетова Г. Ж.
9:30                              Математика Алыкулова К.Б
11:00    Основы экономики, менеджмента и маркетинга Алт...
12:40                         Русский язык Омуркулова Г.М.
Name: АСОИ-1-23, dtype: object

型
输出（* 整个表 *）：

print(df)

                          АСОИ-1-23 ауд.-1  ...      ЭУБДМ-1-23 ауд.-15
Время-1     Время-2                         ...                        
ПОНЕДЕЛЬНИК 8:00     Физика (пр)...    338  ...  Физика (пр)...     338
            9:30     Математика ...    407  ...  Математика ...     407
            11:00    Основы экон...    411  ...  Основы экон...     411
...                             ...    ...  ...             ...     ...
ПЯТНИЦА     9:30     Физика(лек)...    338  ...  Физика(лек)...     338
            11:00    Введение в ...    422  ...             NaN     NaN
            12:40    Кыргыз тил ...    405  ...  Кыргыз тил ...     405

[20 rows x 30 columns]

型
为了好玩，如果你也想克隆格式，你可以使用Styler：

def fmt_outeridx(ser):
    return ["""background-color: #00ffff; font-weight: bold;
            font-size: 14pt;text-align: center;""" for _ in ser]

def fmt_aya(ser):
    return np.where(ser.index.str.startswith("ауд"),
                    "background-color:#ffff99", "")

(
    df.style
        .set_properties(**{"font-weight": "bold",
            "border": "1px solid", "text-align": "center"})
        .apply_index(fmt_outeridx, axis=0, level=0)
        .apply_index(fmt_outeridx, axis=1)
        .apply(fmt_aya, axis=1)
)

型

的数据

赞(0）回复(0）举报 11个月前

我来回答

regex 无法编写正确的正则表达式集

1条答案

相关问题

热门标签

最新问答