如何将mysqldump导入Pandas

qnzebej0  于 2022-12-16  发布在  Mysql
关注(0)|答案(5)|浏览(139)

我很感兴趣,如果有一个简单的方法来导入mysqldump到Pandas。
我有几个小(~ 110 MB)表,我想有他们作为数据框。
我希望避免把数据放回数据库,因为这将需要安装/连接到这样的数据库。我有。sql文件,并希望将包含的表导入到Pandas。是否有任何模块存在做这件事?
如果版本控制很重要,那么.sql文件都将“MySQL dump 10.13 Distrib 5.6.13,for Win32(x86)”列为生成转储的系统。

事后的背景

我在一台没有数据库连接的计算机上本地工作。我工作的正常流程是给出一个。tsv,.csv或json,并做一些分析,这将返回。一个新的第三方提供了他们所有的数据在.sql格式,这打破了我的工作流程,因为我需要大量的开销,使其成为一种格式,我的程序可以作为输入。我们最终要求他们以不同的格式发送数据,但出于商业/声誉原因,我们想先找个变通办法。

编辑:以下是包含两个表的MYSQLDump文件示例。

/*
MySQL - 5.6.28 : Database - ztest
*********************************************************************
*/

/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`ztest` /*!40100 DEFAULT CHARACTER SET latin1 */;

USE `ztest`;

/*Table structure for table `food_in` */

DROP TABLE IF EXISTS `food_in`;

CREATE TABLE `food_in` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Cat` varchar(255) DEFAULT NULL,
  `Item` varchar(255) DEFAULT NULL,
  `price` decimal(10,4) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL,
  KEY `ID` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;

/*Data for the table `food_in` */

insert  into `food_in`(`ID`,`Cat`,`Item`,`price`,`quantity`) values 

(2,'Liq','Beer','2.5000','300'),

(7,'Liq','Water','3.5000','230'),

(9,'Liq','Soda','3.5000','399');

/*Table structure for table `food_min` */

DROP TABLE IF EXISTS `food_min`;

CREATE TABLE `food_min` (
  `Item` varchar(255) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

/*Data for the table `food_min` */

insert  into `food_min`(`Item`,`quantity`) values 

('Pizza','300'),

('Hotdogs','200'),

('Beer','300'),

('Water','230'),

('Soda','399'),

('Soup','100');

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
wi3ka0sx

wi3ka0sx1#

没有

Pandas没有“原生”的方式来阅读mysqldump而不通过数据库。
有一个可能的变通办法,但在我看来这是一个非常糟糕的主意。

变通方案(不建议用于生产)

当然,您可以使用预处理器解析mysqldump文件中的数据。
MySQLdump文件通常包含大量我们在加载Pandas Dataframe 时不感兴趣的额外数据,因此我们需要对其进行预处理,去除噪声,甚至重新格式化行以使其符合要求。
使用StringIO,我们可以读取文件,处理数据,然后再将其馈送到the pandas.read_csv funcion

from StringIO import StringIO
import re

def read_dump(dump_filename, target_table):
    sio = StringIO()
        
    fast_forward = True
    with open(dump_filename, 'rb') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                fast_forward = False
            if fast_forward:
                continue
            data = re.findall('\([^\)]*\)', line)
            try:
                newline = data[0]
                newline = newline.strip(' ()')
                newline = newline.replace('`', '')
                sio.write(newline)
                sio.write("\n")
            except IndexError:
                pass
            if line.endswith(';'):
                break
    sio.pos = 0
    return sio

现在我们有了一个读取数据并将其格式化为CSV文件的函数,可以使用pandas.read_csv()读取它

import pandas as pd

food_min_filedata = read_dump('mysqldumpexample', 'food_min')
food_in_filedata = read_dump('mysqldumpexample', 'food_in')

df_food_min = pd.read_csv(food_min_filedata)
df_food_in = pd.read_csv(food_in_filedata)

结果:

Item quantity
0    'Pizza'    '300'
1  'Hotdogs'    '200'
2     'Beer'    '300'
3    'Water'    '230'
4     'Soda'    '399'
5     'Soup'    '100'

以及

ID    Cat     Item     price quantity
0   2  'Liq'   'Beer'  '2.5000'    '300'
1   7  'Liq'  'Water'  '3.5000'    '230'
2   9  'Liq'   'Soda'  '3.5000'    '399'

关于流处理的说明

这种方法被称为流处理,令人难以置信地精简,几乎完全不占用内存。总的来说,使用这种方法将csv文件更高效地读入Pandas是个好主意。

我建议不要解析mysqldump文件

xmq68pz9

xmq68pz92#

一种方法是export mysqldump to sqlite(例如run this shell script),然后读取sqlite文件/数据库。
请参见文档的SQL部分:

pd.read_sql_table(table_name, sqlite_file)
  • 另一种选择是直接在mysql数据库上运行read_sql... *
b4wnujal

b4wnujal3#

I found myself in a similar situation to yours, and the answer from @ firelynx was really helpful!
But since I had only limited knowledge of the tables included in the file, I extended the script by adding the header generation (pandas picks it up automatically), as well as searching for all the tables within the dump file. As a result, I ended up with a following script, that indeed works extremely fast. I switched to io.StringIO , and save the resulting tables as table_name.csv files.
P.S. I also support the advise against relying on this approach, and provide the code just for illustration purposes :)
So, first thing first, we can augment the read_dump function like this

from io import StringIO
import re, shutil

def read_dump(dump_filename, target_table):
    sio = StringIO()

    read_mode = 0 # 0 - skip, 1 - header, 2 - data
    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                read_mode = 2
            if line.lower().startswith('create table') and target_table in line:
                read_mode = 1
                continue

            if read_mode==0:
                continue

            # Filling up the headers
            elif read_mode==1:
                if line.lower().startswith('primary'):
                    # add more conditions here for different cases 
                    #(e.g. when simply a key is defined, or no key is defined)
                    read_mode=0
                    sio.seek(sio.tell()-1) # delete last comma
                    sio.write('\n')
                    continue
                colheader = re.findall('`([\w_]+)`',line)
                for col in colheader:
                    sio.write(col.strip())
                    sio.write(',')

            # Filling up the data -same as @firelynx's code
            elif read_mode ==2:
                data = re.findall('\([^\)]*\)', line)
                try:
                    # ...
                except IndexError:
                    pass
                if line.endswith(';'):
                    break
    sio.seek(0)
    with open (target_table+'.csv', 'w') as fd:
        shutil.copyfileobj(sio, fd,-1)
    return # or simply return sio itself

To find the list of tables we can use the following function:

def find_tables(dump_filename):
    table_list=[]

    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('create table'):
                table_name = re.findall('create table `([\w_]+)`', line.lower())
                table_list.extend(table_name)

    return table_list

Then just combine the two, for example in a .py script that you'll run like
python this_script.py mysqldump_name.sql [table_name]

import os.path
def main():
    try:
        if len(sys.argv)>=2 and os.path.isfile(sys.argv[1]):
            if len(sys.argv)==2:
                print('Table name not provided, looking for all tables...')
                table_list = find_tables(sys.argv[1])
                if len(table_list)>0:
                    print('Found tables: ',str(table_list))
                    for table in table_list:
                        read_dump(sys.argv[1], table)
            elif len(sys.argv)==3:
                read_dump(sys.argv[1], sys.argv[2])
    except KeyboardInterrupt:
        sys.exit(0)
k3bvogb1

k3bvogb14#

我想分享我对这个问题的解决方案,并征求反馈:

import pandas as pd
import re
import os.path
import csv
import logging
import sys

def convert_dump_to_intermediate_csv(dump_filename, csv_header, csv_out_put_file, delete_csv_file_after_read=True):
    """
    :param dump_filename: five an mysql export dump (mysqldump...syntax)
    :param csv_header: the very first line in the csv file which should appear, give a string separated by coma
    :param csv_out_put_file: the name of the csv file
    :param delete_csv_file_after_read: if you set this to False, no new records will be written as the file exists.
    :return: returns a pandas dataframe for further analysis.
    """
    with open(dump_filename, 'r') as f:
        for line in f:
            pre_compiled_all_values_per_line = re.compile('(?:INSERT\sINTO\s\S[a-z\S]+\sVALUES\s+)(?P<values>.*)(?=\;)')
            result = pre_compiled_all_values_per_line.finditer(line)
            for element in result:
                values_only = element.groups('values')[0]
                value_compile = re.compile('\(.*?\)')
                all_identified = value_compile.finditer(values_only)
                for single_values in all_identified:
                    string_to_split = single_values.group(0)[1:-1]
                    string_array = string_to_split.split(",")

                    if not os.path.exists(csv_out_put_file):
                        with open(csv_out_put_file, 'w', newline='') as file:
                            writer = csv.writer(file)
                            writer.writerow(csv_header.split(","))
                            writer.writerow(string_array)
                    else:
                        with open(csv_out_put_file, 'a', newline='') as file:
                            writer = csv.writer(file)
                            writer.writerow(string_array)
    df = pd.read_csv(csv_out_put_file)
    if delete_csv_file_after_read:
        os.remove(csv_out_put_file)
    return df

if __name__ == "__main__":
    log_name = 'test.log'
    LOGGER = logging.getLogger(log_name)
    LOGGER.setLevel(logging.DEBUG)
    LOGGER.addHandler(logging.NullHandler())
    FORMATTER = logging.Formatter(
        fmt='%(asctime)s %(levelname)-8s %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    SCREEN_HANDLER = logging.StreamHandler(stream=sys.stdout)
    SCREEN_HANDLER.setFormatter(FORMATTER)
    LOGGER.addHandler(SCREEN_HANDLER)

    dump_filename = 'test_sql.sql'
    header_of_csv_file = "A,B,C,D,E,F,G,H,I" # i did not identify the columns in the table definition...
    csv_output_file = 'test.csv'
    pandas_df = convert_dump_to_intermediate_csv(dump_filename, header_of_csv_file, csv_output_file, delete_csv_file_after_read=False)
    LOGGER.debug(pandas_df)

当然,记录器部分可以移除......

pgpifvop

pgpifvop5#

我在一台没有数据库连接的本地计算机上工作。我工作的正常流程是给一个.tsv
尝试mysqltotsv pypi模块:

pip3 install --user mysqltotsv
python3 mysql-to-tsv.py --file dump.sql --outdir out1

这将在out1目录中生成多个.tsv文件(MySQL转储中的每个表对应一个.tsv文件),然后通过加载TSV文件继续Pandas的正常工作流程。

相关问题