pandas 如何从多个字典创建打印

6za6bjd0  于 2023-04-19  发布在  其他
关注(0)|答案(1)|浏览(146)

我试图从字典中的项目做一个散点图,需要使用seaborn进行比较。
对于每只动物,所列值需要在图中以碱基对重复数[1000, 2000, 3000]进行比较。

x     y
1000    53
2000    69
3000     0
import seaborn as sns

dict_1={'cat': [53, 69, 0], 'cheetah': [65, 52, 28]}
dict_2={'cat': [40, 39, 10], 'cheetah': [35, 62, 88]}

sns.set_theme()

sns.relplot(
    data=dict_1,
    x="organism", y="CpG sites")

技术说明:第一个字典是原始序列,第二个字典是具有相同ACGT含量的随机化序列,需要在图中比较列出的值作为重复的CG量。在原始序列中,对于前1000bp,CG重复53次,在随机化序列中,对于Cat,CG重复40次,然后在2000bp中,它在原始序列中重复69次,而对于随机化的一个,它重复39次,等等。
例如:而不是字典中列出的'tip' (x)'CG value',每1000个碱基对而不是'total_bill' (y)

up9lanfz

up9lanfz1#

  • 最简单的方法是将字典合并到pandas.DataFrame中,然后使用组织数据的其他详细信息更新df
  • 如果dictionaries中的值长度不等(如注解中所示),则使用Creating dataframe from a dictionary where entries have different lengths
  • 为每个dict创建一个DataFrame,如链接的答案所示,然后再次使用pd.concat合并每个DataFrame。
    *python 3.11.2pandas 2.0.0seaborn 0.12.2中测试
import pandas as pd
import seaborn as sns

# update data in dictionaries from a comment
original_sequence = {'cat': [67, 17, 0], 'cheetah': [67, 17, 11], 'chlamydia': [67, 17, 27, 37, 17], 'polarbear': [67, 17, 27, 37, 32, 0]}
randomized_sequence = {'cat': [71, 61, 0], 'cheetah': [58, 56, 26], 'chlamydia': [47, 43, 44, 42, 29], 'polarbear': [52, 44, 54, 43, 42, 1]}

# list of dicts
list_of_dicts = [original_sequence, randomized_sequence]

# combine the dicts into dataframes, assign a new column to distinguish each sequence, reset the index and use it as the base pair amount
df = (pd.concat([pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)
                 .assign(Sequence=i) for i, data in enumerate(list_of_dicts)], ignore_index=False)
      .reset_index()
      .rename({'index': 'CG Amount'}, axis=1))

# Update the CG Amount column to correspond to the actual numbers
df['CG Amount'] = df['CG Amount'].add(1).mul(1000)

# seaborn works with DataFrames in a long form, so melt
df = df.melt(id_vars=['Sequence', 'CG Amount'], var_name='Organism', value_name='Repeats', col_wrap=2)

scatter

g = sns.relplot(data=df, x='CG Amount', y='Repeats', hue='Sequence', col='Organism')

bar

  • 如果你要比较两个离散间隔的序列,条形图似乎是更好的选择。
g = sns.catplot(data=df, kind='bar', x='CG Amount', y='Repeats', hue='Sequence', col='Organism', col_wrap=2)

df.melt

CG Amount   cat  cheetah  chlamydia  polarbear  Sequence
0        1000  67.0     67.0       67.0         67         0
1        2000  17.0     17.0       17.0         17         0
2        3000   0.0     11.0       27.0         27         0
3        4000   NaN      NaN       37.0         37         0
4        5000   NaN      NaN       17.0         32         0
5        6000   NaN      NaN        NaN          0         0
6        1000  71.0     58.0       47.0         52         1
7        2000  61.0     56.0       43.0         44         1
8        3000   0.0     26.0       44.0         54         1
9        4000   NaN      NaN       42.0         43         1
10       5000   NaN      NaN       29.0         42         1
11       6000   NaN      NaN        NaN          1         1

df.head().melt

Sequence  CG Amount Organism  Repeats
0         0       1000      cat     67.0
1         0       2000      cat     17.0
2         0       3000      cat      0.0
3         0       4000      cat      NaN
4         0       5000      cat      NaN

df.tail().melt

Sequence  CG Amount   Organism  Repeats
43         1       2000  polarbear     44.0
44         1       3000  polarbear     54.0
45         1       4000  polarbear     43.0
46         1       5000  polarbear     42.0
47         1       6000  polarbear      1.0

注意事项

  • 如果字典中的值具有相同的长度,则使用以下代码创建df
dict_1 = {'cat': [53, 69, 0], 'cheetah': [65, 52, 28]}
dict_2 = {'cat': [40, 39, 10], 'cheetah': [35, 62, 88]}

list_of_dicts = [dict_1, dict_2]

df = (pd.concat([pd.DataFrame(d, index=range(1000, 4000, 1000)).assign(Sequence=i) for i, d in enumerate(list_of_dicts)],
                ignore_index=False)
      .reset_index()
      .rename({'index': 'CG Amount'}, axis=1))

相关问题