excel 不能替换重复值与新值在xlsx与Pandas

1rhkuytd  于 2023-01-03  发布在  其他
关注(0)|答案(2)|浏览(157)

我有一个包含太多数据的xlsx文件。但是,数据包含名为UniversalIDS的列中的duplicate值,我想将其替换为随机生成的IDSPandas
到目前为止,我已经尝试了不同的情况下,我谷歌,但没有工作.例如,我尝试了这个:

import pandas as pd
import uuid

df = pd.read_excel('C:/Users/Nervous/Downloads/ss.xlsx')

df.loc[df.duplicated(subset=["UniversalIDS"]), "UniversalIDS"] = uuid.uuid4()

df.to_excel("C:/Users/Nervous/Downloads/tt.xlsx")

print("done!")

此外,我尝试了在这个网站上看到的其他替代品,例如:

df2 = df.assign(UniversalIDS=df['UniversalIDS'].where(
    ~df.duplicated(['UniversalIDS']), uuid.uuid4()))

这个也不管用

df.loc[df["UniversalIDS"].duplicated(), "test"]

这是xlsx data的一个片段:

UniversalIDS
f6112cd7-0868-4cc9-b5ab-d7381cc23bdf
f3e75641-f328-429f-ae32-41399d8a0bf0
08dccc5c-5774-4614-925c-ad9373821075
79a8ebed-154c-47c7-b16d-cbba5d8469eb
396f8e63-1950-4c36-9524-c1dec8728ffd
62cba3bd-a855-4141-8460-7ff838ecea62
b7f4f753-b479-413a-abcd-34d08436eb85
c0fd6e61-edb1-4dce-94ac-7d0529860e1f
c42f8c98-c285-4552-af9f-2f4d8e89b9e8
8cb77021-eb4f-4cfa-a4a3-e649f6a53c03
cb7f4b8d-976a-4481-919e-c8ff7d384cc6
e15fd2bb-5409-4a8b-9fdc-bf1e5862db58
27b97893-aae7-4a9a-aae1-0fc21a099209
1abc2c2f-94f2-4546-b912-b85bc4ed0cb8
6bf264fb-1b82-48e3-966a-14e48a61a63e
9653faeb-7b3d-408e-93e3-bc729f259c75
a09f3eb6-0059-4a77-bf2f-4f7436508ba8
65e06948-2e6c-413f-a768-c3faf8108a6c
291ff491-4ff0-4fb2-b095-b3e66f2d7ca0
653535c7-0389-4077-8e72-3835fbd72d4d
61408fc8-4f45-48e0-b83a-40b6bfd76ad5
3ae8d547-bf4b-42ac-b083-a1662f1a5c82
4955c673-c5da-464c-8e14-a897df0774eb
a39bad90-5235-4679-945e-534bb47b8347
264a1f6e-adf4-45a7-b1b1-e6f3fc073447
a855025b-ee84-46d5-aedb-cbac9a5b1920
71b16a5b-3f6d-4d30-8a65-203959fe87a2
4f3f86f2-4e61-475a-bc1d-eb2112f23953
59da45de-c192-4885-8a55-9138ca49b33a
8f41df73-d9dc-4663-9f64-d090d7c5ca77
84f7103f-e9de-444f-b046-c02d75af0ed1
2738f733-7438-494c-9368-5fb700df93d1
777a3cd7-19ae-4181-b91d-9be8eaf30523
b6083731-a43e-4b5a-ac9a-94a3202103e7
f22873c1-6811-4025-8f0d-47d72d49e499
f262c369-f44a-4b90-8219-d29b33bc14e8
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54

如上文所示,在UniversalIDS列中存在重复值,同样值得一提的是,在数据中存在其他列,但为了简单起见,将导致问题的列删除。
因此我的问题是如何用新的唯一ID替换UniversalIDS列中的重复值?

dldeef67

dldeef671#

你的表情:

df.loc[df.duplicated(subset=["UniversalIDS"]), "UniversalIDS"] = uuid.uuid4()

是正确的python,但是它为所有重复的元素设置了一个uuid,这意味着元素在执行后仍然会重复。您应该创建一个具有不同uuid的Series:

df.loc[df.duplicated(subset=["UniversalIDS"]), "UniversalIDS"] = pd.Series([uuid.uuid4() for _ in df.duplicated(subset=["UniversalIDS"])])
gk7wooem

gk7wooem2#

我发现基本上与以前的答案相同,当我重现你的问题,但想出了一个略有不同的解决方案,所以张贴供参考:

for idx, _ in df[df.duplicated(subset=["UniversalIDS"])].iterrows():
    df.at[idx, 'UniversalIDS'] = uuid.uuid4()

相关问题