基于R中的人口普查预测个人的性别

e1xvtsh3  于 2023-02-27  发布在  其他
关注(0)|答案(3)|浏览(167)

我有三个数据框,一个是客户的名字,第二个是女性的名字,第三个是男性的名字。如果一个客户的名字出现在男性名字df中,那么它的性别就被指定为男性,女性也是如此。但是如果一个名字同时出现在男性和女性数据框中,那么我必须使用计数来指定性别。
例如:

male_names <- data.frame(
  Name = c("Ajit","Binod","Chinmay","Dhiraj","Eshan","Krishna"),  
  count = c(4,2,3,4,2,7)
)

female_names <- data.frame(
  Name = c("Amita","Binita","Cherry","Deepika","Krishna"), 
  count = c(4,1,2,3,2)
)

customer_names <- data.frame(
  Name = c("Ajit","Binita","Dhiraj","Krishna")
)

我该怎么做呢?

osh3o9ms

osh3o9ms1#

这是我对dplyr的处理方法:

female_names %>%
  full_join(male_names, by = "Name") %>%
  replace(is.na(.), 0) %>%
  mutate(Name, gender = ifelse(count.x > count.y, "female", "male"), .keep = "none") %>%
  right_join(customer_names, by = "Name")

结果:

Name gender
1  Binita female
2 Krishna   male
3    Ajit   male
4  Dhiraj   male
lztngnrs

lztngnrs2#

我建议您先将您的女性和男性姓名合并,然后求和,这样您就可以知道要为每个姓名指定什么性别。然后将您的客户加入到该表中。我在这里使用data.table,这只是我的喜好。

library(data.table)

gender_by_name <- setDT(merge(male_names, female_names, by = "Name", all = T))[, .(gender = ifelse(sum(count.x, -count.y, na.rm = T) > 0, "male", "female")), Name]
gender_by_name[setDT(customer_names), on = .(Name)]
    • 结果**
Name gender
1:    Ajit   male
2:  Binita female
3:  Dhiraj   male
4: Krishna   male
kknvjkwl

kknvjkwl3#

如果你只想要性别间数量最多的性别。
[Edit:意识到这将在数量相等的情况下保持两性,但问题是不清楚在这种情况下会发生什么]

bind_rows(male=male_names, female=female_names, .id = "gender") |> 
       group_by(Name) |>
       filter(count==max(count)) |>
       merge(customer_names)

     Name gender count
1    Ajit   male     4
2  Binita female     1
3  Dhiraj   male     4
4 Krishna   male     7

但我会仔细考虑你要用这些数据做什么,以及这是否是猜测性别的合适方法。

相关问题