如何在R中的所有组中查找具有共同重复值的行

ekqde3dh  于 2023-04-03  发布在  其他
关注(0)|答案(1)|浏览(104)

我最近需要编写R代码来识别包含所有年份访问的地区的行。有三个主要列需要考虑:

  • year
  • original_district数据收集时的地区名称
  • current_district居住在最初调查的同一区域的地区的当前名称

我需要我的代码尽可能通用,以便它可以处理任意数量的年份,并且它需要标识与original_districtcurrent_district的任意组合匹配的任意行。
这听起来并不复杂,但有一个增加的复杂性:如果一个地区在数据收集和现在之间的时间内被分割,则current_district可能包含多个值。例如,original_district可能等于 * Mountain District *,而current_district可能等于 * "Northern Mountain District;Southern Mountain District" *
我包含了一个解决方案来识别下面重复值的行。如果你能帮助我识别所有组中出现的确切值,那将非常有帮助。
为了让这个问题更容易理解,我创建了一个例子,使用了多年来参加的喜剧节目,头条新闻和开场白列表。

library(tidyverse)
# Set the random seed for predictability
set.seed(2023)

# Define comics that are 100% visited as a headliner each year.
# Mitch Hedberg is removed after 2005.
always_comics <- 
  c("Norm MacDonald", "Maria Bamford", "Mitch Hedberg")

# Define the other comics that might be seen
other_comics <- 
  c("Dave Attell", "John Mulaney", "Tig Notaro", "Eddie Izzard", 
    "Doug Stanhope", "Jim Jefferies", "Gilbert Gottfried", "Patrice O'Neal", 
    "Chelsea Peretti", "Jim Gaffigan", "Amy Schumer", "Eddie Murphy", "Sarah Silverman",
    "Bill Burr", "Shane Gillis", "Mark Normand", "tony hinchcliffe", "Ellen DeGeneres",
    "Nicole Byer", "Michelle Wolf", "Jenny Slate", "Sarah Sherman", "Katherine Ryan") %>% 
  str_to_title() %>%
  unique() %>%
  sort()

# Initialize vectors for years and headliners
years <- numeric(0)
headliners <- character(0)

years_of_shows <- 
  seq(2001, 2019, 4)

for(i in 1:length(years_of_shows)){
  # We always see "Norm MacDonald", "Maria Bamford", "Mitch Hedberg" every year, 
  # but determine a random number of additional shows for each year.
  n_shows = sample(2:4, size = 1)
  
  # Concatenate the correct number of years to the years vector
  years <- c(years, rep(years_of_shows[i], (3 + n_shows)))
  
  # Concatenate the headliners
  headliners <- 
    c(
      headliners, 
      c(always_comics, sample(other_comics, n_shows)) %>% sample(size = n_shows + 3)
    )
}

# Create a dataframe of the headliners and the years attended.
comedy_shows_df <-
  tibble(
    year = years,
    headliner = headliners,
    openers = NA_character_
  )

# Now add openers.  These can be any comic.
all_comics <-
  c(always_comics, other_comics) %>% unique()

for(i in 1:nrow(comedy_shows_df)){
  # Choose a random number of openers
  n_openers = sample(3:5, size = 1, prob = c(1,3,2))
  # Sample the openers
  openers = sample(all_comics, size = n_openers)
  
  # Make so Mitch headberg always performs alone.
  if("Mitch Hedberg" %in% comedy_shows_df$headliner[i]){
    openers = "Mitch Hedberg"
  }
  # Mitch Hedberg died in 2005, so remove him from latter years
  if("Mitch Hedberg" %in% openers & comedy_shows_df$year[i] > 2005){
    openers = str_replace_all(openers, "Mitch Hedberg", "Norm MacDonald")
  }
  # The above two IF statement ensure that Mitch Hedberg shouldn't end up in the 
  # final result.
  
  # If the headliner shows up as an opener, make that person the only opener 
  # (i.e. they performed alone)
  if(comedy_shows_df$headliner[i] %in% openers){
    openers = comedy_shows_df$headliner[i]
  }
  
  # Save the list of openers as a semi-colon separated string
  comedy_shows_df$openers[i] <- 
    openers %>%
    unique() %>%
    paste(collapse = ";")
}

# Mitch Hedberg died in 2005, so remove him from latter years
comedy_shows_df %<>%
  filter(
    !(str_detect(headliner, "Hedberg") & years > 2005)
  )

comedy_shows_df %<>%
  mutate(
    show_id = 1:nrow(.),
    .before = 1
  )

comedy_shows_df

好了,现在我有了一个你可以重新创建的数据集。接下来,让我们检查我的解决方案,或者发布你自己的!

fnx2tebb

fnx2tebb1#

溶液

我的解决方案包括拆分以分号分隔的列表字符串,按该列取消嵌套,长轴旋转,然后减少结果。

library(tidyverse)
comedy_shows_df_long <-
  comedy_shows_df %>%
  # First, split the semi-colon separated list
  mutate(
    openers =
      str_split(openers, pattern = ";")
  ) %>% 
  # Next, unnest the split column
  unnest(openers) %>%
  # After that, since headliner and opener represent 
  # the same thing (a comic), stack the columns using `pivot_longer`
  pivot_longer(
    cols = c(headliner, openers)
  ) %>% 
  rename(
    comedian = value
  )
comedy_shows_df_long
# Next, split the dataframe by comedian and year, then intersect ALL of the 
# results using the `Reduce` function
# This results in the names of the comedians seen every single year.
comics_seen_every_year <-
  Reduce(
    f = intersect, 
    x = 
      split(
        x = comedy_shows_df_long$comedian, 
        f = comedy_shows_df_long$year
      )
  ) %>%
  sort()

comics_seen_every_year

接下来,识别这些喜剧演员表演的独特节目。这将向您显示show_id,以及在该节目中作为头条新闻或开场白表演的重复漫画。
一个三个三个一个
下面是原始 Dataframe comedy_shows_df的屏幕截图。

这里是一个简化结果的副本,显示了每次表演中重复的表演者,repeated_comics_with_show_id_df

这是最终结果,原始数据框被过滤为仅包含含有重复漫画filtered_comedy_shows_df的行

相关问题