我最近需要编写R代码来识别包含所有年份访问的地区的行。有三个主要列需要考虑:
year
original_district
数据收集时的地区名称current_district
居住在最初调查的同一区域的地区的当前名称
我需要我的代码尽可能通用,以便它可以处理任意数量的年份,并且它需要标识与original_district
和current_district
的任意组合匹配的任意行。
这听起来并不复杂,但有一个增加的复杂性:如果一个地区在数据收集和现在之间的时间内被分割,则current_district
可能包含多个值。例如,original_district
可能等于 * Mountain District
*,而current_district
可能等于 * "Northern Mountain District;Southern Mountain District"
*
我包含了一个解决方案来识别下面重复值的行。如果你能帮助我识别所有组中出现的确切值,那将非常有帮助。
为了让这个问题更容易理解,我创建了一个例子,使用了多年来参加的喜剧节目,头条新闻和开场白列表。
library(tidyverse)
# Set the random seed for predictability
set.seed(2023)
# Define comics that are 100% visited as a headliner each year.
# Mitch Hedberg is removed after 2005.
always_comics <-
c("Norm MacDonald", "Maria Bamford", "Mitch Hedberg")
# Define the other comics that might be seen
other_comics <-
c("Dave Attell", "John Mulaney", "Tig Notaro", "Eddie Izzard",
"Doug Stanhope", "Jim Jefferies", "Gilbert Gottfried", "Patrice O'Neal",
"Chelsea Peretti", "Jim Gaffigan", "Amy Schumer", "Eddie Murphy", "Sarah Silverman",
"Bill Burr", "Shane Gillis", "Mark Normand", "tony hinchcliffe", "Ellen DeGeneres",
"Nicole Byer", "Michelle Wolf", "Jenny Slate", "Sarah Sherman", "Katherine Ryan") %>%
str_to_title() %>%
unique() %>%
sort()
# Initialize vectors for years and headliners
years <- numeric(0)
headliners <- character(0)
years_of_shows <-
seq(2001, 2019, 4)
for(i in 1:length(years_of_shows)){
# We always see "Norm MacDonald", "Maria Bamford", "Mitch Hedberg" every year,
# but determine a random number of additional shows for each year.
n_shows = sample(2:4, size = 1)
# Concatenate the correct number of years to the years vector
years <- c(years, rep(years_of_shows[i], (3 + n_shows)))
# Concatenate the headliners
headliners <-
c(
headliners,
c(always_comics, sample(other_comics, n_shows)) %>% sample(size = n_shows + 3)
)
}
# Create a dataframe of the headliners and the years attended.
comedy_shows_df <-
tibble(
year = years,
headliner = headliners,
openers = NA_character_
)
# Now add openers. These can be any comic.
all_comics <-
c(always_comics, other_comics) %>% unique()
for(i in 1:nrow(comedy_shows_df)){
# Choose a random number of openers
n_openers = sample(3:5, size = 1, prob = c(1,3,2))
# Sample the openers
openers = sample(all_comics, size = n_openers)
# Make so Mitch headberg always performs alone.
if("Mitch Hedberg" %in% comedy_shows_df$headliner[i]){
openers = "Mitch Hedberg"
}
# Mitch Hedberg died in 2005, so remove him from latter years
if("Mitch Hedberg" %in% openers & comedy_shows_df$year[i] > 2005){
openers = str_replace_all(openers, "Mitch Hedberg", "Norm MacDonald")
}
# The above two IF statement ensure that Mitch Hedberg shouldn't end up in the
# final result.
# If the headliner shows up as an opener, make that person the only opener
# (i.e. they performed alone)
if(comedy_shows_df$headliner[i] %in% openers){
openers = comedy_shows_df$headliner[i]
}
# Save the list of openers as a semi-colon separated string
comedy_shows_df$openers[i] <-
openers %>%
unique() %>%
paste(collapse = ";")
}
# Mitch Hedberg died in 2005, so remove him from latter years
comedy_shows_df %<>%
filter(
!(str_detect(headliner, "Hedberg") & years > 2005)
)
comedy_shows_df %<>%
mutate(
show_id = 1:nrow(.),
.before = 1
)
comedy_shows_df
好了,现在我有了一个你可以重新创建的数据集。接下来,让我们检查我的解决方案,或者发布你自己的!
1条答案
按热度按时间fnx2tebb1#
溶液
我的解决方案包括拆分以分号分隔的列表字符串,按该列取消嵌套,长轴旋转,然后减少结果。
接下来,识别这些喜剧演员表演的独特节目。这将向您显示show_id,以及在该节目中作为头条新闻或开场白表演的重复漫画。
一个三个三个一个
下面是原始 Dataframe
comedy_shows_df
的屏幕截图。这里是一个简化结果的副本,显示了每次表演中重复的表演者,
repeated_comics_with_show_id_df
。这是最终结果,原始数据框被过滤为仅包含含有重复漫画
filtered_comedy_shows_df
的行