R等效于.first或.last sas运算符

uxh89sit  于 2023-07-31  发布在  其他
关注(0)|答案(6)|浏览(101)

有没有人知道什么是最好的R替代SAS首先。还是最后一个操作员?我没有找到。
SAS有第一个。以及最后一个自动变量,其标识具有特定变量的相同值的组中的第一个和最后一个记录;所以在下面的数据集中定义了FIRST.model和LAST.model:

Model,SaleID,First.Model,Last.Model
Explorer,1,1,0
Explorer,2,0,0
Explorer,3,0,0
Explorer,4,0,1
Civic,5,1,0
Civic,6,0,0
Civic,7,0,1

字符串

okxuctiv

okxuctiv1#

听起来像是在寻找!duplicated,其中fromLast参数是FALSETRUE

d <- datasets::Puromycin

d$state
# [1] treated   treated   treated   treated   treated   treated   treated  
# [8] treated   treated   treated   treated   treated   untreated untreated
#[15] untreated untreated untreated untreated untreated untreated untreated
#[22] untreated untreated
#Levels: treated untreated
!duplicated(d$state)
# [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[13]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
!duplicated(d$state,fromLast=TRUE)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
#[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

字符串
这个函数有一些警告和边缘情况行为,您可以通过帮助文件(?duplicated)找到。

dzhpxtsq

dzhpxtsq2#

更新(先读)

如果您真的只对行索引感兴趣,也许直接使用splitrange会有用。下面假设数据集中的行名是按顺序编号的,但也可能进行调整。

irisFirstLast <- sapply(split(iris, iris$Species), 
                        function(x) range(as.numeric(rownames(x))))
irisFirstLast              ## Just the indices
#      setosa versicolor virginica
# [1,]      1         51       101
# [2,]     50        100       150
iris[irisFirstLast[1, ], ] ## `1` would represent "first"
#     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 1            5.1         3.5          1.4         0.2     setosa
# 51           7.0         3.2          4.7         1.4 versicolor
# 101          6.3         3.3          6.0         2.5  virginica
iris[irisFirstLast, ]      ## nothing would represent both first and last
#     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 1            5.1         3.5          1.4         0.2     setosa
# 50           5.0         3.3          1.4         0.2     setosa
# 51           7.0         3.2          4.7         1.4 versicolor
# 100          5.7         2.8          4.1         1.3 versicolor
# 101          6.3         3.3          6.0         2.5  virginica
# 150          5.9         3.0          5.1         1.8  virginica

d <- datasets::Puromycin   
dFirstLast <- sapply(split(d, d$state), 
                     function(x) range(as.numeric(rownames(x))))
dFirstLast
#      treated untreated
# [1,]       1        13
# [2,]      12        23
d[dFirstLast[2, ], ]       ## `2` would represent `last`
#    conc rate     state
# 12  1.1  200   treated
# 23  1.1  160 untreated

字符串
如果使用命名行,一般的方法是相同的,但您必须自己指定范围。下面是一般模式:

datasetFirstLast <- sapply(split(dataset, dataset$groupingvariable), 
                           function(x) c(rownames(x)[1], 
                                         rownames(x)[length(rownames(x))]))

初始答案(已编辑)

如果您对提取行而不是将行号用于其他目的感兴趣,那么还可以研究data.table。以下是一些例子:

library(data.table)
DT <- data.table(iris, key="Species")
DT[J(unique(Species)), mult = "first"]
#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa          5.1         3.5          1.4         0.2
# 2: versicolor          7.0         3.2          4.7         1.4
# 3:  virginica          6.3         3.3          6.0         2.5
DT[J(unique(Species)), mult = "last"]
#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa          5.0         3.3          1.4         0.2
# 2: versicolor          5.7         2.8          4.1         1.3
# 3:  virginica          5.9         3.0          5.1         1.8
DT[, .SD[c(1,.N)], by=Species]
#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa          5.1         3.5          1.4         0.2
# 2:     setosa          5.0         3.3          1.4         0.2
# 3: versicolor          7.0         3.2          4.7         1.4
# 4: versicolor          5.7         2.8          4.1         1.3
# 5:  virginica          6.3         3.3          6.0         2.5
# 6:  virginica          5.9         3.0          5.1         1.8


最后一种方法非常方便。例如,如果你想要每个组的前三行和后三行,你可以用途:DT[, .SD[c(1:3, (.N-2):.N)], by=Species](仅供参考:.N表示每组病例数。
其他有用的办法包括:

DT[, tail(.SD, 2), by = Species] ## last two rows of each group
DT[, head(.SD, 4), by = Species] ## first four rows of each group

bkhjykvo

bkhjykvo3#

带有n=1选项的head和tail函数与by相结合是一种很好的方法。参见R for SAS and SPss Users**(Robert Muenchen)使用感兴趣的变量(即last)创建一个 Dataframe 。

dfby<- data.frame(df$var1, df$var2)
mylastList<-by(df,dfby,tail, n=1)
#turn into a dataframe
mylastDF<-do.call(rbind,mylastList)

字符串

bxpogfeg

bxpogfeg4#

下面是一个dplyr解决方案:

# input
dataset <- structure(list(Model = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L
), .Label = c("Civic", "Explorer"), class = "factor"), SaleID = 1:7), .Names = c("Model", 
"SaleID"), class = "data.frame", row.names = c(NA, -7L))

# code 
library(dplyr)

dataset %>% 

  group_by(Model) %>%

  mutate(
          "First"        = row_number() == min( row_number() ),
          "Last"         = row_number() == max( row_number() )
  )

# output:

     Model SaleID First  Last
    <fctr>  <int> <lgl> <lgl>
1 Explorer      1  TRUE FALSE
2 Explorer      2 FALSE FALSE
3 Explorer      3 FALSE FALSE
4 Explorer      4 FALSE  TRUE
5    Civic      5  TRUE FALSE
6    Civic      6 FALSE FALSE
7    Civic      7 FALSE  TRUE

字符串
PS:如果你没有安装dplyr,运行:

install.packages("dplyr")

ao218c7q

ao218c7q5#

下面的函数基于@Joe对First / Last的描述。
该函数返回向量列表。
每个列表条目对应于dataframe的列(即数据集的特征或变量)
然后,在给定的列表条目中,存在与每个观察类别的第一个(或最后一个)元素有关的索引。

示例用法:

# Pass in your data frame, and indicate whether or not you want to find Last or find First. 
# Assign to the appropriate variable
first <- findFirstLast(myDF)
last  <- findFirstLast(myDF, findFirst=FALSE)

字符串

data(iris)使用示例

data(iris)
first <- findFirstLast(iris)
last  <- findFirstLast(iris, findFirst=FALSE)

每个物种的观察结果:

first$Species
 #    setosa versicolor  virginica 
 #        1         51        101 

 last$Species
 #    setosa versicolor  virginica 
 #        50        100        150

为sepcies的每个第一个观察抓取整行

iris[first$Species, ]
#      Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#  1            5.1         3.5          1.4         0.2     setosa
#  51           7.0         3.2          4.7         1.4 versicolor
#  101          6.3         3.3          6.0         2.5  virginica

CODE FOR FUNCTION findFirstLast():

findFirstLast <- function(myDF, findFirst=TRUE) {
  # myDF should be a data frame or matrix 

    # By default, this function finds the first occurence of each unique value in a column
    # If instead we want to find last, set findFirst to FALSE.  This will give `maxOrMin` a value of -1
    #    finding the min of the negative indecies is the same as finding the max of the positive indecies. 
    maxOrMin <- ifelse(findFirst, 1, -1) 

    # For each column in myDF, make a list of all unique values (`levs`) and iterate over that list, 
    #   finding the min (or max) of all the indicies of where that given value appears within the column  
    apply(myDF, 2, function(colm) {
        levs <- unique(colm)
        sapply(levs, function(lev) {
          inds <- which(colm==lev)
          ifelse(length(inds)==0, NA, maxOrMin*min(inds*maxOrMin) ) 
        })   
      })
  }

7kjnsjlb

7kjnsjlb6#

你可以试试这个函数,它可以创建第一个和最后一个标志,并像sas一样处理NA
R dplyr::arrange:对于本地数据,NA总是排序到末尾,即使使用dplyr::desc() Package 也是如此。
SAS PROC SORT:数值变量的缺失值小于所有数字。字符变量的缺失值小于任何可打印的字符值。
该函数用于对数据进行排序,如SAS,缺失值最小,并为每个排序变量创建变量first和last。

library(dplyr, warn.conflicts = FALSE)

#' Sort data rows and create first and last like SAS
#'
#' @param data input data
#' @param ... variables for sort
#' @param first_last logical value, whether or not create the first and last variables
#' @param first_prefix character string of prefix for creating first variable
#' @param last_prefix character string of prefix for creating last variable
sas_sort <- function(data, ...,
                     first_last = TRUE,
                     first_prefix = ".first.",
                     last_prefix = ".last.") {
  stopifnot(!missing(data), is.data.frame(data))
  
  if (dplyr::is.grouped_df(data)) {
    message(
      "the input data is grouped by `",
      dplyr::group_vars(data),
      "`, and wil be ungrouped "
    )
    data <- data %>% dplyr::ungroup()
  }
  
  dots <- rlang::enexprs(...)
  if (length(dots) == 0) {
    stop("argumetn `...` is empty")
  }
  sort <- vector("list")
  for (i in seq_along(dots)) {
    dot <- dots[[i]]
    dot_str <- deparse(dot)
    if (stringr::str_detect(dot_str, "((dplyr::)?desc\\()(.+)(\\))")) {
      sort <- append(sort, dot)
    } else {
      sort <- append(sort, rlang::parse_expr(paste0("!is.na(", dot_str, ")")))
      sort <- append(sort, dot)
    }
  }
  
  data <- data %>% dplyr::arrange(!!!sort)
  
  if (first_last) {
    for (i in seq_along(dots)) {
      dot <- dots[[i]]
      dot_str <- deparse(dot)
      if (stringr::str_detect(dot_str, "((dplyr::)?desc\\()(.+)(\\))")) {
        dot_str <- stringr::str_extract(dot_str, "((dplyr::)?desc\\()(.+)(\\))", group = 3)
        dot <- rlang::sym(dot_str)
      }
      
      first <- paste0(first_prefix, dot_str)
      last <- paste0(last_prefix, dot_str)
      
      data <- data %>%
        dplyr::group_by(!!dot, .add = TRUE) %>%
        dplyr::arrange(!!!sort) %>%
        dplyr::mutate(
          !!first := dplyr::row_number() == 1L,
          !!last := dplyr::row_number() == dplyr::n()
        )
    }
  }
  
  data %>% dplyr::ungroup()
}

# this data is from SAS Programmer’s Guide: Essentials
# FIRST. and LAST. DATA Step Variables
# Example 1: Grouping Observations by State, City, and ZIP Code

zip <- tibble::tribble(
  ~State,      ~City, ~ZipCode,      ~Street,
  "AZ",   "Tucson",   85730L, "Domenic Ln",
  "AZ",   "Tucson",   85730L, "Gleeson Pl",
  "FL", "Lakeland",   33801L, "French Ave",
  "FL", "Lakeland",   33809L,   "Egret Dr",
  "FL",    "Miami",   33133L,    "Rice St",
  "FL",    "Miami",   33133L, "Thomas Ave",
  "FL",    "Miami",   33133L,  "Surrey Dr",
  "FL",    "Miami",   33133L,  "Trade Ave",
  "FL",    "Miami",   33146L,  "Nervia St",
  "FL",    "Miami",   33146L, "Corsica St"
)

zip_sort_r <- sas_sort(zip, State, City, ZipCode,
                           first_prefix = "first_",
                           last_prefix = "last_")

zip_sort_r
#> # A tibble: 10 × 10
#>    State City     ZipCode Street     first_State last_State first_City last_City
#>    <chr> <chr>      <int> <chr>      <lgl>       <lgl>      <lgl>      <lgl>    
#>  1 AZ    Tucson     85730 Domenic Ln TRUE        FALSE      TRUE       FALSE    
#>  2 AZ    Tucson     85730 Gleeson Pl FALSE       TRUE       FALSE      TRUE     
#>  3 FL    Lakeland   33801 French Ave TRUE        FALSE      TRUE       FALSE    
#>  4 FL    Lakeland   33809 Egret Dr   FALSE       FALSE      FALSE      TRUE     
#>  5 FL    Miami      33133 Rice St    FALSE       FALSE      TRUE       FALSE    
#>  6 FL    Miami      33133 Thomas Ave FALSE       FALSE      FALSE      FALSE    
#>  7 FL    Miami      33133 Surrey Dr  FALSE       FALSE      FALSE      FALSE    
#>  8 FL    Miami      33133 Trade Ave  FALSE       FALSE      FALSE      FALSE    
#>  9 FL    Miami      33146 Nervia St  FALSE       FALSE      FALSE      FALSE    
#> 10 FL    Miami      33146 Corsica St FALSE       TRUE       FALSE      TRUE     
#> # ℹ 2 more variables: first_ZipCode <lgl>, last_ZipCode <lgl>

df <- tibble::tribble(
  ~x, ~y, ~z,
  "b", 2L, NA,
  NA, 1L, NA,
  NA, 2L, NA,
  NA, NA, NA,
  NA, NA, "a",
  "a", NA, "a",
  "a", 1L, "a",
  "a", 2L, "b",
  "b", NA, NA,
  "b", 1L, "b",
  "a", NA, NA,
  NA, 1L, "b",
  "b", NA, "b",
  "a", 2L, "a",
  "b", 2L, "b",
  NA, 2L, "b",
  NA, 1L, "a",
  "b", 1L, NA,
  "a", NA, "b",
  "b", NA, "a",
  "a", 2L, NA,
  "a", 1L, "b",
  "a", 1L, NA,
  "b", 1L, "a",
  "b", 2L, "a",
  NA, NA, "b",
  NA, 2L, "a"
)

sort1 <- sas_sort(df,x,y,z)
sort1
#> # A tibble: 27 × 9
#>    x         y z     .first.x .last.x .first.y .last.y .first.z .last.z
#>    <chr> <int> <chr> <lgl>    <lgl>   <lgl>    <lgl>   <lgl>    <lgl>  
#>  1 <NA>     NA <NA>  TRUE     FALSE   TRUE     FALSE   TRUE     TRUE   
#>  2 <NA>     NA a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  3 <NA>     NA b     FALSE    FALSE   FALSE    TRUE    TRUE     TRUE   
#>  4 <NA>      1 <NA>  FALSE    FALSE   TRUE     FALSE   TRUE     TRUE   
#>  5 <NA>      1 a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  6 <NA>      1 b     FALSE    FALSE   FALSE    TRUE    TRUE     TRUE   
#>  7 <NA>      2 <NA>  FALSE    FALSE   TRUE     FALSE   TRUE     TRUE   
#>  8 <NA>      2 a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  9 <NA>      2 b     FALSE    TRUE    FALSE    TRUE    TRUE     TRUE   
#> 10 a        NA <NA>  TRUE     FALSE   TRUE     FALSE   TRUE     TRUE   
#> # ℹ 17 more rows

sort2 <- sas_sort(df, x, dplyr::desc(y), z)
sort2
#> # A tibble: 27 × 9
#>    x         y z     .first.x .last.x .first.y .last.y .first.z .last.z
#>    <chr> <int> <chr> <lgl>    <lgl>   <lgl>    <lgl>   <lgl>    <lgl>  
#>  1 <NA>      2 <NA>  TRUE     FALSE   TRUE     FALSE   TRUE     TRUE   
#>  2 <NA>      2 a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  3 <NA>      2 b     FALSE    FALSE   FALSE    TRUE    TRUE     TRUE   
#>  4 <NA>      1 <NA>  FALSE    FALSE   TRUE     FALSE   TRUE     TRUE   
#>  5 <NA>      1 a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  6 <NA>      1 b     FALSE    FALSE   FALSE    TRUE    TRUE     TRUE   
#>  7 <NA>     NA <NA>  FALSE    FALSE   TRUE     FALSE   TRUE     TRUE   
#>  8 <NA>     NA a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  9 <NA>     NA b     FALSE    TRUE    FALSE    TRUE    TRUE     TRUE   
#> 10 a         2 <NA>  TRUE     FALSE   TRUE     FALSE   TRUE     TRUE   
#> # ℹ 17 more rows

# delete the first and last
delete_first_last <- sas_sort(df, x, dplyr::desc(y), z) %>%
  dplyr::select(
    -dplyr::starts_with(".first."),
    -dplyr::starts_with(".last.")
  )

delete_first_last
#> # A tibble: 27 × 3
#>    x         y z    
#>    <chr> <int> <chr>
#>  1 <NA>      2 <NA> 
#>  2 <NA>      2 a    
#>  3 <NA>      2 b    
#>  4 <NA>      1 <NA> 
#>  5 <NA>      1 a    
#>  6 <NA>      1 b    
#>  7 <NA>     NA <NA> 
#>  8 <NA>     NA a    
#>  9 <NA>     NA b    
#> 10 a         2 <NA> 
#> # ℹ 17 more rows

字符串
创建于2023-07-19,使用reprex v2.0.2

相关问题