R语言 如何确定列是定量数据还是分类数据?[关闭]

tjrkku2a  于 2023-05-20  发布在  其他
关注(0)|答案(2)|浏览(116)

**关闭。**这个问题是not about programming or software development。目前不接受答复。

这个问题似乎不是关于a specific programming problem, a software algorithm, or software tools primarily used by programmers的。如果你认为这个问题与another Stack Exchange site的主题有关,你可以留下评论,解释在哪里可以回答这个问题。
4天前关闭。
Improve this question
如果我有一个包含许多列的文件,数据都是数字,我如何知道特定列是分类数据还是定量数据?。这类问题是否有研究领域?如果不是,有哪些启发式方法可以用来确定?
我能想到的一些启发式方法:

可能是分类数据

  • 对唯一值进行汇总,如果它<some_threshold,则有更高的机会成为分类数据。
  • 如果数据高度集中(低标准)
  • 如果唯一值是高度连续的,并且从1开始
  • 如果列中的所有值都具有固定长度(可以是ID/日期)
  • 如果它在Benford's Law处具有非常小p值
  • 如果它在结果列的卡方检验中具有非常小的p值

可能是定量数据

  • 如果列有浮点数
  • 如果列具有稀疏值
  • 如果列具有负值

其他

  • 也许定量数据更有可能接近定量数据(反之亦然)

我使用R,但问题不需要是R特定的。

waxmsbnn

waxmsbnn1#

这假设有人正确地编码了数据。
也许你是在暗示数据没有被正确编码或标记,它都是以数字形式输入的,其中一些确实是分类的。在这种情况下,我不知道人们如何能肯定地说。分类数据可以有小数位,并且可以是负数。
在这种情况下,我会问自己的问题是,我如何处理数据有什么不同?
如果你对第二种情况感兴趣,也许你应该在Stack Exchange上提出你的问题。

my.data <- read.table(text = '
    aa     bb      cc     dd
    10    100    1000      1
    20    200    2000      2
    30    300    3000      3
    40    400    4000      4
    50    500    5000      5
    60    600    6000      6
', header = TRUE, colClasses = c('numeric', 'character', 'numeric', 'character'))

my.data

# one way
str(my.data)

'data.frame':   6 obs. of  4 variables:
 $ aa: num  10 20 30 40 50 60
 $ bb: chr  "100" "200" "300" "400" ...
 $ cc: num  1000 2000 3000 4000 5000 6000
 $ dd: chr  "1" "2" "3" "4" ...

以下是记录信息的方法:

my.class <- rep('empty', ncol(my.data))

for(i in 1:ncol(my.data)) {
    my.class[i] <- class(my.data[,i])
}

> my.class
[1] "numeric"   "character" "numeric"   "character"

编辑

下面是一种不使用for-loop记录每列class的方法:

my.class <- sapply(my.data, class)
xuo3flqw

xuo3flqw2#

以下是使用大多数建议的R函数的第一个切割:

require( "hablar" )
require( "DescTools" )

# unique.p - threshold for unique values as a proportion of total cases
# unique.n - if unique values of x < unique.n then classify as factor
# first.n - if 90% of cases are contained within the first.n levels then classify as a factor 
# max.v - if x is an integer and the variance of x is below max.v then classify as a factor
# b.to.f - convert binary variables (x in 0,1) to factors? 


is_factor <- function( x, unique.p=0.10, unique.n=(length(x)*unique.p),  
                       first.n=25, max.v=2, b.to.f=FALSE )
{
  cat( paste0( "\n-----------------  ", deparse(substitute(x)), "\n\n" ) )

  # exclude NA, NaN, and Inf

  if( is.numeric(x) | is.logical(x) )
  {  x <- x[ is.finite(x) ] }

  if( is.character(x) )
  {  
    x[ x == "NaN" | x == "Inf" ] <- NA
    x <- na.omit(x)
  }
 

  n <- length(x)

  if( n == 0 )
  {
    cat( "The variable is empty (all NAs)" )
    return(FALSE)
  }

  cat( paste0( "Valid N (after NA drop) = ", n, "\n" ) )
  cat( paste0( "Unique levels/values of x = ", length(unique(x)), "\n" ) )
  cat( paste0( "unique.n argument = ", unique.n, "\n" ) )
  cat( paste0( "unique.p argument = ", unique.p, "\n\n" ) )

  if( "factor" %in% class(x) )
  { 
    cat( "has class FACTOR \n" )
    cat( paste0( "Values of x: \n", paste( head( unique(x), 10 ), collapse=",\n" ), "\n\n" ) )
    cat( "####   IS FACTOR   #### \n\n\n" )
    return(TRUE) 
  }

  if( "logical" %in% class(x) )
  { 
    cat( "has class LOGICAL: is NOT a factor \n\n" )
    cat( paste0( "Values of x: \n", paste( head( unique(x), 10 ), collapse=",\n" ), "\n\n" ) )
    return(FALSE) 
  }

  if( any( DescTools::IsDate(x) ) )
  {
    x.dates <- x[ DescTools::IsDate(x) ]
    cat( "x has class DATE: is NOT a factor \n" )
    cat( paste0( "Values of x: \n", paste( head( unique(x.dates), 10 ), collapse=",\n" ) ) )
    return(FALSE)
  }
  
  if( "character" %in% class(x) )
  {
    cat( "x has class CHARACTER: \n\n" )

    # is a logical vector
    if( length(unique(x)) == 1 )
    { 
      cat( "All values of x are the same: \n" )
      cat( paste0( "Values of x: \n", paste( head( unique(x), 10 ), collapse=",\n" ), "\n" ) )
      if( b.to.f )
      { 
        cat( "Convert binary to factor is set to TRUE \n\n" )
        cat( "####   IS FACTOR   #### \n\n\n" )
        return(TRUE) 
      }
      cat( "Convert binary to factor is set to FALSE \n\n" )
      return(FALSE) 
    }

    # strings with same length (standardized categories) 
    #  but keep the total levels low so it doesn't flag IDs
    is.same <- length( unique( nchar(x) ) ) == 1  & length(unique(x)) < ( n * unique.p )
    
    if( is.same )
    { 
      cat( "All strings have the same number of characters \n\n" )
      cat( paste0( "Values of x (first 10): \n", paste( head( unique(x), 10 ), collapse=",\n" ), "\n\n" ) ) 
    }
    
    # small number of unique cases
    n.unique <- length( unique( x ) ) 
    
    # small prop of total cases unique
    p.unique <- length( unique( x ) ) / n
    
    is.small.unique.n <- n.unique <= unique.n & p.unique <= unique.p

    if( is.small.unique.n )
    { 
      cat( "x has a small number & proportion of unique cases\n" )
      cat( paste0( "N < ", unique.n, " & prop < ", unique.p, "\n" ) )
      cat( paste0( "Number of unique values of x: ", length(unique(x)), "\n" ) )
      cat( paste0( "Values of x (first 10): \n", paste( head( sort(unique(x)), 10 ), collapse=",\n" ), "\n\n" ) )
    }
   
    # most common levels account for large portion of total
    
    first.n.total <- table(x) %>% sort(desc=T) %>% head( first.n ) %>% sum() 
    total.p <- first.n.total / n
    is.large.p.total <- total.p > 0.90

    first.n.levels <- table(x) %>% sort(desc=T) %>% head( first.n ) %>% names()

    if( is.large.p.total )
    { 
      cat( paste0( "First ", first.n, " levels accounts for > 90% of total cases \n" ) ) 
      cat( paste0( "First N levels: \n", paste( first.n.levels, collapse=",\n" ), "\n\n" ) )
    }
    
    # if it meets any criteria return factor
    if( is.same | is.small.unique.n | is.large.p.total )
    { 
      cat( "####   IS FACTOR   #### \n\n\n" )
      return(TRUE) 
    }
  }
  
  # only test integers 
  x <- hablar::retype(x)

  if( "numeric" %in% class(x) )
  { 
    cat( "x is non-integer number: NOT a factor \n\n" )
    cat( paste0( "Values of x (first 10): \n", paste( head( unique(x), 10 ), collapse=",\n" ), "\n\n" ) )
    return(FALSE)
  }
  
  if( "integer" %in% class(x) )
  {
    cat( "x has class INTEGER: \n\n" )

    # is a logical vector
    if( all( x %in% c(0,1) ) | length(unique(x))==1 )
    { 
      cat( "All values of x are 0/1 or a single value: \n" )
      cat( paste0( "Values of x: \n", paste( head( unique(x), 10 ), collapse=",\n" ), "\n" ) )
      if( b.to.f )
      { 
        cat( "Convert binary to factor is set to TRUE \n\n" )
        cat( "####   IS FACTOR   #### \n\n\n" )
        return(TRUE) 
      }
      cat( "Convert binary to factor is set to FALSE \n\n" )
      return(FALSE) 
    }
    
    # has negative values 
    if( any( x < 0 ) )
    { 
      cat( "Contains negative integers \n" )
      cat( paste0( "Range x: ", range(x), "\n\n" ) )
      return(FALSE) 
    }
    
    # small numer of unique values
    n.unique <- length( unique( x ) ) 
    
    # small prop of total cases unique
    p.unique <- length( unique( x ) ) / n
    
    is.small.unique.n <- n.unique <= unique.n & p.unique <= unique.p

    if( is.small.unique.n )
    { 
      cat( "x has a small number & proportion of unique cases \n" )
      cat( paste0( "unique(x) < ", unique.n, " & unique(x)/length(x) < ", unique.p, " \n" ) )
      cat( paste0( "Number of unique values of x: ", length(unique(x)), "\n" ) )
      cat( paste0( "Values of x (first 10): \n", paste( head( sort(unique(x)), 10 ), collapse=",\n" ), "\n\n" ) )
    }
    
    # starts with 1 and is an approximate sequence
    starts.with.one <- min(x) == 1 
    width.of.range.x <- max(x) - min(x) + 1
    is.approx.seq <- length(unique(x)) / width.of.range.x > 0.8
    
    is.seq.from.one <- starts.with.one & is.approx.seq

    if( is.seq.from.one )
    { cat( "x is an approximate sequence of integers starting with one \n\n" ) }
    
    # is a true sequence, e.g. 9,10,11,12
    is.true.seq <- length(unique(x)) == width.of.range.x & 
                   length(unique(x))/length(x) < unique.p

    if( is.true.seq )
    { 
      cat( "x is a true sequence of integers \n" )
      cat( paste0( "Values: \n", paste( sort(unique(x)), collapse=",\n" ), "\n\n" ) )
    }
    
    # equal intervals between all numbers
    is.equal.intervals <- length( unique( x[-1] - x[-length(x)] ) ) == 1
    
    if( is.equal.intervals )
    {
      cat( "All values of x have equal intervals between them \n" )
      cat( paste0( "Values: ", paste( head(sort(unique(x))), collapse="," ), "\n\n" ) )
    }

    # small variance
    is.small.var <- var(x) < max.v

    if( is.small.var )
    { cat( paste0( "The variance of x is below ", max.v, "\n\n" ) ) }
    
    # if it meets any criteria return factor
    if( is.small.unique.n | is.seq.from.one | is.true.seq | is.equal.intervals )
    { 
      cat( "####   IS FACTOR   #### \n\n\n" )
      return(TRUE) 
    }  
  }

  cat( "There are a large number of unique values: x is NOT a factor \n" )
  cat( paste0( "Number of unique values of x: ", length(unique(x)), "\n" ) )
  cat( paste0( "Values of x (first 10): \n", paste( head( sort(unique(x)), 10 ), collapse=",\n" ), "\n\n" ) )
  return( FALSE )
}

样本数据集:mtcars:
潜在因素可能是:

  • 气缸数
  • 齿轮(齿轮数)
  • 化油器数
  • vs(V形或直形发动机在0/1)
  • am(自动或手动变速器处于0/1)

Cyl和Gear被标记为因子。碳水化合物具有6个独特值或6/32 = 18%独特比例分数,高于由独特. p设定的10%阈值。
这些参数将对样本大小敏感-例如,50个唯一的州代码代表了具有数百个地址的数据集中总值的很大一部分,但州的数量不会随着大小而增长,因此随着数据集的增长,唯一值在总案例中的比例自然会变小。这些演示数据集是敏感的。
如果您希望将二进制变量标记为因子,则可以将参数“B.to.f”设置为TRUE:在这个例子中是vs和am。

#   mpg Miles/(US) gallon
#   cyl Number of cylinders
#   disp    Displacement (cu.in.)
#   hp  Gross horsepower
#   drat    Rear axle ratio
#   wt  Weight (1000 lbs)
#   qsec    1/4 mile time
#   vs  Engine (0 = V-shaped, 1 = straight)
#   am  Transmission (0 = automatic, 1 = manual)
#   gear    Number of forward gears
#   carb    Number of carburetors

> head( mtcars )
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
lapply( mtcars, is_factor )

-----------------  mpg

Valid N (after NA drop) = 32
Unique levels/values of x = 25
unique.n argument = 3.2
unique.p argument = 0.1

x is non-integer number: NOT a factor 

Values of x (first 10): 
21,
22.8,
21.4,
18.7,
18.1,
14.3,
24.4,
19.2,
17.8,
16.4

-----------------  cyl

Valid N (after NA drop) = 32
Unique levels/values of x = 3
unique.n argument = 3.2
unique.p argument = 0.1

x has class INTEGER: 

x has a small number & proportion of unique cases 
unique(x) < 3.2 & unique(x)/length(x) < 0.1 
Number of unique values of x: 3
Values of x (first 10): 
4,
6,
8

####   IS FACTOR   #### 


-----------------  disp

Valid N (after NA drop) = 32
Unique levels/values of x = 27
unique.n argument = 3.2
unique.p argument = 0.1

x is non-integer number: NOT a factor 

Values of x (first 10): 
160,
108,
258,
360,
225,
146.7,
140.8,
167.6,
275.8,
472

-----------------  hp

Valid N (after NA drop) = 32
Unique levels/values of x = 22
unique.n argument = 3.2
unique.p argument = 0.1

x has class INTEGER: 

There are a large number of unique values: x is NOT a factor 
Number of unique values of x: 22
Values of x (first 10): 
52,
62,
65,
66,
91,
93,
95,
97,
105,
109

-----------------  drat

Valid N (after NA drop) = 32
Unique levels/values of x = 22
unique.n argument = 3.2
unique.p argument = 0.1

x is non-integer number: NOT a factor 

Values of x (first 10): 
3.9,
3.85,
3.08,
3.15,
2.76,
3.21,
3.69,
3.92,
3.07,
2.93
 

-----------------  wt

Valid N (after NA drop) = 32
Unique levels/values of x = 29
unique.n argument = 3.2
unique.p argument = 0.1

x is non-integer number: NOT a factor 

Values of x (first 10): 
2.62,
2.875,
2.32,
3.215,
3.44,
3.46,
3.57,
3.19,
3.15,
4.07

-----------------  qsec

Valid N (after NA drop) = 32
Unique levels/values of x = 30
unique.n argument = 3.2
unique.p argument = 0.1

x is non-integer number: NOT a factor 

Values of x (first 10): 
16.46,
17.02,
18.61,
19.44,
20.22,
15.84,
20,
22.9,
18.3,
18.9

-----------------  vs

Valid N (after NA drop) = 32
Unique levels/values of x = 2
unique.n argument = 3.2
unique.p argument = 0.1

x has class INTEGER: 

All values of x are 0/1 or a single value: 
Values of x: 
0,
1
Convert binary to factor is set to FALSE 

-----------------  am

Valid N (after NA drop) = 32
Unique levels/values of x = 2
unique.n argument = 3.2
unique.p argument = 0.1

x has class INTEGER: 

All values of x are 0/1 or a single value: 
Values of x: 
1,
0
Convert binary to factor is set to FALSE 

-----------------  gear

Valid N (after NA drop) = 32
Unique levels/values of x = 3
unique.n argument = 3.2
unique.p argument = 0.1

x has class INTEGER: 

x has a small number & proportion of unique cases 
unique(x) < 3.2 & unique(x)/length(x) < 0.1 
Number of unique values of x: 3
Values of x (first 10): 
3,
4,
5

x is a true sequence of integers 
Values: 
3,
4,
5

The variance of x is below 2

####   IS FACTOR   #### 


-----------------  carb

Valid N (after NA drop) = 32
Unique levels/values of x = 6
unique.n argument = 3.2
unique.p argument = 0.1

x has class INTEGER: 

There are a large number of unique values: x is NOT a factor 
Number of unique values of x: 6
Values of x (first 10): 
1,
2,
3,
4,
6,
8

$mpg
[1] FALSE

$cyl
[1] TRUE

$disp
[1] FALSE

$hp
[1] FALSE

$drat
[1] FALSE

$wt
[1] FALSE

$qsec
[1] FALSE

$vs
[1] FALSE

$am
[1] FALSE

$gear
[1] TRUE

$carb
[1] FALSE

相关问题