R语言 使用箭头库检索gc()之后使用的内存时出错

wfveoks0  于 2023-06-27  发布在  其他
关注(0)|答案(1)|浏览(100)

我在Windows 10上使用R版本4.1.3,遇到内存使用问题。
目前,我需要在程序中使用arrow和dplyr库,当我比较windows任务管理器和memory.size(max=F)函数之间使用的内存时,windows任务管理器给出的内存要大得多,243.5 MB RAM Windows,而memory.size(max=F)函数给出的内存为75.77 MB。
但是,我删除了用rm()创建的对象,然后使用gc()函数恢复对象使用的内存。
下面,R代码,有输出和没有输出,我用来呈现我的问题:

-带输出的编码

> gc(verbose = TRUE)
Garbage collection 2 = 0+0+2 (level 2) ... 
14.2 Mbytes of cons cells used (41%)
3.9 Mbytes of vectors used (6%)
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 264908 14.2     648748 34.7   401965 21.5
Vcells 500529  3.9    8388608 64.0  1671274 12.8
> 
> # basic memory
> memory.size(max=F)
[1] 28.78
> 
> library(arrow)

Attachement du package : ‘arrow’

L'objet suivant est masqué depuis ‘package:utils’:

    timestamp

> 
> # Memory after loading the arrow library with memory.size
> memory.size(max=F)
[1] 51.32
> 
> library(dplyr)

Attachement du package : ‘dplyr’

Les objets suivants sont masqués depuis ‘package:stats’:

    filter, lag

Les objets suivants sont masqués depuis ‘package:base’:

    intersect, setdiff, setequal, union

> 
> # Memory after loading the dplyr library with memory.size
> memory.size(max=F)
[1] 90.2
> 
> df <- data.frame(
+   col1 = rnorm(1000000),
+   col2 = rnorm(1000000),
+   col3 = runif(1000000),
+   col4 = sample(1:999, size = 1000000, replace = T),
+   col5 = sample(c("GroupA", "GroupB"), size = 1000000, replace = T),
+   col6 = sample(c("TypeA", "TypeB"), size = 1000000, replace = T)
+ )
> 
> # Memory after df object creation
> memory.size(max=F)
[1] 132.83
> 
> arrow::write_dataset(
+   df,
+   paste0(Sys.getenv("USERPROFILE"),"/ExProblemeGc"),
+   format = "parquet"
+ )
> 
> # Memory after writing to disk
> memory.size(max=F)
[1] 120.07
> 
> rm(df)
> 
> # Memory after deletion df
> memory.size(max=F)
[1] 120.07
> 
> gc(verbose = TRUE)
Garbage collection 15 = 9+2+4 (level 2) ... 
45.0 Mbytes of cons cells used (61%)
38.0 Mbytes of vectors used (49%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  842160   45    1380031 73.8  1380031 73.8
Vcells 4976056   38   10146329 77.5  8388368 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 101.27
> 
> gc(verbose = TRUE)
Garbage collection 16 = 9+2+5 (level 2) ... 
45.0 Mbytes of cons cells used (61%)
11.3 Mbytes of vectors used (15%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  842053 45.0    1380031 73.8  1380031 73.8
Vcells 1475891 11.3   10146329 77.5  8388368 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 74.34
> 
> ds <- arrow::open_dataset(paste0(Sys.getenv("USERPROFILE"),"/ExProblemeGc"))
> 
> # Memory after ds creation
> memory.size(max=F)
[1] 79.02
> 
> req <-
+   ds %>%
+   collect()
> 
> # Memory after req creation
> memory.size(max=F)
[1] 84.45
> 
> rm(req)
> 
> # Mémoire aprés suppression df
> memory.size(max=F)
[1] 84.45
> 
> gc(verbose = TRUE)
Garbage collection 17 = 9+2+6 (level 2) ... 
49.6 Mbytes of cons cells used (52%)
12.5 Mbytes of vectors used (16%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  927293 49.6    1797205 96.0  1380031 73.8
Vcells 1627658 12.5   10146329 77.5  8388368 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 75.77
> 
> gc(verbose = TRUE)
Garbage collection 18 = 9+2+7 (level 2) ... 
49.6 Mbytes of cons cells used (52%)
12.5 Mbytes of vectors used (16%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  927239 49.6    1797205 96.0  1380031 73.8
Vcells 1627568 12.5   10146329 77.5  8388368 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 75.77
> 
> rm(ds)
> 
> # Memory after deletion df
> memory.size(max=F)
[1] 75.77
> 
> gc(verbose = TRUE)
Garbage collection 19 = 9+2+8 (level 2) ... 
49.6 Mbytes of cons cells used (52%)
12.5 Mbytes of vectors used (16%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  927149 49.6    1797205 96.0  1380031 73.8
Vcells 1627532 12.5   10146329 77.5  8388368 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 75.77
> 
> gc(verbose = TRUE)
Garbage collection 20 = 9+2+9 (level 2) ... 
49.6 Mbytes of cons cells used (52%)
12.5 Mbytes of vectors used (16%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  927146 49.6    1797205 96.0  1380031 73.8
Vcells 1627527 12.5   10146329 77.5  8388368 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 75.77

-无输出编码

gc(verbose = TRUE)

# basic memory
memory.size(max=F)

library(arrow)

# Memory after loading the arrow library with memory.size
memory.size(max=F)

library(dplyr)

# Memory after loading the dplyr library with memory.size
memory.size(max=F)

df <- data.frame(
  col1 = rnorm(1000000),
  col2 = rnorm(1000000),
  col3 = runif(1000000),
  col4 = sample(1:999, size = 1000000, replace = T),
  col5 = sample(c("GroupA", "GroupB"), size = 1000000, replace = T),
  col6 = sample(c("TypeA", "TypeB"), size = 1000000, replace = T)
)

# Memory after df object creation
memory.size(max=F)

arrow::write_dataset(
  df,
  paste0(Sys.getenv("USERPROFILE"),"/ExProblemeGc"),
  format = "parquet"
)

# Memory after writing to disk
memory.size(max=F)

rm(df)

# Memory after deletion df
memory.size(max=F)

gc(verbose = TRUE)

# Memory after gc(verbose = TRUE)
memory.size(max=F)

gc(verbose = TRUE)

# Memory after gc(verbose = TRUE)
memory.size(max=F)

ds <- arrow::open_dataset(paste0(Sys.getenv("USERPROFILE"),"/ExProblemeGc"))

# Memory after ds creation
memory.size(max=F)

req <-
  ds %>%
  collect()

# Memory after req creation
memory.size(max=F)

rm(req)

# Mémoire aprés suppression df
memory.size(max=F)

gc(verbose = TRUE)

# Memory after gc(verbose = TRUE)
memory.size(max=F)

gc(verbose = TRUE)

# Memory after gc(verbose = TRUE)
memory.size(max=F)

rm(ds)

# Memory after deletion df
memory.size(max=F)

gc(verbose = TRUE)

# Memory after gc(verbose = TRUE)
memory.size(max=F)

gc(verbose = TRUE)

# Memory after gc(verbose = TRUE)
memory.size(max=F)

你认为这种记忆差异正常吗?它是否可能是由所使用的库和/或使用R语言的不良做法造成的?
我想知道为什么Windows任务管理器和R的memory.size(max=F)函数在内存使用上有区别。
谢谢你的帮助,我将随时为你提供你可能需要的任何进一步的信息。
最好的问候,

2izufjch

2izufjch1#

作为补充,我使用了函数default_memory_pool()$bytes_allocated和default_memory_pool()$max_memory,下面是我得到的返回:

> gc(verbose = TRUE)
Garbage collection 2 = 0+0+2 (level 2) ... 
14.2 Mbytes of cons cells used (41%)
3.9 Mbytes of vectors used (6%)
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 264908 14.2     648748 34.7   401965 21.5
Vcells 500529  3.9    8388608 64.0  1671274 12.8
> 
> # basic memory
> memory.size(max=F)
[1] 28.78
> 
> library(arrow, warn.conflicts = FALSE)
> 
> # Memory after loading the arrow library with memory.size
> memory.size(max=F)
[1] 51.01
> 
> # bytes_allocated after loading the arrow library
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after loading the arrow library
> default_memory_pool()$max_memory
[1] 0
> 
> library(dplyr)

Attachement du package : ‘dplyr’

Les objets suivants sont masqués depuis ‘package:stats’:

    filter, lag

Les objets suivants sont masqués depuis ‘package:base’:

    intersect, setdiff, setequal, union

> 
> # Memory after loading the dplyr library with memory.size
> memory.size(max=F)
[1] 90.74
> 
> # bytes_allocated after loading the dplyr library
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after loading the dplyr library
> default_memory_pool()$max_memory
[1] 0
> 
> df <- data.frame(
+   col1 = rnorm(1000000),
+   col2 = rnorm(1000000),
+   col3 = runif(1000000),
+   col4 = sample(1:999, size = 1000000, replace = T),
+   col5 = sample(c("GroupA", "GroupB"), size = 1000000, replace = T),
+   col6 = sample(c("TypeA", "TypeB"), size = 1000000, replace = T)
+ )
> 
> # Memory after df object creation
> memory.size(max=F)
[1] 133.23
> 
> # bytes_allocated after df object creation
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after df object creation
> default_memory_pool()$max_memory
[1] 0
> 
> arrow::write_dataset(
+   df,
+   paste0(Sys.getenv("USERPROFILE"),"/ExProblemeGc"),
+   format = "parquet"
+ )
> 
> # Memory after writing to disk
> memory.size(max=F)
[1] 120.07
> 
> # bytes_allocated after writing to disk
> default_memory_pool()$bytes_allocated
[1] 19000128
> 
> # max_memory after writing to disk
> default_memory_pool()$max_memory
[1] 27126592
> 
> rm(df)
> 
> # Memory after deletion df
> memory.size(max=F)
[1] 120.07
> 
> # bytes_allocated after deletion df
> default_memory_pool()$bytes_allocated
[1] 19000128
> 
> # max_memory after deletion df
> default_memory_pool()$max_memory
[1] 27126592
> 
> gc(verbose = TRUE)
Garbage collection 15 = 9+2+4 (level 2) ... 
45.0 Mbytes of cons cells used (61%)
38.0 Mbytes of vectors used (49%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  842008   45    1387691 74.2  1387691 74.2
Vcells 4975717   38   10146329 77.5  8388601 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 101.29
> 
> # bytes_allocated after gc(verbose = TRUE)
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after gc(verbose = TRUE)
> default_memory_pool()$max_memory
[1] 27126592
> 
> gc(verbose = TRUE)
Garbage collection 16 = 9+2+5 (level 2) ... 
45.0 Mbytes of cons cells used (61%)
11.3 Mbytes of vectors used (15%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  841895 45.0    1387691 74.2  1387691 74.2
Vcells 1475542 11.3   10146329 77.5  8388601 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 74.35
> 
> # bytes_allocated after gc(verbose = TRUE)
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after gc(verbose = TRUE)
> default_memory_pool()$max_memory
[1] 27126592
> 
> ds <- arrow::open_dataset(paste0(Sys.getenv("USERPROFILE"),"/ExProblemeGc"))
> 
> # Memory after ds creation
> memory.size(max=F)
[1] 79.01
> 
> # bytes_allocated after ds creation
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after ds creation
> default_memory_pool()$max_memory
[1] 27126592
> 
> req <-
+   ds %>%
+   collect()
> 
> # Memory after req creation
> memory.size(max=F)
[1] 84.46
> 
> # bytes_allocated after req creation
> default_memory_pool()$bytes_allocated
[1] 47504192
> 
> # max_memory after req creation
> default_memory_pool()$max_memory
[1] 83176320
> 
> rm(req)
> 
> # Memory after deletion req
> memory.size(max=F)
[1] 84.47
> 
> # bytes_allocated after deletion req
> default_memory_pool()$bytes_allocated
[1] 47504192
> 
> # max_memory after deletion req
> default_memory_pool()$max_memory
[1] 83176320
> 
> gc(verbose = TRUE)
Garbage collection 17 = 9+2+6 (level 2) ... 
49.6 Mbytes of cons cells used (52%)
12.5 Mbytes of vectors used (16%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  927153 49.6    1792975 95.8  1387691 74.2
Vcells 1627339 12.5   10146329 77.5  8388601 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 75.8
> 
> # bytes_allocated after gc(verbose = TRUE)
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after gc(verbose = TRUE)
> default_memory_pool()$max_memory
[1] 83176320
> 
> gc(verbose = TRUE)
Garbage collection 18 = 9+2+7 (level 2) ... 
49.6 Mbytes of cons cells used (52%)
12.5 Mbytes of vectors used (16%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  927081 49.6    1792975 95.8  1387691 74.2
Vcells 1627219 12.5   10146329 77.5  8388601 64.0
> 
> # bytes_allocated after gc(verbose = TRUE)
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after gc(verbose = TRUE)
> default_memory_pool()$max_memory
[1] 83176320
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 75.8
> 
> rm(ds)
> 
> # Memory after deletion df
> memory.size(max=F)
[1] 75.8
> 
> # bytes_allocated after deletion df
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after deletion df
> default_memory_pool()$max_memory
[1] 83176320
> 
> gc(verbose = TRUE)
Garbage collection 19 = 9+2+8 (level 2) ... 
49.6 Mbytes of cons cells used (52%)
12.5 Mbytes of vectors used (16%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  926997 49.6    1792975 95.8  1387691 74.2
Vcells 1627193 12.5   10146329 77.5  8388601 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 75.8
> 
> # bytes_allocated after gc(verbose = TRUE)
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after gc(verbose = TRUE)
> default_memory_pool()$max_memory
[1] 83176320
> 
> gc(verbose = TRUE)
Garbage collection 20 = 9+2+9 (level 2) ... 
49.6 Mbytes of cons cells used (52%)
12.5 Mbytes of vectors used (16%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  926988 49.6    1792975 95.8  1387691 74.2
Vcells 1627178 12.5   10146329 77.5  8388601 64.0
> 
> # Memory after gc(verbose = TRUE)
> memory.size(max=F)
[1] 75.8
> 
> # bytes_allocated after gc(verbose = TRUE)
> default_memory_pool()$bytes_allocated
[1] 0
> 
> # max_memory after gc(verbose = TRUE)
> default_memory_pool()$max_memory
[1] 83176320

1-加载所有必要的库之后:

memory.size() = 90.74
default_memory_pool()$bytes_allocated = 0
default_memory_pool()$max_memory = 0

2-使用data.frame创建对象df后:

memory.size() = 133.23
default_memory_pool()$bytes_allocated = 0
default_memory_pool()$max_memory = 0

没有使用箭头函数,我想我明白了为什么$bytes_allocated和$max_memory的值不受影响?
3-使用arrow::write_dataset后:

memory.size() = 120.07
default_memory_pool()$bytes_allocated = 19000128
default_memory_pool()$max_memory = 27126592

使用箭头函数会影响$bytes_allocated和$max_memory的值
4-删除df对象和gc()后:

memory.size() = 74.35
default_memory_pool()$bytes_allocated = 0
default_memory_pool()$max_memory = 27126592

我不明白为什么default_memory_pool()在删除df后$bytes_allocated = 0,而创建df时为0,arrow::write_dataset后为19000128。不是19000128吗?
5-在创建ds对象时使用arrow::open_dataset之后:

memory.size() = 79.01
default_memory_pool()$bytes_allocated = 0
default_memory_pool()$max_memory = 27126592

在创建ds时使用箭头函数不会影响$bytes_allocated和$max_memory的值。为什么不呢?
6-在传递ds的内容并使用collect()创建req对象之后:

memory.size() = 84.46
default_memory_pool()$bytes_allocated = 47504192
default_memory_pool()$max_memory = 83176320

再次使用箭头函数会影响$bytes_allocated和$max_memory的值。为什么不呢?
7-删除req对象和gc()后:

memory.size() = 75.8
default_memory_pool()$bytes_allocated = 0
default_memory_pool()$max_memory = 83176320

删除req对象会影响$bytes_allocated的值
8-删除ds对象和gc()后:

memory.size() = 75.8
default_memory_pool()$bytes_allocated = 0
default_memory_pool()$max_memory = 83176320

我不太明白$bytes_allocated和$max_memory是如何工作的。你能解释一下吗?

相关问题