R/SQL:查找表是否包含某个值

omqzjyyz  于 2023-06-19  发布在  其他
关注(0)|答案(2)|浏览(126)

我在做NetezzaSQL。
假设我有3个表位于一个服务器上(例如在真实的生活中,可能存在许多这样表,且每个表具有许多行和许多列):

  • 表1:col 1、col 2、col 3
  • 表2:col 4、col 5、col 6
  • 表3:col 7、col 8、col 9
  • 等等的。
    **我的问题:**我试图找出哪些表中的哪些列包含值%LIKE%“ABC”(即如果列中的行包含%LIKE%“ABC”值)

我想,也许一起使用R/SQL可能是解决这个问题的更好方法,因为我不认为有一种方法可以专门使用Netezza SQL(也许这是可能的动态查询,但这看起来非常困难。

**步骤1:**我已经在一个表中有了这个信息(使用Netezza内置的information_schema.column表创建):

# I re-created this for the stackoverflow example

table_name = c("table1","table1", "table1", "table2","table2", "table2", "table3", "table3", "table3")
col_name = c("col1","col2", "col3", "col4", "col5", "col6","col7", "col8", "col9")

summary = data.frame(table_name, col_name)

  table_name col_name
1     table1     col1
2     table1     col2
3     table1     col3
4     table2     col4
5     table2     col5
6     table2     col6
7     table3     col7
8     table3     col8
9     table3     col9

**步骤2:**我创建了一列SQL查询,我将发送:

# generate a column of SQL queries to be sent
summary$queries = SQL(paste0("select * from ", summary$table_name, " where ", "summary$col_name " ," LIKE %abc%"))

table_name col_name                                                 queries
1     table1     col1 select * from table1 where summary$col_name  LIKE %abc%
2     table1     col2 select * from table1 where summary$col_name  LIKE %abc%
3     table1     col3 select * from table1 where summary$col_name  LIKE %abc%
4     table2     col4 select * from table2 where summary$col_name  LIKE %abc%
5     table2     col5 select * from table2 where summary$col_name  LIKE %abc%
6     table2     col6 select * from table2 where summary$col_name  LIKE %abc%
7     table3     col7 select * from table3 where summary$col_name  LIKE %abc%
8     table3     col8 select * from table3 where summary$col_name  LIKE %abc%
9     table3     col9 select * from table3 where summary$col_name  LIKE %abc%

**第三步:**接下来,我写了下面的R代码:

#import libraries
library(odbc)
library(DBI)

# connect
my_con <- dbConnect(odbc:: odbc()...)

# create for loop

my_list = list()

for (i in 1:nrow(summary)) {
  tryCatch({
    is_null_i = dbExecute(my_con, summary$queries[i])
    ifelse(is.null(is_null_i) == "FALSE", my_list[[i]] = is_null_i, NA)
  }, error = function(e) {
    # ignore error
  })
}

**我的问题:**虽然每个单独的dbExecute(my_con, summary$queries[i])看起来都在运行,但is_null_i并没有看到被创建。

最后,我会寻找这样的输出:

# hypothetical output tables and columns that contain the word %ABC%:

  table_name col_name
1     table1     col1
2     table1     col3
3     table3     col7
4     table3     col8

”””有人可以告诉我一个更好的方法来做这个吗?**也许可以只在Netezza这样做?
谢谢!

yhived7q

yhived7q1#

library(tidyverse)
library(dbplyr)

我们可以使用DuckDB创建一个小型的内存中测试设置

con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")

假设我们有以下三个表:

table1 <- tibble(
  col1 = c("A", "B", "aBc"),
  col2 = c("A", "B", "abc"),
  col3 = c("A", "B", "ABC")
)

table2 <- tibble(
  col4 = c("A", "B", "aBc"),
  col5 = c("A", "B", "ABC"),
  col6 = c("A", "B", "abc"),
  col7 = c("A", "B", "ABC")
)

table3 <- tibble(
  col8 = c("A", "B", "ABC"),
  col9 = c("A", "B", "abc"),
)

copy_to(con, table1)
copy_to(con, table2)
copy_to(con, table3)

因此期望输出为

table      column
table1     col3
table2     col5
table2     col7
table3     col8

如你所知,我们可以列出数据库“服务器”上的所有表

DBI::dbListTables(con)
#> [1] "table1" "table2" "table3"

理论上我们可以使用for循环,但对于这类操作,我更喜欢将内容保存在dataframe中,并依赖于不同的dplyr工具。首先,我们可以使用enframe将表名列表移动到tibble中,并建立到不同远程表的连接。可以这样做:

(remote_data <- DBI::dbListTables(con) |> 
  enframe(value = "table_name") |> 
  rowwise() |> 
  mutate(
    data = list(tbl(con, table_name))
  )
)
#> # A tibble: 3 × 3
#> # Rowwise: 
#>    name table_name data          
#>   <int> <chr>      <list>        
#> 1     1 table1     <tbl_dck_[,3]>
#> 2     2 table2     <tbl_dck_[,4]>
#> 3     3 table3     <tbl_dck_[,2]>

现在我们有了一个列,其中包含到duckdb表的连接列表。我们可以依靠dbplyr包为我们做翻译,而不是手动构造查询。我们先从其中一张table开始

remote_data[["data"]][[1]] |> 
  summarise(
    across(everything(), \(column) any(str_detect(column, "ABC")))
  ) |> 
  pivot_longer(everything())
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source:   SQL [3 x 2]
#> # Database: DuckDB 0.5.1 [unknown@Linux 5.15.90.1-microsoft-standard-WSL2:R 4.2.2/:memory:]
#>   name  value
#>   <chr> <lgl>
#> 1 col1  FALSE
#> 2 col2  FALSE
#> 3 col3  TRUE

现在,我们可以像这样对所有远程表重复该操作

(results <- remote_data |> 
  reframe(
    table_name, 
    data |> 
      summarise(
        across(everything(), \(column) any(str_detect(column, "ABC")))
      ) |> 
      pivot_longer(
        everything(), 
        names_to = "column_name", 
        values_to = "contains_value"
      ) |> 
      collect()
  ) |> 
  ungroup()
)
#> # A tibble: 9 × 3
#>   table_name column_name contains_value
#>   <chr>      <chr>       <lgl>         
#> 1 table1     col1        FALSE         
#> 2 table1     col2        FALSE         
#> 3 table1     col3        TRUE          
#> 4 table2     col4        FALSE         
#> 5 table2     col5        TRUE          
#> 6 table2     col6        FALSE         
#> 7 table2     col7        TRUE          
#> 8 table3     col8        TRUE          
#> 9 table3     col9        FALSE

为了得到预期的输出,我们只需要过滤和选择

results |> 
  filter(contains_value) |> 
  select(-contains_value)
#> # A tibble: 4 × 2
#>   table_name column_name
#>   <chr>      <chr>      
#> 1 table1     col3       
#> 2 table2     col5       
#> 3 table2     col7       
#> 4 table3     col8

创建于2023-06-12带有reprex v2.0.2

fcy6dtqo

fcy6dtqo2#

library(tidyverse)
library(dbplyr)

## setup ###########
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
table1 <- tibble(
  col1 = c("A", "B", "aBc"),
  col2 = c("A", "B", "abc"),
  col3 = c("A", "B", "ABC")
)

table2 <- tibble(
  col4 = c("A", "B", "aBc"),
  col5 = c("A", "B", "ABC"),
  col6 = c("A", "B", "abc"),
  col7 = c("A", "B", "ABC")
)

table3 <- tibble(
  col8 = c("A", "B", "ABC"),
  col9 = c("A", "B", "abc"),
)
copy_to(con, table1)
copy_to(con, table2)
copy_to(con, table3)
####################

如果您不想依赖dbplyr的翻译,另一种选择是自己构造查询。我会建议使用胶水包中的glue_sql。使用paste构造查询通常是一种不好的做法,因为可能会遇到SQL注入的问题。

(queries_df <- DBI::dbListTables(con) |> 
  enframe(value = "table_name") |> 
  group_by(table_name) |> 
  reframe(
    column_name = DBI::dbListFields(con, table_name),
  ) |> 
  ungroup() |> 
  mutate(
    query = glue::glue_sql(
      "select * from {`table_name`} where {`column_name`} like '%ABC%'",
      .con = con
    ) |> as.character()
  )
 
)
#> # A tibble: 9 × 3
#>   table_name column_name query                                       
#>   <chr>      <chr>       <chr>                                       
#> 1 table1     col1        select * from table1 where col1 like '%ABC%'
#> 2 table1     col2        select * from table1 where col2 like '%ABC%'
#> 3 table1     col3        select * from table1 where col3 like '%ABC%'
#> 4 table2     col4        select * from table2 where col4 like '%ABC%'
#> 5 table2     col5        select * from table2 where col5 like '%ABC%'
#> 6 table2     col6        select * from table2 where col6 like '%ABC%'
#> 7 table2     col7        select * from table2 where col7 like '%ABC%'
#> 8 table3     col8        select * from table3 where col8 like '%ABC%'
#> 9 table3     col9        select * from table3 where col9 like '%ABC%'

然后你可以发送查询与DBI::dbGetQuery。这里你可以使用for循环,但是使用map可能更容易,因为我们已经有了表格格式的数据。

(results_df <- queries_df |> 
  mutate(
    results = map(query, \(.x) DBI::dbGetQuery(con, .x))
  )
)
#> # A tibble: 9 × 4
#>   table_name column_name query                                        results
#>   <chr>      <chr>       <chr>                                        <list> 
#> 1 table1     col1        select * from table1 where col1 like '%ABC%' <df>   
#> 2 table1     col2        select * from table1 where col2 like '%ABC%' <df>   
#> 3 table1     col3        select * from table1 where col3 like '%ABC%' <df>   
#> 4 table2     col4        select * from table2 where col4 like '%ABC%' <df>   
#> 5 table2     col5        select * from table2 where col5 like '%ABC%' <df>   
#> 6 table2     col6        select * from table2 where col6 like '%ABC%' <df>   
#> 7 table2     col7        select * from table2 where col7 like '%ABC%' <df>   
#> 8 table3     col8        select * from table3 where col8 like '%ABC%' <df>   
#> 9 table3     col9        select * from table3 where col9 like '%ABC%' <df>

然后,您可以得到如下所示的预期结果:

results_df |> 
  filter(map_int(results, nrow) != 0) |> 
  select(table_name, column_name)
#> # A tibble: 4 × 2
#>   table_name column_name
#>   <chr>      <chr>      
#> 1 table1     col3       
#> 2 table2     col5       
#> 3 table2     col7       
#> 4 table3     col8

创建于2023-06-12带有reprex v2.0.2

相关问题