Read SQL-Server table in pyspark (databricks) with conditions, not the entire table

4dc9hkyq  于 2023-11-16  发布在  Spark
关注(0)|答案(1)|浏览(87)

Is there any way to read data into pyspark dataframe from sql-server table based on condition, eg read only rows where column 'time_stamp' has current date?

Alternativey, I want to translate :

select * from table_name where time_stamp=cast(getdate() as date)

into pyspark dataframe.

I am using :

remote_table = (spark.read.format("sqlserver")
    .option("host", "host_name")
    .option("user", "use_name")
    .option("password", "password")
    .option("database", "database_name")
    .option("dbtable", "dbo.table_name")
    .load() )

which reads entire table 'table_name'. I just need to read rows that satisfy a condition, like 'where' clause in SQL.

35g0bw71

35g0bw711#

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("DateRangeQuery").getOrCreate()

# Define your JDBC connection properties
connection_properties = {
    "user": "<username>",
    "password": "<password>",
    "url": "jdbc:sqlserver://<server>:<port>;databaseName=<database>"
}

# Define the date range
t_start = "2023-01-01"
t_end = "2023-12-31"

# Define your SQL query with the date range filter
sql_query = f"(SELECT * FROM table_name WHERE t_date BETWEEN '{t_start}' AND '{t_end}') AS custom_query"

# Read the filtered data using the SQL query
df = spark.read.jdbc(url=connection_properties["url"], table=sql_query, properties=connection_properties)

# Show the filtered DataFrame
df.show()

# Stop the Spark session
spark.stop()

相关问题