
Spark's analytics engine processes data 10 to 100 times faster than alternatives. It is designed to deliver the computational speed, scalability, and programmability required for Big Data-specifically for streaming data, graph data, machine learning, and artificial intelligence (AI) applications. What is Apache Spark?Īpache Spark (Spark) is an open source data-processing engine for large data sets. This is of course by no means a relevant benchmark for real-life data loads but can provide some insight into optimizing the loads.Apache Spark is a lightning-fast, open source data-processing engine for machine learning and AI applications, backed by the largest open source community in big data. We see that the “lazy” approach that does not cache the entire table into memory has yielded the result around 41% faster. This is a bit difficult to show with our toy example, as everything is physically happening inside the same container (and therefore the same file system), but differences can be observed even with this setup and our small dataset: microbenchmark::microbenchmark( Transferring as little data as possible from the database into Spark memory may bring significant performance benefits. This forces Spark to perform the action of loading the entire table into memory.ĭepending on our use case, it might be much more beneficial to use memory = FALSE and only cache into Spark memory the parts of the table (or processed results) that we need, as the most time-costly operations usually are data transfers over the network. What happens when using the default memory = TRUE is that the table in the Spark SQL context is cached using CACHE TABLE and a SELECT count(*) FROM query is executed on the cached table. The memory argument to spark_read_jdbc() can prove very important when performance is of interest. Note that the only element that changed is the jdbcDataOpts list, which now contains a query element instead of a dbtable element. We mentioned above that apart from just loading a table, we can also choose to execute a SQL query and use its result as the source for our Spark DtaFrame. A bit more on that and some performance implications below

# Write our `test_df` into a table called `test_table`ĭBI::dbWriteTable(con, "test_table", test_df, overwrite = TRUE) # Create a connection to database `testdb`

We will use the and call the newly created table test_table: test_df <- nycflights13::flights


#SPARK URL EXTRACTOR FREE#
If you are interested only in the Spark loading part, feel free to skip this paragraph.įor a fully reproducible example, we will use a local MySQL server instance as due to its open-source nature it is very accessible.
