Last week I was presenting at Oracle Code in New York. I’ve presented at a few Oracle Code events over the past 12 months and it is always interesting to meet and talk with developers from around the World.
The title of my presentation this time was ‘SQL: The one language to rule all your data’.
I’ve given this presentation a few times at different events (POUG, OOW, Oracle Code). I take the contents of this presentation for granted and that most people know these things. But the opposite is true. Well a lot of people do know these things, but a magnitude more do not seem to know.
For example, at last weeks Oracle Code event, I had about 100 people in the room. I started out by asking the attendees ‘How many of you write SQL every day?’. About 90% put up their hand. Then a few minutes later after I start talking about various statistical functions in the database, I then ask them to ‘Count how many statistical functions they have used?’ I then asked them to raise their hands if they use over five statistical functions. About eight people put up their hands. Then I asked how many people use over ten functions. To my surprise only one (yes one) person put up their hand.
The first half of the presentation talks about statistical, analytical and machine learning in the database.
The second half covers some (not all) of the various data types and locations of data that can be accessed from the database.
The presentation then concludes with the title of the presentation about SQL being the one language to rule all your data.
Based on last weeks experience, it looks like a lot more people need to hear it !
Hopefully I’ll get the chance to share this presentation with other events and Oracle User Group conferences.
Two of the key take away messages are:
- Google makes us stupid
- We need to RTFM more often
Here is a link to the slides on SlideShare
And I recorded a short video about the presentation with Bob from OTN/ODC.
With each release of the Oracle Database we get new Machine Learning features, under the umbrella term of Oracle Advanced Analytics option (OAA).
With Oracle 18c we get the following new features, that include new machine learning algorithms, improvements to machine learning algorithms, and meta-data improvements for registering new R based algorithms.
These new OAA features include:
New Time-Series function : This new function forecasts target value based solely on a known history of target values and uses the popular auto-regressive modelling method.
New Model Detail Views : Previously you could inspect the details of a model using a function. This is being phased out and replaced by model view, with the format DM$VA
New Neural Networks Algorithm : With the growing interest in deep learning, Oracle have now included a neural network algorithm into the database, thus providing SQL and PL/SQL interfaces to all for easy of use and easy of integration into applications.
New Random Forest Algorithm : Random Forests has been proven over the past few years to be very accurate for certain types of classification problems. This algorithm has now been included in the database, with SQL and PL/SQL interfaces.
Improved Sampling for Association Rules : A new specialised sampling approach is introduced for Association Rules. This is to improve performance, while maintaining accuracy, for large/big data sets.
Algorithm Meta Data Registration : Simplifies the integration of new algorithms in the R extensibility framework. This feature allows a uniform consistent approach of registering new algorithm functions and their settings.
New Exponential Smoothing Algorithm : This allows for users to make predictions from time series data, and includes 14 models, including the popular Holt (trend) and Holt-Winters (trend and seasonality) models, and the ability to handle irregular time series intervals.
New CUR Decomposition-based Algorithm for Attribute and Row Importance : Most algorithms focus on identifying columns or rows that are important within their data sets. This algorithm has the added feature of also identifying important rows.
As you can see there are a lot of machine learning new features in Oracle 18c. Each one of these new features will be explored in more detail in separate blog posts.
If you used other languages, including Oracle PL/SQL, more than likely you will have experienced having to play buffering the number of records that are returned from a cursor. Typically this is needed when you are processing more than a few hundred records. The default buffering size is relatively small and by increasing the size of the number of records to be buffered can dramatically improve the performance of your code.
As with all things in coding and IT, the phrase “It Depends” applies here and changing the buffering size may not be what you need and my not help you to gain optimal performance for your code.
There are lots and lots of examples of how to test this in PL/SQL and other languages, but what I’m going to show you here in this blog post is to change the buffering size when using Python to process data in an Oracle Database using the Oracle Python library cx_Oracle.
Let us begin with taking the defaults and seeing what happens. In this first scenario the default buffering is used. Here we execute a query and the process the records in a FOR loop (yes these is a row-by-row, slow-by-slow approach.
import time i = 0 # define a cursor to use with the connection cur2 = con.cursor() # execute a query returning the results to the cursor print("Starting cursor at", time.ctime()) cur2.execute('select * from sh.customers') print("Finished cursor at", time.ctime()) # for each row returned to the cursor, print the record print("Starting for loop", time.ctime()) t0 = time.time() for row in cur2: i = i+1 if (i%10000) == 0: print(i,"records processed", time.ctime()) t1 = time.time() print("Finished for loop at", time.ctime()) print("Number of records counted = ", i) ttime = t1 - t0 print("in ", ttime, "seconds.")
This gives us the following output.
Starting cursor at 10:11:43 Finished cursor at 10:11:43 Starting for loop 10:11:43 10000 records processed 10:11:49 20000 records processed 10:11:54 30000 records processed 10:11:59 40000 records processed 10:12:05 50000 records processed 10:12:09 Finished for loop at 10:12:11 Number of records counted = 55500 in 28.398550033569336 seconds.
Processing the data this way takes approx. 28 seconds and this corresponds to the buffering of approx 50-75 records at a time. This involves many, many, many round trips to the the database to retrieve this data. This default processing might be fine when our query is only retrieving a small number of records, but as our data set or results set from the query increases so does the time it takes to process the query.
But we have a simple way of reducing the time taken, as the number of records in our results set increases. We can do this by increasing the number of records that are buffered. This can be done by changing the size of the ‘arrysize’ for the cursor definition. This reduces the number of “roundtrips” made to the database, often reducing networks load and reducing the number of context switches on the database server.
The following gives an example of same code with one additional line.
cur2.arraysize = 500
Here is the full code example.
# Test : Change the arraysize and see what impact that has import time i = 0 # define a cursor to use with the connection cur2 = con.cursor() cur2.arraysize = 500 # execute a query returning the results to the cursor print("Starting cursor at", time.ctime()) cur2.execute('select * from sh.customers') print("Finished cursor at", time.ctime()) # for each row returned to the cursor, print the record print("Starting for loop", time.ctime()) t0 = time.time() for row in cur2: i = i+1 if (i%10000) == 0: print(i,"records processed", time.ctime()) t1 = time.time() print("Finished for loop at", time.ctime()) print("Number of records counted = ", i) ttime = t1 - t0 print("in ", ttime, "seconds.")
Now the response time to process all the records is.
Starting cursor at 10:13:02
Finished cursor at 10:13:02
Starting for loop 10:13:02
10000 records processed 10:13:04
20000 records processed 10:13:06
30000 records processed 10:13:08
40000 records processed 10:13:10
50000 records processed 10:13:12
Finished for loop at 10:13:13
Number of records counted = 55500
in 11.780734777450562 seconds.
All done in just under 12 seconds, compared to 28 seconds previously.
Here is another alternative way of processing the data and retrieves the entire results set, using the ‘fetchall’ command, and stores it located in ‘res’.
The 18c Oracle DBaaS is now available. This is the only place that Oracle 18c will be available until later in 2018. So if you want to try it out, then you are going to need to get some Oracle Cloud credits, or you may already have a paying account for Oracle Cloud.
The following outlines the steps you need to go through to gets Oracle 18c setup.
1. Log into your Oracle Cloud
Log into your Oracle Cloud environment. Depending on your access path you will get to your dashboard.
Select Create Instance from the dashboard.
2. Create a new Database
From the list of services to create, select Database.
3. Click ‘Create Instance’
4. Enter the Database Instance details
Enter the details for your new Oracle 18c Database. I’ve called mine ‘db18c’.
Then for the Software Release dropdown list, select ‘Oracle Database 18c’.
Next select the Software Edition from the dropdown list.
5. Fill in the Instance Details
Fill in the details for ‘DB Name’, ‘PDB Name’, ‘Administration Password’, ‘Confirm Password’, setup the SSH Public Key, and then decide if you need the Backup and Recovery option.
6. Create the DBaaS
Double check everything and when ready click on the ‘Create’ button.
7. Wait for Everything to be Create
Now is the time to be patient and wait while your cloud service is created.
I’ve created two different version of the 18c Oracle DBaaS. The Enterprise Edition to 30 minutes to complete and the High Performance service too 47 minutes.
No it’s time to go play.
On Friday afternoon (16th February) we started to see tweets and blog posts from people in Oracle saying that Oracle 18c was now available. But is only available on Oracle Cloud and Engineered Systems.
It looks like we will have to wait until the Autumn before we can install it ourselves on our own servers 😦
Here is the link to the official announcement for Oracle 18c.
Oracle 18c is really Oracle 18.104.22.168. The next full new release of the Oracle database is expected to be Oracle 19.
The new features and incremental enhancements in Oracle 18c are:
- Memory Optimized Fetches
- Exadata RAC Optimizations
- High Availability
- Online Partition Merge
- Improved Machine Learning (OAA)
- Polymorphic Table Functions
- Spatial and Graph
- More JSON improvements
- Private Temporary Tablespaces
- New mode for Connection Manager
And now the all important links to the documentation.
To give Oracle 18c a try you will need to go to cloud.oracle.com and select Database from the drop down list from the Platform menu. Yes you are going to need an Oracle Cloud account and some money or some free credit. Go and get some free cloud credits at the upcoming Oracle Code events.
If you want a ‘free’ way of trying out Oracle 18c, you can use Oracle Live SQL. They have setup some examples of the new features for you to try.
NOTE: Oracle 18c is not Autonomous. Check out Tim Hall’s blog posts about this. The Autonomous Oracle Database is something different, and we will be hearing more about this going forward.
Is Python the new R?
Maybe, maybe not, but that I’m finding in recent months is more companies are asking me to use Python instead of R for some of my work.
In this blog post I will walk through the steps of setting up the Oracle driver for Python, called cx_Oracle. The documentation for this drive is good and detailed with plenty of examples available on GitHub. Hopefully there isn’t anything new in this post, but it is my experiences and what I did.
1. Install Oracle Client
The Python driver requires Oracle Client software to be installed. Go here, download and install. It’s a straightforward install. Make sure the directories are added to the search path.
2. Download and install cx_Oracle
You can use pip3 to do this.
pip3 install cx_Oracle Collecting cx_Oracle Downloading cx_Oracle-6.1.tar.gz (232kB) 100% |████████████████████████████████| 235kB 679kB/s Building wheels for collected packages: cx-Oracle Running setup.py bdist_wheel for cx-Oracle ... done Stored in directory: /Users/brendan.tierney/Library/Caches/pip/wheels/0d/c4/b5/5a4d976432f3b045c3f019cbf6b5ba202b1cc4a36406c6c453 Successfully built cx-Oracle Installing collected packages: cx-Oracle Successfully installed cx-Oracle-6.1
3. Create a connection in Python
Now we can create a connection. When you see some text enclosed in angled brackets <>, you will need to enter your detailed for your schema and database server.
# import the Oracle Python library import cx_Oracle # define the login details p_username = "" p_password = "" p_host = "" p_service = "" p_port = "1521" # create the connection con = cx_Oracle.connect(user=p_username, password=p_password, dsn=p_host+"/"+p_service+":"+p_port) # an alternative way to create the connection # con = cx_Oracle.connect('/@/:1521') # print some details about the connection and the library print("Database version:", con.version) print("Oracle Python version:", cx_Oracle.version) Database version: 22.214.171.124.0 Oracle Python version: 6.1
4. Query some data and return results to Python
In this example the query returns the list of tables in the schema.
# define a cursor to use with the connection cur = con.cursor() # execute a query returning the results to the cursor cur.execute('select table_name from user_tables') # for each row returned to the cursor, print the record for row in cur: print("Table: ", row) Table: ('DECISION_TREE_MODEL_SETTINGS',) Table: ('INSUR_CUST_LTV_SAMPLE',) Table: ('ODMR_CARS_DATA',)
Now list the Views available in the schema.
# define a second cursor cur2 = con.cursor() # return the list of Views in the schema to the cursor cur2.execute('select view_name from user_views') # display the list of Views for result_name in cur2: print("View: ", result_name) View: ('MINING_DATA_APPLY_V',) View: ('MINING_DATA_BUILD_V',) View: ('MINING_DATA_TEST_V',) View: ('MINING_DATA_TEXT_APPLY_V',) View: ('MINING_DATA_TEXT_BUILD_V',) View: ('MINING_DATA_TEXT_TEST_V',)
5. Query some data and return to a Panda in Python
Pandas are commonly used for storing, structuring and processing data in Python, using a data frame format. The following returns the results from a query and stores the results in a panda.
# in this example the results of a query are loaded into a Panda # load the pandas library import pandas as pd # execute the query and return results into the panda called df df = pd.read_sql_query("SELECT * from INSUR_CUST_LTV_SAMPLE", con) # print the records returned by query and stored in panda print(df.head()) CUSTOMER_ID LAST FIRST STATE REGION SEX PROFESSION \ 0 CU13388 LEIF ARNOLD MI Midwest M PROF-2 1 CU13386 ALVA VERNON OK Midwest M PROF-18 2 CU6607 HECTOR SUMMERS MI Midwest M Veterinarian 3 CU7331 PATRICK GARRETT CA West M PROF-46 4 CU2624 CAITLYN LOVE NY NorthEast F Clerical BUY_INSURANCE AGE HAS_CHILDREN ... MONTHLY_CHECKS_WRITTEN \ 0 No 70 0 ... 0 1 No 24 0 ... 9 2 No 30 1 ... 2 3 No 43 0 ... 4 4 No 27 1 ... 4 MORTGAGE_AMOUNT N_TRANS_ATM N_MORTGAGES N_TRANS_TELLER \ 0 0 3 0 0 1 3000 4 1 1 2 980 4 1 3 3 0 2 0 1 4 5000 4 1 2 CREDIT_CARD_LIMITS N_TRANS_KIOSK N_TRANS_WEB_BANK LTV LTV_BIN 0 2500 1 0 17621.00 MEDIUM 1 2500 1 450 22183.00 HIGH 2 500 1 250 18805.25 MEDIUM 3 800 1 0 22574.75 HIGH 4 3000 2 1500 17217.25 MEDIUM [5 rows x 31 columns]
6. Wrapping it up and closing things
Finally we need to wrap thing up and close our cursors and our connection to the database.
# close the cursors cur2.close() cur.close() # close the connection to the database con.close()
Watch out for more blog posts on using Python with Oracle, Oracle Data Mining and Oracle R Enterprise.
Here is a series of videos by CJ Date on Relational Theory.
An Introduction to set theory (videos on O’Reilly website)
Nullology (videos on O’Reilly website)
Nulls, Three-Valued Logic, and Missing Information (videos on O’Reilly website)
The Closed World Assumption (videos on O’Reilly website)
View Updating (videos on O’Reilly website)