Over the past few days I’ve been doing a bit more playing around with Python, and create a word cloud. Yes there are lots of examples out there that show this, but none of them worked for me. This could be due to those examples using the older version of Python, libraries/packages no long exist, etc. There are lots of possible reasons. So I have to piece it together and the code given below is what I ended up with. Some steps could be skipped but this is what I ended up with.
Step 1 – Read in the data
In my example I wanted to create a word cloud for a website, so I picked my own blog for this exercise/example. The following code is used to read the website (a list of all packages used is given at the end).
import nltk from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://www.oralytics.com/" html = urlopen(url).read() print(html)
The last line above, print(html), isn’t needed, but I used to to inspect what html was read from the webpage.
Step 2 – Extract just the Text from the webpage
The Beautiful soup library has some useful functions for processing html. There are many alternative ways of doing this processing but this is the approached that I liked.
The first step is to convert the downloaded html into BeautifulSoup format. When you view this converted data you will notices how everything is nicely laid out.
The second step is to remove some of the scripts from the code.
soup = BeautifulSoup(html) print(soup) # kill all script and style elements for script in soup(["script", "style"]): script.extract() # rip it out print(soup)
Step 3 – Extract plain text and remove whitespacing
The first line in the following extracts just the plain text and the remaining lines removes leading and trailing spaces, compacts multi-headlines and drops blank lines.
text = soup.get_text() print(text) # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = '\n'.join(chunk for chunk in chunks if chunk) print(text)
Step 4 – Remove stop words, tokenise and convert to lower case
As the heading says this code removes standard stop words for the English language, removes numbers and punctuation, tokenises the text into individual words, and then converts all words to lower case.
#download and print the stop words for the English language from nltk.corpus import stopwords #nltk.download('stopwords') stop_words = set(stopwords.words('english')) print(stop_words) #tokenise the data set from nltk.tokenize import sent_tokenize, word_tokenize words = word_tokenize(text) print(words) # removes punctuation and numbers wordsFiltered = [word.lower() for word in words if word.isalpha()] print(wordsFiltered) # remove stop words from tokenised data set filtered_words = [word for word in wordsFiltered if word not in stopwords.words('english')] print(filtered_words)
Step 5 – Create the Word Cloud
Finally we can create a word cloud backed on the finalised data set of tokenised words. Here we use the WordCloud library to create the word cloud and then the matplotlib library to display the image.
from wordcloud import WordCloud import matplotlib.pyplot as plt wc = WordCloud(max_words=1000, margin=10, background_color='white', scale=3, relative_scaling = 0.5, width=500, height=400, random_state=1).generate(' '.join(filtered_words)) plt.figure(figsize=(20,10)) plt.imshow(wc) plt.axis("off") plt.show() #wc.to_file("/wordcloud.png")
We get the following word cloud.
Step 6 – Word Cloud based on frequency counts
Another alternative when using the WordCloud library is to generate a WordCloud based on the frequency counts. For this you need to build up a table containing two items. The first item is the distinct token and the second column contains the number of times that word/token appears in the text. The following code shows this code and the code to generate the word cloud based on this frequency count.
from collections import Counter # count frequencies cnt = Counter() for word in filtered_words: cnt[word] += 1 print(cnt) from wordcloud import WordCloud import matplotlib.pyplot as plt wc = WordCloud(max_words=1000, margin=10, background_color='white', scale=3, relative_scaling = 0.5, width=500, height=400, random_state=1).generate_from_frequencies(cnt) plt.figure(figsize=(20,10)) plt.imshow(wc) #plt.axis("off") plt.show()
Now we get the following word cloud.
When you examine these word cloud to can easily guess what the main contents of my blog is about. Machine Learning, Oracle SQL and coding.
What Python Packages did I use?
Here are the list of Python libraries that I used in the above code. You can use PIP3 to install these into your environment.
nltk url open BeautifulSoup wordcloud Counter
I almost forgot, but my 4th book has been published !
It is titled ‘Data Science’ and is published by MIT Press as part of their Essentials Knowledge series, and is co-written with John Kelleher.
It is available on Amazon in print, Kindle and Audio formats. Go check it out.
This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.
Go check it out.
One of the new features of the Autonomous Data Warehouse Cloud (ADWC) service is Oracle Machine Learning. This is a Zeppelin based notebook for your machine learning on ADWC. Check out my previous blog post about this.
In order to be able to use this new product and the in-database machine learning in ADWC, you will need your database user to have certain privileges. The first step in this is to create a typical user for accessing the ADWC and grant it the necessary OML privileges.
To do this open the ADWC console and then open the Service Console.
This will then open a new admin page which contains a link for ‘Manage Oracle ML User’. Click on this.
You can then enter the Username, Password and other details for the user, and then click Create.
This will then create a new user that is specific for Oracle Machine Learning. This new user will be granted the DWROLE, that contains the basic schema privileges and the privileges required to run the in-database machine learning algorithms. For those that a familiar with Oracle Data Mining/Oracle Advanced Analytics option in the Enterprise Edition of the Oracle database, you will see that these privileges are very similar.
You can examine the privileges granted to this DWROLE in the database as an administrator. When you do you will see the following:
CREATE ANALYTIC VIEW CREATE ATTRIBUTE DIMENSION ALTER SESSION CREATE HIERARCHY CREATE JOB CREATE MINING MODEL CREATE PROCEDURE CREATE SEQUENCE CREATE SESSION CREATE SYNONYM CREATE TABLE CREATE TRIGGER CREATE TYPE CREATE VIEW READ,WRITE ON directory DATA_PUMP_DIR EXECUTE privilege on the PL/SQL package DBMS_CLOUD
Over the past 6-8 months I’ve been working on a project with the Veterinary School of Medicine, part of University of Dublin. The project was focused on using Machine Learning to find patterns in blood tests and x-rays from dogs who are are suffering with Irritable Bowel Syndrome (IBS) in dogs.
Over the past five years the Veterinary School of Medicine has built up a very larger data set of blood samples, x-rays, family history of the dog, eating habits of the dogs, etc.
This project has finally completed and it is only now that I can share with you some of the results and lessons we learned during the project.
The first part of the project was getting all this historical data loaded into and Oracle 12.2c database. This was a relatively simple task but this it did take a bit of coding to get the data model correct and to perform the necessary data transformations and integration needed, as illustrated in the following diagram.
Once the data was loaded into the database we could start using the in-database Machine Learning algorithms to find patterns that indicated if a dog was suffering from IBS, and ideally if they were in the early stages of IBS. The treatments for early detection had a higher success rate.
We began using the GUI Oracle Data Miner tool, part of SQL Developer, to build the predictive model workflows.
But the results we were getting was very disappointing. We were getting average predictive accuracy of 56% for all the models. This is not good and not much better than flipping a coin.
We were a bit stuck at this point about what to do next, and then Oracle’s Cloud Engagement team heard about our troubles and suggested that we join the Oracle 18c beta. They got us setup and running with a beta Oracle 18c cloud instance in no time and within a couple of days we are generating new machine learning models.
Oracle 18c has a number of new features that Oracle thought would be useful to use. Firstly they had a new and improved Neural Networks algorithm that had better accuracy when working with images stored as BLOBs, plus there was a number of new SQL analytic functions. One particular function was TO_DOG_YEAR(). A year in a dogs life is not 365 days, and is dependent on the age of the dog as the length of a dog year changes with age and the breed of the dog. Some recent research indicates the geography location and origin of the dog also plays a part.
The syntax of this function is
NLS_BREED STRING )
Oracle has been working with veterinary schools around the world on this problem and hence the introduction of this new SQL analytic function in Oracle 18c. This function accepts the DOB of the dog (DATE), if the dog is Female, and the Breed of the dog. It then calculates the appropriate age of the dog down to the nearest day. We built this into our machine learning workflow, and were very surprised by the outcomes. The predictive accuracy of the models went from 56% to 93%. That is an amazing jump. Perhaps too amazing. But after a few days of extra validation we concluded the difference was down to the new TO_DOG_YEAR() function and the ability to so accurately calculate the age of the dog.
In the last few weeks we have noticed, after the latest Oracle 18c patch has automatically been applied, that this function now has an additional parameter, this is OWN_SMOKE, and seems to indicated if the dog is owned by someone who smokes. This will indeed affect the age of the animal. We having had a chance to try this new parameter yet, but hope to soon.
The following diagram shows the updated workflow along with the transformation node that uses the TO_DOG_YEAR() function.
If you do a bit of googling you will find lots of research by various veterinary schools around the world, who spend so much time researching the various aspects that apply to this calculation. It was this research and Oracle’s involvement in previous research that resulted in the TO_DOG_YEAR() function being included in Oracle 18c.
A more detailed research paper going to be published in the International Journal of Veterinary Science and Medicine in June 2018 (Volume 6, Issue 1). This paper will explain in more details of the effects of age, breed and sex has on the accuracy of the machine learning models.
We have also been asked to submit our project to Oracle Open World, and it is currently been considered for early selection. This will allow OOW to include this project in their promotional material.
With the recent release of Oracle’s Autonomous Data Warehouse Cloud (ADWC), Oracle has given data scientists a new tool for data discovery and machine learning on the ADWC. Oracle Machine Learning is based on Apache Zeppelin and gives us a new machine learning tool for accessing the in-database machine learning algorithms and in-database statistical functions.
Oracle Machine Learning (OML) SQL notebooks provide easy access to Oracle’s parallelized, scalable in-database implementations of a library of Oracle Advanced Analytics’ machine learning algorithms (classification, regression, anomaly detection, clustering, associations, attribute importance, feature extraction, times series, etc.), SQL, PL/SQL and Oracle’s statistical and analytical SQL functions. Oracle Machine Learning SQL notebooks and Oracle Advanced Analytics’ library of machine learning SQL functions combined with PL/SQL allow companies to automate their discovery of new insights, generate predictions and add “AI” to data viz dashboards and enterprise applications.
The key features of Oracle Machine Learning include:
- Collaborative SQL notebook UI for data scientists
- Packaged with Oracle Autonomous Data Warehouse Cloud
- Easy access to shared notebooks, templates, permissions, scheduler, etc.
- Access to 30+ parallel, scalable in-database implementations of machine learning algorithms
- SQL and PL/SQL scripting language supported
- Enables and Supports Deployments of Enterprise Machine Learning Methodologies in ADWC
Here is a list of key resources for Oracle Machine Learning:
- Oracle Machine Learning Notebooks
- Video overview of Oracle Machine Learning
- Download sample Oracle Machine Learning notebooks
- Quick Start Tutorial for getting started with Oracle Machine Learning
- Documentation: Using Oracle Machine Learning
Is Python the new R?
Maybe, maybe not, but that I’m finding in recent months is more companies are asking me to use Python instead of R for some of my work.
In this blog post I will walk through the steps of setting up the Oracle driver for Python, called cx_Oracle. The documentation for this drive is good and detailed with plenty of examples available on GitHub. Hopefully there isn’t anything new in this post, but it is my experiences and what I did.
1. Install Oracle Client
The Python driver requires Oracle Client software to be installed. Go here, download and install. It’s a straightforward install. Make sure the directories are added to the search path.
2. Download and install cx_Oracle
You can use pip3 to do this.
pip3 install cx_Oracle Collecting cx_Oracle Downloading cx_Oracle-6.1.tar.gz (232kB) 100% |████████████████████████████████| 235kB 679kB/s Building wheels for collected packages: cx-Oracle Running setup.py bdist_wheel for cx-Oracle ... done Stored in directory: /Users/brendan.tierney/Library/Caches/pip/wheels/0d/c4/b5/5a4d976432f3b045c3f019cbf6b5ba202b1cc4a36406c6c453 Successfully built cx-Oracle Installing collected packages: cx-Oracle Successfully installed cx-Oracle-6.1
3. Create a connection in Python
Now we can create a connection. When you see some text enclosed in angled brackets <>, you will need to enter your detailed for your schema and database server.
# import the Oracle Python library import cx_Oracle # define the login details p_username = "" p_password = "" p_host = "" p_service = "" p_port = "1521" # create the connection con = cx_Oracle.connect(user=p_username, password=p_password, dsn=p_host+"/"+p_service+":"+p_port) # an alternative way to create the connection # con = cx_Oracle.connect('/@/:1521') # print some details about the connection and the library print("Database version:", con.version) print("Oracle Python version:", cx_Oracle.version) Database version: 126.96.36.199.0 Oracle Python version: 6.1
4. Query some data and return results to Python
In this example the query returns the list of tables in the schema.
# define a cursor to use with the connection cur = con.cursor() # execute a query returning the results to the cursor cur.execute('select table_name from user_tables') # for each row returned to the cursor, print the record for row in cur: print("Table: ", row) Table: ('DECISION_TREE_MODEL_SETTINGS',) Table: ('INSUR_CUST_LTV_SAMPLE',) Table: ('ODMR_CARS_DATA',)
Now list the Views available in the schema.
# define a second cursor cur2 = con.cursor() # return the list of Views in the schema to the cursor cur2.execute('select view_name from user_views') # display the list of Views for result_name in cur2: print("View: ", result_name) View: ('MINING_DATA_APPLY_V',) View: ('MINING_DATA_BUILD_V',) View: ('MINING_DATA_TEST_V',) View: ('MINING_DATA_TEXT_APPLY_V',) View: ('MINING_DATA_TEXT_BUILD_V',) View: ('MINING_DATA_TEXT_TEST_V',)
5. Query some data and return to a Panda in Python
Pandas are commonly used for storing, structuring and processing data in Python, using a data frame format. The following returns the results from a query and stores the results in a panda.
# in this example the results of a query are loaded into a Panda # load the pandas library import pandas as pd # execute the query and return results into the panda called df df = pd.read_sql_query("SELECT * from INSUR_CUST_LTV_SAMPLE", con) # print the records returned by query and stored in panda print(df.head()) CUSTOMER_ID LAST FIRST STATE REGION SEX PROFESSION \ 0 CU13388 LEIF ARNOLD MI Midwest M PROF-2 1 CU13386 ALVA VERNON OK Midwest M PROF-18 2 CU6607 HECTOR SUMMERS MI Midwest M Veterinarian 3 CU7331 PATRICK GARRETT CA West M PROF-46 4 CU2624 CAITLYN LOVE NY NorthEast F Clerical BUY_INSURANCE AGE HAS_CHILDREN ... MONTHLY_CHECKS_WRITTEN \ 0 No 70 0 ... 0 1 No 24 0 ... 9 2 No 30 1 ... 2 3 No 43 0 ... 4 4 No 27 1 ... 4 MORTGAGE_AMOUNT N_TRANS_ATM N_MORTGAGES N_TRANS_TELLER \ 0 0 3 0 0 1 3000 4 1 1 2 980 4 1 3 3 0 2 0 1 4 5000 4 1 2 CREDIT_CARD_LIMITS N_TRANS_KIOSK N_TRANS_WEB_BANK LTV LTV_BIN 0 2500 1 0 17621.00 MEDIUM 1 2500 1 450 22183.00 HIGH 2 500 1 250 18805.25 MEDIUM 3 800 1 0 22574.75 HIGH 4 3000 2 1500 17217.25 MEDIUM [5 rows x 31 columns]
6. Wrapping it up and closing things
Finally we need to wrap thing up and close our cursors and our connection to the database.
# close the cursors cur2.close() cur.close() # close the connection to the database con.close()
Watch out for more blog posts on using Python with Oracle, Oracle Data Mining and Oracle R Enterprise.