In my previous blog post I introduced the new Explicit Semantic Analysis (ESA) algorithm and gave an example of how you can build an ESA model and use it. Check out this link for that blog post.
In this blog post I will show you how you can manually create an ESA model. The reason that I’m showing you this way is that the workflow (in ODMr and it’s scheduler) may not be for everyone. You may want to automate the creation or recreation of the ESA model from time to time based on certain business requirements.
In my previous blog post I showed how you can setup a training data set. This comes with ODMr 4.2 but you may need to expand this data set or to use an alternative data set that is more in keeping with your domain.
Setup the ODM Settings table
As with all ODM algorithms we need to create a settings table. This settings table allows us to store the various parameters and their values, that will be used by the algorithm.
-- Create the settings table CREATE TABLE ESA_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(30)); -- Populate the settings table -- Specify ESA. By default, Naive Bayes is used for classification. -- Specify ADP. By default, ADP is not used. Need to turn this on. BEGIN INSERT INTO ESA_settings (setting_name, setting_value) VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_explicit_semantic_analys); INSERT INTO ESA_settings (setting_name, setting_value) VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on); INSERT INTO ESA_settings (setting_name, setting_value) VALUES (odms_sampling,odms_sampling_disable); commit; END;
These are the minimum number of parameter setting needed to run the ESA algorithm. The other ESA algorithm setting include:
Setup the Oracle Text Policy
You also need to setup an Oracle Text Policy and a lexer for the Stopwords.
DECLARE v_policy_name varchar2(30); v_lexer_name varchar2(3) BEGIN v_policy_name := 'ESA_TEXT_POLICY'; v_lexer_name := 'ESA_LEXER'; ctx_ddl.create_preference(v_lexer_name, 'BASIC_LEXER'); v_stoplist_name := 'CTXSYS.DEFAULT_STOPLIST'; -- default stop list ctx_ddl.create_policy(policy_name => v_policy_name, lexer => v_lexer_name, stoplist => v_stoplist_name); END;
Create the ESA model
Once we have the settings table created with the parameter values set for the algorithm and the Oracle Text policy created, we can now create the model.
To ensure that the Oracle Text Policy is applied to the text we want to analyse we need to create a transformation list and add the Text Policy to it.
We can then pass the text transformation list as a parameter to the CREATE_MODEL, procedure.
DECLARE v_xlst dbms_data_mining_transform.TRANSFORM_LIST; v_policy_name VARCHAR2(130) := 'ESA_TEXT_POLICY'; v_model_name varchar2(50) := 'ESA_MODEL_DEMO_2'; BEGIN v_xlst := dbms_data_mining_transform.TRANSFORM_LIST(); DBMS_DATA_MINING_TRANSFORM.SET_TRANSFORM(v_xlst, '"TEXT"', NULL, '"TEXT"', '"TEXT"', 'TEXT(POLICY_NAME:'||v_policy_name||')(MAX_FEATURES:3000)(MIN_DOCUMENTS:1)(TOKEN_TYPE:NORMAL)'); DBMS_DATA_MINING.DROP_MODEL(v_model_name, TRUE); DBMS_DATA_MINING.CREATE_MODEL( model_name => v_model_name, mining_function => DBMS_DATA_MINING.FEATURE_EXTRACTION, data_table_name => 'WIKISAMPLE', case_id_column_name => 'TITLE', target_column_name => NULL, settings_table_name => 'ESA_SETTINGS', xform_list => v_xlst); END;
NOTE: Yes we could have merged all of the above code into one PL/SQL block.
Use the ESA model
We can now use the FEATURE_COMPARE function to use the model we just created, just like I did in my previous blog post.
SELECT FEATURE_COMPARE(ESA_MODEL_DEMO_2 USING 'Oracle Database is the best available for managing your data' text AND USING 'The SQL language is the one language that all databases have in common' text) similarity FROM DUAL;
Go give the ESA algorithm a go and see where you could apply it within your applications.
In this 5th blog post in my series on using the capabilities of Oracle Text, Oracle R Enterprise and Oracle Data Mining to process documents and text, I will have a look at some of the machine learning features of Oracle Text.
Oracle Text comes with a number of machine learning algorithms. These can be divided into two types. The first is called ‘Supervised Learning’ where we have two machine learning algorithms for classification type of problem. The second type is called ‘Unsupervised Learning’ where we have the ability to use clustering machine learning algorithms to look for patterns in our text documents and to find similarities between documents based on their contents.
It is this second type of document clustering that I will work through in this blog post.
When using clustering with text documents, the machine learning algorithm will look for patterns that are common between the documents. These patterns will include the words used, the frequency of the words, the position or ordering of these words, the co-occurance of words, etc. Yes this is a large an complex task and that is why we need a machine learning algorithm to help us.
With Oracle Text we only have one clustering machine learning algorithm available to use. When we move onto using the Oracle Advanced Analytics Option (Oracle Data Mining and Oracle R Enterprise) we more algorithms available to us.
With Oracle Text the clustering algorithm is called k-Means. In a way the actual algorithm is unimportant as it is the only one available to us when using Oracle Text. To use this algorithm we have the CTX_CLS.CLUSTERING procedure. This procedure takes the documents we want to compare and will then identify the clusters (using hierarchical clustering) and will then tells us, for each document, what clusters the documents belong to and they probability value. With clustering a document (or a record) can belong to many clusters. Typically in the text books we see clusters that are very distinct and are clearly separated from each other. When you work on real data this is never the case. We will have many over lapping clusters and a data point/record can belong to one or more clusters. This is why we need the probability vale. We can use this to determine what cluster our record belongs to most and what other clusters it is associated with.
Using the example documents that I have been using during this series of blog posts we can use the CTX_CLS.CLUSTERING algorithm to cluster and identify similarities in these documents.
We need to setup the parameters that will be used by the CTX_CLS.CLUSTERING procedure. Tell it to use the k-Means algorithm and then the number of clusters to generate. As with all Oracle Text procedures or algorithms there are a number of settings you can configure or you can just accept the default values.
exec ctx_ddl.drop_preference('Cluster_My_Documents'); exec ctx_ddl.create_preference('Cluster_My_Documents','KMEAN_CLUSTERING'); exec ctx_ddl.set_attribute('Cluster_My_Documents','CLUSTER_NUM','3');
The code above is an example of the basics of what you need to setup for clustering. Other attribute or cluster parameter setting available to you include,
MAX_DOCTERMS, MAX_FEATURES, THEME_ON, TOKEN_ON, STEM_ON, MEMORY_SIZE and SECTION_WEIGHT.
Now we can run the CTX_CLS.CLUSTERING procedure on our documents. This procedure has the following parameters.
– The Oracle Text Index Name
– Document Id Column Name
– Document Assignment (cluster assignment) Table Name. This table will be created if it doesn’t already exist
– Cluster Description Table Name. This table will be created if it doesn’t already exist.
– Name of the Oracle Text Preference (list)
exec ctx_cls.clustering( 'MY_DOCUMENTS_OT_IDX', 'DOC_PK', 'OT_CLUSTER_RESULTS', 'DOC_CLUSTER_DETAILS', 'Cluster_My_Documents');
When the procedure has completed we can now examine the OT_CLUSTER_RESULTS and the DOC_CLUSTER_DETAILS tables. The first of these (OT_CLUSTER_RESULTS) allows us to see what documents have been clustered together. The following is what was produced for my documents.
SELECT d.doc_pk, d.doc_title, r.clusterid, r.score FROM my_documents d, ot_cluster_results r WHERE d.doc_pk = r.docid;
We can see that two of the documents have been grouped into the same cluster (ClusterId=2). If you have a look back at what these documents are about then you can see that yes these are very similar. For the other two documents we can see that they have been clustered into separate clusters (ClusterId=4 & 5). The clustering algorithms have said that they are different types of documents. Again when you examine these documents you will see that they are talking about different topics. So the clustering process worked !
You can also explore the various features of the clusters by looking that he DOC_CLUSTER_DETAILS table. Although the details in this table are not overly useful but it will give you some insight into what clusters the k-Means algorithm has produced.
Hopefully I’ve shown you how easy it is to setup and use the clustering feature of Oracle Text.
WARNING: Before using the Clustering or Classification with Oracle Text, you need to check with your local Oracle Sales representative about if there is licence implication. There seems to be some mentions the the algorithms used are those that come with Oracle Data Mining. Oracle Data Mining is a licence cost option for the database. So make sure you check before you go using these features.
This is the fourth blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Make sure to check out the previous blog posts as each one builds upon each other.
In this blog post, I will have an initial look at how you can use Oracle Text to perform document classification. In my next blog post, in the series, I will look at how you can use Oracle Data Mining with Oracle Text to perform classification.
The area of document classification using Oracle Text is a well trodden field and there are lots and lots of material out there to assist you. This blog post will look at the core steps you need to follow and how Oracle Text can help you with classifying your documents or text objects in a table.
When you use Oracle Text for documentation classification the simplest approach is to use ‘Rule-based Classification’. With this approach you will defined a set of rules, when applied to the document will determine classification that will be assigned to the document.
There is a little bit of setup and configuration needed to make this happen. This includes the following.
- Create a table that will store you document. See my previous blog posts in the series to see an example of one that is used to store the text from webpages.
- Create a rules table. This will contain the classification label and then a set of rules that will be used by Oracle Text to determine that classification to assign to the document. These are in the format similar to what you might see in the WHERE clause of a SELECT statement. You will need follow the rules and syntax of CTXRULES to make sure your rules fire correctly.
- Create a CTXRULE index on the rules table you created in the previous step.
- Create a table that will be a link table between the table that contains your documents and the table that contains your categories.
When you have these steps completed you can now start classifying your documents. The following example illustrates using these steps using the text documents I setup in my previous blog posts.
Here is the structure of my documents table. I had also created an Oracle Text
CTXSYS.CONTEXT index on the DOC_TEXT attribute.
create table MY_DOCUMENTS ( doc_pk NUMBER(10) PRIMARY KEY, doc_title VARCHAR2(100), doc_extracted DATE, data_source VARCHAR2(200), doc_text CLOB );
The next step is to create a table that contains our categories and rules. The structure of this table is very simple, and the following is an example.
create table DOCUMENT_CATEGORIES ( doc_cat_pk NUMBER(10) PRIMARY KEY, doc_category VARCHAR2(40), doc_cat_query VARCHAR2(2000) ); create sequence doc_cat_seq;
Now we can create the table that will store the identified document categories/classifications for each of out documents. This is a link table that contains the primary keys from the MY_DOCUMENTS and the MY_DOCUMENT_CATEGORIES tables.
create table MY_DOC_CAT ( doc_pk NUMBER(10), doc_cat_pk NUMBER(10) );
Queries for CTXRULE are similar to those of CONTAINS queries. Basic phrasing within quotes is supported, as are the following CONTAINS operators: ABOUT, AND, NEAR, NOT, OR, STEM, WITHIN, and THESAURUS. The following statements contain my rules.
insert into document_categories values (doc_cat_seq.nextval, 'OAA','Oracle Advanced Analytics'); insert into document_categories values (doc_cat_seq.nextval, 'Oracle Data Mining','ODM or Oracle Data Mining'); insert into document_categories values (doc_cat_seq.nextval, 'Oracle Data Miner','ODMr or Oracle Data Miner or SQL Developer'); insert into document_categories values (doc_cat_seq.nextval, 'R Technologies','Oracle R Enterprise or ROacle or ORAACH or R');
We are now ready to create the Oracle Text CTXRULE index.
create index doc_cat_idx on document_categories(doc_cat_query) indextype is ctxsys.ctxrule;
Our next step is to apply the rules and to generate the categories/classifications. We have two scenarios to deal with here. The first is how do we do this for our existing records and the second to how can you do this ongoing as new documents get loaded into the MY_DOCUMENTS table.
For the first scenario, where the documents already exist in our table, we can can use a procedure, just like the following.
DECLARE v_document MY_DOCUMENTS.DOC_TEXT%TYPE; v_doc MY_DOCUMENTS.DOC_PK%TYPE; BEGIN for doc in (select doc_pk, doc_text from my_documents) loop v_document := doc.doc_text; v_doc := doc.doc_pk; for c in (select doc_cat_pk from document_categories where matches(doc_cat_query, v_document) > 0 ) loop insert into my_doc_cat values (doc.doc_pk, c.doc_cat_pk); end loop; end loop; END; /
Let us have a look at the categories/classifications that were generated.
select a.doc_title, c.doc_cat_pk, b.doc_category from my_documents a, document_categories b, my_doc_cat c where a.doc_pk = c.doc_pk and c.doc_cat_pk = b.doc_cat_pk order by a.doc_pk, c.doc_cat_pk;
We can see the the categorisation/classification actually gives us the results we would have expected of these documents/web pages.
Now we can look at how to generate these these categories/classifications on an on going basis. For this we will need a database trigger on the MY_DOCUMENTS table. Something like the following should do the trick.
CREATE or REPLACE TRIGGER t_cat_doc before insert on MY_DOCUMENTS for each row BEGIN for c in (select doc_cat_pk from document_categories where matches(doc_cat_query, :new.doc_text)>0) loop insert into my_doc_cat values (:new.doc_pk, c.doc_cat_pk); end loop; END;
At this point we have now worked through how to build and use Oracle Text to perform Rule based document categorisation/classification.
In addition to this type of classification, Oracle Text also has uses some machine learning algorithms to classify documents. These include using Decision Trees, Support Vector Machines and Clustering. It is important to note that these are not the machine learning algorithms that come as part of Oracle Data Mining. Look out of my other blog posts that cover these topics.
A project that I’ve been working on for a while now involves the use of Oracle Text, Oracle R Enterprise and Oracle Data Mining. Oracle Text comes with your Oracle Database licence. Oracle R Enterprise and Oracle Data Mining are part of the Oracle Advanced Analytics (extra cost) option.
What I will be doing over the course of 4 or maybe 5 blog posts is how these products can work together to help you gain a grater insight into your data, and part of your data being large text items like free format text, documents (in various forms e.g. html, xml, pdf, ms word), etc.
Unfortunately I cannot show you examples from the actual project I’ve been working on (and still am, from time to time). But what I can do is to show you how products and components can work together.
In this blog post I will just do some data setup. As with all project scenarios there can be many ways of performing the same tasks. Some might be better than others. But what I will be showing you is for demonstration purposes.
The scenario: The scenario for this blog post is that I want to extract text from some webpages and store them in a table in my schema. I then want to use Oracle Text to search the text from these webpages.
Schema setup: We need to create a table that will store the text from the webpages. We also want to create an Oracle Text index so that this text is searchable.
drop sequence my_doc_seq; create sequence my_doc_seq; drop table my_documents; create table my_documents ( doc_pk number(10) primary key, doc_title varchar2(100), doc_extracted date, data_source varchar2(200), doc_text clob); create index my_documents_ot_idx on my_documents(doc_text) indextype is CTXSYS.CONTEXT;
In the table we have a number of descriptive attributes and then a club for storing the website text. We will only be storing the website text and not the html document (More on that later). In order to make the website text searchable in the DOC_TEXT attribute we need to create an Oracle Text index of type CONTEXT.
There are a few challenges with using this type of index. For example when you insert a new record or update the DOC_TEXT attribute, the new values/text will not be reflected instantly, just like we are use to with traditional indexes. Instead you have to decide when you want to index to be updated. For example, if you would like the index to be updated after each commit then you can create the index using the following.
create index my_documents_ot_idx on my_documents(doc_text) indextype is CTXSYS.CONTEXT parameters ('sync (on commit)');
Depending on the number of documents you have being committed to the DB, this might not be for you. You need to find the balance. Alternatively you could schedule the index to be updated by passing an interval to the ‘sync’ in the above command. Alternatively you might want to use DBMS_JOB to schedule the update.
To manually sync (or via DBMS_JOB) the index, assuming we used the first ‘create index’ statement, we would need to run the following.
This function just adds the new documents to the index. This can, over time, lead to some fragmentation of the index, and will require it to the re-organised on a semi-regular basis. Perhaps you can schedule this to happen every night, or once a week, or whatever makes sense to you.
BEGIN CTX_DDL.OPTIMIZE_INDEX('my_documents_ot_idx','FULL'); END;
(I could talk a lot more about setting up some basics of Oracle Text, the indexes, etc. But I’ll leave that for another day or you can read some of the many blog posts that already exist on the topic.)
Extracting text from a webpage using R: Some time ago I wrote a blog post on using some of the text mining features and packages in R to produce a word cloud based on some of the Oracle Advanced Analytics webpages.
I’m going to use the same webpages and some of the same code/functions/packages here.
The first task you need to do is to get your hands on the ‘htmlToText function. You can download the htmlToText function on github. This function requires the ‘Curl’ and ‘XML’ R packages. So you may need to install these.
I also use the str_replace_all function (“stringer’ R package) to remove some of the html that remains, to remove some special quotes and to replace and occurrences of ‘&’ with ‘and’.
# Load the function and required R packages
data1 <- str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , "") data1 <- str_replace_all(data1, "&", "and") data2 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and") data2 <- str_replace_all(data2, "&", "and") data3 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and") data3 <- str_replace_all(data3, "&", "and") data4 <- str_replace_all(str_replace_all(htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html"), "[\r\n\t\"\'\u201C\u201D]" , ""), "&", "and") data4 <- str_replace_all(data4, "&", "and")
We now have the text extracted and cleaned up.
Create a data frame to contain all our data: Now that we have the text extracted, we can prepare the other data items we need to insert the data into our table (‘my_documents’). The first stept is to construct a data frame to contain all the data.
data_source = c("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html", "http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html", "http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html", "http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html") doc_title = c("OAA_OVERVIEW", "OAA_ODM", "R_TECHNOLOGIES", "OAA_ORE") doc_extracted = Sys.Date() data_text <- c(data1, data2, data3, data4) my_docs <- data.frame(doc_title, doc_extracted, data_source, data_text)
Insert the data into our database table: With the data in our data fram (my_docs) we can now use this data to insert into our database table. There are a number of ways of doing this in R. What I’m going to show you here is how to do it using Oracle R Enterprise (ORE). The thing with ORE is that there is no explicit functionality for inserting and updating records in a database table. What you need to do is to construct, in my case, the insert statement and then use ore.exec to execute this statement in the database.