{"id":158,"date":"2017-12-06T17:24:03","date_gmt":"2017-12-06T17:24:03","guid":{"rendered":"http:\/\/eipsoftware.com\/musings\/?p=158"},"modified":"2018-02-01T19:21:33","modified_gmt":"2018-02-01T19:21:33","slug":"class-for-parsing-ngrams","status":"publish","type":"post","link":"https:\/\/eipsoftware.com\/musings\/class-for-parsing-ngrams\/","title":{"rendered":"Class for Parsing NGrams"},"content":{"rendered":"<h4>Train an NGram Model<\/h4>\n<p>Will take the ngrams, reading and writing from the SQLite database and train the model utilizing Katz-Backoff methodology on the likelihood of the next word.<\/p>\n<p>See code below<\/p>\n<p><!--more--><\/p>\n<pre class=\"lang:r decode:true \">#ngramTrainer Class\r\nlibrary(methods)\r\nlibrary(stringr)\r\nlibrary(plyr)\r\n# ----------------------------------------------------------\r\n#' class for parsing the ngrams for training\r\n#' @param ngrams ngrams character list of ngrams\r\n#' @param ngramsparsed data.frame with ngrams separated by word\r\n#'\r\n\r\nngramTrainer &lt;- setRefClass(\"ngramTrainer\", fields = list(ngrams = \"character\"\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t  ,ngramsparsed = \"data.frame\"\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t  ,ngramguesses = \"data.frame\"\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t  ,ngramscores = \"data.frame\"\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t  ,ngramrow = \"data.frame\"\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t  ,ngramsize = \"numeric\"\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t  ,sqlitedb = \"character\"\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t  ,sqlModel = \"sqlTrainer\"\r\n\t\t\t\t\t\t))\r\n\r\n\r\nngramTrainer$methods(\r\n\t# ----------------------------------------------------------\r\n\t#' initialize method for prepping the ngrams for training\r\n\tinitialize = function(ngramslist = \"list\", ngramdbase = \"character\")\r\n\t{\r\n\t\tngrams &lt;&lt;- unlist(ngramslist)\r\n\t\tsqlitedb &lt;&lt;- ngramdbase\r\n\r\n\t\t# convert into dataframe for processing\r\n\t\ttempmatrix &lt;- matrix(.self$ngrams, ncol = 1 ,nrow = length(.self$ngrams))\r\n\t\tngramsparsed &lt;&lt;- data.frame(ngram = tempmatrix, stringsAsFactors = FALSE)\r\n\t\trm(tempmatrix)\r\n\r\n\t\t#split into individual words\r\n\t\tsplit_words &lt;- str_split(.self$ngramsparsed[,1] ,\"_\" ,simplify = TRUE)\r\n\r\n\t\t#edit the dataframe so each word into its own column\r\n\t\tngramsparsed &lt;&lt;- ngramsparsed %&gt;%\r\n\t\t\tmutate(word_1 = split_words[,1]\r\n\t\t\t\t   ,word_2 = split_words[,2]\r\n\t\t\t\t   ,word_3 = split_words[,3]\r\n\t\t\t\t   ,word_4 = split_words[,4])\r\n\r\n\t\trm(split_words)\r\n\r\n\t\t#set the database location\r\n\t\tsqlModel$database &lt;&lt;- .self$sqlitedb\r\n\t}\r\n\t,runQueries = function(currentrow = \"data.frame\")\r\n\t{\r\n\t\tngramrow &lt;&lt;- currentrow\r\n\t\t#message(c(\"processing: \",ngramrow$ngram))\r\n\r\n\t\t#run query\r\n\t\tsqlModel$query &lt;&lt;- switch(currentrow$V1,\r\n\t\t\t\t\t\t\t\t  \"1\" = sqlModel$queryNgramTwo(.self$ngramrow)\r\n\t\t\t\t\t\t\t\t  ,\"2\" = sqlModel$queryNgramTwo(.self$ngramrow)\r\n\t\t\t\t\t\t\t\t  ,\"3\" = sqlModel$queryNgramThree(.self$ngramrow)\r\n\t\t\t\t\t\t\t\t  ,\"4\" = sqlModel$queryNgramFour(.self$ngramrow)\r\n\t\t\t\t\t\t\t\t  )\r\n\r\n\t\tqueryresults &lt;- sqlModel$selectQuery()\r\n\t\treturn(queryresults)\r\n\t}\r\n\t,processQueryResults = function()\r\n\t{\t#process query\r\n\t\tmessage(\"..... processing query results\")\r\n\t\tngramguesses &lt;&lt;- ngramguesses %&gt;% group_by(ngram) %&gt;%\r\n\t\t\t\t\t\tmutate(frequency_relative = frequency \/ sum(frequency)\r\n\t\t\t\t\t\t\t   ,frequency_log = log10(frequency\/ sum(frequency))\r\n\t\t\t\t\t\t\t   ,frequency_log_scores = case_when(V1 == 1 ~ frequency_log * 0.2\r\n\t\t\t\t\t\t\t   \t\t\t\t\t\t\t\t   \t,V1 == 2 ~ frequency_log * 0.4\r\n\t\t\t\t\t\t\t   \t\t\t\t\t\t\t\t \t,V1 == 3 ~ frequency_log * 0.6\r\n\t\t\t\t\t\t\t   \t\t\t\t\t\t\t\t \t,V1 == 4 ~ frequency_log * 0.8\r\n\t\t\t\t\t\t\t   \t\t\t\t\t\t\t\t    ,TRUE ~ 0\r\n\t\t\t\t\t\t\t   \t\t\t\t\t\t\t\t\t)\r\n\t\t\t\t\t\t\t   ,ranking = rank(-frequency_log_scores, 100)\r\n\t\t\t\t\t\t\t   )\r\n\t\t# TO-DO multiply by weighting factors\r\n\r\n\t}\r\n\t,scoreQueryResults = function(currentrow = \"data.frame\")\r\n\t{\r\n\t\t#score results\r\n\t\trowid &lt;- match(currentrow$ngram, .self$ngramguesses$dbase_ngram)\r\n\t\t#message(c(\"measureResults: \", currentrow$ngram, \" rank: \", .self$ngramguesses[rowid, 8] ))\r\n\t\treturn(as.data.frame(c(.self$ngramguesses[rowid, \"ranking\"]\r\n\t\t\t   ,.self$ngramguesses[rowid, \"frequency_relative\"]\r\n\t\t\t   ,.self$ngramguesses[rowid, \"frequency_log\"])))\r\n\t}\r\n\t,storeQueryResults = function()\r\n\t{\r\n\t\t#store results\r\n\t\tmessage(\"..... storing score results\")\r\n\t\tngramscores &lt;&lt;- ngramsparsed %&gt;% select(ngram\r\n\t\t\t\t\t\t\t\t\t\t\t\t, ranking\r\n\t\t\t\t\t\t\t\t\t\t\t\t, frequency_relative\r\n\t\t\t\t\t\t\t\t\t\t\t\t, frequency_log\r\n\t\t\t\t\t\t\t\t\t\t\t\t, ngram_size = V1)\r\n\r\n\t\tqueryresults &lt;- sqlModel$writeTable(output_dataframe = .self$ngramscores\r\n\t\t\t\t\t\t\t\t\t\t\t, remote_table =\"ngrams_scores\")\r\n\t}\r\n\r\n\t,ngramSize = function(ngramrow = \"data.frame\")\r\n\t{\r\n\t\tngramsize &lt;&lt;- sum( nzchar(ngramrow$word_1)\r\n\t\t\t\t\t\t  +nzchar(ngramrow$word_2)\r\n\t\t\t\t\t\t  +nzchar(ngramrow$word_3)\r\n\t\t\t\t\t\t  +nzchar(ngramrow$word_4)\r\n\t\t\t\t\t\t  )\r\n\t\treturn(.self$ngramsize)\r\n\t}\r\n\r\n)<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Train an NGram Model Will take the ngrams, reading and writing from the SQLite database and train the model utilizing Katz-Backoff methodology on the likelihood of the next word. See code below<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[4,5,6],"tags":[38,30,35,36],"series":[],"class_list":["post-158","post","type-post","status-publish","format-standard","hentry","category-code","category-r","category-sql","tag-ngram","tag-code","tag-r","tag-class"],"_links":{"self":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/158","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/comments?post=158"}],"version-history":[{"count":2,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/158\/revisions"}],"predecessor-version":[{"id":160,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/158\/revisions\/160"}],"wp:attachment":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/media?parent=158"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/categories?post=158"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/tags?post=158"},{"taxonomy":"series","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/series?post=158"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}