This project is read-only.

SOP based tools/usage in language "n-grams"

May 26, 2015 at 10:34 PM
Below are excerpt of the suggestion from Sanmayce to the "B-Tree... dictionary" article in codeproject:

Hi Gerardo,
very thorough approach, sadly too complicated for my simplistic taste, can't you write a simplified (only external) version in form of useful console tool?
That way everyone will be able not only to test/benchmark your code but to use it, in practice, you know.

As for the benchmark you did, it fails to show the power of your code, my suggestion is to use some heavy loads instead.
Since I am interested in putting all n-grams of English language into an external B-tree you can consider using some of next datasets:
http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html[^]

After downloading and concatenating them:
01/10/2015 12:41 PM 10,624,363,237 Google_Books_corpus_version_20130501_English_All_Nodes.txt
01/10/2015 02:38 PM 1,844,711,941 Google_Books_corpus_version_20130501_English_All_Nodes.txt.graffith

01/13/2015 08:59 AM 179,736,720,202 Google_Books_corpus_version_20130501_English_All_Arcs.txt
01/15/2015 05:19 AM 23,990,734,563 Google_Books_corpus_version_20130501_English_All_Arcs.txt.graffith

01/18/2015 04:46 AM 298,223,429,647 Google_Books_corpus_version_20130501_English_All_BiArcs.txt
01/18/2015 11:07 PM 32,885,642,660 Google_Books_corpus_version_20130501_English_All_BiArcs.txt.graffith

01/22/2015 04:19 PM 302,743,777,792 Google_Books_corpus_version_20130501_English_All_TriArcs.txt
01/23/2015 07:23 AM 28,396,779,848 Google_Books_corpus_version_20130501_English_All_TriArcs.txt.graffith

"Google_Books_corpus_version_20130501_English_All_Nodes.txt":
LineWordreporter: Encountered lines in all files: 46,104,611
LineWordreporter: Encountered words in all files: 178,441,681
LineWordreporter: Longest line: 4,901
LineWordreporter: Longest word: 123

"Google_Books_corpus_version_20130501_English_All_Arcs.txt":
LineWordreporter: Encountered lines in all files: 918,860,187
LineWordreporter: Encountered words in all files: 7,419,031,777
LineWordreporter: Longest line: 4,244
LineWordreporter: Longest word: 217

"Google_Books_corpus_version_20130501_English_All_BiArcs.txt":
LineWordreporter: Encountered lines in all files: 1,783,018,535
LineWordreporter: Encountered words in all files: 20,599,208,820
LineWordreporter: Longest line: 3,722
LineWordreporter: Longest word: 217

"Google_Books_corpus_version_20130501_English_All_TriArcs.txt":
LineWordreporter: Encountered lines in all files: 1,876,974,527
LineWordreporter: Encountered words in all files: 28,304,385,066
LineWordreporter: Longest line: 3,346
LineWordreporter: Longest word: 394


I write only in C and have written one superb tool (implementing B-tree order 3) called Leprechaun. It suffers from several limitations but my goal was to write an useful tool working at uncompromising speeds. If you are interested you can use freely any part of it, I would be glad if you can come up with some new functionality and ideas considering your experience, that and your open attitude made me write these suggestions.
After several years of wresting with n-grams I came up with one superuseful (in my view) dump/visualization, called Pagoda, but constructing it takes a lot of time, do you have any quick idea how your B-Tree Sorted Dictionary could help in dumping that e.g. the word 'exascale' as shown in next screenshot:
http://www.sanmayce.com/GW_r1+++_4-GrammingC_balloon.png[^]
For more info:
http://forum.thefreedictionary.com/postst31183p3_MASAKARI--The-people-s-choice--General-Purpose-Grade--English-wordlist.aspx[^]
The thing that disturbs my peace is the lack of practical tools helping users to harness the power of B-tree, my dream is to have all the useful English phrases (of order 1 to 9 i.e. consisted of one up to nine words) and to offer a tool helping people quickly to check their phrases against those of the given corpus, that way phrase-checking will be available - way powerful than the "spell-checking" which in reality is just 1-gram (phrase order 1) checking.

May 27, 2015 at 12:06 AM
Glad I am that my call for collaboration didn't go unheard.
To me, the most important thing is not the results one have achieved but the intentions, the results are a good thing for sure but not that much as the desire and hope for something better.
My skills and expertise are fully amateurish but this is okay as far as the innerfire (after the well-established 'hellfire') is burning.

Reading https://sop.codeplex.com/ overview made me realize how far from modern models my poor skillset did fall/lag.

I am so tightly attached to simple coding that anything outside C frightens me, and on top of that I fear that all my life I will wrestle with simple things not able to step up and deliver something more substantial, never mind, just few thoughts.

My attempt to taste the superspeeds coming out of B-tree was embodied as Leprechaun console tool using order 3, I sacrificed the rest orders because I wanted to have hand-optimized code dealing with both scenarios - internal&external RAM. In physical RAM the speeds are very sweet.
For what it is worth, here are some old dumps:
http://www.sanmayce.com/Downloads/index.html#Leprechaun

By the way, the latest Leprechaun C source code is within this archive L_512passes.zip package:
http://www.sanmayce.com/_GW.zip

In short, it's been 13 years since I got the idea to "rip" all the available to me English texts and create a phrase-checker in spirit of open-source licenselessness i.e. a tool/package being 100% FREE. So many years and I am still at the drawing board, meh.