Page 1 of 1

"de-duping" lists (finding unique entries)

PostPosted: Fri Jan 12, 2007 12:30 pm
by dpayer
Though the new database plugin may make this obsolete (sql statements will solve the problem with very little code) this was a very good exercise for me as I had a group of log files I had to extract information from and I resolved to use NB as my tool to do that. (I could have used excel and taken a bit less time!).

I had 90 megs of log files from a mail server that had massive duplication of usernames/passwords. I had to find unique entries so I could move all accounts to a new server.

With the following code I was able to read each line of the log files and then check to see if the portion of the line that held the email address contained a unique value. I did this by creating an array for each letter of the alphabet and 0-9 and I put any entry from the log not found in the array, into the array. I did this by looping the array on each time. This is a slow way to do it but it works.

I attempted to read the entire file into a variable but it overwhelms the application to import 6+ meg files this way. Also, you can't tell what is going on during the import process as there is no feedback until it is done.

Here is the code:

NOTE: I had several files with information and I accessed them one by one using a loop. The names of the files were held in an array called [data23file]. The file being examined and parsed for unique entries would be "[data23file[data23loopPOSITION]]

:preloading arrays so every letter of the alphabet and numbers has an entry in its array [STARTINGwithA] “A”, [STARTINGwithB] “B”, etc
SetVar "[AlphaNumericConstants]" "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
Loop "1" "36" "[ANCloopcounter]"
SubStr "[AlphaNumericConstants]" "[ANCloopcounter]" "1" "[ANCtemporaryVARIABLE]"
SetVar "[STARTINGwith[ANCtemporaryVARIABLE]]" "[ANCtemporaryVARIABLE]"

: we are looping through individual data files
FileLen "[data23file[data23loopposition]]" "[lengthCURRENTdtaFILE]"
Loop "1" "[lengthCURRENTdtaFILE]" "[CURRENTdtaFILEloopPOSITION]"
: The WORKINGlineDATA is the line being analyzed for uniqueness
FileRead "[data23file[data23loopPOSITION]]" "[CURRENTdtaFILEloopPOSITION]" "[WORKINGlineDATA]"

SubStr "[WORKINGlineDATA]" "1" "1" "[WORKINGlineDATAfirstLETTER]"
SearchStr "," "[WORKINGlineDATA]" "[endPOSITIONofEMAILaddress]"
SubStr "[WorkingLineData]" "1" "[endPOSITIONofEMAILaddress]-1" "[WorkingEmailAddress]"

:checking preloaded arrays to see how many entries are in the variable
for the letter – addl entries made further down in code


:loop through 'startingwithLETTER' array for letter for duplicates - "letter loop"
SetVar "[emailmatch]" "NO"
SetVar "[emailmatch]" "YES"
:If mail is not in array (meaning, it is unique), then add it to the right array at the right point

:Write the unique entry to the file on disk
FileWrite "[pubdir]UNIQUEemailaddresses.txt" "Append" "[WORKINGlineDATA]"

:next line is for GOTOLINE statement to break out of loop

:display the [NumberUNIQUEaddresses] for user interface

FileLen "[pubdir]UNIQUEemailaddresses.txt" "[NumberUNIQUEaddresses]"


AlertBox "Finished" "All files finished. See product in uniqueemailaddresses.txt"

As a tool, NB was probably not the fastest one to work on this but it was a way to create clearly understandable rules to follow. For me, it didn't matter how long it took to process, as long as it did the job reliably.

In fact, it took about 16 hours of computing time (!) to do both the initial round or manipulation (not shown here) plus the parsing to make my final list.

90 megs of log files / about 1/2 million lines of logs.

Final result: just under 600 usernames/passwords.

Still, I consider NB a great tool for an entry level developer like myself. I hope my annoted coding will help another solve a similar problem.

David P.

PostPosted: Mon Jan 15, 2007 12:01 pm
by TMcD
I really like NB too!

In addition to your post, I'd like to reference a previous post (also in the "suggestions" section.)

(A reference to speed.)