Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Jun 20th, 2006, 1:19 PM   #1
Arla
Hobbyist Programmer
 
Arla's Avatar
 
Join Date: Mar 2005
Posts: 227
Rep Power: 4 Arla is on a distinguished road
Question about analyzing 6 million records

So this isn't so much a how do I question, more a how would you type question?

I have approx 6 million customer records, and I need to check for "duplicates" now, this is currently based on Name and address.

I have 10 text files, as each is read in, I check the record I just read (concatonated name and address fields) against everything that's already been read, if it's already there I flag the record as a duplicate, if not I add it to the end of the array.

I'm using the IndexOf function to check for duplicates. This isn't quick... it's not horribly slow, but it's going to take quite a while to processes all 6 million records, anyone have other ideas? Without resorting to a database (which I'd like to do, but would take me quite a bit longer to get the coding done).
Arla is online now   Reply With Quote
Old Jun 20th, 2006, 1:37 PM   #2
Infinite Recursion
Programming Guru
 
Infinite Recursion's Avatar
 
Join Date: Jul 2004
Location: United States
Posts: 3,467
Rep Power: 8 Infinite Recursion is on a distinguished road
Send a message via MSN to Infinite Recursion Send a message via Yahoo to Infinite Recursion
Is this program going to be ran often, or on occasion to synchronize, etc?

I would merge the 10 files together, then use Perl to strip the duplicates. Or Sed as an alternative:

[PHP]
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
[/PHP]
__________________
http://jasonpowers.net

"There are a thousand hacking at the branches of evil to one who is striking at the root."
Infinite Recursion is offline   Reply With Quote
Old Jun 20th, 2006, 1:38 PM   #3
randum77
Programmer
 
randum77's Avatar
 
Join Date: Jun 2006
Location: Fayettehell, NC
Posts: 56
Rep Power: 3 randum77 is on a distinguished road
Well crap, that sounds pretty intense. I will say right off, i am not a programming guru, but will offer a quick thoery. Bassically just build an array that would hold the 2 values you speak of and have each name compared to the array, if it's not found as a duplicate, append it. Like i said, i'm just learning programming, but that's the first thing to come to mind.
__________________
_Marshall_

"America has bred a society that is innocent and incapable of accepting responsibility, but yet, is able to place blame on others without guilt."
randum77 is offline   Reply With Quote
Old Jun 20th, 2006, 2:15 PM   #4
Arla
Hobbyist Programmer
 
Arla's Avatar
 
Join Date: Mar 2005
Posts: 227
Rep Power: 4 Arla is on a distinguished road
Infinite, thanks, not sure I can do that though, because while I want to strip out the duplicates, I also need to flag them as such (so I know that a duplicate was found).

Currently Randum77 I have basically what you suggest, I have a massive array, and I read a line from the file, compare it to everything in the array (actually just using IndexOf function because that's rather more efficient than using a for loop) and then if it was found, just flag the found record as a duplicate (i.e. that customer has another record) if not found, then I add the new record into the array and move on.

I'll have to do some reading up on Sed I don't know much about it so am unclear on what I could do with it.
Arla is online now   Reply With Quote
Old Jun 20th, 2006, 2:20 PM   #5
Dameon
Troll
 
Dameon's Avatar
 
Join Date: Apr 2005
Location: Texas
Posts: 732
Rep Power: 4 Dameon is on a distinguished road
Hmmm...I'd suggest the Hashtable class. Your mileage may vary, however.

For a longer term solution, though, I'd move to a relational database. I'd trust the performance of MySQL with 6 million rows a lot more than C#.
__________________
MD5(sig) = bcef75433db02e9ad9bf81d6f7c5c270
Dameon is offline   Reply With Quote
Old Jun 20th, 2006, 2:27 PM   #6
andro
Professional Programmer
 
Join Date: Oct 2005
Location: California
Posts: 312
Rep Power: 3 andro is on a distinguished road
Send a message via AIM to andro
Quote:
Originally Posted by Infinite Recursion
Is this program going to be ran often, or on occasion to synchronize, etc?

I would merge the 10 files together, then use Perl to strip the duplicates. Or Sed as an alternative:

[PHP]
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
[/PHP]

cat * | sort | uniq

?
andro is offline   Reply With Quote
Old Jun 20th, 2006, 2:33 PM   #7
Infinite Recursion
Programming Guru
 
Infinite Recursion's Avatar
 
Join Date: Jul 2004
Location: United States
Posts: 3,467
Rep Power: 8 Infinite Recursion is on a distinguished road
Send a message via MSN to Infinite Recursion Send a message via Yahoo to Infinite Recursion
Andro, that is much easier... I like to do things the hard way I guesss...

It boils down to if the OP needs the information to stay in place or not... sorting would rearrange data. Also, if he/she is even using Linux vs Windows.

Then the new found issue of needing to know if duplicates existed prior to their removal... to do this with Andro's suggestion, you could use uniq's -d -c -u flags... see here: http://www-128.ibm.com/developerwork...l-tiptex6.html


I saw the code below on http://www.perl.com/doc/FMTEYEWTK/regexps.html
It may help, if you are entertaining the idea of using Perl.

#!/usr/bin/perl -00 -n
while ( /\b(\w+)(\s+\1)+\b/gi ) { 
print "dup $1 at paragraph $.\n";
}
This now yields: 
dup at paragraph 10 
dup at paragraph 33
__________________
http://jasonpowers.net

"There are a thousand hacking at the branches of evil to one who is striking at the root."

Last edited by Infinite Recursion; Jun 20th, 2006 at 2:48 PM.
Infinite Recursion is offline   Reply With Quote
Old Jun 20th, 2006, 2:43 PM   #8
Arla
Hobbyist Programmer
 
Arla's Avatar
 
Join Date: Mar 2005
Posts: 227
Rep Power: 4 Arla is on a distinguished road
Dameon,

Thanks a bunch, Hashtable is much better than my less elegant Array solution, finished my "test" run of 500k in a couple of minutes instead of running for more than an hour (as it had been doing).
Arla is online now   Reply With Quote
Old Jun 20th, 2006, 4:07 PM   #9
Dameon
Troll
 
Dameon's Avatar
 
Join Date: Apr 2005
Location: Texas
Posts: 732
Rep Power: 4 Dameon is on a distinguished road
Yes, and it should scale better than ArrayList as well. I'm curious to hear what kind of speeds you get.

Hashtable lookups, in an ideal case, should be fairly close to constant time. However, to prevent the hashtable having to expand partway through (which entails rehashing every single item within it), be sure to put a fairly liberal estimate of how many slots you will need in the constructor. Keep in mind that the default load factor in C# 2.0 is 0.72 according to this, so it expands when it gets 72 percent full.
__________________
MD5(sig) = bcef75433db02e9ad9bf81d6f7c5c270
Dameon is offline   Reply With Quote
Old Jun 21st, 2006, 5:05 PM   #10
Arla
Hobbyist Programmer
 
Arla's Avatar
 
Join Date: Mar 2005
Posts: 227
Rep Power: 4 Arla is on a distinguished road
Okay, so some information on speeds

Running as default just using "new hashtable()" with no size (couldn't find anything that would actually tell me what the default initial capacity setting is) gives me the following

Each text file I load is approximately 510k records (including any duplicates),

For one run
File 0 took 0:24s to load and left the hashtable with 505734 records
File 1 took 0:45s to load and left the hashtable with 1007982 records
File 2 took 1:12s to load and left the hashtable with 1506624 records
File 3 took 1:27s to load and left the hashtable with 2001634 records
File 4 took 1:28s to load and left the hashtable with 2492918 records
File 5 took 2:43s to load and left the hashtable with 2981337 records
File 6 took 2:49s to load and left the hashtable with 3466315 records
File 7 took 2:48s to load and left the hashtable with 3947574 records
File 8 took 2:49s to load and left the hashtable with 4425733 records
File 9 took 2:54s to load and left the hashtable with 4900196 records

Total time is 19:19s

Changing the hashtable to have a "size" of 4900196 (not sure if Microsoft was intelligent enough to apply the inverse load size to say if I expect 4.9m rows, then I need more rows for the required free space) I get

File 0 took 2:39s to load and left the hashtable with 505734 records
File 1 took 2:40s to load and left the hashtable with 1007982 records
File 2 took 2:40s to load and left the hashtable with 1506624 records
File 3 took 2:42s to load and left the hashtable with 2001634 records
File 4 took 2:44s to load and left the hashtable with 2492918 records
File 5 took 2:43s to load and left the hashtable with 2981337 records
File 6 took 2:49s to load and left the hashtable with 3466315 records
File 7 took 2:48s to load and left the hashtable with 3947574 records
File 8 took 2:48s to load and left the hashtable with 4425733 records
File 9 took 2:54s to load and left the hashtable with 4900196 records

So as can be seen setting the default size actually makes the program perform much worse, presumably because I'm not doing a straight "load" and then search, I'm searching the table while loading so while perhaps less efficient at the end of the day, it's far more efficient at the start (because I'm only searching a few records).
Arla is online now   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 6:53 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC