![]() |
Question about analyzing 6 million records
So this isn't so much a how do I question, more a how would you type question?
I have approx 6 million customer records, and I need to check for "duplicates" now, this is currently based on Name and address. I have 10 text files, as each is read in, I check the record I just read (concatonated name and address fields) against everything that's already been read, if it's already there I flag the record as a duplicate, if not I add it to the end of the array. I'm using the IndexOf function to check for duplicates. This isn't quick... it's not horribly slow, but it's going to take quite a while to processes all 6 million records, anyone have other ideas? Without resorting to a database (which I'd like to do, but would take me quite a bit longer to get the coding done). |
Is this program going to be ran often, or on occasion to synchronize, etc?
I would merge the 10 files together, then use Perl to strip the duplicates. Or Sed as an alternative: [PHP] # delete duplicate, nonconsecutive lines from a file. Beware not to # overflow the buffer size of the hold space, or else use GNU sed. sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P' [/PHP] |
Well crap, that sounds pretty intense. I will say right off, i am not a programming guru, but will offer a quick thoery. Bassically just build an array that would hold the 2 values you speak of and have each name compared to the array, if it's not found as a duplicate, append it. Like i said, i'm just learning programming, but that's the first thing to come to mind.
|
Infinite, thanks, not sure I can do that though, because while I want to strip out the duplicates, I also need to flag them as such (so I know that a duplicate was found).
Currently Randum77 I have basically what you suggest, I have a massive array, and I read a line from the file, compare it to everything in the array (actually just using IndexOf function because that's rather more efficient than using a for loop) and then if it was found, just flag the found record as a duplicate (i.e. that customer has another record) if not found, then I add the new record into the array and move on. I'll have to do some reading up on Sed I don't know much about it so am unclear on what I could do with it. |
Hmmm...I'd suggest the Hashtable class. Your mileage may vary, however.
For a longer term solution, though, I'd move to a relational database. I'd trust the performance of MySQL with 6 million rows a lot more than C#. |
Quote:
:
cat * | sort | uniq? |
Andro, that is much easier... I like to do things the hard way I guesss... :)
It boils down to if the OP needs the information to stay in place or not... sorting would rearrange data. Also, if he/she is even using Linux vs Windows. Then the new found issue of needing to know if duplicates existed prior to their removal... to do this with Andro's suggestion, you could use uniq's -d -c -u flags... see here: http://www-128.ibm.com/developerwork...l-tiptex6.html I saw the code below on http://www.perl.com/doc/FMTEYEWTK/regexps.html It may help, if you are entertaining the idea of using Perl. :
#!/usr/bin/perl -00 -n |
Dameon,
Thanks a bunch, Hashtable is much better than my less elegant Array solution, finished my "test" run of 500k in a couple of minutes instead of running for more than an hour (as it had been doing). |
Yes, and it should scale better than ArrayList as well. I'm curious to hear what kind of speeds you get.
Hashtable lookups, in an ideal case, should be fairly close to constant time. However, to prevent the hashtable having to expand partway through (which entails rehashing every single item within it), be sure to put a fairly liberal estimate of how many slots you will need in the constructor. Keep in mind that the default load factor in C# 2.0 is 0.72 according to this, so it expands when it gets 72 percent full. |
Okay, so some information on speeds
Running as default just using "new hashtable()" with no size (couldn't find anything that would actually tell me what the default initial capacity setting is) gives me the following Each text file I load is approximately 510k records (including any duplicates), For one run File 0 took 0:24s to load and left the hashtable with 505734 records File 1 took 0:45s to load and left the hashtable with 1007982 records File 2 took 1:12s to load and left the hashtable with 1506624 records File 3 took 1:27s to load and left the hashtable with 2001634 records File 4 took 1:28s to load and left the hashtable with 2492918 records File 5 took 2:43s to load and left the hashtable with 2981337 records File 6 took 2:49s to load and left the hashtable with 3466315 records File 7 took 2:48s to load and left the hashtable with 3947574 records File 8 took 2:49s to load and left the hashtable with 4425733 records File 9 took 2:54s to load and left the hashtable with 4900196 records Total time is 19:19s Changing the hashtable to have a "size" of 4900196 (not sure if Microsoft was intelligent enough to apply the inverse load size to say if I expect 4.9m rows, then I need more rows for the required free space) I get File 0 took 2:39s to load and left the hashtable with 505734 records File 1 took 2:40s to load and left the hashtable with 1007982 records File 2 took 2:40s to load and left the hashtable with 1506624 records File 3 took 2:42s to load and left the hashtable with 2001634 records File 4 took 2:44s to load and left the hashtable with 2492918 records File 5 took 2:43s to load and left the hashtable with 2981337 records File 6 took 2:49s to load and left the hashtable with 3466315 records File 7 took 2:48s to load and left the hashtable with 3947574 records File 8 took 2:48s to load and left the hashtable with 4425733 records File 9 took 2:54s to load and left the hashtable with 4900196 records So as can be seen setting the default size actually makes the program perform much worse, presumably because I'm not doing a straight "load" and then search, I'm searching the table while loading so while perhaps less efficient at the end of the day, it's far more efficient at the start (because I'm only searching a few records). |
| All times are GMT -5. The time now is 8:00 AM. |
Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC