Programming Forums

Programming Forums (http://www.programmingforums.org/forumindex.php)
-   Sed and Awk (http://www.programmingforums.org/forum22.html)
-   -   Let's try efficiency - a coding challenge (http://www.programmingforums.org/showthread.php?t=11152)

jim mcnamara Aug 23rd, 2006 5:43 PM

Let's try efficiency - a coding challenge
 
Suppose that we have a really big number file, say 20 MB of numbers in a
single column. It has duplicates. Let's call this file bigfile.

You are given another file, which is much smaller, say 20000 lines. We'll
call it smallfile. It is also a list of numbers, with no duplicates.

Requirements Statement:
Create a new file based on the data in bigfile.
1. the new file will contain no lines found in smallfile
2. the new file will have no duplicates.

This code snippet, while it works, will take a large number of operations: file I/O's & searches:

:

while read number
do
        grep -v "$number" bigfile > newfile
        mv newfile bigfile
done < smallfile
sort -u bigfile > newfile


:

grep -f -v smallfile bigfile | sort -u > newfile
is a possibility. If your grep doesn't barf on more than 2048000 bytes in the -f file (XOPEN limit)

But we're in another forum... and we're trying something.

Soo... based on the name of the forum (sed & awk in case you forgot) how
would you create a nice efficient chunk of code that meets the Requirements
Statement above? Efficient means the least number of passes thru bigfile.
Say 2-3 maybe. Forget grep & sort.

In other words, you have to create a resultset from bigfile "minus"
smallfile that is a unique list.


Go for it. And think associative arrays.


All times are GMT -5. The time now is 12:48 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC