Suppose that we have a really big number file, say 20 MB of numbers in a
single column. It has duplicates. Let's call this file bigfile.
You are given another file, which is much smaller, say 20000 lines. We'll
call it smallfile. It is also a list of numbers, with no duplicates.
Requirements Statement:
Create a new file based on the data in bigfile.
1. the new file will contain no lines found in smallfile
2. the new file will have no duplicates.
This code snippet, while it works, will take a large number of operations: file I/O's & searches:
while read number
do
grep -v "$number" bigfile > newfile
mv newfile bigfile
done < smallfile
sort -u bigfile > newfile
grep -f -v smallfile bigfile | sort -u > newfile
is a possibility. If your grep doesn't barf on more than 2048000 bytes in the -f file (XOPEN limit)
But we're in another forum... and we're trying something.
Soo... based on the name of the forum (sed & awk in case you forgot) how
would you create a nice efficient chunk of code that meets the Requirements
Statement above? Efficient means the least number of passes thru bigfile.
Say 2-3 maybe. Forget grep & sort.
In other words, you have to create a resultset from bigfile "minus"
smallfile that is a unique list.
Go for it. And think associative arrays.