![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#1 |
|
Newbie
Join Date: Jul 2005
Posts: 4
Rep Power: 0
![]() |
SCRIPT TO TRACE 100MB FILE(One million records)
Hi,
I have to sort & search for a particular pattern in an 100MB file. ( 1 million records). Can you please provide an efficient solution to this? Thanks in advance. Regards, Pavan |
|
|
|
|
|
#2 | |
|
Professional Programmer
Join Date: May 2005
Location: Bad Nauheim, Germany
Posts: 436
Rep Power: 4
![]() |
Quote:
__________________
-Steven "Is this a piece of your brain?" - Basil Fawlty |
|
|
|
|
|
|
#3 |
|
Newbie
Join Date: Jul 2005
Posts: 4
Rep Power: 0
![]() |
The file contains only Mobile numbers. It requires search the file and find-out the particular number existing or not. The script has already written. The problem i am facing.,awk script shooting CPU usage. Here OS is Linux.
|
|
|
|
|
|
#4 |
|
Resident Grouch
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jun 2005
Posts: 6,453
Rep Power: 10
![]() |
If the file is sorted, the search should be relatively efficient. I suppose a germane question is: how often is the file modified and how extensive are the modifications? Problems such as this need to be addressed in terms of their specifics whereas general solutions can be applied to less recalcitrant things.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code. Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers |
|
|
|
|
|
#5 |
|
Newbie
Join Date: Jul 2005
Posts: 4
Rep Power: 0
![]() |
Yes. first i am doing sorting and then applying binary search logic. File content is fixed. It's exactly 100MB size and contains all mobile numbers.,each line one mobile number.
The problem is once if i run the script.,it's shooting CPU usage., but it works fine. |
|
|
|
|
|
#6 |
|
Resident Grouch
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jun 2005
Posts: 6,453
Rep Power: 10
![]() |
If the file is modified rarely or in a limited manner, I would think an insertion procedure could maintain the sort, as one would do with a linked list. If it's modified grossly, then a heap sort might be in order. There are other methods that require more storage and an initial organization, but are amenable to more effective operations subsequently. A database is a good example of a sophisticated and complex process that yields ultimate gains. My initial impression is that you should deep-six the general-purpose utility approach and go for an application dedicated to the problem. One hundred megs is not a breathtaking amount these days, but one needs to realize the problems that can arise if improper amounts of cache can lead to thrashing (page swapping unduly) and other undesirable thangys. The fact that each item is on one line is immaterial. A "line" is the result of just another delimiter. It's possible that a mere modification of your approach is in order. Consider a silly extreme: you perform a bubble sort on 10 million items when the data has been modified only slightly or not at all, then you perform a binary search. If something like this is an invariant. then there is much room for improvement. Again, specific suggestions can't be made in the absence of specific information regarding the process as well as the mere organization of the data.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code. Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers |
|
|
|
|
|
#7 |
|
Hobbyist Programmer
Join Date: Jun 2005
Location: New Mexico
Posts: 228
Rep Power: 4
![]() |
You can use grep "123-333-1234" <filename>, it is slow.
We have a C routine that can do a find in about 20 I/O operations or less searching a fixed record length file - kind of a binary search on a file - This is a simplified version - i tested it on a file with 8 million records. test results: kcsdev:/home/jmcnama> time ffind 000000000 8000000 /tmp/test real 0m0.21s user 0m0.00s sys 0m0.01s kcsdev:/home/jmcnama> echo $? 1 kcsdev:/home/jmcnama> time ffind 000050000 8000000 /tmp/test real 0m0.08s user 0m0.01s sys 0m0.02s kcsdev:/home/jmcnama> echo $? 0 kcsdev:/home/jmcnama> time ffind 000000001 8000000 /tmp/test real 0m0.07s user 0m0.00s sys 0m0.02s code: /*****************************************
* ffind.c --
* find a value in a file using random access
* usage ffind <value> <nrec> <filename>
* requires a fixed length record file
* possibly with a newline for each record,
* reclen= data + newline
* returns:
* 0 if the value is found,
* 1 if not found
**********************************/
#define RECL 10
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
FILE *in=NULL;
char _Wvalue[RECL*2]={0x0};
void usage(void) /* help */
{
fprintf(stderr,"%s\n",
"usage: ffind <value to find> <number records> <filename>");
exit(EXIT_FAILURE);
}
char *setfilepos(long pos) /* read from in the file*/
{
if(fseek(in,pos*RECL,SEEK_SET)!=0)
{ /* note we do not check for EINTR which is possible */
perror("Parameter error - incorrect number of records");
exit(EXIT_FAILURE);
}
memset(_Wvalue,0x0,sizeof(_Wvalue));
if(fgets(_Wvalue,sizeof(_Wvalue),in)==NULL)
{
if(!feof(in))
{
perror("File read error");
exit(EXIT_FAILURE);
}
}
return _Wvalue;
}
int compar(const char *key,long rec) /* compare key & test value */
{
char value[RECL+1]={0x0};
char *p=NULL;
strcpy(value,setfilepos(rec));
p=strchr(value,'\n');
if(p!=NULL) *p=0x0;
return strcmp(key,value);
}
/* binary search against a fixed Rec Len file */
int fbsearch(const char *key, long nrecs)
{
int retval=0;
long offset=0;
long guess=0;
nrecs--;
guess=(nrecs - offset)/2;
for(;;)
{
retval=compar(key,guess);
if(retval)
{
if(retval>0)
{
offset=guess+1;
}
else
{
nrecs=guess-1;
}
if(offset > nrecs) break;
guess=offset + (nrecs-offset)/2;
continue;
}
break;
}
return retval;
}
int main(int argc, char *argv[])
{
int result=0;
if(argc<4)
{
usage();
}
in=fopen(argv[3],"r");
if(in==NULL)
{
perror("Error opening input file");
exit(EXIT_FAILURE);
}
result=fbsearch(argv[1],atol(argv[2]));
return (result!=0);
} |
|
|
|
|
|
#8 |
|
Newbie
Join Date: Jul 2005
Posts: 4
Rep Power: 0
![]() |
yes JIM. Thanks for the same.
|
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|