Programming Forums

Programming Forums (http://www.programmingforums.org/forumindex.php)
-   Sed and Awk (http://www.programmingforums.org/forum22.html)
-   -   SCRIPT TO TRACE 100MB FILE(One million records) (http://www.programmingforums.org/showthread.php?t=5149)

pavant Jul 28th, 2005 4:54 PM

SCRIPT TO TRACE 100MB FILE(One million records)
 
Hi,

I have to sort & search for a particular pattern in an 100MB file. ( 1 million records).
Can you please provide an efficient solution to this?

Thanks in advance.

Regards,
Pavan

stevengs Jul 28th, 2005 5:00 PM

Quote:

Originally Posted by pavant
...provide an efficient solution to this...

provide you with a solution? What are your ideas so far (except for the one about someone else providing you with a solution)? Will this be a recurring problem, or is it a one time thing? How is the file organized? What is the operating system? Which tools do you intend to use?......

pavant Jul 28th, 2005 5:09 PM

The file contains only Mobile numbers. It requires search the file and find-out the particular number existing or not. The script has already written. The problem i am facing.,awk script shooting CPU usage. Here OS is Linux.

DaWei Jul 28th, 2005 5:59 PM

If the file is sorted, the search should be relatively efficient. I suppose a germane question is: how often is the file modified and how extensive are the modifications? Problems such as this need to be addressed in terms of their specifics whereas general solutions can be applied to less recalcitrant things.

pavant Jul 28th, 2005 6:21 PM

Yes. first i am doing sorting and then applying binary search logic. File content is fixed. It's exactly 100MB size and contains all mobile numbers.,each line one mobile number.
The problem is once if i run the script.,it's shooting CPU usage., but it works fine.

DaWei Jul 28th, 2005 7:09 PM

If the file is modified rarely or in a limited manner, I would think an insertion procedure could maintain the sort, as one would do with a linked list. If it's modified grossly, then a heap sort might be in order. There are other methods that require more storage and an initial organization, but are amenable to more effective operations subsequently. A database is a good example of a sophisticated and complex process that yields ultimate gains. My initial impression is that you should deep-six the general-purpose utility approach and go for an application dedicated to the problem. One hundred megs is not a breathtaking amount these days, but one needs to realize the problems that can arise if improper amounts of cache can lead to thrashing (page swapping unduly) and other undesirable thangys. The fact that each item is on one line is immaterial. A "line" is the result of just another delimiter. It's possible that a mere modification of your approach is in order. Consider a silly extreme: you perform a bubble sort on 10 million items when the data has been modified only slightly or not at all, then you perform a binary search. If something like this is an invariant. then there is much room for improvement. Again, specific suggestions can't be made in the absence of specific information regarding the process as well as the mere organization of the data.

jim mcnamara Jul 29th, 2005 11:32 AM

You can use grep "123-333-1234" <filename>, it is slow.
We have a C routine that can do a find in about 20 I/O operations or less
searching a fixed record length file - kind of a binary search on a file -
This is a simplified version - i tested it on a file with 8 million records.
test results:
:

kcsdev:/home/jmcnama> time ffind 000000000 8000000 /tmp/test

real    0m0.21s
user    0m0.00s
sys    0m0.01s
kcsdev:/home/jmcnama> echo $?
1
kcsdev:/home/jmcnama> time ffind 000050000 8000000 /tmp/test

real    0m0.08s
user    0m0.01s
sys    0m0.02s
kcsdev:/home/jmcnama> echo $?
0
kcsdev:/home/jmcnama> time ffind 000000001 8000000 /tmp/test

real    0m0.07s
user    0m0.00s
sys    0m0.02s


code:
:

/*****************************************
*  ffind.c --
*  find a value in a file using random access
*  usage ffind <value> <nrec> <filename>
*  requires a fixed length record file
*  possibly with a newline for each record,
*  reclen= data + newline
*  returns:
*    0 if the value is found,
*    1 if not found
**********************************/
#define RECL 10

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

FILE *in=NULL;
char _Wvalue[RECL*2]={0x0};

void usage(void)      /* help */
{
    fprintf(stderr,"%s\n",
        "usage: ffind <value to find> <number records> <filename>");
    exit(EXIT_FAILURE);
}

char  *setfilepos(long pos)  /* read from  in the file*/
{
    if(fseek(in,pos*RECL,SEEK_SET)!=0)
    {  /* note we do not check for EINTR which is possible */
            perror("Parameter error - incorrect number of records");
            exit(EXIT_FAILURE);
    }
    memset(_Wvalue,0x0,sizeof(_Wvalue));
    if(fgets(_Wvalue,sizeof(_Wvalue),in)==NULL)
    {
        if(!feof(in))
        {
            perror("File read error");
            exit(EXIT_FAILURE);
        }     
    }
    return _Wvalue;
}

int compar(const char *key,long rec) /* compare key & test value */
{
    char value[RECL+1]={0x0};
    char *p=NULL;

    strcpy(value,setfilepos(rec));
    p=strchr(value,'\n');
    if(p!=NULL) *p=0x0;
    return strcmp(key,value);
}

/* binary search against a fixed Rec Len  file */
int fbsearch(const char *key, long nrecs)
{
    int retval=0;
    long offset=0;
    long guess=0;

    nrecs--;
    guess=(nrecs - offset)/2;
    for(;;)
    {
        retval=compar(key,guess);
        if(retval)
        {
            if(retval>0)
            {
                    offset=guess+1;
            }
            else
            {
                nrecs=guess-1;
            }
            if(offset > nrecs) break;

            guess=offset + (nrecs-offset)/2;
            continue;
        }
        break;
    }
    return retval;
}

int main(int argc, char *argv[])
{
    int result=0;
    if(argc<4)
    {
        usage();
    }
    in=fopen(argv[3],"r");
    if(in==NULL)
    {
        perror("Error opening input file");
        exit(EXIT_FAILURE);
    }
    result=fbsearch(argv[1],atol(argv[2]));
    return (result!=0);
}


pavant Aug 5th, 2005 8:35 AM

yes JIM. Thanks for the same.


All times are GMT -5. The time now is 9:22 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC