Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Jul 28th, 2005, 4:54 PM   #1
pavant
Newbie
 
Join Date: Jul 2005
Posts: 4
Rep Power: 0 pavant is on a distinguished road
SCRIPT TO TRACE 100MB FILE(One million records)

Hi,

I have to sort & search for a particular pattern in an 100MB file. ( 1 million records).
Can you please provide an efficient solution to this?

Thanks in advance.

Regards,
Pavan
pavant is offline   Reply With Quote
Old Jul 28th, 2005, 5:00 PM   #2
stevengs
Professional Programmer
 
stevengs's Avatar
 
Join Date: May 2005
Location: Bad Nauheim, Germany
Posts: 436
Rep Power: 4 stevengs is on a distinguished road
Quote:
Originally Posted by pavant
...provide an efficient solution to this...
provide you with a solution? What are your ideas so far (except for the one about someone else providing you with a solution)? Will this be a recurring problem, or is it a one time thing? How is the file organized? What is the operating system? Which tools do you intend to use?......
__________________
-Steven
"Is this a piece of your brain?" - Basil Fawlty
stevengs is offline   Reply With Quote
Old Jul 28th, 2005, 5:09 PM   #3
pavant
Newbie
 
Join Date: Jul 2005
Posts: 4
Rep Power: 0 pavant is on a distinguished road
The file contains only Mobile numbers. It requires search the file and find-out the particular number existing or not. The script has already written. The problem i am facing.,awk script shooting CPU usage. Here OS is Linux.
pavant is offline   Reply With Quote
Old Jul 28th, 2005, 5:59 PM   #4
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
If the file is sorted, the search should be relatively efficient. I suppose a germane question is: how often is the file modified and how extensive are the modifications? Problems such as this need to be addressed in terms of their specifics whereas general solutions can be applied to less recalcitrant things.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote
Old Jul 28th, 2005, 6:21 PM   #5
pavant
Newbie
 
Join Date: Jul 2005
Posts: 4
Rep Power: 0 pavant is on a distinguished road
Yes. first i am doing sorting and then applying binary search logic. File content is fixed. It's exactly 100MB size and contains all mobile numbers.,each line one mobile number.
The problem is once if i run the script.,it's shooting CPU usage., but it works fine.
pavant is offline   Reply With Quote
Old Jul 28th, 2005, 7:09 PM   #6
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
If the file is modified rarely or in a limited manner, I would think an insertion procedure could maintain the sort, as one would do with a linked list. If it's modified grossly, then a heap sort might be in order. There are other methods that require more storage and an initial organization, but are amenable to more effective operations subsequently. A database is a good example of a sophisticated and complex process that yields ultimate gains. My initial impression is that you should deep-six the general-purpose utility approach and go for an application dedicated to the problem. One hundred megs is not a breathtaking amount these days, but one needs to realize the problems that can arise if improper amounts of cache can lead to thrashing (page swapping unduly) and other undesirable thangys. The fact that each item is on one line is immaterial. A "line" is the result of just another delimiter. It's possible that a mere modification of your approach is in order. Consider a silly extreme: you perform a bubble sort on 10 million items when the data has been modified only slightly or not at all, then you perform a binary search. If something like this is an invariant. then there is much room for improvement. Again, specific suggestions can't be made in the absence of specific information regarding the process as well as the mere organization of the data.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote
Old Jul 29th, 2005, 11:32 AM   #7
jim mcnamara
Hobbyist Programmer
 
Join Date: Jun 2005
Location: New Mexico
Posts: 228
Rep Power: 4 jim mcnamara is on a distinguished road
You can use grep "123-333-1234" <filename>, it is slow.
We have a C routine that can do a find in about 20 I/O operations or less
searching a fixed record length file - kind of a binary search on a file -
This is a simplified version - i tested it on a file with 8 million records.
test results:
kcsdev:/home/jmcnama> time ffind 000000000 8000000 /tmp/test

real    0m0.21s
user    0m0.00s
sys     0m0.01s
kcsdev:/home/jmcnama> echo $?
1
kcsdev:/home/jmcnama> time ffind 000050000 8000000 /tmp/test

real    0m0.08s
user    0m0.01s
sys     0m0.02s
kcsdev:/home/jmcnama> echo $?
0
kcsdev:/home/jmcnama> time ffind 000000001 8000000 /tmp/test

real    0m0.07s
user    0m0.00s
sys     0m0.02s

code:
/*****************************************
*   ffind.c --
*   find a value in a file using random access
*   usage ffind <value> <nrec> <filename>
*   requires a fixed length record file
*   possibly with a newline for each record,
*   reclen= data + newline
*   returns:
*    0 if the value is found,
*    1 if not found
**********************************/
#define RECL 10

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

FILE *in=NULL;
char _Wvalue[RECL*2]={0x0};

void usage(void)       /* help */
{
    fprintf(stderr,"%s\n",
        "usage: ffind <value to find> <number records> <filename>");
    exit(EXIT_FAILURE);
}

char  *setfilepos(long pos)  /* read from  in the file*/
{
     if(fseek(in,pos*RECL,SEEK_SET)!=0)
     {  /* note we do not check for EINTR which is possible */
     	perror("Parameter error - incorrect number of records");
     	exit(EXIT_FAILURE);
     }
     memset(_Wvalue,0x0,sizeof(_Wvalue));
     if(fgets(_Wvalue,sizeof(_Wvalue),in)==NULL)
     {
        if(!feof(in))
        {
            perror("File read error");
            exit(EXIT_FAILURE);
        }      
     }
     return _Wvalue;
}

int compar(const char *key,long rec) /* compare key & test value */
{
    char value[RECL+1]={0x0};
    char *p=NULL;

    strcpy(value,setfilepos(rec));
    p=strchr(value,'\n');
    if(p!=NULL) *p=0x0;
    return strcmp(key,value);
}

/* binary search against a fixed Rec Len  file */
int fbsearch(const char *key, long nrecs)
{
    int retval=0;
    long offset=0;
    long guess=0;

    nrecs--;
    guess=(nrecs - offset)/2;
    for(;;)
    {
        retval=compar(key,guess);
        if(retval)
        {
            if(retval>0)
            {
            	offset=guess+1;
            }
            else
            {
                nrecs=guess-1;
            }
            if(offset > nrecs) break;

            guess=offset + (nrecs-offset)/2;
            continue;
        }
        break;
    }
    return retval;
}

int main(int argc, char *argv[])
{
    int result=0;
    if(argc<4)
    {
        usage();
    }
    in=fopen(argv[3],"r");
    if(in==NULL)
    {
        perror("Error opening input file");
        exit(EXIT_FAILURE);
    }
    result=fbsearch(argv[1],atol(argv[2]));
    return (result!=0);
}
jim mcnamara is offline   Reply With Quote
Old Aug 5th, 2005, 8:35 AM   #8
pavant
Newbie
 
Join Date: Jul 2005
Posts: 4
Rep Power: 0 pavant is on a distinguished road
yes JIM. Thanks for the same.
pavant is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 12:18 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC