Programming Forums

Programming Forums (http://www.programmingforums.org/forumindex.php)
-   Python (http://www.programmingforums.org/forum43.html)
-   -   Hacking Python Memory (http://www.programmingforums.org/showthread.php?t=12842)

Sane Mar 18th, 2007 2:48 PM

Hacking Python Memory
 
This might sound a little over the edge, but bare with me.

I have a Python program that at any given time could be storing huge amounts of memory in the RAM. However, most of the time only 1/8ths of that memory may actually be useful.

The obvious choice is keeping this stored in a mysql database so that mysql can decide for itself what's important enough to cache, and what can be stored on the hard drive.

However, I don't want to go that route, since the memory that will be accessed must be accessed as quickly as possible, and very frequently. MySQL is adequately fast, but retrieving large amounts of different information is too frequent to be fast enough.

Therefore, my idea was to create another layer overtop of the memory, and underneath the script execution. When a block of memory hasn't been accessed for a while, the layer will store the block of memory in a new file, and delete that portion from the RAM. If the script attempts to access that memory, the layer will retrieve the file's contents, delete the file, and store it in the memory again.

This could be very easy to do, or very difficult, depending on what Python has to offer in these regards. Does Python support lookups of memory address locations? Are there any existing libraries that can help?

The memory that's of interest is a list of class instances. Each class instance is storing several strings, integers and more lists.

My first thought is to solve this using Python's decorators, by adding a decorator to every function that will look at these class instances each time the function is called. If a variable is set to None or False, then its corresponding id(var) will have a file with its contents stored in it. Only problem is I don't believe that will lighten the load on the RAM, and that's a big problem.

Any help, advice, or food for thought will be very helpful. Thanks in advance.

Arevos Mar 18th, 2007 3:43 PM

If you're using CPython (the official Python interpreter), then this is relatively simple. CPython uses a reference counting memory management system, which means that the number of references to an object is kept track of, and when this reaches zero, the object is instantly destroyed. This is a very simplistic and somewhat inefficient approach to garbage collection, as it's usually better to dereference a whole block of memory all at once, since freeing memory takes time; however there are advantages to reference counting. For instance, in CPython, you can do something like this:
:

lines = open("file.txt").readlines()
Because the file object created by "open" is not referenced, it's destroyed instantly afterwards, and the file is closed. In IronPython, which uses the .NET GC, the object isn't destroyed until sometime later when it is more efficient to discard the memory, and hence the file could be left open for a long time. So whilst reference counting is not efficient, it does result in very predictable behaviour.

Essentially, you could create a wrapper class that keeps objects on disk until they are needed, and then expires them after a certain amount of time (perhaps using the "shelve" module as storage). To expire an object, just remove all references to it. You may want to use the weakref module to make sure you don't give out any "real" references that might prevent your objects from being recycled. You could also use the __getattr__ method so that you can access your data like this:
:

diskcache.commonvar = 10  # gets from in-memory cache (a dict)
diskcache.rarevar += 10  # gets from disk (via shelve), stores in memory cache

...  # more stuff

# rarevar hasn't been access for some time,
# so it's removed when diskcache is accessed again:
diskcache.x += 1    # rarevar removed as x is returned


DaWei Mar 18th, 2007 3:44 PM

This is what a reasonably decent operating system tries to do for you, with its cache/virtual memory. Have you determined by performance measurements that it's really necessary?

Sane Mar 18th, 2007 4:20 PM

@Arevos : That all seems pretty straight forward. But how do I get the memory back when it's attempted to be referenced again? Maybe I don't see how this works.

@DaWei : I haven't yet done any measurements, but this is more because I anticipate that the list of class instances could potentially reach several hundred thousand instances. And my RAM can't possibly handle that cleanly.

Arevos Mar 18th, 2007 6:21 PM

Quote:

Originally Posted by Sane (Post 125422)
@Arevos : That all seems pretty straight forward. But how do I get the memory back when it's attempted to be referenced again? Maybe I don't see how this works.

Well, if I'm understanding right, then you have a series of objects containing data. You could pickle those objects to a file, using something like the shelve module, and keep them in a time-limited cache whenever they're accessed. In order to access the objects, one must always go through the cache.

Sane Mar 18th, 2007 6:36 PM

Edit :
Nevermind. You can probably disregard the original post. So, if I understand this correctly, the shelve module does not increase RAM for the number of objects being stored? It stores them on the hard drive, but my program will treat them as traditional variables?

Then I would add a layer before the shelf level, where a cache keeps their contents in the RAM?
An off-topic question here: Is there a function that automatically dumps an instance's contents to a binary file, and then reads it right back in with all the types and attributes in tact? If not, I could just quickly write one.

Original Post :
I'm not sure if I'm missing something, or if you're missing something, but to make sure we're on the same page, I should probably clarify:

Once a block has been removed from the RAM, it could still possibly be needed by the program at any time. When I say that 1/8th of the RAM might only be useful, I only mean at the current time. All 8/8ths of the RAM may be looked at at least once in a 24 hour period.

So once it's saved to the file, the program must still know to look at the file and add its contents back to the memory, if it's ever attempted to be accessed.

Is that what you have been assuming? Or why don't I see how this works?

Arevos Mar 18th, 2007 7:30 PM

Quote:

Originally Posted by Sane (Post 125426)
Edit :[indent]Nevermind. You can probably disregard the original post. So, if I understand this correctly, the shelve module does not increase RAM for the number of objects being stored? It stores them on the hard drive, but my program will treat them as traditional variables?

Yes and no. The Shelf module uses a dbm database to store objects serialized by the pickle module. When you request an object, the Shelf class queries the database, extracts the string of data that represents the object, and deserializes it (or "unpickles" it) into a real object. The Shelf class does have the option of having an in-memory cache, but this cache is essentially unlimited in size, so probably not what you want; fortunately, it's off by default.

To show you what I mean:
:

import shelve

shelf = shelve.open("somefile.dbm")

# "dog" is read from disk and made into an object, then printed out. Because
# the object has not been assigned a name, it is destroyed the moment the
# print command ends.
print shelf["dog"]

# This line will read the same data in again, and store it only briefly in
# memory before discarding it again.
print shelf["dog"]

# "cat" is read from disk, but this time it is assigned a reference. This means
# that the "cat" object will persist in memory.
cat = shelf["cat"]
print cat

# This command doesn't reread the data from disk - it uses the in-memory
# version of "cat"
print cat

# Now, we can wait for "cat" to fall out of scope, or we can delete it
# manually:
del cat

# No more cat instances exist now


Sane Mar 18th, 2007 11:28 PM

Okay, everything seems to be working fine, except python.exe doesn't seem to lower in memory usage when something is shelved. When I delete it form the RAM, it lowers in memory, but then it goes straight up again once it's shelved.


I ran the following commands sequentially in the Python command line, while watching the memory:

:

>>> class x:
...    def __init__(self):
...        self.mem = "-" * 1024*100

>>> shelf = shelve.open("test_shelve.dbm")
>>> shelf["a"] = x()


And the memory went waaay up. Shouldn't it only go up a little bit, since shelve won't keep it in the memory?

By the way, this is my current solution, for which it works, but does not lower memory:

:

class main:

    def __init__(self):

        self.shelf = shelve.open("link_db.dbm")

...

        self.last_db_dump  = self.last_save
        self.db_dump_every = 60
        self.db_dump_age  = 30
       
        self.grab("queue", list)
        self.grab("active", list)
        self.grab("max_link_id", int)

        self.links  = dict()
        self.db_fetch_times = dict()
           
...

    def fetch_link(self, link_id):

        date = int(time.time())
        self.db_fetch_times[link_id] = date

        if not self.links.has_key(link_id):
            # load
            print "Loaded :", link_id
            self.links[link_id] = self.shelf["instance_%s"%(link_id)]

        # do a routine check for unused RAM
        if date >= self.last_db_dump + self.db_dump_every:
            for dump_link_id in self.db_fetch_times.keys():
                if date >= self.db_fetch_times[dump_link_id] + self.db_dump_age:
                    # save
                    self.shelf["instance_%s"%(dump_link_id)] = self.links[dump_link_id]
                    del self.links[dump_link_id]
                    del self.db_fetch_times[dump_link_id]
                    print "Saved :", dump_link_id
            self.last_db_dump = date
           
        return self.links[link_id]
             
    def grab(self, var_name, var_type):
       
        try:
            setattr(self, var_name, self.shelf[var_name])
        except KeyError:
            setattr(self, var_name, var_type())

...

    def add_to_queue(self, link_id):

        link = self.fetch_link(link_id)
        if link.points > 0:
            self.queue.append(link_id)
            link.points -= 1

...


ZenMasterJG Mar 19th, 2007 7:58 AM

Sane:
IMHO, you're optimizing prematurely. DaWei is right. Find out how your algorithm does, and *then* optimize. If your RAM can't handle the number of objects your creating, trying to think of an alternative solution is probably a better idea then trying to re-write Python's memory management.

Arevos Mar 19th, 2007 8:19 AM

Quote:

Originally Posted by ZenMasterJG (Post 125446)
IMHO, you're optimizing prematurely. DaWei is right. Find out how your algorithm does, and *then* optimize. If your RAM can't handle the number of objects your creating, trying to think of an alternative solution is probably a better idea then trying to re-write Python's memory management.

Normally, I'd agree, but if the number of objects in memory plainly exceeds the amount of RAM available, then this sort of optimization is necessary. However, RAM tends to be fairly large these days, so one might want to consider whether it really is necessary; but in principle, at least, this might not fall under the category of premature optimization, but more a matter of necessity.

Quote:

Originally Posted by Sane (Post 125437)
And the memory went waaay up. Shouldn't it only go up a little bit, since shelve won't keep it in the memory?

Whilst Python discards memory instantly, your OS usually does not. If a chunk of memory is freed by a program, and you have plenty of free RAM around, then the OS will often keep the chunk of memory around until it's needed.

In order to properly check that the application works, you need to operate on many different objects. For instance:
:

x = "-" * 100 * 1024
del x    # x deleted, but OS doesn't bother to reassign 100M straight away

# The following code will likely result in an out of memory error:
l = []
for i in range(100):
    l.append("-" * 100 * 1024)

# The following code will not, since the OS will reassign the freed bits of
# memory when it deems it necessary:
for i in range(100):
    x = "-" * 100 * 1024

Try your class with a large number of objects that would normally result in an out-of-memory error if they were all loaded in at once.


All times are GMT -5. The time now is 3:17 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC