Showing posts with label datastore. Show all posts
Showing posts with label datastore. Show all posts

Tuesday, January 6, 2009

Trivial resolution of Datastore performance

In addition to Model.put() Datastore has db.put(). I did not notice the latter can put several entities at once until Arachnid told me so. So in my code I changed this:
for cell in cells:
  cell.put()
To this:
db.put(cells)
That's all what was needed to fix the performance.

Sunday, January 4, 2009

Improved Datastore performance

Looks like the problem with Datastore performance is that the information was very fine-grained. I created the test following Google's suggestion (look at the tip at the end of the page). So this time I made an opposite test:
  • Instead of having a single integer, each entity has a text with 10,000 characters
  • A half of records is written in transactions by 10 records, and another half - record by record
The results show that the size of entity had no effect unlike entities' count. So it's better to write a few large objects than many small ones.
Also, this time I had a huge difference between the real appengine server and dev_appserver after having many records in the database (real server was much faster). Grouping few records in a transaction also helped. This is the test code:
from google.appengine.ext import db
from time import time

print 'Content-Type: text/plain'
print ''

total_t=time()
class Root(db.Model):
    pass

class C(db.Model):
 i=db.TextProperty()

t1000="a"*10000

def add_in_transaction(root, text, amount):
     for j in range(amount):
        c=C(parent=root, i=text)
        c.put()

print "with transactions - big"
for i in range(5):
    t=time()
    root=Root()
    root.put()
    db.run_in_transaction(add_in_transaction, root, t1000, 10)
    print time()-t
print "without transactions - big"
for i in range(5):
    t=time()
    root=Root()
    root.put()
    add_in_transaction(None, t1000, 10)
    print time()-t
print "without transactions - small"
for i in range(5):
    t=time()
    root=Root()
    root.put()
    add_in_transaction(None, "a", 10)
    print time()-t

print "total time:", time()-total_t
And this is the result
with transactions - big
0.161096096039
0.154489994049
0.367100000381
0.152635812759
0.153033971786
without transactions - big
0.315757989883
0.359083890915
0.559228181839
0.360776901245
0.330877780914
without transactions - small
0.279601812363
0.541454076767
0.324053049088
0.311630964279
0.306309938431
total time: 4.67810916901
I think it's worth to open a bug on appengine documentation so they mention these performance considerations.

P.S. changed the test a little to demonstrate that writing one character or 10K characters has no difference.

Datastore performance

Something strange with the performance of the AppEngine Datastore. I tried to run the following code:

from google.appengine.ext import db
from time import time

print 'Content-Type: text/plain'
print ''

total_t=time()

class C(db.Model):
 i=db.IntegerProperty()

for i in range(10):
 t=time()
 for j in range(10):
  c=C(i=i)
  c.save()
 print time()-t

print "total time:", time()-total_t
As you can see, this is a complete python module, not dependent on django or anything else. Just add a corresponding mapping to app.yaml and you can try it by yourself. So the output of this code, which adds 100 records to the Datastore is:
0.307200908661
0.279258012772
0.305376052856
0.310864925385
0.286242008209
0.283288002014
0.299383878708
0.286517858505
0.281584024429
0.268044948578
total time: 2.90873217583

I tried to add 200 records, and got a time-out as AppEngine does not allow long-running queries. I had pretty similar timings on the dev_appserver. This is very slow, and I cannot understand where is the catch.

Saturday, January 3, 2009

Querying for None in Datastore

I got a weird problem with GAE Datastore, when tried to search for None value. If I use gql, then the query works as expected:

from game.models import *
for c in Cell.gql("WHERE game=:g", g=None):
 print c

The above code prints the expected cells which are not bound to any game. But I need to iterate through cells of a certain board type, so instead of Cell.gql I start from board.cell_set and am trying to define a filter on game=None. The following code should give the same outcome as the previous one:

from game.models import *
for c in Cell.all().filter("game=", None):
 print c

But this time I get no results. Why?

Cached ReferenceProperty: now with round trip

One thing was really missing in a CachedReferenceProperty - cached round trip. Suppose we have the following one-to-many relationship:

class Master(db.Model):
  pass

class Detail(db.Model):
  master=CachedReferenceProperty(Master)

By cached round trip here I mean that when a master holds a cached collection of details, those details reference the same master, so going back and forth from master to details does not make any database hits.

To make it possible, I replaced collection builder in _CachedReverseReferenceProperty from this:

  res=[c for c in query]

to this:

  res=[]
  for c in query:
    resolved_name='_RESOLVED_'+self.__prop #WARNING: using internal
    setattr(c, resolved_name, model_instance)
    res += [c]

Very ugly, need an idea how to eliminate using internal attribute. The whole source file is here.

Thursday, December 25, 2008

Cached ReferenceProperty

Piece of cake

Earlier I wrote about my wish to subclass ReferenceProperty so the collection would not be fetched every time I iterate though it. Well, it was so easy I can post the whole implementation here.
from google.appengine.ext import db

class CachedReferenceProperty(db.ReferenceProperty):

  def __property_config__(self, model_class, property_name):
    super(CachedReferenceProperty, self).__property_config__(model_class,
                                                       property_name)
    #Just carelessly override what super made
    setattr(self.reference_class,
            self.collection_name,
            _CachedReverseReferenceProperty(model_class, property_name,
                self.collection_name))

class _CachedReverseReferenceProperty(db._ReverseReferenceProperty):

    def __init__(self, model, prop, collection_name):
        super(_CachedReverseReferenceProperty, self).__init__(model, prop)
        self.__collection_name = collection_name

    def __get__(self, model_instance, model_class):
        if model_instance is None:
            return self
        if self.__collection_name in model_instance.__dict__:# why does it get here at all?
            return model_instance.__dict__[self.__collection_name]

        query=super(_CachedReverseReferenceProperty, self).__get__(model_instance,
            model_class)
        #replace the attribute on the instance
        res=[c for c in query]
        model_instance.__dict__[self.__collection_name]=res
        return res

    def __delete__ (self, model_instance):
        if model_instance is not None:
            del model_instance.__dict__[self.__collection_name]
Having these classes now we can rewrite previous example as:
class Master(db.Model):
  pass

class Detail(db.Model):
  master=CachedReferenceProperty(Master)
Try to run the same cycle and you will see it executes instantly even with 100,000 iterations instead of 1000.

Is it a free cake?

Not exactly. Try this:
m=Master()
m.put()
d1=Detail(master=m)
d1.put()
print m.detail_set
d2=Detail(master=m)
d2.put()
print m.detail_set
The second time it returned a wrong result, which did not include d2. So we need a way to reset the cached value and fetch up-to-date values from the datastore. Fortunately, it's achieved easily:
del m.detail_set
print m.detail_set
This is why I implemented _CachedReverseReferenceProperty.__delete__. When m.__dict__ has no key'detail_set', m.detail_set is dispatched to type(m).__dict__('detail_set'), and there I call the base class to access the datastore. What surprised me is when I do have m.__dict__('detail_set'), m.detail_set is still dispatched to Master.__dict__('detail_set'). I don't understand why that happens, so I worked around this problem. Have to learn Python better to answer that question.

Wednesday, December 24, 2008

AppEngine Datastore and memcache

I miss Hibernate collections. In the following code I access the collection a thousand times:

class Master(db.Model):
  pass

class Detail(db.Model):
  master=db.ReferenceProperty(Master)

m=Master()
m.put()
d=Detail(master=m)
d.put()

for i in range(1000):
  for tmp_d in m.detail_set:
    pass

The above code takes a few second to execute. The reason is Datastore fetches the collection from the storage every time, and in Hibernate the collection would be fetched from the database only once until the end of the session. Oops, no sessions with Datastore. So Datastore developers were right when they opted to fetch collection every time - they don't know when the details change.

This is the reason Master cannot be put in memcache effectively: it would be stored without the Details. Master.detail_set holds only the definition of the query needed to get the details. So I'm thinking of a way I could decorate ReferenceProperty to make one-to-many relations suitable for the memcache. So big object trees will be read from Datastore once and then accessible in a fast way.

Saturday, December 20, 2008

Polymorphism in AppEngine Datastore Models

There is a problem with inherited classes in AppEngine

Let's suppose we have the following models:

class Master(db.Model):
  mp = db.StringProperty()

class Detail(db.Model):
  dp = db.StringProperty()
  master = db.ReferenceProperty(Master)
When these are declared, Datastore appends automatically Detail_set property to the Master. So if we made
m=Master(mp='foo')
m.put()
d1=Detail(dp='bar', master=m)
d1.put()
d2=Detail(dp='zee', master=m)
d2.put()
then we have m.Detail_set property which will fetch [d1, d2]. But if we define
class MoreDetail (Detail):
  mdp=db.StringProperty()

d3=MoreDetail (dp='org', mdp='jee', master=m)
d3.put()
then m.detail_set will fetch the third d3 but de-serialize it as Detail instead of MoreDetail class. Here is how I checked it:
>>> for d in m.detail_set.fetch(10):
...  print d.properties()
{'master': <ReferenceProperty object at 0x018B8330>, 'dp': <StringProperty object at 0x023A8C10>}
{'master': <ReferenceProperty object at 0x018B8330>, 'dp': <StringProperty object at 0x023A8C10>}
{'master': <ReferenceProperty object at 0x018B8330>, 'dp': <StringProperty object at 0x023A8C10>}
One of these objects should have an mdp property defined in MoreDetail, but that did not happen.