Saturday, January 31, 2009

Inherited classes in Hibernate

Few days ago I made some refactoring of a Hibernate based JavaEE application. There was a table and a view on that table which included all columns, like this:
CREATE TABLE Person (
  Id NUMBER,
  Name VARCHAR(10),
  BirthDate DATE,
);

CREATE VIEW PersonExtended AS
  SELECT p.*, YearsFromNow(p.BirthDate) AS Age FROM Person p;
Assuming we have a corresponding function this view will include all columns from Person and have an additional column named Age. Before refactoring, there were 2 corresponding entity classes. In the actual code entities have full annotated getters and corresponding setters, but for readability I'll use the most compact and not recommended format here:
//Person.java

@Entity
class Person {
  @Id long id;
  String name;
  Date birthDate;
}

//PersonExtended.java

@Entity
class PersonExtended {
  @Id long id;
  String name;
  Date birthDate;
  int age;
}
Trying to remove the code duplication I changed the entities as following:
//Person.java

@Entity
@Inheritance (strategy=InheritanceType.TABLE_PER_CLASS)
class Person {
  @Id long id;
  String name;
  Date birthDate;
}

//PersonExtended.java

@Entity
class PersonExtended extends Person {
  int age;
}
Alone from the cosmetic improvement, this change also allowed to use the same code for working with both entities. But this polymorphism also created new problems. For example, query fetching all data from Person generated the following SQL:
SELECT p.*, NULL as Age, 1 as discriminator from Person p
  UNION
SELECT pe.*, 2 as discriminator from PersonExtended pe;
For clarity I replaced with * the actual field list from Hibernate generated SQL. So what happened? Hibernate treats PersonExtended as kind of Person, so the result of this query would be all records from Person followed by all records of PersonExtended! They will have correct type in Java, by the way, thanks to discriminator columns generated by Hibernate. Anyway, it's not what we wanted and it's a regression (a new bug) after the refactoring. To fix that bug we must tell Hibernate that PersonExtended is not considered Person. I used a MappedSuperclass for that:
//AbstractPersonBase.java

@MappedSuperclass
abstract class AbstractPersonBase {
  @Id long id;
  String name;
  Date birthDate;
}

//Person.java

@Entity
Person extends AbstractPersonBase {
  //empty, all person data is defined in the superclass
}

//PersonExtended.java

@Entity
PersonExtended extends AbstractPersonBase {
  int age;
}
This code correctly defines the relation between PersonExtended and Person. They have common part but should not be used one instead of the other. This solution with an abstract base has no problem with fetching different entities in the same query. On the other hand, it allows using AbstractPersonBase in cases where both entities are processed in the same way in Java.

Friday, January 9, 2009

Python class slots

Today I came over __slots__ feature of Python. It's used to define the list of possible attributes at the class creation time, so by default no dictionary is kept for every instance. This can save memory, if such instances are stored in big lists. To use slots, class should be defined like this:
class Point(object):
  __slots__=["x","y"]
The next example demonstrates the difference between a class with slots and a regular class.
class OldPoint(object):
  pass

p=OldPoint()
p.x=10
p.y=20   # these are OK
p.z=30   # this is OK as well - any attributes are allowed

p=Point()
p.x=10
p.y=20   # this are OK
p.z=30   # this causes AttributeError: 'Point' object has no attribute 'z'
Defining __slots__ affects not only the dictionary of the instances, but also the way they are serialized (or pickled in Python terminolodgy). Also a weak reference (__weakref__) is not enabled by default (can be overriden)

Links

Tuesday, January 6, 2009

lj-cut on blogger

LifeJournal has a useful feature lj-cut. It allows to show only a part of the post on the main page, and reveal the rest on a separate page. I was looking for a similar feature on blogger, as my posts with code examples are quite lengthy.
Don't need to say much more, the solution is here. Don't forget the step 1, I started from the red part and scratched the head what's wrong.

P.S. I ended up changing their tags from span to div, as span can not include <pre> elements among many other limitations.

Trivial resolution of Datastore performance

In addition to Model.put() Datastore has db.put(). I did not notice the latter can put several entities at once until Arachnid told me so. So in my code I changed this:
for cell in cells:
  cell.put()
To this:
db.put(cells)
That's all what was needed to fix the performance.

Sunday, January 4, 2009

Improved Datastore performance

Looks like the problem with Datastore performance is that the information was very fine-grained. I created the test following Google's suggestion (look at the tip at the end of the page). So this time I made an opposite test:
  • Instead of having a single integer, each entity has a text with 10,000 characters
  • A half of records is written in transactions by 10 records, and another half - record by record
The results show that the size of entity had no effect unlike entities' count. So it's better to write a few large objects than many small ones.
Also, this time I had a huge difference between the real appengine server and dev_appserver after having many records in the database (real server was much faster). Grouping few records in a transaction also helped. This is the test code:
from google.appengine.ext import db
from time import time

print 'Content-Type: text/plain'
print ''

total_t=time()
class Root(db.Model):
    pass

class C(db.Model):
 i=db.TextProperty()

t1000="a"*10000

def add_in_transaction(root, text, amount):
     for j in range(amount):
        c=C(parent=root, i=text)
        c.put()

print "with transactions - big"
for i in range(5):
    t=time()
    root=Root()
    root.put()
    db.run_in_transaction(add_in_transaction, root, t1000, 10)
    print time()-t
print "without transactions - big"
for i in range(5):
    t=time()
    root=Root()
    root.put()
    add_in_transaction(None, t1000, 10)
    print time()-t
print "without transactions - small"
for i in range(5):
    t=time()
    root=Root()
    root.put()
    add_in_transaction(None, "a", 10)
    print time()-t

print "total time:", time()-total_t
And this is the result
with transactions - big
0.161096096039
0.154489994049
0.367100000381
0.152635812759
0.153033971786
without transactions - big
0.315757989883
0.359083890915
0.559228181839
0.360776901245
0.330877780914
without transactions - small
0.279601812363
0.541454076767
0.324053049088
0.311630964279
0.306309938431
total time: 4.67810916901
I think it's worth to open a bug on appengine documentation so they mention these performance considerations.

P.S. changed the test a little to demonstrate that writing one character or 10K characters has no difference.

Datastore performance

Something strange with the performance of the AppEngine Datastore. I tried to run the following code:

from google.appengine.ext import db
from time import time

print 'Content-Type: text/plain'
print ''

total_t=time()

class C(db.Model):
 i=db.IntegerProperty()

for i in range(10):
 t=time()
 for j in range(10):
  c=C(i=i)
  c.save()
 print time()-t

print "total time:", time()-total_t
As you can see, this is a complete python module, not dependent on django or anything else. Just add a corresponding mapping to app.yaml and you can try it by yourself. So the output of this code, which adds 100 records to the Datastore is:
0.307200908661
0.279258012772
0.305376052856
0.310864925385
0.286242008209
0.283288002014
0.299383878708
0.286517858505
0.281584024429
0.268044948578
total time: 2.90873217583

I tried to add 200 records, and got a time-out as AppEngine does not allow long-running queries. I had pretty similar timings on the dev_appserver. This is very slow, and I cannot understand where is the catch.

Saturday, January 3, 2009

Querying for None in Datastore

I got a weird problem with GAE Datastore, when tried to search for None value. If I use gql, then the query works as expected:

from game.models import *
for c in Cell.gql("WHERE game=:g", g=None):
 print c

The above code prints the expected cells which are not bound to any game. But I need to iterate through cells of a certain board type, so instead of Cell.gql I start from board.cell_set and am trying to define a filter on game=None. The following code should give the same outcome as the previous one:

from game.models import *
for c in Cell.all().filter("game=", None):
 print c

But this time I get no results. Why?

Cached ReferenceProperty: now with round trip

One thing was really missing in a CachedReferenceProperty - cached round trip. Suppose we have the following one-to-many relationship:

class Master(db.Model):
  pass

class Detail(db.Model):
  master=CachedReferenceProperty(Master)

By cached round trip here I mean that when a master holds a cached collection of details, those details reference the same master, so going back and forth from master to details does not make any database hits.

To make it possible, I replaced collection builder in _CachedReverseReferenceProperty from this:

  res=[c for c in query]

to this:

  res=[]
  for c in query:
    resolved_name='_RESOLVED_'+self.__prop #WARNING: using internal
    setattr(c, resolved_name, model_instance)
    res += [c]

Very ugly, need an idea how to eliminate using internal attribute. The whole source file is here.