In my last post, I found out that my parsed calendar.ics file consumes more than 7MB in memory. I decided to look on eds with massif. As I have only this one calendar in Evolution, I expected to get similar results. Here is the massif graph I got:

EDS under massif

EDS under massif

It’s about 13.8 MB. Looking closer on massif logs, I found out, that almost all allocations were done in method  e_cal_backend_file_open. Next investigations 🙂 led to the finding, that files $HOME/.evolution/calendar/local/system/tasks.ics and $HOME/.evolution/memos/local/system/journal.ics should be blamed for that.

Now, the most interesting thing is the content of those files. Both files contain only time zones. A lot of copies of the same time zone. Is that a feature or a bug ? 🙂


slusny@turret:~/.evolution$ grep BEGIN:VTIMEZONE ~/.evolution/tasks/local/system/tasks.ics | wc -l
538
slusny@turret:~/.evolution$ grep "TZID:/softwarestudio.org/Olson_20011030_5/Africa/Algiers" ~/.evolution/tasks/local/system/tasks.ics | wc -l
318
slusny@turret:~/.evolution$ grep BEGIN:VTIMEZONE /home/slusny/.evolution/memos/local/system/journal.ics | wc -l
814

Here is the massif graph, when I removed these two files:

EDS under massif with removed journal.ics and tasks.ics

EDS under massif with removed journal.ics and tasks.ics

Note 1.8.2008: Files journal.ics and tasks.ics are growing only when eds is restarted (either it was killed or segfaulted…).

To reproduce the bug, it is enough to re-run the evolution and activate the calendar view.

Advertisements

How much memory is  consumed by the calendar ? I hope, that this post helps to answer the question.

I will use my favourite big calendar.ics file. Some basic facts about calendar:


slusny@turret:~/gsoc/massif logs$ ls -lh calendar.ics
-rw-r--r-- 1 slusny slusny 813K 2008-07-28 14:15 calendar.ics

slusny@turret:~/gsoc/massif logs$ grep VEVENT calendar.ics | wc -l
2550

Saying this, I consider the calendar as a big one. I prepared simple program, that just opens the calendar and parses the ICS file, similarly to file backend. Valgrind’s massif gave me this graph


MB
7.192^                                                              ,,:. .
|                                                        . . @ @#:: :.
|                                                    . @@: : @ @#:: ::
|                                                  @@: @@: : @ @#:: ::
|                                               ,, @@: @@: : @ @#:: ::@
|                                            . @@@ @@: @@: : @ @#:: ::@
|                                         .. : @@@ @@: @@: : @ @#:: ::@
|                                      .@ :: : @@@ @@: @@: : @ @#:: ::@
|                                    ,@:@ :: : @@@ @@: @@: : @ @#:: ::@
|                                 ,@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@:
|                              ,: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@:
|                           , :@: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@:
|                         @@@ :@: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@:
|                    ., @ @@@ :@: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@:
|                 . ,:@ @ @@@ :@: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@:.
|               .@: @:@ @ @@@ :@: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@::
|            , @:@: @:@ @ @@@ :@: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@::
|         ,:@@ @:@: @:@ @ @@@ :@: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@::
|     , @ @:@@ @:@: @:@ @ @@@ :@: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@::
|    @@ @ @:@@ @:@: @:@ @ @@@ :@: @@ @@:@ :: : @@@ @@: @@: : @ @#:: ::@::
0 +----------------------------------------------------------------------->Mi
0                                                                   287.6

So, the calendar consumes more thatn 7 MB. I ran the program again without valgrind and looked at process
with smem.pl script. This script parses /proc/$pid/smaps into something more useful

VMSIZE: 18136 kb
RSS: 10784 kb total
2864 kb shared
168 kb private clean
7752 kb private dirty

PRIVATE MAPPINGS
vmsize rss clean rss dirty file
7148 kb 0 kb 7148 kb [heap]
40 kb 0 kb 40 kb /usr/local/lib/libORBit-2.so.0.1.0
44 kb 0 kb 40 kb /home/slusny/gsoc/evolution-data-server/calendar/libecal/.lib
88 kb 0 kb 36 kb [stack]

The heap is shown as smaller here (who knows why ?). However, the calendar is quite big, I think. Perhaps iwe should not hold all events (event prehistoric ones) and all event fields in memory all the time 🙂

In this post I will describe impact of new more efficient data structure on calendar part of EDS. The new data structure is our famous interval tree. Imagine, that we are maintainers of calendar data and enemy is asking us expensive questions about our data, mostly “give me events in some time range”. In the past, we took naive approach to answer questions. Now, we are doing more wisely.

On the attached picture is shown, how many events need to be scanned to evaluate query. Red color are queries BEFORE optimization – either 1271 or 346 items were scanned (I have two calendars with 1271 and 346 items and before optimization, all events are scanned to answer question). The green line shows, how many items need to be scanned after optimization.

Number of events scanned before and after optimization

Number of events scanned before and after optimization

One can notice, that there is green (optimized) query on the beginning that was doing scan through all events. This is query, in which no subexpression of from *occur-in-range* was
found – it was not question of form “give me all events in some time range”. Therefore, interval tree did not help us.

More results:

  • More numbers – how many took to evaluate query ?

    Comparison of performance - time to evaluate query

    Comparison of performance - time to evaluate query

  • How many items was tested (sum of 423 queries) ? For every tested event we had to generate recurrences.
Number of scanned events

Number of scanned events

Testing interval trees

July 17, 2008

After implementing interval trees, I have started to think about testing them. It would be great to have some test script. To do this, I would need to think about different scenarios, that can happen – different ways of inserting and removing data can lead to different rotations, for example. Of course, I did not want to spend too much time on testing.

Therefore I have written small test program, that inserts random events into interval tree and list, does several searches, deletes some events and  does some searches again. I have found several bugs this way.

But how can I be sure, that all interesting situations were tested ? I have used great gcov and lcov tools. Almost 90% of code was covered by tests, and although the remaining code is not so important, I would like to add more tests in the future. The valgrind did not show me any memory leaks. This is my tip for your  better sleep: test your code.

Interval trees in EDS

July 17, 2008

Interval tree

Interval tree

Finally, I have finished implementation of inteval trees. Less or more, I have followed the book Introduction to Algorithms, 2nd edition. My implementation is based on red-black trees,  too. EDS is often asked for events in some time range (interval). In times before interval trees, we had to go through all events, generate recurrences and test, if the event is in this range. With interval tree, we can do it much faster.

The data in the tree (events in our case) are accessed by interval queries. Underlaying data structure is a red black tree with the low endpoint chosen as a key. To do efficient interval search, each node needs to maintain some additional data – maximum (max) and minimum  (min) value of any interval endpoint stored in the subtree rooted at x. It is needed to update min and max fields when performing rotations, inserting and deleting new data.

Basic operations, like insert, delete and search, are simple, as we already know about red-black trees. They can be done in logarithmic time. The interesting operation is “search all”, which should return all events in given interval. To search for events, we start in the root of the tree and check if we should go left or right – the min and max fields of the left and right son help us to make correct decision. Sometimes it is needed to go in both directions. Therefore, time can be O( min(n, k lg n) ), where k is the number of events in output list and n is the total number of events.

If someone is interested in implementation, here are some differences between my implementation and book:

  • The code in the book is keeping only min field, not max (max field help us to make more effective decisions)
  • We need to store different events with the same interval and search by UID of the event
  • We need to work with open-ended intervals (special value -1)