We have experienced that the scalability of a GRR installation highly depends on which data store backend is used. Currently, we provide two different data stores, a MongoDB and a MySQL based one, that are pretty simple to use but also, sadly, not very scalable. To mitigate this problem, we have recently released a new distributed data store that offers a huge performance improvement over all other the data stores. This data store is a bit harder to set up than the others but we think it's really worth it performance wise. In this blog post I'll describe how I installed GRR using a small group of data store servers.
I started out by getting four machines on Amazon EC2, all of them running Ubuntu 14.04.
I planned to have one main server running all the GRR processes:
54.76.238.125
And the rest are data store servers:
54.171.141.180 <-- the first one is the master server.
54.171.117.245
54.171.117.72
First, I installed GRR on all those machines using the install script provided ( install_script_ubuntu.sh ). This is just a little easier because it installs all the dependencies we need automatically but in the end I was going to use the latest code from the repository (highly recommended if you try the latest features!) so I did a:
> git clone https://github.com/google/grr.git
and
> python setup.py build
> sudo python setup.py install
GRR usually uses two config files:
- /etc/grr/grr-server.yaml : This one comes with every server release and should not be touched.
- /etc/grr/server.local.yaml : This is the one where all the local modifications go. On the main server where I installed the whole GRR package, this already contains my customized settings, like the keys I generated. This is where I'm going to put all my modifications.
The first thing I did was switching away from the MongoDB backend. I tried to use the SqliteDataStore so I appended to the config file:
Datastore.implementation: SqliteDataStore
I did a quick restart using the initctl_switch script and, while I was there, also switched to multiprocessing:
sudo /usr/share/grr/scripts/initctl_switch.sh multi
(If you already have grr running in multiprocess mode, use
sudo /usr/share/grr/scripts/initctl_switch.sh restart
instead.)
I confirmed that all processes are there:
> ps -ef | grep grr
root 13297 1 35 13:34 ? 00:00:00 /usr/bin/python /usr/bin/grr_server --start_http_server --config=/etc/grr/grr-server.yaml
root 13310 1 34 13:34 ? 00:00:00 /usr/bin/python /usr/bin/grr_server --start_ui --config=/etc/grr/grr-server.yaml
root 13323 1 33 13:34 ? 00:00:00 /usr/bin/python /usr/bin/grr_server --start_enroller --config=/etc/grr/grr-server.yaml
root 13336 1 33 13:34 ? 00:00:00 /usr/bin/python /usr/bin/grr_server --start_worker --config=/etc/grr/grr-server.yaml
and that I can reach the gui @ http://54.76.238.125:8000/. It also seemed like it was using the SQLite data store (using the default path in /tmp):
> ls /tmp/grr-datastore/
aff4.sqlite config.sqlite files.sqlite foreman.sqlite pmem%2Dsigned.sqlite pmem.sqlite stats_store.sqlite winpmem%2Eamd64%2Esys.sqlite winpmem%2Ex86%2Esys.sqlite
After the basic setup worked, I started setting up the remote data store. For this to work I had to set up
- client credentials so the GRR server can talk to the data store servers (need to be set on the GRR server and the master server)
- server credentials so the slaves can talk to the master (need to be set on all data store servers)
- the data store master server - has to know about all the slave servers.
- the slave server(s).
I decided to go with two data store servers at first - one master, one
slave. First I set up the data master by putting in both servers
(the master has to be the first entry!), client credentials, and
server username + password:
> sudo cat /etc/grr/server.local.yaml
Dataserver.server_list:
- http://54.171.141.180:7000
- http://54.171.117.245:7000
Dataserver.client_credentials:
- myawesomeuser:thepassword:rw
Dataserver.server_username: servergroup
Dataserver.server_password: 65f2271f7
Starting it up:
> sudo python grr/server/data_server/data_server.py --config=/etc/grr/grr-server.yaml --master
[...]
Starting Data Master/Server on port 7000 ...
Yay, it works! Now I set up the slave server. This one only needs to know the master
and the server credentials, client credentials will be automatically
distributed:
> sudo cat /etc/grr/server.local.yaml
Dataserver.server_list:
- http://54.171.141.180:7000
Dataserver.server_username: servergroup
Dataserver.server_password: 65f2271f7
Starting the slave (I added verbose so there is some more info):
> PYTHONPATH=. python grr/server/data_server/data_server.py --config=grr/config/grr-server.yaml --verbose
[...]
Registering with data master at 54.171.141.180:7000.
[...]
DataServer fully registered.
Starting Data Server on port 7000 ...
At the same time, since our group only consisted of two servers, the master also displayed:
Registered server 54.171.117.245:7000
All data servers have registered!
which indicates that our data store is now good to go. Thus, I could
then set up the main server to actually use the data store server group:
> sudo cat /etc/grr/server.local.yaml
[...]
Dataserver.server_list:
- http://54.171.141.180:7000
- http://54.171.117.245:7000
Datastore.implementation: HTTPDataStore
HTTPDataStore.username: myawesomeuser
HTTPDataStore.password: thepassword
Running a quick test with the console confirms, it's working:
> sudo PYTHONPATH=. python grr/tools/console.py --config=/etc/grr/grr-server.yaml
[...]
In [1]:
Now I still had one more server that I wanted to include in my group
so I fired up the data store manager (after making sure that noone is
using the data server group):
> sudo PYTHONPATH=. python grr/server/data_server/manager.py --config=grr/config/grr-server.yaml
Manager(2 servers)>servers
Last refresh: Thu Oct 23 12:53:12 2014
Server 0 54.171.141.180:7000 (Size: 100KB, Load: 0)
14 components 7KB average size
Server 1 54.171.117.245:7000 (Size: 111KB, Load: 0)
13 components 8KB average size
Manager(2 servers)> ranges
Server 0 54.171.141.180:7000 50% [00000000000000000000, 09223372036854775808[
Server 1 54.171.117.245:7000 50% [09223372036854775808, 18446744073709551616[
Manager(2 servers)>
So all my servers were registered, they used 7 and 8 kb of storage and the hash space was distributed evenly:
hex(9223372036854775808) == '0x8000000000000000L'
hex(18446744073709551616) == '0x10000000000000000L'
To add a third server, we need to first add it to the data store server master:
Manager(2 servers)> addserver 54.171.117.72 7000
Master server allows us to add server 54.171.117.72:7000
Do you really want to add server //54.171.117.72:7000? (y/n) y
=============================================
Operation completed.
To rebalance server data you have to do the following:
1. Add '//54.171.117.72:7000' to Dataserver.server_list in your configuration file.
2. Start the new server at 54.171.117.72:7000
3. Run 'rebalance'
The server was added, but it has an empty hash range so far:
Manager(3 servers)> ranges
Server 0 54.171.141.180:7000 50% [00000000000000000000, 09223372036854775808[
Server 1 54.171.117.245:7000 50% [09223372036854775808, 18446744073709551616[
Server 2 54.171.117.72:7000 0% [18446744073709551616, 18446744073709551616[
So I took down all servers, added the third server to the config on
the master and the GRR server, and brought everything back up
(including the new data store server). Afterwards, I just used the data store manager to do a rebalance operation to actually fill the new
server with data:
> sudo PYTHONPATH=. python grr/server/data_server/manager.py --config=grr/config/grr-server.yaml
Manager(3 servers)> ranges
Server 0 54.171.141.180:7000 50% [00000000000000000000, 09223372036854775808[
Server 1 54.171.117.245:7000 50% [09223372036854775808, 18446744073709551616[
Server 2 54.171.117.72:7000 0% [18446744073709551616, 18446744073709551616[
Manager(3 servers)> rebalance
The new ranges will be:
Server 0 54.171.141.180:7000 33% [00000000000000000000, 06148914691236516864[
Server 1 54.171.117.245:7000 33% [06148914691236516864, 12297829382473033728[
Server 2 54.171.117.72:7000 33% [12297829382473033728, 18446744073709551616[
Contacting master server to start re-sharding... OK
The following servers will need to move data:
Server 0 moves 30KB
Server 1 moves 69KB
Server 2 moves 14KB
Proceed with re-sharding? (y/n) y
Rebalance with id e03c2a40-06cf-46f1-ab27-dcc30aa55ce6 fully performed.
Manager(3 servers)> sync
Sync done.
Manager(3 servers)> ranges
Server 0 54.171.141.180:7000 33% [00000000000000000000, 06148914691236516864[
Server 1 54.171.117.245:7000 33% [06148914691236516864, 12297829382473033728[
Server 2 54.171.117.72:7000 33% [12297829382473033728, 18446744073709551616[
Manager(3 servers)> servers
Last refresh: Thu Oct 23 13:09:51 2014
Server 0 54.171.141.180:7000 (Size: 70KB, Load: 0)
10 components 7KB average size
Server 1 54.171.117.245:7000 (Size: 72KB, Load: 0)
9 components 8KB average size
Server 2 54.171.117.72:7000 (Size: 83KB, Load: 0)
10 components 8KB average size
Manager(3 servers)>
And it worked, all servers were equally loaded.