Skip to content

Networking

TxMongo - Your Asynchronous MongoDB Twisted Client

2000px-Twisted_Logo_(software).svg

We're proud to announce the release of 0.6 of TxMongo, which brings SSL support using Twisted's SSL context factory, "find with cursor" support just like PyMongo, bug fixes and updated unit tests! TxMongo is an asynchronous MongoDB client written for Twisted in Python. The biggest change is that TxMongo is now sponsored by Amplidata. Through them we were able to get development, bug fixes and Twisted first-party sponsorship online. We now have continuous integration (CI) with a wide matrix of support for py26/py27/pypy using Twisted 12.1 to 14.0 (and trunk). We also now have 78% code coverage with unit testing as a result! This is also the very last release in the 0.x series before we step over to the "year.release" model used by Twisted, it will also eventually find its way into Twisted's github organization as a first class library. You can download TxMongo 0.6.0 and other releases here: TxMongo Github Releases

What to expect We have a list of priorities:

  • Switch documentation over to sphinx for readthedocs.org supports.
  • Get TxMongo moved over to Twisted's org, with Travis-CI and Coveralls.
  • Get coverage to at least 80%.
  • Research functions found in PyMongo that are missing TxMongo.
  • Contact various TxMongo forks and gather up bugs/issues/patches from various distros.

Backstory In evaluating various options for using MongoDB with Twisted, there where two options:

  1. PyMongo
  2. TxMongo

The first option, supported by MongoDB themselves, is up to date in form of features but is synchronous and blocking. To get around this behaviour, you'll need to defer it thread. The second option is TxMongo that lacks a lot of the features of PyMongo, but is made for Twisted. Amplidata's only concern was the lack of SSL support in TxMongo, but all the main features that we needed are there. Thankfully the original author Alexandre Fiori, who is now in maintenance mode, accepted our patch. We talked a bit about the future of TxMongo and as it turns out, he is no longer developing TxMongo but he would love to give it to the community to see it furthered developed and maintained since he no longer has the time. We included Glyph of Twisted into the conversation to see about a new home, with the driving development work coming from Amplidata. The rest, is how they say, history. Example code using TxMongo and SSL First we startup mongodb:

#!/bin/bash
# create the path
mkdir -p /tmp/mongodb
# start mongodb process
mongod --dbpath /tmp/mongodb --sslMode requireSSL --sslPEMKeyFile mongodb.pem

Second we run this code:


from OpenSSL import SSL from txmongo.connection import ConnectionPool from twisted.internet import defer, reactor, ssl

class ServerTLSContext(ssl.DefaultOpenSSLContextFactory): def init(self, args, kw): kw['sslmethod'] = SSL.TLSv1_METHOD ssl.DefaultOpenSSLContextFactory.init(self, args, **kw)

@defer.inlineCallbacks def example(): tls_ctx = ServerTLSContext(privateKeyFileName='./mongodb.key', certificateFileName='./mongodb.crt') mongodb_uri = "mongodb://localhost:27017"

mongo = yield ConnectionPool(mongodb_uri, ssl_context_factory=tls_ctx)

foo = mongo.foo  # `foo` database
test = foo.test  # `test` collection

# fetch some documents
docs = yield test.find(limit=10)
for doc in docs:
    print doc

if name == 'main': example().addCallback(lambda ign: reactor.stop()) reactor.run()


Latency simulation over long fat network

World Network

One of our clients asked us how we handle latency, and not just a few ms across racks but 2 and even 3 digit ms latency that indicates geographically separate locations across continents, not just a country. Not only that, the "pipes" involved are 10Gbps and we had to fill them. We have the theories and made models of how it would work. We perhaps might not be able to fill a 10Gbps fully with one stream, we could fill it with multiple streams but we had to validate this conclusion. The question now becomes, how do we test this. We've done our research and there are only a few commercial solutions available like the Netropy 10G2 which is a 4 port, 2 lane hardware latency simulator for $30,000 new. Not only is that outside my budget, it is still limited to simulating 2 10Gbps pipes while we need at least 3 lanes (6 ports) and possibility to expand to more as necessary. We decided it was cheaper in terms of total cost to put the research into creating our own Latency Simulator. We studied what we could from google, specifically the work done by NASA on a "Channel Emulator". They used traffic control (tc) to handle delay on the egress of an interface. This means that if a packet travels through, it is delayed but the return packet is not and goes right through. Our setup means that we have one 10Gbps card with 2 ports. We then link the two interfaces with bridge control (brctl) to create a layer2 bridge. We then split the "round trip time" or RTT delay and apply that to each 10Gbps interface. All packets going to and returning from a network then have the full simulated RTT delay. This type of machine does not need much in the way of RAM as the buffers necessary are not large, 4GiB is sufficient. What is important is the CPU operating frequency, all other aspects of the CPU is not important except that there should be 1 core per 10Gbps interface. This is required because part of the network stack is being simulated with the bridge then processed. For a 3 lane setup, that is 6 ports so we need at least a 6 core CPU @ >= 2.2 Ghz to handle the load.

Latency Simulator

You may be asking why just 3 and not 4 latency lanes, this is because for us there will always be a 'local' data center and the other 3 connect to it in a star like network layout like in the above diagram. Since this is a 'flat' network in the same subnet, any ping from one of the data centers to another data center will go through the 'local' data center. In reality, these 'data center' switches are connected to the Latency Simulator which then connects to the local data center switch. Realistic latency from the 'local' data center in New York: California: 32ms England: 80ms Japan: 346ms Source: Verizon's latency table Going from California to Ireland would involve first a hop through New York, so the compound delay would be 112ms. With that in mind you can then compute your bandwidth delay product (BDP) Once the machine is up and running with whatever Linux distribution you like, make sure that tc and brctl are installed. Here are the scripts that can be used to bring the bridges up and down, and apply latencies and remove the latencies for the four geographically seperate datacenters. Files:

Once in place, we could ping from one side to the other and see the latency being applied. It is now time for baseline testing. First we turned off the latency and used iperf to test end to end that we can fill the 10Gbps pipes and that the Latency Simulator isn't the bottleneck. We could get around 9.50Gbps point to point. Then we turn on the latency and see the impact directly. The first thing we noticed is that when running iperf for the default 10s that the slow start and initial TCP window size has an impact how much data we can send over the wire. Because of the slow start, if you want better performance in your stream then you need to test for longer than 10s. We could not fill a pipe with 120ms latency until after 25s of running iperf which time we had transferred something like 15GiB of data. So trying to send a 1GiB file will not fill pipe.

RTT in ms   MiB/s default   MiB/s MAX
0       1162        1157
2       1053        1136
4       513     1076
8       248     1075
16      103     691
22      91      366
32      47      358
44      31      208
64      8.2     64
128     0.8     26
130     0.7     26

The MAX settings I used is the MAX TCP Window Size of 1GiB. If you try to go above that, you will find that Linux gets mad and some networking services will just not work. The sweet spot for us to set the initial window size to 8MiB which gave the algorithm enough time to shrink to either 4096 bytes or to grow in the other direction. Below are two 'key' tunables where rmem is the read buffer and wmem is the write buffer of the TCP buffer. sysctl -w net.ipv4.tcp_rmem='4096 8388608 33554432' sysctl -w net.ipv4.tcp_wmem='4096 8388608 33554432' However even with an initial 8MiB TCP Window Size, you'll never reach this potential because the Initial Congestion Window (initcwnd) is set to 10 as of 2011 per this git diff. This "slow start" is a congestion avoidance mechanism with exponential growth, a feature not a bug. Below is the 'slow start' in action when downloading a linux source tarball from kernel.org. slow star congestion control slow star congestion control What you are seeing is the an exponential growth of the congestion window that eventually grows to allow the TCP Window Size to kick in which then scales up linearly. You can however changed this per route which makes sense because congestion control works on a per network/host level. Examples of setting the initial congestion and receive windows: ip route change default via x.x.x.x initcwnd 20 initrwnd 20 # update your default gateway ip route change dev eth0 192.168.1.0/24 proto kernel src 192.168.0.1 initcwnd 20 initrwnd 20 # if you want to apply it just to one network Do not think of this as just updating the values and expecting fantastic results, because if you enter packet loss into the equation or real network congestion, then you are in for a painful experience with values that are too large. You'll not be as agile to respond to the pitfalls of the Internet, but if you are on a long fat network then adjusting these values can be a real boon for your throughput. You should now the tools necessary to implement your own "Long Fat Network" simulator and various things you can look at and adjust to get the most out of your network and applications.

Doing battle with a Dell R620 and Ubuntu

dell_poweredge_r620

We recently got sent a Dell R620 to evaluate and while its technical specification is amazing there are a few things that need to be handled first.

As far as Ubuntu and the Dell R620 go, Precise (12.04) is the only way to go here. Every release before Precise has issues with this hardware in one way or another. This is new hardware of after all.

For our "use case" we downgraded the PERC H710P controller to a H310 controller so we can have direct access to the drives via pass-through. The H310 allows TRIM support for SSDs and SMART data via smartctl to be used without any problems. If you are interested in SMART information and PERC H700 series RAID controller, I posted about possible workarounds at Dell's customer support site.

Let's begin: USB Booting: try as we might, we could not get any usb stick to boot on the R620. We've gone through the iDRAC to do virtual drives and looked at BIOS/UEFI methods. The usb stick is recognized, but the R620 just shows us a blank screen. The same stick works in the R610, VM and other machines. We have a ticket with Dell support and they have yet to resolve the problem. Booting over PXE or CD/DVD are our only options at this point.

Bad MAC Addresses

Intel® Ethernet Server Adapter I350-T4: The igb kernel module for 2.6.35 and 2.6.38 will detect this card and it will get you connectivity, but it will behave funny. For example, 3 to 4 ports will have the same MAC address. You need download, compile, and install the latest sources for the igb from Intel before you get full functionality out of your I350-T4. The other option is to install Ubuntu Precise (12.04) as the 3.2 kernel has the updated drivers from Intel.

DRHD: handling fault status reg: at some point during booting of a freshly installed Ubuntu with the 2.6.35 kernel, we ran into this error that would effectively loop endlessly and cause the R620 to become unresponsive. We got this:

DRHD INTR-REMAP problems

DRHD: handling fault status reg 502 INTR-REMAP: Request device[[42:00.1] fault index b0 INTR-REMAP:[] Present field in the IRTE entry is clear

and it would endlessly print that to the console. This apparently has something to do with the IO-MMU part of the kernel dealing with interrupt remapping. Whatever the problem was, it was fixed in the 2.6.38 kernel and caused no more problems.

Dell SSD: the SSDs are rebranded Samsung drives which do not support TRIM but are at least over provisioned. These drives have a problem with smartctl in that while there is SMART information, the drive itself doesn't (yet) exist in the drivedb.h file. You have to use the latest smartctl version (5.42) to get anything usefull out of the drive. Older versions give you things like this:

Log Sense failed, IE page

hdparm: and other tools like smartctl, lshw and others have issues when getting the required data from over the PERC H310, even if it is pass-through. You have to use the latest versions of each to even read the serial number off a HDD or SSD. Hdparm versions >= 9.37 work, otherwise you get this:

root@node:~# hdparm -I /dev/sda

/dev/sda: HDIO_DRIVE_CMD(identify) failed: Invalid exchange

Once we got all the little inconveniences out of the way, we got down to benchmarking and performance testing. In comparison to the Dell R610's 2x Xeon(R) E5606, the R620's 2x Xeon(R) CPU E5-2643 has double the CPU performance in our testing. The obvious bottleneck here are the 2x 2port 10Gbps NICs in that even at a theoretical max of 40Gbps, for our purposes, we would be network bound. Thankfully there is another PCI-Express available, just in case.

Finding and making a network subnet calculator

subnet ip calculator

Doing subnet calculations by hand can be tedious and thankfully there are tools available online to help with that. One in particular Subnet Calculator with a PHP backend was handy and compact.

I've sent a few requests asking if I could re-write it to be client side so that it could be used in intranet applications or even possibly be ported as an hand-held application. After a few months of waiting and no response, I repurposed some of their CSS and layout and wrote the javascript equivalent.

Mindwerks's Subnet Calculator

The code is released as open source and can be reused as per terms of the license. It is part of the WP-Mindwerks wordpress plugin, but you can use it also as a standalone webpage.

ipcalc

Switch your switch with switches

Switch

It is not unusual for me to find 10,000 Euro worth of networking equipment on my desk one day. It usually means that I have a long week of reading and testing ahead of me as I am the only person in the company, let alone building, that has ever seen or worked with these devices before. That means I am on my own aside from an Internet connection.

While your mileage may very, I've had the joy (and horror) of testing these devices as 'drop in replacements' to the test environment that we are using. In some instances, things just worked out of the box, however there are a few devices that needed to be poked a few times to get things moving.

Manufactures of routers and switches I have tested are:

  • Arista: We where able to test the Arista 7000 to validate it against our setup. Painless to install and without tweaking, handed Cisco's 3560-X it's hat. The latency (cut through as opposed to store and forward) helped and the overall throughput was great.
  • Brocade: SSE-X24S was a interesting 24 port 10Gbps switch, we where however unable to capture the same level of performance as the Arista.
  • Cisco: Tried and tested Cisco and their IOS that every switch tries to emulate. Everything we've tried to do with Cisco devices just worked. What is better is that they abandoned their serial and have gone USB. That cannot be said for the rest of the devices we've tested. We where limited to just 2 10G ports per 3560 switch, which is a shame.
  • Extreme Networks: After meeting with their technical sales people, they (twice) gave us the wrong equipment. We ask for 48 port 1Gbit (Cat5) with 4 10Gbps SFP+ switch and we received a 48 prt 1Gbit SFP which was later replaced with what we needed but with only 2 10Gbit SFP+. After a month of being unable to test their product(s) in our environment, we stopped dealing with Extreme Networks and their purple hardware.
  • Force10: lived up to its expectations as a force to be reckoned with. Nothing worked at first as all the ports are shutdown by default. You must first use a usb->serial->rollover cable to get in, 'no shutdown' your ports and then tell them they are 'switchports', at which point they should start switching packets. Same level of performance as the Arista, meaning that we could saturate our 10Gbps nics.
  • Interface Masters: the Niagara 2924-24TG switch was the latest to be tested and there isn't much information about them online. Their switches (24x 1Gbps and 24x 10Gbps) where comparable if slightly less performant than Force10. Their plus point was that they allow you to re-brand their products. We experienced hardware fault in their 10Gbps port, we used another port and continued testing. Their technical support was good but we never got to deep dive into why that one port had so many problems. They sent us a new switch to test and everything worked out.

There is also an issue with SFP+ cable length and Intel cards:

This issue has been witnessed at Intel's POC using 10Gbps Intel NIC and Arista DCS-7048T-A switch.

The INTEL LAN department has analyzed that there is a known issue with short SFP+ cables, leading to a flicker in the network signal. Apparently there needs to be a minimum length of 3m.

Replacing all 0.5 m cables with 3m cables solves the issue.

For those looking for some numbers and benchmarks, I'm not allowed to post the results. Needless to say, they are not far off from my experiences that I've listed above. I'm now a fan of Force10 being reliable and performant. Arista ranks up there as well and Interface Masters (a total unknown) can be competitive and allow for re-branding which is interesting for some companies.

A poor man's https, using ssh to secure web traffic

HTTPS_Everywhere_new_logo

Sometimes you get a web-hosting environment that only serves non-ssl (http) content. If you need to do any type of management through tools like phpMyAdmin, then you can see the problem with this. All it would take is someone on your network or on the Internet to sniff the traffic and retrieve your username and password, then they too can do a bit of "management" on your site.

If you also have secure shell (SSH) access, then there is a way to manage your site securely by using SSH's venerable port forwarding (SOCKS). The trick is to tell your management tools to only listen or respond to connections coming in over SSH instead of normal traffic. First you need to set up your SSH connection and configure your browser to use your newly made SOCKS proxy. Please refer to my post about SSH Proxies for more information.

The second part is to secure your application to only accept connections from itself, which is where your browser requests travels through your secure tunnel. We can mask it a bit so that you will have to look hard to see that there is something of interest going on there. It will also be ignored by Google and other search engines.

You can add this to your php code: /* custom code to deny access to world */ if ($_SERVER["SERVER_ADDR"] != $_SERVER["REMOTE_ADDR"]){ header('HTTP/1.1 404 Not Found'); exit();

If the remote IP (your request) is not he same as the server IP, then we give the 404 error message in return, otherwise you get to your application.

SSH as a socks proxy

passwords

Recently there was a need to visit a US based website to verify some personal information. Apparently there are 'rules' about who is geographical allowed to get access to the site which means that a citizen of said country cannot access the site from outside of the US.

I will not get into the absurdity of such security mandates, instead we will go around the problem and get our information that bureaucracy tried to prevent.

The general idea is to use a proxy inside the US that will allow us to hop over the geographical firewall. I do not trust open proxies by default because of their ability to sniff traffic. I do however have access to a secure shell (SSH) in the US that I can use. Using this command: ssh -D 8080 [email protected]

will create a port 8080 on localhost (your computer). You can then use a web-browser like chrome from anywhere in the world and through an encrypted tunnel come out the other side on a network based in the US. You need to configure your web-browser to use a 'SOCKS Proxy' in order for this to work.

For chrome, it is easy as doing this: chrome --proxy-server="socks5://127.0.0.1:8080" or you can follow this guide to setting up chrome with socks.

To check that it is working, go to google and ask "What is my IP".

For more detailed information, here is the ssh man page:

Specifies a local “dynamic” application-level port forwarding. This works by allocating a socket to listen to port on the local side, optionally bound to the specified bind_address. Whenever a connection is made to this port, the connection is forwarded over the secure channel, and the application protocol is then used to determine where to connect to from the remote machine. Currently the SOCKS4 and SOCKS5 protocols are supported, and ssh will act as a SOCKS server.

Asymmetric networking using Linux

asymmetric_network_icon

The idea is simple, two subnets (separate networks) and then route packets from one to the other. The environment, however, is not symmetric. We wanted to contact a node on the other subnet and we could see the packets travelling over the switch to the router back through another switch to the node, but the node itself refused to reply. Each node has two NICs and each NIC is connected to a separate network. If you try to connect or ping one node from another, Linux is smart enough to go directly over the NIC with the right network. If a NIC should ever fail, the failover is that the packets are then routed up one network to the router then over to the other network. The network looks something like this:

        ----(router)----
        |               |
        |               |
   | switch |__  __| switch |
        |      \/       |
        |      /\       |
        |     /  \      |
        |    x    \     |
   | node1 |/      \| node2 |

note: The x is the broken link.

Apparently when going from node1 to node2 is not the problem, node2 just does not respond. This has to do Reverse Path Filtering and per default is enabled in the Linux kernel. From the manual:

By default, routers route everything, even packets which 'obviously' don't belong on your network. A common example is private IP space escaping onto the internet. If you have an interface with a route of 195.96.96.0/24 to it, you do not expect packets from 212.64.94.1 to arrive there. ... Basically, if the reply to this packet wouldn't go out the interface this packet came in, then this is a bogus packet and should be ignored.

Armed with this new knowledge and acknowledging that this system will not be on an Internet route-able environment, we decided to turn off the filtering. for i in /proc/sys/net/ipv4/conf/*/rp_filter ; do echo 0 > $i; done This solved the problem and node2 could reply back over it's NIC to the other network without having to go back through the router.