Tuesday, December 22, 2015

Big data is sometimes a big pain...

EDIT: scroll to bottom for a full script.

To start off, I want to explain the background of this project. I run an ADS-B reciever for FlightRadar24.com. If you don't know what ADS-B is, you should definitely check it out. It is the protocol that airplane radar beacons use to transmit location and other data. Using an RTL-SDR module, available for around $30, you too can see local aircraft using a program such as dump1090.

Anyway, the equipment I host is a slightly more expensive, professional grade reciever, with much better decoding capabilities. Combined with the supplied antenna, it is much more accurate than an RTL-SDR reciever. There is a little box that sits on my desk (call it "the module"?) containing the receiver circuitry and a small ARM-based Linux system that sends data back to the main website for everyone to view. After installing it and confirming it was working, I pretty much ignored it.

That is, until recently, when I was debugging some port scanning code I'm writing. I was doing service scans on several hosts on my LAN, and decided to scan all 65536 ports of the module. Something then caught my eye:

$ python main.py 10.0.1.75 1-65535
Scanning 1 hosts (65535 ports per host)
Hosts up: 1
10.0.1.75 22 : SSH-2.0-OpenSSH_6.0

10.0.1.75 10685 : 
10.0.1.75 30003 : AIR,,333,1,AB14D9,101,2015/12/22,23:23:43.125,2015/12/22,23:23:43.125

10.0.1.75 30334 : 2RDa{]?/???

Ports 30003 and 30334 didn't show up in any port databases. Netcatting 30334 resulted in a bunch of unprintables (sample hexdump here), but 30003 ended up more promising:

MSG,3,333,8,780214,108,2015/12/22,23:33:14.4294967089,2015/12/22,23:33:14.4294967089,,35000,,,47.94058,-116.75212,,,0,0,0,0
MSG,4,333,8,780214,108,2015/12/22,23:33:14.4294967119,2015/12/22,23:33:14.4294967119,,,550.0,94.6,,,128,,,,,
MSG,3,333,2,A49441,102,2015/12/22,23:33:14.4294967122,2015/12/22,23:33:14.4294967122,,34975,,,47.56073,-113.82550,,,0,0,0,0
MSG,4,333,2,A49441,102,2015/12/22,23:33:14.4294967122,2015/12/22,23:33:14.4294967122,,,523.0,97.5,,,0,,,,,
MSG,4,333,3,A968DF,103,2015/12/22,23:33:14.4294967163,2015/12/22,23:33:14.4294967163,,,410.0,259.9,,,-64,,,,,
MSG,3,333,7,A94B48,107,2015/12/22,23:33:14.001,2015/12/22,23:33:14.001,,32975,,,46.94609,-115.34965,,,0,0,0,0
MSG,3,333,8,780214,108,2015/12/22,23:33:14.222,2015/12/22,23:33:14.222,,35000,,,47.94049,-116.75061,,,0,0,0,0
MSG,3,333,2,A49441,102,2015/12/22,23:33:14.230,2015/12/22,23:33:14.230,,34975,,,47.56056,-113.82371,,,0,0,0,0


It looks to me like a real-time output of the ADS-B decoder, in CSV format. The only problem is that I have no idea what the columns mean. After capturing several thousand lines of output, I tried to analyze the columns by finding the most common values in each column, using a long shell command:

$ for i in {1..30}; do echo "=============== column $i ==============="; cat flightradar_dump.csv | cut -d ',' -f $i  | sort | uniq -c | sort -rn | head -n20; done

=============== column 1 ===============
6667 MSG
   5 AIR
   4 STA
   3 ID
=============== column 2 ===============
2607 8
1662 4
1650 3
 541 5
 203 1
  12 
   4 6
=============== column 3 ===============
6679 333
=============== column 4 ===============
1595 18
1485 8
1284 20
1127 17
 808 21
 220 6
 125 7
  22 22
  13 19
=============== column 5 ===============
1595 AD4C2A
1485 780214
1284 A48272
1127 AE07FF
 808 AB013C
 220 A81077
 125 A94B48
  22 A53317
  13 AA1F97
=============== column 6 ===============
1595 118
1485 108
1284 120
1127 117
 808 121
 220 106
 125 107
  22 122
  13 119
=============== column 7 ===============
6679 2015/12/22
=============== column 8 ===============
   4 23:45:55.349
   3 23:46:14.257
   3 23:46:10.4294966829
   3 23:46:10.4294966821
   3 23:46:05.4294967104
   3 23:46:05.4294967081
   3 23:46:05.4294967073
   3 23:46:00.4294967272
   3 23:46:00.090
   3 23:45:55.373
   3 23:45:41.221
   3 23:45:41.214
   3 23:45:37.4294966801
   3 23:45:37.4294966793
   3 23:45:32.4294967084
   3 23:45:32.4294967076
   3 23:45:32.252
   3 23:45:28.4294966934
   3 23:45:27.070
   3 23:45:18.4294966925
=============== column 9 ===============
6679 2015/12/22
=============== column 10 ===============
   4 23:45:55.349
   3 23:46:14.257
   3 23:46:10.4294966829
   3 23:46:10.4294966821
   3 23:46:05.4294967104
   3 23:46:05.4294967081
   3 23:46:05.4294967073
   3 23:46:00.4294967272
   3 23:46:00.090
   3 23:45:55.373
   3 23:45:41.221
   3 23:45:41.214
   3 23:45:37.4294966801
   3 23:45:37.4294966793
   3 23:45:32.4294967084
   3 23:45:32.4294967076
   3 23:45:32.252
   3 23:45:28.4294966934
   3 23:45:27.070
   3 23:45:18.4294966925
=============== column 11 ===============
6469 
  48 CPA846
  48 AAL1519
  46 HIRE71
  35 N39WP
  22 SCX285
   4 UAL670
   2 SL
   2 RM
   1 SCX285
   1 N39WP
   1 AAL1519
=============== column 12 ===============
4484 
 630 38000
 452 35000
 308 43000
 280 35975
 215 36000
  98 37975
  73 43025
  57 35025
  23 32975
  21 42975
  13 33000
  11 24000
   5 37025
   5 37000
   4 3675
=============== column 13 ===============
5017 
 163 394.0
 124 393.0
 111 541.0
 103 400.0
 102 543.0
  94 410.0
  92 411.0
  80 396.0
  69 545.0
  67 542.0
  58 395.0
  57 399.0
  55 547.0
  49 407.0
  46 546.0
  42 540.0
  37 406.0
  37 401.0
  35 392.0
=============== column 14 ===============
5017 
 175 273.2
 132 270.3
 125 270.1
 111 272.8
 110 96.1
  96 270.4
  92 272.5
  87 273.3
  84 96.4
  82 96.5
  75 96.6
  51 273.1
  51 272.6
  46 96.2
  42 273.5
  40 272.4
  39 96.3
  27 96.7
  25 272.9
=============== column 15 ===============
5029 
  21 47.47307
  15 47.47311
  15 47.47302
  15 47.47298
  14 47.47293
  14 47.47284
  12 47.47275
  11 47.47243
  10 47.47289
  10 47.47285
  10 47.47270
  10 47.47266
   9 47.47279
   9 47.47234
   9 47.47220
   8 47.47304
   8 47.47183
   7 47.47309
   7 47.47261
=============== column 16 ===============
5029 
   2 -114.02531
   2 -114.01941
   2 -113.90083
   2 -113.83525
   2 -113.78294
   2 -113.61738
   2 -113.53587
   2 -113.51678
   2 -113.48582
   2 -113.46893
   2 -113.46729
   2 -113.46295
   2 -113.45102
   2 -113.42651
   2 -113.41482
   2 -113.35567
   2 -113.34924
   2 -113.29115
   2 -113.28186
=============== column 17 ===============
5017 
 880 0
 388 64
 308 -64
  74 -128
  12 128
=============== column 18 ===============
6675 
   2 1756
   2 1366
=============== column 19 ===============
4484 
2195 0
=============== column 20 ===============
5025 
1654 0
=============== column 21 ===============
4484 
2195 0
=============== column 22 ===============
4582 0
1865 
 220 1
  12 

The first column in the output is the number of occurrences of the value in the second column.

Now allow me to guess using this example message, based on what I know:

MSG,3,333,7,A94B48,107,2015/12/22,23:33:14.001,2015/12/22,23:33:14.001,,32975,,,46.94609,-115.34965,,,0,0,0,0

Column 1 looks like the message type. MSG is the most common, but AIR, STA, and ID were also seen occasionally. No idea what the difference is.

Column 5 is almost certainly the airplane identification number.

Column 6 might be the message length.

Columns 7 and 9 are both a date field. I don't understand why there are two dates, as they are exactly the same.

Columns 8 and 10 are a time field. Yet again they are the same, so the presence of two is strange. None seemed to be different at all:

>>> for i in open('flightradar_dump.csv').readlines():
...     a = i.strip().split(',')
...     if a[7] != a[9]:
...             print 'different:',a[7],a[9]
(no output)

Column 11 appears to be the flight number, including the airline identifier.

Column 12, I would guess to be the plane's altitude. Whether this is meters or feet is unknown.

I'm relatively sure column 13 is speed in mph.

Column 14 could be the plane's direction in degrees.

Columns 15 and 16 are almost certainly latitude and longitude of the plane.

As for the rest of the columns, I have no idea. The internet was no help in this endeavour, though I might be able to find something out from FlightRadar support. I doubt they'd add this feature if they didn't intend people to use it.

EDIT: I've built a simple script to process streamed data. It has a similar, somewhat messy interface as dump1090, but uses local feed data. Run it with:

python fr24_feed.py <ip of receiver>

Output looks something like this now:


ID:         Flight          Alt         Speed              Lat              Lon       Heading     Last seen   PPS
A7D3FE:                 11175ft           mph                E                N           deg          0sec     3
A58672:                 34950ft           mph                E                N           deg          1sec     0
A4CA6A:        752      35000ft      484.0mph      -115.41179E        46.45871N      107.2deg          0sec     7
A500CA:                 20450ft           mph                E                N           deg          0sec     2
89911D:                 38000ft           mph                E                N           deg          0sec     3
A5C4E2:         36      35000ft      466.0mph      -113.48465E        47.69261N       88.9deg          0sec     6
86DCE6:       JAL4      40975ft      507.0mph      -113.85406E        48.29695N       96.7deg          0sec     8
AB3FC8:                  8475ft           mph                E                N           deg          0sec     1
AB4508:                      ft           mph                E                N           deg         14sec     0

Bytes/sec: 3075  Packets/sec: 30  Avg bytes/pkt: 102

Planes visible: 9  Total seen: 9

In conclusion, the FlightRadar receiver is much more accurate than others I have seen. At a roughly calculated rate of 30-40 message per second, lots of processing may be unfeasible. Recording it may also be hard, as mine used about 150 kilobytes per minute of disk space.

Also here's the script:





No comments:

Post a Comment