Brain Games - SCORM/suspend_data and xAPI/state

For one of my customers, I provide an LMS (Learning Management System) which I built myself. It's a neat system that tracks users progress through annual training. I won't get into much detail because it's an exclusive service I provide to only this one customer. Something I will divulge (because it's necessary to establish a context to this post) is that it supports SCORM.

Something I enjoyed very much was writing a SCORM Javascript/PHP interface. It's one of those love/hate things. I'm really proud of my JS/PHP interface, but I really hate SCORM.... Specifically, I hate being required to overcome the limitations built into SCORM, like the CORS/Cross Domain limitation...that was tough to overcome...it took a few days to think my way around that one. I might even release the code for that (cross domain SCORM hosting)

Adding insult to injury, the modules I'm working with are created by Articulate Storyline. The reason this makes it worse, is that Articulate has decided to compress/encrypt the data being stored in the LMS. There is a segment of very useful data that's blocked off behind some proprietary storage format. In their forums, they've remarked about how useful the data would be....right before they say "make a feature request".....which they've been doing for at least 4 years while their customers beg for access to the data.

"How is it useful?", you might ask. Well, it contains things like slides viewed, the answers to questions present in the module, time spent....really useful stuff for anyone running an LMS. Personally, I use it to display progress to the users watching the modules (by counting unique slides viewed and dividing by the total number of slides I can present the percentage of completion)....this isn't any proprietary information they're storing, it's really mundane data. In fact, the compression/encryption they've employed isn't a particularly good compression. They could probably reduce the complexity of their code by just storing it in JSON format....like everyone else does.

I only bring up Articulate because of a new requirement from my customer - xAPI/TinCan support. Like it's predecessor, it also has a data storage requirement and like it's predecessor - Articulate has compressed/encrypted the data.....big surprise.

I asked my customer for a large xAPI module to harvest some data from, and harvest I have done. 115k worth of state data destined for a server that wasn't paying any attention to it (because I haven't written the LRS code yet). I want to know what's in this data first - so when I deploy xAPI next to SCORM for my customer, the transition will be seamless. The code I wrote to decrypt/decompress the SCORM data will have a relative doing the same thing to xAPI state data. The code I wrote to recover data from the SCORM modules isn't working for xAPI state data, although the data looks very similar. My goal is to refine the process and find the true method to recover the viewed slide data.

Here's a very small sample of the data I'll be obsessing over for the next few days. The state data ignored by my test server for the first 11 slides.

2M146070ji1001112a0101201112~201r100000000000000000000000000v_player.6RcLlMaxzl8.5yCw4XHlO1J1^1^0000000000000000000
2T16607080on1001212f010120111201212~201r300000000000000000000000000v_player.6RcLlMaxzl8.663onvqKo3V1^1^0000000000000000000
2_1860708090ts1001312k01012011120121201312~201r700000000000000000000000000v_player.6RcLlMaxzl8.6ccmrL9ioUo1^1^0000000000000000000
252a60708090a0yx1001412p0101201112012120131201412~201rf00000000000000000000000000v_player.6RcLlMaxzl8.5vbbOBcgbsm1^1^0000000000000000000
2c2c60708090a0b0DC1001512u010120111201212013120141201512~201rv00000000000000000000000000v_player.6RcLlMaxzl8.5rwUxY42BNH1^1^0000000000000000000
2j2e60708090a0b0c0IH1001612z01012011120121201312014120151201612~201r$00000000000000000000000000v_player.6RcLlMaxzl8.5rykj01k7mc1^1^0000000000000000000
2q2g60708090a0b0c0d0NM1001712E0101201112012120131201412015120161201712~201r$10000000000000000000000000v_player.6RcLlMaxzl8.6onHE6Mpydo1^1^0000000000000000000
2x2i60708090a0b0c0d0e0SR1001812J010120111201212013120141201512016120171201812~201r$30000000000000000000000000v_player.6RcLlMaxzl8.5oFwUyK7Xt21^1^0000000000000000000
2E2k60708090a0b0c0d0e0f0XW1001912O01012011120121201312014120151201612017120181201912~201r$70000000000000000000000000v_player.6RcLlMaxzl8.5XHqRZf2nMZ1^1^0000000000000000000
2O2m60708090a0b0c0d0e0f0g0~201$1001a12T0101201112012120131201412015120161201712018120191201a12~201r$f0000000000000000000000000v_player.6RcLlMaxzl8.6RCS0OVapMt1^1^0000000000000000000
2Y2o60708090a0b0c0d0e0f0g0h0~281~2411001b12Y0101201112012120131201412015120161201712018120191201a1201b12~201r$v0000000000000000000000000v_player.6RcLlMaxzl8.5wzXzefGwxp1^1^0000000000000000000

I've already identified a few different patterns in this data, so this won't take long.

Looking a little closer, the patterns change a bit with a larger dataset. I believe the data I'm looking for starts with the 1st "100" and ends with "~201r", so I'm just going to show that section from the last state stored for the module. Something I notice is that the first and last piece of data match - for this line, "1c1g" is found immediately after the start marker and immediately before the end marker. I believe that's the last viewed slide after looking at a section of data where I pressed "Next" then "Prev" then "Next" again. I haven't identified the next piece of data, but it's always followed by a "010" which I believe to be another kind of start marker.

Here's where it gets interesting. I don't think the delimiter is 1201, but 120...and I also don't think it's 120 - but another type that increments - possibly giving value to the numbers that follow. Whatever it is, it's 3 digits, because the numbers following it form a pattern - 11,12,13,14,15,16,17,18,19,1a,1b,1c. If converted from hex, they're 17,18,19.... It doesn't make sense to start with 11 though - and where the module transitions from 1.13 to 2.1 - the delimiter changes to 130 and there's an anomalous 10 (16 in decimal). The first set starts with 11, and all of the rest of the sets start with 10. I also know it's not hex - because the numbers extend beyond f - but I haven't discovered an upper limit yet - so I'll keep looking at it as hex+ until I establish an upper limit (which may require I ask someone to build me a module with a section holding greater than....62? slides so I can find out what happens when 0-9 a-z and A-Z are exhausted).

1001c1g~2Gc0101201112012120131201412015120161201712018120191201a1201b1201c120101301113012130131301413015130161301713018130191301a1301b1301c1301d1301e1301f1301g1301h1301i1301j1301k130101401114012140131401414015140161401714018140101501115012150131501415015150161501715018150191501a1501b1501c1501d1501e15010160111601216013160141601516016160171601017011170121701317014170151701018011180121801318014180151801618017180101901119012190131901419015190101a0111a0121a0131a0141a0151a0161a0171a0181a0191a01a1a0101b0111b0121b0131b0141b0151b0161b0171b0181b0191b01a1b01b1b01c1b01d1b01e1b01f1b01g1b01h1b01i1b01j1b01k1b01l1b01m1b01n1b01o1b01p1b01q1b01r1b01s1b01t1b0101c0111c0121c0131c0141c0151c0161c0101d0111d0121d0101e0111e0101f0111f0121f0131f0141f0151f0161f0171f0181f0191f0101g0111g0121g0131g0141g0151g0161g0171g0181g0191g01a1g01b1g01c1g~201r

I think I got it wrong again...but the more I look at this, the more it makes sense. The start markers haven't changed except that 2nd 010 delimiter...that's actually data! The first section does start with a "10" like the rest...I just didn't see it at first. I believe the actual delimiter is 0 and each record is 4 characters - 2 pair of hex(ish) digits. The progression of numbers suggests the page is first, and the section is second. That makes the first entry (1012 or 10/12) translate to 16,18 - followed by 17,18 and 18,18.

1001c1g~2Gc0101201112012120131201412015120161201712018120191201a1201b1201c120101301113012130131301413015130161301713018130191301a1301b1301c1301d1301e1301f1301g1301h1301i1301j1301k130101401114012140131401414015140161401714018140101501115012150131501415015150161501715018150191501a1501b1501c1501d1501e15010160111601216013160141601516016160171601017011170121701317014170151701018011180121801318014180151801618017180101901119012190131901419015190101a0111a0121a0131a0141a0151a0161a0171a0181a0191a01a1a0101b0111b0121b0131b0141b0151b0161b0171b0181b0191b01a1b01b1b01c1b01d1b01e1b01f1b01g1b01h1b01i1b01j1b01k1b01l1b01m1b01n1b01o1b01p1b01q1b01r1b01s1b01t1b0101c0111c0121c0131c0141c0151c0161c0101d0111d0121d0101e0111e0101f0111f0121f0131f0141f0151f0161f0171f0181f0191f0101g0111g0121g0131g0141g0151g0161g0171g0181g0191g01a1g01b1g01c1g~201r

I have a feeling that the first piece of data after the last viewed slide is a counter - time in module maybe. The first entry is a0 and the last slide is ~2Gc. I need to figure out what base this number is. Obviously, it contains ~ alpha upper and lower as well as numeric. If I assume it's base 63 (0-9a-zA-Z~) - the first data stored (a0) is equal to 630 milliseconds...and that doesn't sound out of the realm of possibility. time to click next...maybe time to load the first slide... The last value ~2Gc is equal to 258.5585 minutes.....if my assumption is correct. I'll have to test this theory.

a0,f0,k0,p0.....~2Gc - it increments at regular intervals a to f = 5, f to k = 5, k to p = 5 - so it has nothing to do with time. After the last entry above (Y0) it jumps to ~21 - so I think it's a flag - but what are the 4 digits between Y and 2? I think it's safe to assume there's a Z and a 1 in there - Z??12. Time spent would be nice to gather, but this isn't it, and doesn't seem to have anything to do with the data I'm interested in (slides viewed) - so I'll ignore it for now.....back to the task at hand - slides.

At this point, I can work up some regex to isolate the data I'm interested in. This should get me to the start of the dataset: 100(.*?)[0-9a-zA-Z]{4}[~]{0,1}[0-9a-zA-Z]{1,}?(?=010) - and the end of the set is a constant ~201r with a dataset in between starting with 0....easy.

preg_match('/100(.*?)[0-9a-zA-Z]{4}[~]{0,1}[0-9a-zA-Z]{1,}?(?=010)(?P<data>(.*?))~201r/',$data,$matches);

print_r($matches);
Array
(
 [0] => 1001112a0101201112~201r
 [1] => 
 [data] => 0101201112
 [2] => 0101201112
 [3] => 0101201112
)

Now I can access just the data I want to work with in $matches['data'];

Because the data is delimited with 0 and also contains 0 within some elements - I can't use explode. There are other ways.

preg_match_all('/0(?=[0-9a-zA-Z])(?P<item>[0-9a-zA-Z]{4})/',$matches['data'],$items);

print_r($items);
Array
(
 [0] => Array
 (
 [0] => 01012
 [1] => 01112
 )

 [item] => Array
 (
 [0] => 1012
 [1] => 1112
 )

 [1] => Array
 (
 [0] => 1012
 [1] => 1112
 )

)

And now I have an array of viewed slides. It may contain duplicates, so if you need a count and not breadcrumbs you can use array_unique to clean it up. Here's the function I'm using in my system.

function slides($data) {
 preg_match('/100(.*?)[0-9a-zA-Z]{4}[~]{0,1}[0-9a-zA-Z]{1,}?(?=010)(?P<data>(.*?))~201r/',$data,$matches);
 preg_match_all('/0(?=[0-9a-zA-Z])(?P<item>[0-9a-zA-Z]{4})/',$matches['data'],$items);
 return $items['item'];
}

By the way - this should just as easily decode Articulate SCORM suspend_data to retrieve the slide count....or whatever you modify the function to retrieve. Happy trails!

Update - more data has become available. My customer agreed to send me a test module with 65 slides within one section. This should show me the base numbering system used in the stored items. I'll update when I have an accurate picture of that data.

This new data is interesting. 65 slides proved to not be enough - so it was bumped to 85 slides and that revealed the base number system is 64 digits (but it's not standard base64). As regex [0-9a-zA-Z_$] - for some reason I thought it would be bigger - but I'll take what I can get. Something else is interesting about this data - the ~201r end marker is apparently not an ending marker. I'll have to review each of the state submissions to find where it disappears. The page/section digits are not always 4 digits. When the page rolls over at $, it becomes 3 digits making the page/section 5 digits...so I'll need to adjust my regex statements to accommodate the possibility of changing lengths. Knowing the entire charset of the numbering system will allow me to decode everything, but I'll have to come up with a system to manage the decoded numbers in order to accommodate a change in the length of digits. Here's what an 85 slide section data looks like. I'm leaving this uncolored because that takes a long time by hand (as I said, I haven't rewritten the regex to cut this up automatically yet).

2Ha~2G260708090a0b0c0d0e0f0g0h0i0j0k0l0m0n0o0p0q0r0s0t0u0v0w0x0y0z0A0B0C0D0E0F0G0H0I0J0K0L0M0N0O0P0Q0R0S0T0U0V0W0X0Y0Z0_0$001112131415161718191a1b1c1d1e1f1g1h1i1j1k1l1m1n1o1p1q1~2e7~2a71002k112~2_60101201112012120131201412015120161201712018120191201a1201b1201c1201d1201e1201f1201g1201h1201i1201j1201k1201l1201m1201n1201o1201p1201q1201r1201s1201t1201u1201v1201w1201x1201y1201z1201A1201B1201C1201D1201E1201F1201G1201H1201I1201J1201K1201L1201M1201N1201O1201P1201Q1201R1201S1201T1201U1201V1201W1201X1201Y1201Z1201_1201$1202011202111202211202311202411202511202611202711202811202911202a11202b11202c11202d11202e11202f11202g11202h11202i11202j11202k112B0v_player.6crDQVV0N7p.5dnr7ivoCLN1^1^00000

This is going to become a set of functions to not only decode and list, but also translate.

OK, I've had some time to play with the data and I've discovered that the ~201r ending delimiter is not always present. It seems to appear when you have a certain amount of data, so I can't rely on it being present. After adjusting the regex, I can get a reliable result on both datasets.

function slides($data) {
 $matches = array(); // to make my IDE happy
 $items = array(); // to make my IDE happy
 $dataregex = '/100(?P<end>[0-9a-zA-Z_$]{4,5}(?=[0-9a-zA-Z_$~]+010))(.*?)(?=010)(?P<data>(.*?)\k<end>)/';
 $itemregex = '/0(?=[0-9a-zA-Z_$])(?P<item>[0-9a-zA-Z_$]{4,}?(?=(0|$)))/';
 preg_match($dataregex, $data, $matches);
 preg_match_all($itemregex, $matches['data'], $items);
 return $items['item'];
}

$slides = slides($data);
print_r($slides);
Array
(
 [0] => 1012
 [1] => 1112
 [2] => 1212
 [3] => 1312
 [4] => 1412
 [5] => 1512
 [6] => 1612
 [7] => 1712
 [8] => 1812
 [9] => 1912
 [10] => 1a12
 [11] => 1b12
 [12] => 1c12
 [13] => 1d12
 [14] => 1e12
 [15] => 1f12
 [16] => 1g12
 [17] => 1h12
 [18] => 1i12
 [19] => 1j12
 [20] => 1k12
 [21] => 1l12
 [22] => 1m12
 [23] => 1n12
 [24] => 1o12
 [25] => 1p12
 [26] => 1q12
 [27] => 1r12
 [28] => 1s12
 [29] => 1t12
 [30] => 1u12
 [31] => 1v12
 [32] => 1w12
 [33] => 1x12
 [34] => 1y12
 [35] => 1z12
 [36] => 1A12
 [37] => 1B12
 [38] => 1C12
 [39] => 1D12
 [40] => 1E12
 [41] => 1F12
 [42] => 1G12
 [43] => 1H12
 [44] => 1I12
 [45] => 1J12
 [46] => 1K12
 [47] => 1L12
 [48] => 1M12
 [49] => 1N12
 [50] => 1O12
 [51] => 1P12
 [52] => 1Q12
 [53] => 1R12
 [54] => 1S12
 [55] => 1T12
 [56] => 1U12
 [57] => 1V12
 [58] => 1W12
 [59] => 1X12
 [60] => 1Y12
 [61] => 1Z12
 [62] => 1_12
 [63] => 1$12
 [64] => 20112
 [65] => 21112
 [66] => 22112
 [67] => 23112
 [68] => 24112
 [69] => 25112
 [70] => 26112
 [71] => 27112
 [72] => 28112
 [73] => 29112
 [74] => 2a112
 [75] => 2b112
 [76] => 2c112
 [77] => 2d112
 [78] => 2e112
 [79] => 2f112
 [80] => 2g112
 [81] => 2h112
 [82] => 2i112
 [83] => 2j112
 [84] => 2k112
)

Now that I have the character set for the base numbering system, I'll look into writing a function(s) to decode it into something useful. The only problem I foresee is the increasing numbers of digits without a delimiter. I'll need to track the previous section id during the conversion. My solution will require the use of my AnyBase PHP class, available at GitHub here: https://github.com/stutteringp0et/AnyBase

Update: The most satisfying thing about doing this kind of work is the great feeling of gaining understanding over something unknown. Even when I prove myself and my previous assumptions wrong, it's very satisfying to make progress. I mention this because my previous statement about there being no delimiter in the >4 digit page/section was entirely wrong. Upon using the now known base number system, I found that there is indeed a delimiter and the page/section remains 4 digits. When the page exceeds 64, a delimiter of "1" is added between the page/section. Articulate claims this is encoded to manage the size of the data - but this additional delimiter is completely unnecessary and only adds to the size of the output. A 2 digit section using their version of 64 character base number system tops out at 4031 pages and sections (each) before the need to increase either to 3 digits.

I've almost finished the translator for the slide array. This has been really fun.

function translateSlides($slides) {
 $a64 = new AnyBase('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_$');
 $r = array();
 foreach($slides as $slide) {
  switch(strlen($slide)) {
   case 4:
    list($page64,$section64) = str_split($slide,2);
    break;
   default: // anything else
    $section64 = substr($slide,-2);
    $page64 = substr($slide,0,2);
    break;
  }
  $r[$slide] = array('section'=>($a64->decode($section64)-65),'page'=>($a64->decode($page64)-63));
 }
 return $r;
}

 
/*  @copyright Copyright (C) 2013 - 2018 Michael Richey. All rights reserved.
 *  @license GNU General Public License version 3 or later
 */
 
abstract class decodeStateData {
 
 public static function slides($data) {
  $matches = array();
  $items = array();
  $dataregex = '/100(?P<end>[0-9a-zA-Z_$]{4,5}(?=[0-9a-zA-Z_$~]+010))(.*?)(?=010)(?P<data>(.*?)\k<end>)/';
  $itemregex = '/0(?=[0-9a-zA-Z_$])(?P<item>[0-9a-zA-Z_$]{4,}?(?=(0|$)))/';
  preg_match($dataregex, $data, $matches);
  preg_match_all($itemregex, $matches['data'], $items);
  return $items['item'];
 }
 
 public static function translateSlides($slides) {
  require_once('anybase.php');
  $a64 = new AnyBase('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_$');
  $r = array();
  foreach($slides as $slide) {
   switch(strlen($slide)) {
    case 4:
     list($page64,$section64) = str_split($slide,2);
     break;
    default: // anything else
     $section64 = substr($slide,-2);
     $page64 = substr($slide,0,2);
     break;
   }
   $r[$slide] = array('section'=>($a64->decode($section64)-65),'page'=>($a64->decode($page64)-63));
  }
  return $r;
 }
}

Using is is pretty easy:

// First, read your data into the slides method
$slides = decodeStateData::slides($data);
// if you only need the slide count
$numslides = count(array_unique($slides));
// If you want to see the actual slide IDs
$detail = decodeStateData::translateSlides($slides);

I hope this helps you get past the overprotective nonsense. This encoding scheme isn't anything particularly awesome, and definitely not something worth protecting as fiercely as it's being protected. People have been asking for a way to read the data for 4+ years and Articulate has continued to deny that request. This set of functions will serve to grant access to this data, and possibly convince Articulate to abandon their death grip on our data and just store it in a standardized format.

My functions are written in PHP, because that's what my LMS is written in - but I could easily convert this to other languages if needed....that won't be free though ;)

Final thoughts:

There are still some encoded segments that I haven't identified yet. Some make my scratch my head and ask "why?" (like alphabet that seems to mirror the progression of sections...what's the purpose of that?) There are several pieces of data that increment at a regular pace (some increment 7 per update, some 5 per update, some 3 per update). Still other data seems to change with no regularity at all....or perhaps it's too complex for me to identify with my monkey brain.

Did I mentioned that I write and host custom software for most of my customers? What can I do for you?

Update 10/4/2018 - I've rewritten the Regular Expressions to accommodate old and new data storage methods. This covers xAPI as well as the older SCORM output from various versions of Storyline.

I noticed that the beginning of the data section contained the last bit of data, sort of a preamble that identified where the end of the data was. Using that information I was able to construct a new Regex statement that identifies the ending data element and looks for it later in the expression.

2M146070ji1001112a0101201112~201r100000000000000000000000000v_player.6RcLlMaxzl8.5yCw4XHlO1J1^1^0000000000000000000

This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.