TL;DR: LFSR cycles every 65535 generated bits. Therefore, the one-time pad is reused every 65535 bits, and one can perform a crib dragging to recover (either manually or automatically) at least 256 bits of keystream. Once that is done, the entire keystream can be recovered.
Checking on the LFSR parameters, we can see it is of size 256. Since the LFSR is used as a simple stream cipher, any recovered keystream bit will roughly yield one seed bit, and we need about 256 of them to recover the entire keystream (or a little less, that can be completed with brute-force).
To get the information about how each character has been encoded, simply use this:
from dahuffman import HuffmanCodec
codec = HuffmanCodec.load('output.codec')
codec.print_code_table()
There are many ways to recover some bits of keystream, at some various computational costs. For instance:
- The flag format,
CWTE{<some ascii text>}
, can directly yield 65 bits of keystream forCWTE{
, and 15 additional bits if the length is guessed correctly with the closing bracket. However, this needs to guess the exact position of the encoded flag, which has multiple thousands possibilities. - All sentences starts with
The
and ends with.
. This is immediately 16 free bits at the very beginning of the stream, and 24 additional bits for guessing correctly the position of the end of a sentence. - Inspecting the codec, we find some rare letters, like
ñ
. In the wordlists used by wonderwords, only the nounjalapeño
contains this letter, showcasing the presence ofjalapeño
in the text. Guessing its position could reveal 54 additional bits of keystream.
Combining multiple one of these tricks could get you close to 256 bits of keystream, which is about what you would need to recover the seed of the LFSR. However, that is still quite computationnally hard.
Let’s check the other LFSR parameters. In particular we can use sage to find its cycle length. We notice that the polynomial is not at all irreducible, and the resulting LFSR will cycle every 65535 bits.
size = 256
taps = 0x8000000000000000000000000000000000000000000000000000000000000001 | (1 << size)
coeffs = [(taps >> (size-i)) & 1 for i in range(size + 1)]
F.<x> = PolynomialRing(GF(2))
polynomial = F(coeffs)
matrix = companion_matrix(polynomial)
print(matrix.multiplicative_order()) # 65535
Using this knowledge, we know the given ciphertext used a keystream that cycled at least 4 times. This is then a reused one-time pad! Checking the library used for the message, we can notice a lot of different patterns that can help performing a crib dragging.
- All sentences starts with
The
and ends with.
- Sentences uses only specific words, from different wordlists.
- Sentences are all 5 words long, and more precisely, in the format
The <adjective> <noun> <verb-at-present-tense> <noun>.
In order to help distinguishing sensible output sentences from random giberrish, we can build a set of all possible 4-grams that can be generated by this library:
from dahuffman import HuffmanCodec
from pickle import load, dump
codec = HuffmanCodec.load('output.codec')
def load_asset(filename):
words = set()
for word in open(filename, 'r'):
words.add(word.strip())
return words
# https://github.com/mrmaxguns/wonderwordsmodule/blob/master/wonderwords/random_sentence.py
VOWELS = ["a", "e", "i", "o", "u"]
def _present_tense(verb: str) -> str:
"""Convert a verb from the infinitive to the present tense 3rd person
form"""
verb = verb.strip().lower()
if verb.endswith(("ss", "ch", "x", "tch", "sh", "zz")):
verb = verb + "es"
elif verb.endswith("y") and not verb.endswith(
tuple([vowel + "y" for vowel in VOWELS])
):
verb = verb[:-1] + "ies"
else:
verb = verb + "s"
return verb
# https://github.com/mrmaxguns/wonderwordsmodule/tree/master/wonderwords/assets
adjectives = load_asset('adjectivelist.txt')
nouns = load_asset('nounlist.txt')
nounsd = {n + '.' for n in nouns}
verbs = load_asset('verblist.txt')
verbs = {_present_tense(verb) for verb in verbs}
the = {'The'}
sentences = [the, adjectives, nouns, verbs, nounsd, the]
grams = set() # all possible 4-letters sequences
for i in range(5):
print(len(sentences[i]) * len(sentences[i+1]))
for a in sentences[i]:
for b in sentences[i+1]:
seq = a + ' ' + b + ' '
for j in range(len(seq) - 3):
grams.add(seq[j:j+4])
dump(grams, open('grams.pkl', 'wb')) # dump for future use
This takes a minute to compute. Then, we can simply check if a sentence is sensible using the following snippet:
grams = load(open('grams.pkl', 'rb'))
print(len(grams), '4-grams loaded')
def isvalid(sentence):
return all(sentence[i:i+4] in grams for i in range(len(sentence) - 3))
Knowing the first sentence must start with "The "
, we can already
get the first 16 bits of keystream:
with open('output.txt', 'rb') as f:
ciphertext = f.read()
bitstream = [(c >> i) & 1 for c in ciphertext for i in range(7, -1, -1)]
order = 65535
cycles = len(bitstream) // order
start = codec.encode('The ')
crib = [(c >> i) & 1 for c in start for i in range(7, -1, -1)]
ks = [a ^ b for a, b in zip(crib, bitstream)]
(Note that this works nicely since the encoded crib is exactly 16 bits, therefore we do not need to truncate the last byte)
Since we have the first 16 bits of keystream, we may read some encoded text content 65535 bits later, and so on. However, their decoded content may be tricky to directly get as we may fall upon the middle of an encoded character.
The following function returns all the decoded text we can get from the
actual keystream, given the three offsets in a global list, or None
if one of the decoded text is unsensible.
offsets = [0,0,0,0]
def display(ks):
rs = []
for i in range(cycles):
start = i * order
m = [a ^ b for a, b in zip(bitstream[start:], ks)]
s = offsets[i]
m = m[s:]
r = []
for k in range(0, len(m) - 7, 8):
r.append(sum(b << (7-j) for j, b in enumerate(m[k:k+8])))
m = codec.decode(bytes(r))
if not isvalid(m):
return None
rs.append(m)
return rs
With the 16 first bits of keystream, one can already fiddle with the offsets to try to get the three unknown offsets right (offsets[0] is always 0).
With the offsets known, we can then backtrack for enough keystream with a simple branch and bound algorithm, which halts when one of the recovered text is not possible.
BEST = len(ks)
def backtrack(ks):
global BEST
if len(ks) > BEST:
print(len(ks), ' ', *ks, sep = '')
BEST = len(ks)
rs = display(ks)
print(*rs, sep = ' | ')
if len(ks) > 256 + 64: # safe margin for errors
return True
for i in range(2):
ks.append(i)
if display(ks) is not None:
if backtrack(ks):
return True
ks.pop()
return False
If we get more than 256 bits of keystream, we can halt and recover the full keystream directly using the PRNG instead.
def solve_with(ks):
seed = sum(k << i for i, k in enumerate(ks))
from challenge import PRNG
prng = PRNG()
prng.drbg.state = seed
result = []
for c in ciphertext:
r = next(prng)
result.append(c ^ r)
result = codec.decode(bytes(result))
print(result)
i = result.index('CWTE')
j = result.index('}')
print(result[i:j+1])
Here is a simple main code to brute-force over possible offsets until we find the flag.
mag = 4
for o1 in range(mag):
for o2 in range(mag):
for o3 in range(mag):
offsets = [0, o1, o2, o3]
print()
print(offsets)
if backtrack(ks):
solve_with(ks[:256])
exit(0)
Note that the cycling property of the LFSR was not the only thing that could allow one to perform a crib dragging.
Since
TAPS = 0x8000000000000000000000000000000000000000000000000000000000000001
we get keystream[i-255] ^ keystream[i] ^ keystream[i+1] == 0
,
therefore allowing one to get an extra bit of keystream every two consecutive
bits of keystream found.
By locating the position of a known pattern (such as the ones described previously), one could immediately get an other plaintext part of approximately the same length 255 bits away from this pattern.