Implementation and cryptanalysis of Vigenère

1. Implementation

1.1 Introduction

The Vigenère cipher is a method of encrypting alphabetic text by using a series of different Caesar ciphers based on the letters of a keyword.

1.2 Lab exercise

1.2.1 Exercise

Implement a program that asks the user for a text, and that displays it. Modify your program so that all letters are converted to lower-case, and all non-letter characters are removed.

1.2.2 Main idea

Getting the input text and keyword. We don’t know the length of input text and keyword, so we just use malloc and realloc function to allocate address space for input text and keyword. If they are capital letter, change them to the lower-case. After that, removing the non-letter characters.

1.2.3 Source code

Input text

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Getting the input text and covert it using covert function.                               */
/* to get a viginer cipher, you should input the plaintext(the longer, the better) */
origintext = (char*) malloc(1000);
printf("please input origin text \n");
for(i = 0; (*(origintext + i) = getchar()) != '\n'; i++)
{
if((i+1) % 1000 == 0)
{
origintext = (char*)realloc(origintext, strlen(origintext) + 1000);
}
}
origintext[i] = '\0';
plaintext = (char*)malloc(strlen(origintext));
memset(plaintext, '\0', strlen(plaintext));
convert(origintext, plaintext);

Covert the text

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* convert function is used to covert the original text to  plaintext, delete  * 
* no-letter char and covert captical letter to lowercase */
void convert(char *origintext, char *plaintext)
{
int i, j;
i = 0;
j = 0;
memset(plaintext, '\0', strlen(plaintext));
while(origintext[i] != '\0')
{
if(origintext[i] <= 'Z' && origintext[i] >= 'A')
{
plaintext[j] = origintext[i] + 32;
j++;
}
if(origintext[i] <= 'z' && origintext[i] >= 'a')
{
plaintext[j] = origintext[i];
j++;
}
i++;
}
}

Input the key word

1
2
3
4
5
6
7
8
9
10
 /*input the key word*/ 
printf("please input the key\n");
for(i = 0,key = (char*)malloc(100); (*(key + i) = getchar()) != '\n'; i++)
{
if((i + 1) % 100 == 0)
{
key=(char*)realloc(key,strlen(key)+100);
}
}
*(key+i)='\0';

1.2.4 Result

The result was shown in figure 1.1. The plain text was about the introduction of viginere, which came from wikipedia-viginere. The keyword was set to isima.
figure 1.1

Figure 1.1 encryption of plain text

2. Cryptanalysis: key length estimation

2.1 Introduction

When the key length is known, a Vigenère cipher can be broken using frequency analysis. The primary weakness of the Vigenère cipher is the repeating nature of its key. If a cryptanalyst correctly guesses the key’s length, then the cipher text can be treated as interwoven Caesar ciphers, which individually are easily broken. The Kasiski and Friedman tests can help determine the key length.

2.2 Babbage and Kasiki method

Friedrich Kasiski was the first to publish a successful general attack on the Vigenère cipher. Charles Babbage was goaded into breaking the Vigenère cipher when John Hall Brock Thwaites submitted a “new” cipher to the Journal of the Society of the Arts. When Babbage showed that Thwaites’ cipher was essentially just another recreation of the Vigenère cipher, Thwaites challenged Babbage to break his cipher encoded twice, with keys of different length.

2.3 Lab work

To calculate the length of key work. I removed the letters whose repetitions were less than 2 times, and then get the distance between two same strings. After that I get divisors of each distance and calculate the weight of each divisor. With highest weight, we get the most probable probable length of key word.

2.4 source code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
/* get_key_size function is used to calculate the keysize of the key                                                               */ 
/* to avoid noise, get_key_size() calculates the weight of each possible keys and the result is */
/* more precise than previous get_key_size(), which just calculates the same divisor of all possible keys */
/* input: cipher text. */
/* output: the most probable length of key word. */
/* Same_Str struct is used to stored the information of the same strings in ciphertext */
struct Same_Str{
char str[1000];
int num; // the repetition time of each string
int start[1000]; //the start position of string
int end[1000]; // the end position of string, distance = end[i] – start[i], i stands for each repetition.
};

int get_key_size(char* ciphertext)
{

/* step 1: initialize work*/

int size = 0;
int i, j, m, k, r;
int len = strlen(ciphertext);
int *keysize = malloc(sizeof(int)*strlen(ciphertext));
memset(keysize, 0, sizeof(int)*strlen(ciphertext)); // if possible key size is 27, set keysize[26]=1;
int malloc_size = 1000;
struct Same_Str *same_str = malloc(sizeof(struct Same_Str) * malloc_size);
int str_num;
str_num = 0;
for(i = 0; i < malloc_size; i++)
{
same_str[i].num = 0;
}

char *block_1 = malloc(len);
char *block_2 = malloc(len);

/* step 2: compare each string, find the same string and store the information into Same_Str struct*/

for(m = 2; m < len/2 ; m++) //m -> the length of blocks to be compared
{
int block_num = len -m;
for(i = 0; i < block_num; i++)
{
memset(block_1, '\0', strlen(block_1));
memcpy(block_1, ciphertext + i, m+1);
int w = 0;
for(j = i + 1 + m; j < block_num; j++)
{
if(j >= block_num) break;
memset(block_2, '\0', strlen(block_2));
memcpy(block_2, ciphertext + j, m+1);
if(strcmp(block_1, block_2) == 0)
{
w++;

int pos = same_str[str_num].num;
same_str[str_num].start[pos] = i;
same_str[str_num].end[pos] = j;
same_str[str_num].num++;

}
}
if(w > 0)
{
sprintf(same_str[str_num].str, "%s", block_1);
printf("No.%d, found same block: %s, %d times\n",str_num, same_str[str_num].str, same_str[str_num].num);
str_num++;
if(str_num >= malloc_size)
{
malloc_size += 1000;
same_str = realloc(same_str, sizeof(struct Same_Str) * malloc_size);
}

}
//usleep(100);
}
}

free(block_1);
free(block_2);

/* step 3: removed the letters whose repetitions were less than 2 times */

/* from code above, we know all possible key sizes (count: str_num), which are stored in same_str array */
/*if some strs whose replication time is more than 2, then it must be the REAL str*/
k = 0;
for(m = 0; m < str_num; m++ )
{
if(same_str[m].num >= 2) // just skip weak string, one replication string
{
/*the REAL str*/
for(r = 0; r < same_str[m].num; r++)
{
int start = same_str[m].start[r];
int end = same_str[m].end[r];
keysize[k] = end - start;
k++;
}
}

}

/*step 4: get the divisors of each distance and calculate their weight */

/* get max number of key size, to malloc the array, each possible key size will be split as an *
* array ptr[i] and be added with an all-0 array: weight, from the weight of each key, we know *
* the most possible keysize */
int **ptr = malloc(sizeof(int*)*k);
int max = 0;
for(i = 0; i < k; i++)
{
int num = keysize[i];
max = (max > num)? max : num;
}

int *weight = malloc(sizeof(int) * max);
for(i = 0; i < max ;i++)
{
weight[i] = 0;
}
for(i = 0; i < k; i++)
{
int num = keysize[i];
ptr[i] = malloc(sizeof(int)*max);
ptr[i][num-1] = 1 ;
memset(ptr[i], 0, max);

/* at first, ptr[i] = "00000...00100..", after get_divisor, ptr[i] = "0..1..00..100..1..100...", *
* where 1 means the match pos is the divisor number of keysize[i] */
get_divisor(ptr[i], num);
for(j = 0; j < max; j++)
{
weight[j] = weight[j] + ptr[i][j];
}
}

/*step 5, get the most probable length of key words through weight information*/

int max_weight = 0;

for(j = 0; j < max; j++)
{
if(weight[j] != 0)
{
printf("weight of %d: %d\n", j+1, weight[j]);
}
if(max_weight < weight[j])
{
size = j+1;
max_weight = weight[j];
}
}

/*step 6, free resources*/

for(i = 0; i < k; i++)
{
free(ptr[i]);
}
free(keysize);
free(weight);
free(ptr);
free(same_str);
return size;
}

2.5 Result

Figure 2.1 shows the same strings that were found. Figure 2.2 shows the weight of each probable key length. We can easily find the the number 5 has highest weight 226, and the key word that we input was “isima”, so we found the right length of key word, which was shown in figure 2.3.
Figure 2.1

Figure 2.1 the same strings that werefound

Figure 2.2

Figure 2.2 the weight of each probable key length

Figure 2.3

Figure 2.3 most probable key length is 5

2.6 Problem

If the length of key word was the number like 10 whose divisors were 1, 2, 5, 10. then the number 5 will have larger or at least equal weight as number 10. It’s a problem. I am sorry but I don’t know how to solve it.

3.Cryptanalysis: probable word method

3.1 Introduction

For most used English words, each character has a different frequency, which was shown in figure 3.1.
figure 3.1

Figure 3.1 the frequency of each character in alphabet

3.2 Lab work

To find the probable key word, we can detach the cipher into “key_length” part, for example, after calculation, the probable key length was 5, then we just detach the cipher into 5 parts, for the letters in each part, they have the same key character.
After that, we choose 1 character from alphabet and try to decrypt each part of cipher text. Then, we compare frequency of the decryption text with the table of the most used English words, which was shown in figure 3.1. In this case, we just calculate the sample variance of the frequencies, the smaller sample variance, the better. Repeat the decryption of each characters in alphabet, we will get a minimum sample variance, then we get the most probable key character.
Repeat “key length” time of the experiment, we will get the most probable key word.

3.3 source code

Analyze the text and get frequencies of each character

For a given text, we can get the frequencies of each character and sample variance using anal_freq function. We can also get the probability that two randomly chosen letters are the same, which was named Ke. One of the result was shown in figure 3.2.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/* anal_freq function is used to analyze the text, 
input:
1. text to analyze
output:
1. the frequency of each 26 letters ->m_fre[26]
2. The probability that two randomly chosen letters are the same -> Ke
3. return value: the variance, to estimate the quality of decryption test. */

double anal_freq(char *text, double m_fre[26], double *Ke)
{

int i, pos;
int count;

/* the normal frequency for each 26 letters */
double fre[26] = {0.08167, 0.01492, 0.02782, 0.04253, 0.12702, 0.02228, 0.02015,
0.06094, 0.06966, 0.00153, 0.00772, 0.04025, 0.02406, 0.06749,
0.07507, 0.01929, 0.00095, 0.05987, 0.06327, 0.09056,
0.02758, 0.00978, 0.02360, 0.00150, 0.01974, 0.00074, };

/* the offset was set to verify that, no matter how *
* you pick up a letter in an offset interval, the *
* Ke results are always approximately equals to 0.067 */

int offset = 1;
double variance = 0; // to estimate the quality of decryption text
count = strlen(text);

/*get frequency of each letter; get Ke*/

for(i = 0; i < 26; i++)
{
m_fre[i] = 0;
}
i = 0;
while(text[i] != '\0' && i < strlen(text))
{
pos = (int)(text[i] - 'a');
m_fre[pos]++;
i = i + offset;
}
*Ke = 0;
for(i = 0; i < 26; i++)
{
*Ke += (m_fre[i] * (m_fre[i] - 1));
}
count = count / offset;

*Ke = *Ke / (double)(count * (count - 1));
for(i = 0; i < 26; i++)
{
m_fre[i] = m_fre[i] / (double)count;
}
//printf("The probability that two randomly chosen letters are the same: %2.4f\n",Ke);

/*calculate the variance*/
for(i = 0; i < 26; i++)
{
variance += pow((m_fre[i] - fre[i]), 2);
}
variance = variance / (double)25;
return variance;
}

figure 3.2

Figure 3.2 analyze the text
Find most probable key character

Repeat the decryption of each characters in alphabet, we will get a minimum sample variance, then we get the most probable key character. The result was shown in figure 3.3.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/*the find_each_key_letter function is used to find a most probable key letter for a given ciphertext*/ 
/* if the probable key size is 5, then the find_each_key_letter function will be execute for 5 time */
/* input: seperated cipher text, for example, if peobable size is 5, then, the original ciphertext will */
/* be seperated into 5 parts, for all the letters in each part, their key letter are the same */
/* output: the most probable letter, which has the minimum variance */

char find_each_key_letter(char *cipher)
{
int i;
double variance[26];
double Ke[26];
char key;
double Std_Ke = 0.067;
double min = 10000; // set a very big value, to find a min
double min_Ke_off = 1;
char *plaintext = malloc(strlen(cipher) + 1);
for(i = 0; i < 26; i++)
{
char prob_key[2];
prob_key[0] = (char)(i + (int)('a'));
prob_key[1] = '\0';
double freq[26];
decrypt(cipher, prob_key, plaintext);
variance[i] = anal_freq(plaintext, freq, &Ke[i]);
printf("probable key letter : %s, Ke = %f, variance = %f \n",prob_key, Ke[i], variance[i]);
/*method 1: find the smallest variance*/
if(variance[i]< min )
{
key = prob_key[0];
min = variance[i];
}
/*method 2: find the closest Ke, this is wrong, because all the Ke are the same*/
/*double off = (Ke[i] - Std_Ke) > 0 ? (Ke[i] - Std_Ke) : (Std_Ke - Ke[i]);
if(off < min_Ke_off)
{
key = prob_key[0];
min_Ke_off = off;
}*/
}
free(plaintext);
return key; // find the most probable key latter c.
}

Figure 3.3

Figure 3.3 find most probable key character

From the result in figure 3.3, we can also see that no matter what character was set as key character, Ke, the probability that two randomly chosen letters are the same, is always 0.065136. So we can’t get the most probable key character from Ke.
After “key length” times repetition, we can get the most probable key word, which was shown in figure 3.4.
Figure 3.4

Figure 3.4 find the most probable key word
3.4 Decrypt the cipher

For a given key word, we will get the matched plain text using decrypt function, which was shown below. The result was shown in figure 3.5.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/*decrpty the viginer ciphertext*/ 
void decrypt(char *ciphertext, char *key, char* plaintext)
{
int i ;
int j ;
int offset ;
i = 0 ;
j = 0 ;
while(ciphertext[i] != '\n' && ciphertext[i] != '\0')
{
offset = key[j] -(int)('a');
if((ciphertext[i] - (int)'a' - offset) < 0)
{
plaintext[i] = ciphertext[i] - offset + 26;
}
else
{
plaintext[i] = ciphertext[i] - offset; //it seems that we can't modify plaintext
}
i++;
j++;
if(key[j] == '\n' || key[j] == '\0')
{
j = 0;
}
}
plaintext[i] = '\0';

}

Figure 3.5 decryption

After decryption, we finally break viginere and get the plaintext.

4. Reference

[1]. http://en.wikipedia.org/wiki/Letter_frequency
[2]. http://en.wikipedia.org/wiki/Vigen%C3%A8re_cipher
[3]. http://sancy.univ-bpclermont.fr/~guitton/enseignements/admin.html

Note: All the source code was put in my github