Datasets in Statistics and Cybersecurity

Datasets are fundamental for analyzing and modeling behaviors in cybersecurity. They form the basis for detecting anomalies, training predictive models, and evaluating intrusion detection systems [1].


Types of Datasets

1. Structured

Data organized in tables with rows and columns, such as relational databases.
Example: access logs, user tables [2].

2. Unstructured

Data without a fixed schema, such as text, images, or video. They require advanced processing techniques, like NLP or computer vision [3].

3. Semi-Structured

Data with partial structure, such as JSON or XML files. They contain labels or metadata that facilitate analysis [2].


Example: Tabular Dataset from KDD Cup '99

The KDD Cup '99 dataset is widely used for intrusion detection. It contains simulated network traffic information with 41 variables and various attack labels [6].
Here is a simplified tabular representation:

DurationProtocolServiceSrcBytesDstBytesLabel
0tcphttp1815450normal
0udpdomain105146normal
0tcpftp239486attack

This structure allows the application of statistical and machine learning techniques to identify anomalous behaviors.


Dataset Management

Proper dataset management is crucial for reliable and replicable results. Main steps include:


Data Distribution Concepts

Data distribution describes how values of a variable or set of variables are spread. Understanding distribution is essential to:

Types of Distribution

Other Relevant Concepts


Mathematical Formulas for Distributions

Measures of location and dispersion

Common probability distributions

Moments and shape

Chi-squared statistic

Used to compare observed vs expected frequencies (also used in your scoring implementation):

\[ \chi^2=\sum_{i}\frac{(O_i-E_i)^2}{E_i}, \]

where \(O_i\) are observed counts and \(E_i\) are expected counts.


Final Considerations

Proper analysis of datasets and distributions is essential in cybersecurity for identifying threats and optimizing defense systems. The combined use of structured, unstructured, and semi-structured datasets, along with rigorous data management, allows the development of robust and reliable statistical and machine learning models [1][6].


References

  1. Provost, F., & Fawcett, T. (2013). Data Science for Business. O'Reilly Media.
  2. Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. Elsevier.
  3. Jurafsky, D., & Martin, J. H. (2020). Speech and Language Processing. Pearson.
  4. Montgomery, D. C., & Runger, G. C. (2018). Applied Statistics and Probability for Engineers. Wiley.
  5. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
  6. Tavallaee, M., et al. (2009). A detailed analysis of the KDD CUP 99 data set. IEEE Symposium on Computational Intelligence for Security and Defense Applications.

Univariate and Bivariate Distribution on a Dataset

Here is an example based on a simple database created with Access. Using two basic SQL queries, we can calculate both the univariate and bivariate distributions. For the univariate distribution, we consider the Age variable, while for the bivariate distribution we examine Age and Height.

Example tabular data

Univariate Distribution on Age

Univariate Distribution

SQL Code:

Univariate Distribution SQL

Bivariate Distribution on Age and Height

Bivariate Distribution

SQL Code:

Bivariate Distribution SQL

Formulas for univariate / bivariate analysis

Useful formulas for the displayed analyses:

Using Distribution to Break a Caesar Cipher

This section introduces a simple web-based tool and an explanation for working with the Caesar cipher (shift = 3). The tool allows users to:

The frequency-based approach provides a statistical guess for the most likely shift (for longer texts), while brute-force remains reliable for shorter ones.

Main Goals of the Tool

Caesar Cipher Generation

Attempted Decryption of All Shifts

Scoring and Ranking of Candidates

Calculates scores based on:

Results Display

User Utilities

Caesar Cipher: Generate → Analyze (Improved Ranking)

Enter plaintext, choose a shift, and generate the cipher. Then you can analyze the cipher and update the two letter-frequency charts manualmente.
Caesar Cipher Generator
Output:
Weights for combined scoring
Shift Combined Word% Hamming Chi² Bigram% Preview

Selected Text

---

Grafici: occorrenze lettere (Plaintext & Ciphertext)

Premi Aggiorna grafici per calcolare le frequenze A–Z per i due testi. Nessun aggiornamento automatico in background.
Plaintext — percentuale per lettera (A–Z)
Ciphertext — percentuale per lettera (A–Z)

Overview of the Caesar Cipher with Statistical Analysis

The Caesar cipher is a monoalphabetic substitution cipher that shifts each letter of the plaintext by a fixed number of positions.

Plaintext:  ABCDEFGHIJKLMNOPQRSTUVWXYZ
Ciphertext: DEFGHIJKLMNOPQRSTUVWXYZABC
      

Encryption/Decryption Function


function caesarShift(text, shift){
  return text.split('').map(ch => {
    const code = ch.charCodeAt(0);
    if(code >= 65 && code <= 90) return String.fromCharCode(((code - 65 + shift) % 26) + 65);
    if(code >= 97 && code <= 122) return String.fromCharCode(((code - 97 + shift) % 26) + 97);
    return ch;
  }).join('');
}
      

Brute-force Decryption


for(let shift = 1; shift < 26; shift++){
  results.push({shift, text: caesarShift(ciphertext, shift)});
}
      

Statistical / Distribution-based Decryption

The advanced part of the code uses letter and bigram frequency distributions to estimate the most probable shift automatically.

Metrics Explained

MetricDescriptionWeight
Word%Fraction of valid common words (English/Italian)0.45
HammingLetter rank similarity0.25
Chi²Statistical deviation from expected frequencies0.15
Bigram%Common bigram presence0.15

Cryptographic Insights

The Caesar cipher is weak because of its small keyspace (25 shifts) and predictable frequency patterns. Statistical analysis quickly reveals the shift.

Conclusion

This implementation bridges cryptography and statistics, showing how data distributions can reveal hidden information. The combined score metric elegantly integrates multiple statistical indicators to automatically prioritize the most probable plaintext.

Code