Who is next?
Machine
Malware
Protect Yourself

Infection Inspection: Predicting the Probability of a Windows Machine Being Infected by Various Families of Malware.

ABOUT THE DATA

The data was obtained on Kaggle and was supplied by Microsoft. The data set is comprised of real data on millions of windows machines. The supplied CSVs contain 83 columns and approximately 8.9 million rows.

Kaggle

The data set is available on Kaggle

Link to the Data Set!

Feature Definitions: Explanation of each of the coloumns in the data set.


  • MachineIdentifier - Individual machine ID
  • ProductName - Defender state information e.g. win8defender
  • EngineVersion - Defender state information e.g. 1.1.12603.0
  • AppVersion - Defender state information e.g. 4.9.10586.0
  • AvSigVersion - Defender state information e.g. 1.217.1014.0
  • IsBeta - Defender state information e.g. 1.217.1014.0
  • RtpStateBitfield - NA
  • IsSxsPassiveMode - NA
  • DefaultBrowsersIdentifier - ID for the machine's default browser
  • AVProductStatesIdentifier - ID for the specific configuration of a user's antivirus software
  • AVProductsInstalled - NA
  • AVProductsEnabled - NA
  • HasTpm - True if machine has tpm
  • CountryIdentifier - ID for the country the machine is located in
  • CityIdentifier - ID for the city the machine is located in
  • OrganizationIdentifier - ID for the organization the machine belongs in, organization ID is mapped to both specific companies and broad industries
  • GeoNameIdentifier - ID for the geographic region a machine is located in
  • LocaleEnglishNameIdentifier - English name of Locale ID of the current user
  • Platform - Calculates platform name (of OS related properties and processor property)
  • Processor - This is the process architecture of the installed operating system
  • OsVer - Version of the current operating system
  • OsBuild - Build of the current operating system
  • OsSuite - Product suite mask for the current operating system
  • OsPlatformSubRelease - Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2)
  • OsBuildLab - Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109-2022
  • SkuEdition - The goal of this feature is to use the Product Type defined in the MSDN to map to a 'SKU-Edition' name that is useful in population reporting. The valid Product Type are defined in %sdxroot%\data\windowseditions.xml. This API has been used since Vista and Server 2008, so there are many Prouct Types that do not apply to Windows 10. The 'SKU-Edition' is a string value that is in one of three classes of results. The design must hand each class
  • IsProtected - This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this machine, or if the AV is active, but is not receiving the latest updates. c. null if there are no Anti Virus Products in the report. Returns: Whether a machine is protected
  • AutoSampleOptIn - This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+
  • PuaMode - Pua Enabled mode from the service
  • SMode - This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed
  • IeVerIdentifier - NA
  • SmartScreen - This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry
  • Firewall - This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service
  • UacLuaenable - This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Mictosoft\Windows\CurrentVersion\Policies\System\EnableLUA
  • Census_MDC2FormFactor - A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible...)
  • Census_DeviceFamily - AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.Phone
  • Census_OEMNameIdentifier - NA
  • Census_OEMModelIdentifier - NA
  • Census_ProcessorCoreCount - Number of logical cores in the processor
  • Census_ProcessorManufacturerIdentifier - NA
  • Census_ProcessorModelIdentifier - NA
  • Census_ProcessorClass - A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updated
  • Census_PrimaryDiskTotalCapacity - Amount of disk space on primary disk of the machine in MB
  • Census_PrimaryDiskTypeName - Friendly name of Primary Disk Type - HDD or SSD
  • Census_SystemVolumeTotalCapacity - The size of the partition that the System volume is installed on in MB
  • Census_HasOpticalDiskDrive - True indidicates that the machine has an optical disk drive (CD/DVD)
  • Census_TotalPhysicalRAM - Retrieves the Physical RAM in MB
  • Census_ChassisTypeName - Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xx
  • Census_InternalPrimaryDiagonalDisplaySizeInInches - Retrieves the physical diagonal length in inches of the primary display
  • Census_InternalPrimaryDisplayResolutionHorizontal - Retrieves the number of pixels in the horizontal direction of the internal display.
  • Census_InternalPrimaryDisplayResolutionVertical - Retrieves the number of pixels in the vertical direction of the internal display
  • Census_PowerPlatformRoleName - Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the device
  • Census_InternalBatteryType - NA
  • Census_InternalBatteryNumberOfCharges - NA
  • Census_OSVersion - Numeric OS version Example - 10.0.10130.0
  • Census_OSArchitecture - Architecture on which the OS is based. Derived from OSVersionFull. Example - amd64
  • Census_OSBranch - Branch of the OS extracted from the OsVersionFull. Example - OsBranch = fbl_partner_eeap where OsVersion = 6.4.9813.0amd64fre.fbl_partner_eeap.140810-0005
  • Census_OSBuildNumber - OS build number extracted from the OsVersionFull. Example - OsBuildNumber = 10512 or 10240
  • Census_OSBuildRevision - OS Build revision extracted from the OsVersionFull. Example - OsBuildRevision = 1000 or 16458
  • Census_OSEdition - Edition of the current OS. Sourced from HKLM\Software\Microsoft\WindowsNT\CurrentVersion@EditionID in registry. Example: Enterprise
  • Census_OSSkuName - OS edition friendly name (currently Windows only)
  • Census_OSInstallTypeName - Friendly description of what install was used on the machine i.e. clean
  • Census_OSInstallLanguageIdentifier - NA
  • Census_OSUILocaleIdentifier - NA
  • Census_OSWUAutoUpdateOptionsName - Friendly name of the WindowsUpdate auto-update settings on the machine.
  • Census_IsPortableOperatingSystem - Indicates whether OS is booted up and running via Windows-To-Go on a USB stick.
  • Census_GenuineStateName - Friendly name of OSGenuineStateID. 0 = Genuine
  • Census_ActivationChannel - Retail lincense key or Valume license key for a machine
  • Census_IsFlightInternal - NA
  • Census_IsFlightsDisabled - Indicates if the machine is participating in flighting
  • Census_FlightRing - The ring that the device user would like to receive flight for. This might be different from the ring of the OS which is currentlyinstalled if the user changes the ring after getting a flight from a different ring
  • Census_thresholdOptIn - NA
  • Census_FirmwareManufacturerIdentifier - NA
  • Census_FirmwareVersionIdentifier - NA
  • Census_IsSecureBootEnabled - Indicates if Secure Boot mode is enabled.
  • Census_IsWIMBootEnabled - NA
  • Census_IsVirtualDevice - Identifies a Virtual Machine (machine learning model)
  • Census_IsTouchEnabled - Is this a touch device?
  • Census_IsPenCapable - Is the device capable of pen input?
  • Census_IsAlwaysOnAlwaysConnectedCapable - Retreives information about whether the battery enables the device to be AlwaysOnAlwaysConnected.
  • Wdft_IsGamer - Indicates whether the device is a gamer device or not based on its hardware combination.
  • Wdft_RegionIdentifier - NA

Goals:

  1. Using the given information about a computer, how accurately can we predict whether or not malware will be detected on that machine?
  2. Which machine learning techniques apply to this data set?
  3. How do the different ML techniques compare?

CLEANING THE DATA

Although many features were provided, some data that could have been useful, such as forensic indicators and risky user behavior, were not included in this set.

Some fields had information that was duplicated in other fields.

Our first task was to understand what role each feature represented, how the features were related, and which features to include for our analysis.



Inspection

After determining which features were worth including, we had to verify that the values for each feature were usable.

As pictured below, we calculated how many null values were in each column and plotted the ratios for a visualization of the contents of the data set.

After exploring the contents of the data set, we discovered that every feature in this set is categorical. This is important to note, as the lack of "continuous variables" will be a large factor in our choice of machine learning techniques. Rather than removing rows or columns containing null values, we approached each feature on a case-by-case basis to determine whether one-hot encoding, label encoding, or binary encoding would be appropriate. In some cases, we changed the data types in order to preserve ordinality for data that was not purely nominal.

Examples of feature engineering methods that were performed:

  1. changed NaNs in "PuaMode" to "off"
  2. changed NaNs in Census_ProcessorClass to "None"
  3. combined "Off" and "On" with "off" and "on"
  4. changed NaNs in Wdft_IsGamer to "0"
  5. changed NaNs in Census_IsFlightInternal to "1"





Removed Features:Below we have listed the features from the data that we removed and the reason(s) for their removal.


There are a few features that overlap completely or partially with others. They generally fall into several categories:

### OS versioning Features

There were twelve different features in the dataset that all described the same thing. They all gave information about which version of Windows was installed. We chose Census_OSVersion as it was a combinatinon of OSVer, Census_OSBuild, and Census_OSBuildRevision. The other OS information could be derived from those numbers if needed, or substituted later.

Census_OSVersion,

Platform,

OSVer,

OsBuild,

OsSuite,

OsPlatformSubRelease,

OsBuildLab*,

SkuEdition,

Census_OSBranch,

Census_OSBuildNumber,

Census_OSBuildRevision,

Census_OSEdition,

Census_OSSkuName,

Census_OSInstallTypeName,

Census_OSWUAutoUpdateOptionsName,

Census_IsPortableOperatingSystem,

Census_GenuineStateName,

Census_ActivationChannel,

Census_IsSecureBootEnabled,

Census_IsWIMBootEnabled,

\* Note that OSBuildLab also includes information that's in Hardware (Processor)

### Hardware/Firmware Description Features

A lot of these hardware configurations had overlapping values. Many of these other values are dependent on the hardware configurations. Because things like screen size, number of cores, and Total RAM can be related to the type and form factor, We simplified this category using that, then later compared it to select individual features such as Processor Core Count and Total RAM. Wdft_IsGamer is included here because it is based on the detection of a high powered video card. We included this as it was one of the few features that gave insight into user behavior.

Processor,

Census_MDC2FormFactor,

Census_DeviceFamily,

Census_OEMNameIdentifier,

Census_OEMModelIdentifier,

Census_ProcessorCoreCount,

Census_ProcessorManufacturerIdentifier,

Census_ProcessorModelIdentifier,

Census_ProcessorClass,

Census_PrimaryDiskTotalCapacity,

Census_PrimaryDiskTypeName,

Census_SystemVolumeTotalCapacity,

Census_TotalPhysicalRAM,

Census_ChassisTypeName,

Census_InternalPrimaryDiagonalDisplaySizeInInches,

Census_InternalPrimaryDisplayResolutionHorizontal,

Census_InternalPrimaryDisplayResolutionVertical,

Census_PowerPlatformRoleName,

Census_InternalBatteryType,

Census_InternalBatteryNumberOfCharges,

Census_FirmwareVersionIdentifier,

Census_IsVirtualDevice,

Census_IsTouchEnabled,

Census_IsPenCapable,

Census_IsAlwaysOnAlwaysConnectedCapable,

and Wdft_IsGamer

### Antivirus Components and Configuration Features

AVProductStatesIdentifier was a bitwise composite number that included many of the Defender options. It was chosen. Additionally, the EngineVersion, AppVersion, and AVsigVersion all seemed to have some relation to each other.

ProductName,

EngineVersion,

AppVersion,

AvSigVersion,

IsBeta,

RtpStateBitfield,

IsSxsPassiveMode,

AVProductStatesIdentifier,

AVProductsInstalled,

AVProductsEnabled,

HasTpm,

IsProtected,

AutoSampleOptIn,

PuaMode,

SMode,

Firewall,

and UacLuaenable

### Browser Configuration Features

Which web browser was the default, what verion of it if it was Internet Explorer, and Internet Explorer settings that are equilavent to Google's safe browsing.

DefaultBrowsersIdentifier,

IeVerIdentifier,

and SmartScreen

### Geographic/Customer Data Features

The Geographic data did not have much signal in it, so it was excluded.

CountryIdentifier,

CityIdentifier,

OrganizationIdentifier,

GeoNameIdentifier,

LocaleEnglishNameIdentifier,

Census_OSInstallLanguageIdentifier,

Census_OSUILocaleIdentifier,

and Wdft_RegionIdentifier

### Unknown Windows Settings Features

These features had a substantial amount of null values and a large number of skewed values.

Census_IsFlightingInternal,

Census_IsFlightsDisabled,

Census_FlightRing,

and Census_ThresholdOptIn


THE TEAM