The data was obtained on Kaggle and was supplied by Microsoft. The data set is comprised of real data on millions of windows machines. The supplied CSVs contain 83 columns and approximately 8.9 million rows.
Feature Definitions: Explanation of each of the coloumns in the data set.
Although many features were provided, some data that could have been useful, such as forensic indicators and risky user behavior, were not included in this set.
Some fields had information that was duplicated in other fields.
Our first task was to understand what role each feature represented, how the features were related, and which features to include for our analysis.
After determining which features were worth including, we had to verify that the values for each feature were usable.
As pictured below, we calculated how many null values were in each column and plotted the ratios for a visualization of the contents of the data set.
After exploring the contents of the data set, we discovered that every feature in this set is categorical. This is important to note, as the lack of "continuous variables" will be a large factor in our choice of machine learning techniques. Rather than removing rows or columns containing null values, we approached each feature on a case-by-case basis to determine whether one-hot encoding, label encoding, or binary encoding would be appropriate. In some cases, we changed the data types in order to preserve ordinality for data that was not purely nominal.
Examples of feature engineering methods that were performed:
Removed Features:Below we have listed the features from the data that we removed and the reason(s) for their removal.
There are a few features that overlap completely or partially with others. They generally fall into several categories:
### OS versioning Features
There were twelve different features in the dataset that all described the same thing. They all gave information about which version of Windows was installed. We chose Census_OSVersion as it was a combinatinon of OSVer, Census_OSBuild, and Census_OSBuildRevision. The other OS information could be derived from those numbers if needed, or substituted later.
Census_OSVersion,
Platform,
OSVer,
OsBuild,
OsSuite,
OsPlatformSubRelease,
OsBuildLab*,
SkuEdition,
Census_OSBranch,
Census_OSBuildNumber,
Census_OSBuildRevision,
Census_OSEdition,
Census_OSSkuName,
Census_OSInstallTypeName,
Census_OSWUAutoUpdateOptionsName,
Census_IsPortableOperatingSystem,
Census_GenuineStateName,
Census_ActivationChannel,
Census_IsSecureBootEnabled,
Census_IsWIMBootEnabled,
\* Note that OSBuildLab also includes information that's in Hardware (Processor)
### Hardware/Firmware Description Features
A lot of these hardware configurations had overlapping values. Many of these other values are dependent on the hardware configurations. Because things like screen size, number of cores, and Total RAM can be related to the type and form factor, We simplified this category using that, then later compared it to select individual features such as Processor Core Count and Total RAM. Wdft_IsGamer is included here because it is based on the detection of a high powered video card. We included this as it was one of the few features that gave insight into user behavior.
Processor,
Census_MDC2FormFactor,
Census_DeviceFamily,
Census_OEMNameIdentifier,
Census_OEMModelIdentifier,
Census_ProcessorCoreCount,
Census_ProcessorManufacturerIdentifier,
Census_ProcessorModelIdentifier,
Census_ProcessorClass,
Census_PrimaryDiskTotalCapacity,
Census_PrimaryDiskTypeName,
Census_SystemVolumeTotalCapacity,
Census_TotalPhysicalRAM,
Census_ChassisTypeName,
Census_InternalPrimaryDiagonalDisplaySizeInInches,
Census_InternalPrimaryDisplayResolutionHorizontal,
Census_InternalPrimaryDisplayResolutionVertical,
Census_PowerPlatformRoleName,
Census_InternalBatteryType,
Census_InternalBatteryNumberOfCharges,
Census_FirmwareVersionIdentifier,
Census_IsVirtualDevice,
Census_IsTouchEnabled,
Census_IsPenCapable,
Census_IsAlwaysOnAlwaysConnectedCapable,
and Wdft_IsGamer
### Antivirus Components and Configuration Features
AVProductStatesIdentifier was a bitwise composite number that included many of the Defender options. It was chosen. Additionally, the EngineVersion, AppVersion, and AVsigVersion all seemed to have some relation to each other.
ProductName,
EngineVersion,
AppVersion,
AvSigVersion,
IsBeta,
RtpStateBitfield,
IsSxsPassiveMode,
AVProductStatesIdentifier,
AVProductsInstalled,
AVProductsEnabled,
HasTpm,
IsProtected,
AutoSampleOptIn,
PuaMode,
SMode,
Firewall,
and UacLuaenable
### Browser Configuration Features
Which web browser was the default, what verion of it if it was Internet Explorer, and Internet Explorer settings that are equilavent to Google's safe browsing.
DefaultBrowsersIdentifier,
IeVerIdentifier,
and SmartScreen
### Geographic/Customer Data Features
The Geographic data did not have much signal in it, so it was excluded.
CountryIdentifier,
CityIdentifier,
OrganizationIdentifier,
GeoNameIdentifier,
LocaleEnglishNameIdentifier,
Census_OSInstallLanguageIdentifier,
Census_OSUILocaleIdentifier,
and Wdft_RegionIdentifier
### Unknown Windows Settings Features
These features had a substantial amount of null values and a large number of skewed values.
Census_IsFlightingInternal,
Census_IsFlightsDisabled,
Census_FlightRing,
and Census_ThresholdOptIn
Find me on GitHub Click Here
Find me on GitHub Click Here
Find me on GitHub Click Here
Find me on GitHub Click Here
Find me on GitHub Click Here
Find me on GitHub Click Here
Find me on GitHub Click Here