The example code shown in the below explanation can also be found in this example Jupyter notebook.
The Overview visualization is powered by the feature statistics protocol buffer. The feature statistics protocol buffer messages store summary statistics for individual feature columns of a set of input data for an ML system (although it will be general enough to be used for summary statistics of any set of data).
The top-level proto is DatasetFeatureStatisticsList, which is a list of DatasetFeatureStatistics. Each DatasetFeatureStatistics represents the feature statistics for a single dataset. Each DatasetFeatureStatistics contains a list of FeatureNameStatistics, which contain the statistics for a single feature in a single dataset.
The feature statistics are different depending on the feature data type (numeric, string, or raw bytes). For numeric features, the statistics include metrics such as min, mean, median, max and standard deviation. For string feature, the statistics include metrics such as average length, number of unique values and mode.
Feature statistics includes an optional field for weighted statistics. If the dataset has an example weight feature, it can be used to calculate weighted statistics for every feature in addition to standard statistics. If a proto contains weighted fields, then the visualization will show the weighted statistics and the user will be able to toggle between unweighted and weighted versions of the charts per feature.
Feature statistics includes an optional field for custom statistics. If there are additional statistics for features in a dataset that a team wants to track and visualize they can be added to the custom stats field, which is a map of custom stat names to custom stat values (either numbers or strings). These custom stats will be displayed alongside the standard statistics.
The feature statistics protocol buffer can be created for datasets by the python code provided in the facets_overview/facets-overview directory.
This code can be installed through pip install facets-overview
. TensorFlow should also be installed but is not included as a
pip dependency, so as to allow a user to depend on either the tensorflow or tensorflow-gpu package as necessary.
Datasets can be analyzed either from a TfRecord files of tensorflow Example protocol buffers, or from pandas DataFrames.
As of version 1.1.0, the facets-overview
package requires a version of protobuf
at version 3.20.0 or later.
To create the proto from a pandas DataFrame, use the ProtoFromDataFrames
method of the GenericFeatureStatisticsGenerator class.
To create the proto from a TfRecord file, use the ProtoFromTfRecordFiles
method of the FeatureStatisticsGenerator class.
These generators have dependencies on the numpy and pandas python libraries.
Use of the FeatureStatisticsGenerator class also requires having tensorflow installed.
See those files for further documentation.
Example code:
from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
import pandas as pd
df = pd.DataFrame({'num' : [1, 2, 3, 4], 'str' : ['a', 'a', 'b', None]})
proto = GenericFeatureStatisticsGenerator().ProtoFromDataFrames([{'name': 'test', 'table': df}])
The python code in this repository for generating feature stats only works on datasets that are small enough to fit into memory on your local machine. For distributed generation of feature stats for large datasets, check out the independently-developed Facets Overview Spark project.
A proto can easily be visualized in a Jupyter notebook using the installed nbextension.
The proto is stingified and then provided as input to a facets-overview Polymer web component, via the protoInput
property on the element.
The web component is then displayed in output cell of the notebook.
Example code (continued from above example):
from IPython.core.display import display, HTML
import base64
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")
HTML_TEMPLATE = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html" >
<facets-overview id="elem"></facets-overview>
<script>
document.querySelector("#elem").protoInput = "{protostr}";
</script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))
The protoInput
property accepts any of the following three forms of the DatasetFeatureStatisticsList protocol buffer:
The visualization contains two tables: one for numeric features and one for categorical (string) features. Each table contains a row for each feature of that type. The rows contains calculated statistics and charts showing the distribution of values for that feature across the dataset(s).
Potentially problematic statistics, such as a feature is missing (has no value) for a large number of the examples in a dataset, are shown in red and bolded.
At the top of the visualization are controls that affect the individual tables.
The sort-by dropdown changes the sort order for the features in each table. The options are:
The name filter input box allows filtering the tables by feature names that match the text provided.
The currently-set filter is exposed as the property searchString
.
The feature checkboxes allow filtering by the type of value for each feature, such as float, int or string.
Which chart is displayed for the features in a table is controlled by a dropdown above the charts. The options for numeric features are:
The options for string features are:
Additionally, the feature statistics proto allows for custom charts to be stored for any feature. If the input proto to the visualization contains any custom charts, they will be listed in the dropdown as well.
Checkboxes next to the dropdown control some other features of the charts:
There are multiple demos of Overview that can be used as functional tests to ensure new builds are working correctly.
These demos are all found under facets_overview/functional_tests.
To run one, for example the “simple” test, run bazel run facets_overview/functional_tests/simple:devserver
and then navigate your browser to "localhost:6006/facets-overview/functional-tests/simple/index.html” to see the resulting visualization.
Run bazel run facets_overview/common/test:devserver
and then navigate your browser to “localhost:6006/facets-overview/facets-overview/common/test/runner.html”.
The output from the tests can be seen in the developer console.
python setup.py bdist_wheel --universal
twine upload dist/*
to upload it to PyPI.After installing the python package, run python -m feature_statistics_generator_test
and python -m generic_feature_statistics_generator_test
.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。