The AppEnricher module processes app information in the app json interchange format (for instance produced by AppScanner) and enrich it with additional information not present in the original sources of the app store, producing new app information always expressed in the app json interchange format. The extra information that is calculated and added to each app is of three kinds. First, the total number of reviews. Second, the new uses
data structure. Third, the score information (relative to the measures defined in the module): note no ranking information is added, this is done in the appranker module (we keep the components separate because the rank information is relative to the used global dataset).
This module has been built for portability and ease of use, so it is composed by a stand-alone Python program. All you need is to have Python 3 installed on your system, then you can just download the code and run it. The behavior of the module can be tuned by using some command options that are described below. Being a Python program, it is also easy to inspect the code, see how it works, and possibly modify it if you want so.
Command line: appenricher
The Appenricher module can be directly launched without any parameters: if so, it will proceed by using default locations for the input and the output directories (see below).
The AppEnricher behaviour can be modified via the following command line options (shorter or longer formats):
-h, --help show this help message and exit
-i INPUT, --input INPUT
the input json directory (default: google/results)
-o OUTPUT, --output OUTPUT
the output json directory (default: google/enriched)
-b BASE, --base BASE base directory for relative directory paths (default: .)
-k, --keep existing files from output directory are kept (not deleted) (default: False)
-v, --verbose verbose informative output (default: True)
Some more information about the options:
-i, --input
-o, --output
-b, --base
-k, --keep
-v, --verbose
Score measures can be easily defined by using special high-level definitions that allow to write general scoring measures (so, you can easily write your own measures combining privacy, security and also other information of the app).
In order to understand how to write a measure, we can start with a simple example of measure and describe how it works.
Take for instance the sample measure Privacy Aware
as currently defined in appenricher:
"measure": { "name": "Privacy Aware", "description": { "short": "Avoid implicit marketing tracking", "full": "This measure penalizes apps for every user information that is" "obligatorily gathered for advertisement or marketing" }, "id": "org.mobosearch.scores.aware", "conditions": [ { "match": { "uses": { "Advertising or marketing": { "mandatory": "*" } } } }, { "score": 65536, "match": { "uses": { "Undefined": True } } } ] }
First, from the example we can see the basic information structure of a measure: it is defined by a measure
property that contains the name of the measure (via the name
property). Then there is a textual description of the measure, expressed by the description
property and its two subproperties short
(the short textual description of the app) and full
(the longer textual description).
Then, the id
property contains a string with the unique identifier of the measure.
After this information, the conditions
property contains the actual definition of the measure. At high level, the idea is to use a special kind of lazy structural matching
in order to compute the final score. Let's explain how it works by discussing the previous measure. The conditions
property contains a list of conditions, each expressed by a match rule (given by the match
property), and optionally a numerical score (given by the score
property). When no explicit score
information is provided, it is assumed by default to be 1: therefore, in the above example, writing the first match property without a score is equivalent to writing "score": 1
.
The value of each match property is a json structure: this structure is used as for structural pattern matching in the app json, and if a match is found the corresponding score is added: the final score is the sum of all the occurring matches (if there are no matches, the default zero value is returned). Let's see how it works going back to the previous example.
The first match condition is checking whether, among the various data uses of an app, there is the use Advertising or marketing
, containing a mandatory
part of any kind. Here, the string value "*" is used as structural wildcard pattern matcher, so indicating any substructure up to their final value. In natural language terms, the condition is therefore checking whether the app has some data that is unconditionally (mandatorily) used for ads/marketing purposes: for any such data, the rule matches and so produces a corresponding score (in this case, the default score 1), which is added to the overall score. So, after this rule is executed, the score is equal to the number of data items that are used for advertisement or marketing purposes (either collected by the developer or shared, without distinction).
The second matching condition, as we can see, is instead checking whether the data uses are undefined: in this case (obviously terrible from a privacy viewpoint, given that the developer is not disclosing what data is collected!) we give a very high penalty score (65536).
Note how this definition of measures allows to easily write general or particular rules that depend on structural information present in the app. Just to proceed with the previous example, we could for instance modify the measure by adding another rule:
{ "score": -1, "match": { "price": { "amount": 0 } } }
This rule is checking whether the price of the app is 0 (so, a free app), and if this is the case it is giving a -1 score to add. In other words, if the app is free we are giving a bonus point, subtracting 1 to the amount of penalties calculated by the rest of the measure. This simple example illustrates the general fact that a measure can actually reason globally on any information component of an app, not just on its privacy parts.
The source code of the module can be directly downloaded here (note that at public launch we will also provide a GitHub link).