Saturday, January 13, 2018

Crowdsourcing a Labeling task using Amazon Mechanical Turk


Happy New Year! My New Year's resolution for 2018 is, perhaps unsurprisingly, to blog more frequently than I have in 2017.

Despite the recent advances in unsupervised and reinforcement learning, supervised learning remains the most time-tested and reliable method to build Machine Learning (ML) models today, as long as you have enough training data. Among ML models, Deep Learning (DL) has proven to be more effective in many cases. DL's grreatest advantage is its capability to consider all sorts of non-linear feature interactions automatically. In return all it asks for is more processing power and more training data.

With the ubiquity of the computer and the Internet in our everyday lives, it is not surprising that our very act of collective living generates vast amounts of data. In many cases it is possible, with a little bit of ingenuity, to discover implicit labels in this data, making the data usable for training supervised DL models. In most other cases, we are not so lucky and must take explicit steps to generate these labels. Traditionally, people have engaged human experts to do tha labeling from scratch, but this is usually very expensive and time-consuming. More recently, the trend is to generate noisy labels using unsupervised techniques, and validate them using human feedback.

Which brings me to the subject of my current post. One way to get this human feedback is through Amazon's Mechanical Turk (AMT or MTurk), where you can post a Human Intelligence Task (HIT) and have people do these HITs in return for micropayments made through the MTurk network. In this post, I describe the process of creating a collection of HITs and making them available for MTurk workers (aka turkers), then collecting the resulting labels.

Problem Description


I was trying to generate tags for snippets of text. These tags are intended to be keywords that are self-contained and describe some aspect of the text. And yes, I realize that this looks like something plain old search could do as well, but bear with me here -- this data is a first step of a larger pipeline and I do need these multi-word labels.

So each record consists of a snippet of text and 5 multi-word candidate labels. The labels are generated using various unsupervised techniques, some rule based and some that exploit statistical features of language. Because the scoring is not compatible across the various techniques, we select the top 10 percentile from each set, then randomly chose 5 labels for each snippet from the merged label pool.

The first step is to pre-pay for the HITs and push them to the MTurk site where they become visible to turkers, some of whom will take them on and complete them. After the specified number of turkers have completed the HITs to assign their crowdsourced labels and we accept their work, they get paid by AMR, and we need to download their work. MTurk provides an API that allows you to upload the HITs and retrieve the crowdsourced labels, which I will talk about here. My coverage is more from a programming standpoint, so I have done this against the MTurk sandbox site, which is free to use.

In terms of required software, I recently upgraded to Anaconda Python3. The other libraries used are boto3 to handle the network connections, the jinja2 templating engine included with Anaconda for generating the XML for the HIT in the MTurk request, and xmltodict to parse XML payloads in the MTurk response to Python data structures. Both boto3 and xmltodict can be installed using pip install. I also had a lot of help from this post Tutorial: A beginner's guide to crowdsourcing ML training data with Python and MTurk on the MTurk blog.

Creating HITs and uploading to MTurk


The unsupervised algorithms are run and the top results from each merged on our Apache Spark based analytics platform. A sample of these merged results are downloaded and used as input for creating the HITs. The input data looks like this:


The first step is to establish a connection to the MTurk (sandbox) server. For this, you need to have an AWS account, an MTurk development/requester account, and also link your AWS account to the MTurk account. This AWS Documentation page covers these steps in more detail. Once you are done, you should be able to establish a connection to the sandbox and see how much pretend money you have in the sandbox to pay your pretend workers.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from jinja2 import Template
import boto3
import os

# constants
MTURK_SANDBOX = "https://mturk-requester-sandbox.us-east-1.amazonaws.com"
MTURK_REGION = "us-east-1"
MTURK_PREVIEW_URL = "https://workersandbox.mturk.com/mturk/preview?groupId={:s}"

DATA_DIR = "../data"
HIT_ID_FILE = os.path.join(DATA_DIR, "best-keywords-hitids.txt")

NUM_QUESTIONS_PER_HIT = 10

# extract AWS credentials from local file
creds = []
CREDENTIALS_FILE = "/path/to/amazon-credentials.txt"
with open(CREDENTIALS_FILE, "r") as f:
    for line in f:
        if line.startswith("#"):
            continue
        _, cred = line.strip().split("=")
        creds.append(cred)

# verify that we can access MTurk sandbox server
mturk = boto3.client('mturk',
   aws_access_key_id=creds[0],
   aws_secret_access_key=creds[1],
   region_name=MTURK_REGION,
   endpoint_url=MTURK_SANDBOX
)
print("Sandbox account pretend balance: ${:s}".format(
    mturk.get_account_balance()["AvailableBalance"]))

We have (in our example) just 24 snippets with associated keywords. I want to group them into 10 snippets per HIT, so I have 3 HITs with 10, 10 and 4 snippets respectively. In reality you want a larger number for labeling, but since I was in development mode, I was the person doing the HIT each time, and I wanted to minimize my effort. At the same time, I wanted to make sure I could group my input into HITs of 10 snippets each, hence the choice of 24 snippets.

Each HIT needs to get formatted as an HTML form, which is then embedded inside a HTMLQuestion tag that is part of the XML syntax MTurk understands. Since we wanted to put multiple snippets into a single HIT, it was more convenient to use the loop unrolling capabilities of the Jinja2 templating engine than rely on Python's native templating through format() calls. Here is the template for our HIT.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
full_xml = Template("""
<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
    <HTMLContent><![CDATA[
<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
        <script type='text/javascript' src='https://s3.amazonaws.com/mturk-public/externalHIT_v1.js'></script>
    </head>
    <body>
        <form name="mturk_form" method="post" id="mturk_form" 
              action="https://www.mturk.com/mturk/externalSubmit">
        <input type="hidden" value="" name="assignmentId" id="assignmentId" />
        <ol>
        {% for row in rows %}
            <input type="hidden" name="iid_{{ row.id }}" value="{{ row.iid }}"/>
            <li>
                <b>Select all keywords appropriate for the snippet below:</b><br/>
                {{ row.snippet }}
                <p>
                <input type="checkbox" name="k_{{ row.id }}_1">{{ row.keyword_1 }}<br/>
                <input type="checkbox" name="k_{{ row.id }}_2">{{ row.keyword_2 }}<br/>
                <input type="checkbox" name="k_{{ row.id }}_3">{{ row.keyword_3 }}<br/>
                <input type="checkbox" name="k_{{ row.id }}_4">{{ row.keyword_4 }}<br/>
                <input type="checkbox" name="k_{{ row.id }}_5">{{ row.keyword_5 }}<br/>
                </p>
            </li>
            <hr/>
        {% endfor %}
        </ol>

            <p><input type="submit" id="submitButton" value="Submit"/>
            </p>
        </form>
        <script language='Javascript'>turkSetAssignmentID();</script>
    </body>
</html>
]]>
    </HTMLContent>
    <FrameHeight>600</FrameHeight>
</HTMLQuestion>
""")

We then group our data into 10 rows each, create a data structure rows, each row of which contains a dictionary of field names and values, then render the snippet above for this data structure. The resulting XML is fed to the MTurk sandbox server using boto3. Each call corresponds to a single HIT and the server will return a corresponding HIT Id, which we save for later use. It also returns a HIT group ID which we will use to generate a set of preview URLs.

We have modeled each group of 10 snippets as a completely separate HITs, with its own unique title (trailing #n). We could also have run multiple create_hit calls using the same title, in which case, a group of HITs are created under the same title. However, I noticed that I was sometimes getting back duplicate HIT Ids in that case, so I went with the separate HIT per 10 snippets strategy.

I also found a good use for the Keywords parameter - if you put some oddball term in there, you could share it with your team to get back the list of HITs you want them to look at.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def create_hit(mturk, question, hit_seq):
    hit = mturk.create_hit(
        Title="Best Keywords in Caption #{:d}".format(hit_seq),
        Description="Find best keywords in caption text",
        Keywords="aardvaark",
        Reward="0.10",
        MaxAssignments=1,
        LifetimeInSeconds=172800,
        AssignmentDurationInSeconds=600,
        AutoApprovalDelayInSeconds=14400,
        Question=question
    )
    group_id = hit["HIT"]["HITGroupId"]
    hit_id = hit["HIT"]["HITId"]
    return group_id, hit_id


rows = []
hit_group_ids, hit_ids = [], []
hit_seq = 1
with open(os.path.join(DATA_DIR, "best-keywords.tsv"), "r") as f:
    for lid, line in enumerate(f):
        if lid > 0 and lid % NUM_QUESTIONS_PER_HIT == 0:
            question = full_xml.render(rows=rows)
            hit_group_id, hit_id = create_hit(mturk, question, hit_seq)
            hit_group_ids.append(hit_group_id)
            hit_ids.append(hit_id)
            rows = []
            hit_seq += 1
        iid, snippet, key_1, key_2, key_3, key_4, key_5 = line.strip().split("\t")
        row = {
            "id": (lid + 1),
            "iid": iid, 
            "snippet": snippet,
            "keyword_1": key_1,
            "keyword_2": key_2,
            "keyword_3": key_3,
            "keyword_4": key_4,
            "keyword_5": key_5,
        }
        rows.append(row)
        
if len(rows) > 0:
    question = full_xml.render(rows=rows)
    create_hit(mturk, question, hit_seq)
    hit_group_ids.append(hit_group_id)
    hit_ids.append(hit_id)

The code above results in a flat file of HIT Ids that I can use to recall results for these HITs later. You can also see your HITs appear as shown below:


As you might expect, this is a giant form consisting of text snippets separated by checkbox group of 5 candidate keywords, terminated with a single Submit button. I am not sure if you can have Javascript support for more sophisticated use cases, but you can do a lot with HTML5 nowadays. Here is what (part of) the form looks like, marked up by the dev turker (me :-)).



Retrieving crowdsourced labels on HITs from MTurk


In a real-life scenario, the HITs would be on the MTurk production server and real humans would (hopefully) find my micro-payment of 10 cents per HIT adequate and do the marking up for me. I have configured my HIT to have MaxAssignments=1, which means I want only 1 worker to work on the HIT -- in reality, you want at least 3 people to work on each HIT so you can do a majority vote (or something more sophisticated) on their labels. In any case, once all your HITs have been handled by the required number of turkers, it is time to download the results.

Results for a HIT can be retrieved using the list_assignments_for_hit() method of the MTurk client -- you need the HIT Id for the HIT that was returned during HIT creation, and which we had stored away for use now. The response from the MTurk server is a JSON response, with the actual Answer value packaged as an XML payload. We use the xmltodict.parse() method to parse this payload into a Python data structure, which we then pick apart to write out the output.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import boto3
import os
import xmltodict

# constants
MTURK_SANDBOX = "https://mturk-requester-sandbox.us-east-1.amazonaws.com"
MTURK_REGION = "us-east-1"

DATA_DIR = "../data"
HIT_ID_FILE = os.path.join(DATA_DIR, "best-keywords-hitids.txt")
RESULTS_FILE = os.path.join(DATA_DIR, "best-keywords-results.txt")

# extract AWS credentials from local file
creds = []
CREDENTIALS_FILE = "/path/to/amazon-credentials.txt"
with open(CREDENTIALS_FILE, "r") as f:
    for line in f:
        if line.startswith("#"):
            continue
        _, cred = line.strip().split("=")
        creds.append(cred)

# verify access to MTurk
mturk = boto3.client('mturk',
   aws_access_key_id=creds[0],
   aws_secret_access_key=creds[1],
   region_name=MTURK_REGION,
   endpoint_url=MTURK_SANDBOX
)
print("Sandbox account pretend balance: ${:s}".format(
    mturk.get_account_balance()["AvailableBalance"]))

# get HIT Ids stored from during HIT creation
hit_ids = []
with open(HIT_ID_FILE, "r") as f:
    for line in f:
        hit_ids.append(line.strip())

# retrieve MTurk results
fres = open(RESULTS_FILE, "w")
for hit_id in hit_ids:
    snippet_ids, keyword_ids = {}, {}
    results = mturk.list_assignments_for_hit(HITId=hit_id, 
        AssignmentStatuses=['Submitted'])
    if results["NumResults"] > 0:
        for assignment in results["Assignments"]:
            worker_id = assignment["WorkerId"]
            answer_dict = xmltodict.parse(assignment["Answer"])
            answer_dict_2 = answer_dict["QuestionFormAnswers"]["Answer"]
            for answer_pair in answer_dict_2:
                field_name = answer_pair["QuestionIdentifier"]
                field_value = answer_pair["FreeText"]
                if field_name.startswith("iid_"):
                    id = field_name.split("_")[1]
                    snippet_ids[id] = field_value
                    keyword_ids[id] = []
                else:
                    _, iid, kid = field_name.split("_")
                    keyword_ids[iid].append(kid)
    for id, iid in snippet_ids.items():
        selected_kids = ",".join(keyword_ids[id])
        fres.write("{:s}\t{:s}\t{:s}\n".format(worker_id, iid, selected_kids))

fres.close()

The output of this step is a TSV file that contains the worker ID, the snippet ID, and a comma-separated list of keyword IDs that were found to be meaningful by the turker(s). This can now be joined with the original input file to find the preferred labels.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
A2AQYARTZTL5EE S0735109710021418-gr5 2,3
A2AQYARTZTL5EE S0894731707005962-gr1 2,4
A2AQYARTZTL5EE S1740677311000118-gr2 1,5
A2AQYARTZTL5EE S0031938414005393-gr2 1,3,4,5
A2AQYARTZTL5EE S1542356515000415-gr2 3
A2AQYARTZTL5EE S1521661616300158-gr8 2
A2AQYARTZTL5EE S0091743514001212-gr2 1,2,3
A2AQYARTZTL5EE S0735109712023662-gr2 1,2,3
A2AQYARTZTL5EE S0026049509000456-gr1 
A2AQYARTZTL5EE S0079610715000103-gr3 1,3
...

This is all I have for today. I hope you enjoyed the post and found it useful. I believe crowdsourcing will become more important as people begin to realize the benefits of weak supervision, and the MTurk API makes it quite easy to set up this kind of jobs.