MSDialog

MSDialog


 


Introduction


The MSDialog dataset is a labeled dialog dataset of question answering (QA) interactions between information seekers and answer providers from an online forum on Microsoft products (Microsoft Community). The dataset contains more than 2,000 multi-turn information-seeking conversations with 10,000 utterances that are annotated with user intent on the utterance level. Annotations were done using crowdsourcing with Amazon Mechanical Turk. MSDialog has several versions, including the complete set (MSDialog-Complete) and a labeled subset (MSDialog-Intent). We also preprocessed the data to produce MSDialog-ResponseRank for conversation response ranking.

MSDialog-Complete


We crawled over 35,000 dialogs from Microsoft Community, a forum that provides technical support for Microsoft products. This well-moderated forum contains user-generated questions with high-quality answers provided by Microsoft staff and other experienced users including Microsoft Most Valuable Professionals. In technical support online forums, a thread is typically initiated by a user-generated question and answered by experienced users (agents). The users may also exchange clarifications with the agents or give feedback based on answer quality. Thus the flow of a technical support thread resembles the information-seeking process if we consider threads as dialogs and posts as turns/utterances in dialogs. In addition to the dialog title and utterances, we also collected rich metadata, including question popularity, answer vote and user affiliation.

Data Fields

Note:

Example Data Format


{
    "20481": {
        "category": "Word",
        "dialog_time": "2017-09-21T04:15:54",
        "title": "Line and paragraph spacing in Office Word 2007",
                        "frequency": "0",
        "utterances": [{
            "affiliation": "Common User",
            "utterance_time": "2017-09-21T04:15:54",
            "utterance_pos": 1,
            "id": 192941,
            "user_id": "Michael",
            "actor_type": "User",
            "utterance": "Hello. Whenever I open a new Office Word ...",
            "is_answer": 0,
            "vote": "Freq_0"
        }, {
            "affiliation": "MVP",
            "utterance_time": "2017-09-21T05:16:23",
            "utterance_pos": 2,
            "id": 192944,
            "user_id": "Robin",
            "actor_type": "Agent",
            "utterance": "When using ...",
            "is_answer": 0,
            "vote": "0"
        }, { more utterances ... }],
    }
}


MSDialog-Intent


Based on the MSDialog-Complete, we selected some dialogs for user intent annotation on AMT. To ensure the quality and consistency of the dataset, we selected about 2,400 dialogs that meet the following criteria for annotation: (1) With 3 to 10 turns. (2) With 2 to 4 participants. (3) With at least one correct answer selected by the community. (4) Falls into one of the categories of Windows, Office, Bing, and Skype, which are the major categories of Microsoft products. We classify user intent in dialogs into 12 classes shown in the following table. Each utterance can be assigned multiple labels because an utterance can be associated multiple intent (e.g. GG+FQ).

Taxonomy

Code

Label

Description

Example

OQ

Original Question

The first question by a user that initiates the QA dialog.

If a computer is purchased with win 10 can it be downgraded to win 7?

RQ

Repeat Question

Posters other than the user repeat a previous question.

I am experiencing the same problem ...

CQ

Clarifying Question

Users or agents ask for clarification to get more details.

Your advice is not detailed enough. I'm not sure what you mean by ...

FD

Further Details

Users or agents provide more details.

Hi. Sorry for taking so long to reply. The information you need is ...

FQ

Follow Up Question

Users ask follow up questions about relevant issues.

Thanks. I really have one more simple question -- if I ...

IR

Information Request

Agents ask for information of users.

What is the make and model of the computer? Have you tried installing ...

PA

Potential Answer

A potential answer or solution provided by agents.

Hi. To change your PIN in Windows 10, you may follow the steps below: ...

PF

Positive Feedback

Users provide positive feedback for working solutions.

Hi. That was exactly the right fix. All set now. Tx!

NF

Negative Feedback

Users provide negative feedback for useless solutions.

Thank you for your help, but the steps below did not resolve the problem ...

GG

Greetings/Gratitude

Users or agents greet each others or express gratitude.

Thank you all for your responses to my question ...

JK

Junk

There is no useful information in the post.

Emojis. Sigh .... Thread closed by moderator ...

O

Others

Posts that cannot be categorized using other classes.

N/A

Data Fields

Same with MSDialog-Complete, with an extra field of user intent label under “utterances” called “tags”. “tags” include multiple user intent labels separated by space (e.g. “GG OQ”) .

Statistics

Items

Min

Max

Mean

Median

# Turns Per Dialog

3

10

4.56

4

# Participants Per Dialog

2

4

2.79

3

Dialog Length (Words)

27

1,469

296.90

241

Utterance Length (Words)

1

939

65.16

47

Item

MSDialog-Complete

MSDialog-Intent

# Dialogs

35,000

2,199

# Utterances

300,000

10,020

Avg. # Participants

3.18

2.79

Avg. # Turns Per Dialog

8.94

4.56

Avg. # Words Per Utterance

75.91

65.16

We also provide the data split of MSDialog-Intent that we used for our paper "User Intent Prediction for Information-seeking Conversations". The split version is referred to as MSdialog-IntentPred. We also include the feature files. Feel free to refer to our source code to see how the feature files are generated.

MSDialog-ResponseRank


We also preprocessed the MSDialog-Complete data to construct a benchmark data set for response ranking in information-seeking conversations. Given MSDialog-Complete data, we filtered dialogs whose number of turns are out of the range [3,99]. After that we split the data into training/validation/testing partitions by question_time. Specifically, the training data contains 25,019 dialogs from “2005-11-12” to “2017-08-20”. The validation data contains 4,654 dialogs from “2017-08-21” to “2017-09-20”. The testing data contains 5,064 dialogs from “2017-09-21” to “2017-10-04”.

The next step is to generate the dialog context and response candidates. For each utterance by the “User” (We consider the utterances by the user except the first utterance, since there is no associated dialog context with it), we collected the previous c utterances as the dialog context, where c = min(t-1,10) and t-1 is the total number of utterances before the t-th utterance. The true response by the “Agent” becomes the positive response candidate. For the negative response candidates, we adopted negative sampling to construct them following previous work. For each dialog context, we firstly used the true response as the query to retrieve the top 1,000 results from the whole response set of agents with BM25. Then we randomly sampled 9 responses from them to construct the negative response candidates. For more details of data preprocessing, you can check our SIGIR’18 paper on response ranking in information-seeking conversation included in the Citations section.

Data Fields

MSDialog-ResponseRank data includes three tsv files for the training, validation and testing of response ranking models. The format of these three files are as follows:

label \t utterance_1 \t utterance_2 \t ...... \t candidate_response

This format is also adopted by the ubuntu dialog corpus used in several papers. Each line is corresponding to a conversation context/candidate response pair. Suppose there are n_i columns separated by tab in the i-th line. The first column is a binary label to indicate whether the candidate response is the positive candidate response returned by the agent or the sampled negative candidate response. Then the next (n_i - 2) columns are the utterances in the conversation context including the current input utterance by the user. The last column is the candidate response started with “ <<<AGENT>>>:”.

Example Data Format

We show an example conversation context/response pair as follows. For better readability, we put each column of this line into different rows.
        0
        \t
        I upgraded last week with no apparent problems and used Sticky Notes as recently as two nights ago. ......  Sticky Notes doesn't even appear in the list of programs on this machine.  Argh!  Help!!!!  I have information on those notes I NEED.
        \t
        Hello,  Thank you for contacting Microsoft Community.  I can understand the inconvenience caused, be assured that we are here to help you with your concern.   Method: 1 SFC scan: Method: 2 If the issue persists I would suggest you to put .....         
        \t
        Hi Jenith,  It's an Acer Aspire 5750-P5WE0, previously running Windows 7 Home Premium.  I do not get any error messages if I open C:\Windows.old\Users, but all .....
        \t
         <<<AGENT>>>: We suggest that you perform a Clean Boot and try to reset the app. You may refer to this Microsoft article for more information.  Note: Follow the ......

Statistics

The statistics of MSDialog-ResponseRank data is presented as follows:

Data

MSDialog-ResonseRank

Items

Train

Valid

Test

# Context-response pairs

173,680

37,210

35,110

# Candidates per context

10

10

10

# Positive candidates per context

1

1

1

Min # turns per context

2

2

2

Max # turns per context

11

11

11

Avg # turns per context

5.0

4.9

4.4

Avg # words per context

271.4

263.2

227.4

Avg # words per response

66.7

67.6

66.8

Note that the statistics on average words per context/response are based on the preprocessed version of the data after removing stop words and words that appear less than 5 times in the whole corpus.

User ID Anonymization before Data Release

To protect user privacy, we performed user ID anonymization to all versions of MSDialog prior to data release. For MSDialog-Complete and MSDialog-Intent, we replace the user IDs in the “user_id” data field and their mentions in utterances with fake user IDs. For MSDialog-ResponseRanking, we use the Stanford Named Entity Recognizer to recognize person names in the data and replace them with “PERSON_PLACEHOLDER”. Note that the anonymization process may affect the results reported in our paper.

Instructions on Getting the Data


Any researchers interested in using the dataset for internal research should contact info@ciir.cs.umass.edu for access. Please include your name, institution and country you will be downloading from. When you contact us, if you are legitimate researchers, we will supply you with a password to a URL where you will have access to download the data.

By downloading the data, you agree that the data will be used only for internal research and that you will not share the dataset(s) with others.

Citations


If you use MSDialog in your paper, please include citations to the following papers:

Bibtext

        @inproceedings{InforSeek_User_Intent,
            author = {Qu, C. and Yang, L. and Croft, W. B. and Trippas, J. and Zhang, Y. and Qiu, M.},
            title = {Analyzing and Characterizing User Intent in Information-seeking Conversations. },
            booktitle = {SIGIR '18},
            year = {2018},
        } 

        @inproceedings{InforSeek_Response_Ranking,
            author = {Yang, L. and Qiu, M. and Qu, C. and Guo, J. and Zhang, Y. and Croft, W. B. and Huang, J. and Chen, H.},
            title = {Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems},
            booktitle = {SIGIR '18},
            year = {2018},
        } 

        @inproceedings{InforSeek_User_Intent_Pred,
            author = {Qu, C. and Yang, L. and Croft, W. B. and Zhang, Y. and Trippas, J. and Qiu, M.},
            title = {User Intent Prediction in Information-seeking Conversations},
            booktitle = {CHIIR '19},
            year = {2019},
        }


Have Questions?


Ask us questions at our google group or via emails to the authors of the papers.

Acknowledgement


This work was supported in part by the Center for Intelligent Information Retrieval and in part by NSF grant #IIS-1419693 and NSF grant #IIS-1715095. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

MSDialog Agreement


Use of the MSDialog Dataset
By downloading the MSDialog data, you as the “User” agree to use the data distributed by the Center for Intelligent Information Retrieval (CIIR) subject to the following understandings, terms and conditions:

Permitted Uses
The Data may only be used for internal evaluation and research purposes, and user will not share the dataset(s) with others.

Small excerpts of the information may be displayed to others or published in a scientific or technical context, solely for the purpose of describing the research and development and related issues, provided that user includes citations to the CIIR publications listed in the MSDialog data ReadMe file if the MSDialog data is used in a research paper. All efforts must be made not to infringe on the rights of any third party including, but limited to, the authors and publishers of the excerpts.

Copyright
The Information has been obtained by crawling the Internet. Due to the amount of data it has not been practicable to obtain permission from copyright owners to provide the data for the uses permitted under this Agreement (“Permitted Uses”).

User understands that all the documents in the data are documents which have been at some time made publicly available on the Internet and which have been collected using a process which respects the commonly accepted methods (such as robots.txt) for indicating that the documents should not be so collected.

The copyright holders retain ownership and reserve all rights pertaining to the use and distribution of the information. Except as specifically permitted above and as necessary to use and maintain the integrity of the information on computers used by the organization; the display, reproduction, transmission, distribution or publication of the information is prohibited. Violations of the copyright restrictions on the information may result in legal liability.

Disclaimer of Warranty
USER ACKNOWLEDGES AND AGREES THAT “DATA” RECEIVED ARE PROVIDED BY THE CENTER FOR INTELLIGENT INFORMATION RETRIEVAL AND OTHER CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR DATA CONTRIBUTORS BE LIABLE FOR SPECIAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE, INCIDENTAL OR OTHER DAMAGES, LOSSES, COSTS, CHARGES, CLAIMS, DEMANDS, FEES OR EXPENSES OF ANY NATURE OR KIND ARISING IN ANY WAY FROM THE FURNISHING OF OR USER’S USE OF THE DATA RECEIVED.