This page may be out of date. Save your draft before refreshing this page.Submit any pending changes before refreshing this page.

As we looked at patterns across windows of certain sizes, we thought about ways to track trends such as non-decreasing and non-increasing subranges as efficiently as possible.

For this problem, you are given N days of upvote count data, and a fixed window size K. For each window of K days, from left to right, find the number of non-decreasing subranges within the window minus the number of non-increasing subranges within the window.

A window of days is defined as contiguous range of days. Thus, there are exactly [math]N-K+1[/math] windows where this metric needs to be computed. A non-decreasing subrange is defined as a contiguous range of indices [math][a,b][/math], [math]a<b[/math], where each element is at least as large as the previous element. A non-increasing subrange is similarly defined, except each element is at least as large as the next. There are up to [math]K(K-1)/2[/math] of these respective subranges within a window, so the metric is bounded by [math][-K(K-1)/2,K(K-1)/2][/math].

- [math]1 \leq N \leq 100,000[/math] days
- [math]1 \leq K \leq N[/math] days

Line 1: Two integers, [math]N[/math] and [math]K[/math]

Line 2: [math]N[/math] positive integers of upvote counts, each integer less than or equal to [math]10^9[/math].

Line 1..: [math]N-K+1[/math] integers, one integer for each window's result on each line

5 3 1 2 3 1 1

3 0 -2

For the first window of [1, 2, 3], there are 3 non-decreasing subranges and 0

non-increasing, so the answer is 3. For the second window of [2, 3, 1], there is 1 non-decreasing subrange and 1 non-increasing, so the answer is 0. For the third window of [3, 1, 1], there is 1 non-decreasing subrange and 3 non-increasing, so the answer is -2.

For this problem, imagine a simplified version of Quora where each question has only one topic associated with it. In turn, the topics form a simplified

ontology where each topic has a list of children, and all topics are descendants of a single root topic.

Design a system that allows for fast searches of questions under topics. There are [math]N[/math] topics, [math]M[/math] questions, and [math]K[/math] queries, given in this order. Each query has a desired topic as well as a desired string prefix. For each query, return the number of questions that fall under the queried topic and begin with the desired string. When considering topics, we want to include all descendants of the queried topic as well as the queried topic itself. In other words, each query searches over the subtree of the topic.

The topic ontology is given in the form of a flattened tree of topic names, where each topic may optionally have children. If a topic has children, they are listed after it within parentheses, and those topics may have children of their own, etc. See the sample for the exact input format. The tree is guaranteed to have a single root topic.

For ease of parsing, each topic name will be composed of English alphabetical characters, and each question and query text will be composed of English alphabetical characters, spaces, and question marks. Each question and query text will be well behaved: there will be no consecutive spaces or leading/trailing spaces. All queries, however, are case sensitive.

- For 100% of the test data, [math]1 \leq N, M, K \leq 10^5[/math] and the input file is smaller than 5MB
- For 50% of the test data, [math]1 \leq N, M, K \leq 2 \cdot 10^4[/math] and the input file is smaller than 1MB

- Line 1: One integer [math]N[/math]
- Line 2: [math]N[/math] topics arranged in a flat tree (see sample)
- Line 3: One integer [math]M[/math]
- Line 4...M+3: Each line contains a topic name, followed by a colon and a space, and then the question text.
- Line M+4: One integer [math]K[/math]
- Line M+5...M+K+4: Each line contains a topic name, followed by a space, and then the query text.

Line 1...K: Line [math]i[/math] should contain the answer for the [math]i[/math]th query.

6 Animals ( Reptiles Birds ( Eagles Pigeons Crows ) ) 5 Reptiles: Why are many reptiles green? Birds: How do birds fly? Eagles: How endangered are eagles? Pigeons: Where in the world are pigeons most densely populated? Eagles: Where do most eagles live? 4 Eagles How en Birds Where Reptiles Why do Animals Wh

1 2 0 3

The first query corresponds to the green area in the diagram, since it is looking for topics under Eagles, and the query string matches just one question: "How endangered are eagles?" The second query corresponds to the blue area in the diagram, which is the subtree of Birds, and matches two questions that begin with "Where". The third corresponds to the red area, which does not have any questions that begin with "Why do". The final query corresponds to the entire tree, since Animals is the root topic, and matches three questions.

Given that each stuffed animal is a perfect sphere and has an integer cuteness value, help Jerry pick a (possibly empty) subset of the stuffed animals to pick up such that their sum of values is maximized. Note that the stuffed animals can have negative value, so picking up all the animals is not necessarily optimal. Jerry does not want to disorganize the arrangement, so he cannot take any stuffed animal without taking all of the (up to 3) stuffed animals above it.

See the sample input and diagram to see how the stuffed animals are organized as a tetrahedral pyramid and given as input.

- For 100% of the test data, [math]1 \leq N \leq 12[/math]
- For 40% of the test data [math]1 \leq N \leq 6[/math]

- Line 1: One integer [math]N[/math]
- Line 2...N(N+1)/2+1: The formatted pyramid of cuteness values

Line 1: One integer, the maximum sum of values achievable

3 5 -2 -7 -3 1 0 8 0 3 2

`8`

The optimal selection is shown in the diagram in bold. It is suboptimal to take 1 because that would require taking -2 as well, which would decrease the total. On the other hand, 8, 3, and 2 should all be taken because those outweigh -3 and -7.

We have a pretty good machine learning system in place for generating these topics, but it could always be better. Your goal in this task is to design a question topic labeler.

The input data consists of a training dataset of [math]T[/math] questions and an evaluation dataset of [math]E[/math] questions. Your labeler should suggest 10 topics for each question in the evaluation set, ordered by relevance.

The first line consists of two space-separated integers [math]T[/math] and [math]E\, (1 \leq T \leq 20000, 1 \leq E \leq 1000)[/math] .

The next [math]2T[/math] lines provide the training dataset. Each question in the training dataset is described by two lines:

- One integer [math]N\, (1 \leq N \leq 25)[/math] , followed by [math]N[/math] positive integers (below 250) representing the topic IDs of the question in no particular order. On average, there are approximately 3 topics per question in the dataset.
- One string (between 1 and 500 printable ASCII characters), the question text. The average question text length is 70 characters.

The next [math]E[/math] lines provide the evaluation dataset. Each question in the evaluation dataset is described by one line (between 1 and 500 ASCII characters), the question text.

There are [math]T[/math] lines in total. The [math]i[/math]-th line contains 10 space-separated integers, each representing the topic id of a suggestion for the [math]i[/math]-th question in the evaluation dataset. To maximize your score, topic suggestions for a question should be ordered in descending order of relevance.

6 4 3 1 2 4 What is the meaning of life? 7 1 2 5 8 9 11 15 What is Quora? 2 14 178 What are the best Google calendar hacks? 3 117 93 125 Why does government of China not value the freedom of speech? 2 65 164 What is the best piece of design ever? 5 197 183 29 170 143 What was the last conversation you had with your father? Are programming contests fun? What is machine learning? How do you code in C++? Is it possible to sort in linear time?

Note: a more comprehensive sample input (with the same training data made available during evaluation) is available here (input, correct topics). You can use this script to score your output for the sample dataset.

1 2 5 3 7 9 11 10 14 15 2 9 29 3 117 197 1 183 178 15 15 197 170 143 8 5 1 7 2 14 1 8 14 164 125 15 2 65 164 3

Your score for each question is determined as follows:

[math]\frac{{}\sum\limits_{i=0}^9{\sqrt{10 - i} \cdot (guess_i \in questionTopics)}}{\sum\limits_{i=0}^{min(|questionTopics|, 10) - 1}{\sqrt{10 - i}}}[/math]

Your raw score is the sum of each question score.

[math]minScore[/math] = raw score for classifier that guesses 10 most frequent topics.

Your final score is [math]200 \cdot \frac{yourRawScore - minScore}{E - minScore}[/math].

Your program is limited to 512 MB of memory and must run in 60 seconds or less.

As the number of questions on Quora grows, there is an increasing likelihood that a new question may be a duplicate of an existing question. The text of the questions can vary significantly, but semantically might mean the same thing. We rely on human judgment to determine if 2 questions are considered to be duplicates and merge these questions together on Quora.

For this task, given Quora question text and topic data, determine if any 2 given pairs of questions are considered duplicates or not.

The following fields of raw data are given in JSON representing each question:

- question_key (string): Unique identifier for the question.
- question_text (string): Text of the question.
- context_topic (object): The primary topic of a question, if present. Null otherwise. The topic object will contain a name (string) and followers (integer) count.
- topics (array of objects): All topics on a question, including the primary topic. Each topic object will contain a name (string) and followers (integer) count.
- view_count (int): Number of views on the question.

After the raw data for each question, the list of known duplicate questions and non-duplicate questions among all the given questions are shared as pairs of questions with an integer representing whether that pair is a duplicate or not (1 for duplicate and 0 for non-duplicate). All other pairings of any 2 questions that are not in this list are unknown to be duplicates or not and the test set will come from this. Note that all questions referenced by the training and test sets will be present in the list of raw question data.

- The first line will contain an integer [math]1 \leq Q \leq 60,000[/math], which represents the number of lines of questions in the training data to follow.
- The next Q lines will contain JSON encoded fields of raw question data.
- The next line will contain an integer [math]1 \leq D \leq 25,000[/math], which represent the number of lines of duplicate question pairs in the training data to follow.
- The next D lines will each contain 2 known duplicate questions identified by their unique question_key and an integer representing whether the pair are duplicates (1 for duplicate, 0 for non-duplicate) separated by spaces. (e.g. “abc def 1”)
- The next line will contain an integer [math]1 \leq N \leq 3,000[/math] for the number of lines of test question pair data to follow.
- The next N lines will each contain 2 test questions identified by their unique question_key separated by a space.

- N lines of 2 question keys and an integer representing whether the 2 test questions are duplicates (1 for duplicate and 0 for non-duplicate). (e.g. “ghi jkl 0”)

3 {"view_count": 773, "question_text": "Which is the most intelligent alien or alien species in fiction?", "context_topic": {"followers": 130960, "name": "Science Fiction (genre)"}, "topics": [{"followers": 48, "name": "Science Fiction Books"}, {"followers": 130960, "name": "Science Fiction (genre)"}, {"followers": 1182, "name": "Extraterrestrial Intelligence"}, {"followers": 50056, "name": "Extraterrestrial Life"}, {"followers": 3883, "name": "Science Fiction Movies"}], "follow_count": 9, "question_key": "AAEAAJ/qtRMKkzXyA0tvjyz5tPRWgYizvOkCr9Z9CdJ4cood", "age": 413} {"view_count": 3522, "question_text": "What is the best way to keep bookmarks?", "context_topic": {"followers": 513, "name": "Bookmarking"}, "topics": [{"followers": 1136, "name": "Pocket (app)"}, {"followers": 9, "name": "ReadItLater"}, {"followers": 1625, "name": "Pinboard"}, {"followers": 1275, "name": "Social Bookmarking"}, {"followers": 513, "name": "Bookmarking"}, {"followers": 5604, "name": "Delicious (web application)"}, {"followers": 4359, "name": "Instapaper"}, {"followers": 85, "name": "Web Bookmarks"}], "follow_count": 62, "question_key": "AAEAADJKxcVF6l23JZvf1Fz+QrKr35CTlMKayNnZebc8dQAY", "age": 1193} {"view_count": 390, "question_text": "What is best for online bookmarks?", "context_topic": null, "topics": [{"followers": 1275, "name": "Social Bookmarking"}, {"followers": 285, "name": "Social Bookmarking Websites"}, {"followers": 513, "name": "Bookmarking"}], "follow_count": 4, "question_key": "AAEAAO3FKYrsnYH9uKAOnnXfYrGGTVFA3uzHz+Vltm5Ssii3", "age": 1211} 2 AAEAADJKxcVF6l23JZvf1Fz+QrKr35CTlMKayNnZebc8dQAY AAEAAO3FKYrsnYH9uKAOnnXfYrGGTVFA3uzHz+Vltm5Ssii3 1 AAEAADJKxcVF6l23JZvf1Fz+QrKr35CTlMKayNnZebc8dQAY AAEAAJ/qtRMKkzXyA0tvjyz5tPRWgYizvOkCr9Z9CdJ4cood 0 1 AAEAAJ/qtRMKkzXyA0tvjyz5tPRWgYizvOkCr9Z9CdJ4cood AAEAAO3FKYrsnYH9uKAOnnXfYrGGTVFA3uzHz+Vltm5Ssii3

Note: a more comprehensive sample input is available here (input, output). You can use this script to score your output for the sample dataset.

`AAEAAJ/qtRMKkzXyA0tvjyz5tPRWgYizvOkCr9Z9CdJ4cood AAEAAO3FKYrsnYH9uKAOnnXfYrGGTVFA3uzHz+Vltm5Ssii3 0`

You will get 1 point for each pair of questions that you predicted correctly.

Your raw score is the sum of all points for all pairs of questions in the test case ([math]N[/math]).

Your final score is [math]200 \cdot \frac{yourRawScore}{N}[/math].

Your program is limited to 1024 MB of memory and must run in 60 seconds or less.

- This data set is not representative of all the data in Quora but intentionally sampled to make a good challenge to solve.

Quora uses a combination of machine learning algorithms and moderation to ensure high-quality content on the site. High question and answer quality has helped Quora distinguish itself from other Q&A sites on the web.

As we get many questions every day, a challenge we have is to figure out good, interesting and meaningful questions from the bad. What questions are valid ones that can be answered? What questions attract reputable answers that then get upvoted? Can you tell which questions will likely get answers quickly, so that we can surface them in real-time to our users?

The first line contains N. N questions follow, each being a valid json object. The following fields of raw data are given in json.

- question_key (string): Unique identifier for the question.
- question_text (string): Text of the question.
- context_topic (object): The primary topic of a question, if present. Null otherwise. The topic object will contain a name (string) and followers (integer) count.
- topics (array of objects): All topics on a question, including the primary topic. Each topic object will contain a name (string) and followers (integer) count.
- anonymous (boolean): Whether the question was anonymous.
- __ans__ (boolean): Whether the question got an up-voted answer within 1 day.

T questions follow, each being a valid json object.

The json contains all but one field

question_key is of ascii format.

question_text, name in topics and context_topic is of UTF-8 format.

0 <= followers <= 106

9000 <= N <= 45000

1000 <= T <= 5000

Number correct classified / Total input size * 100%

The training and test set each will have approximately an equal number of each boolean type.

Your score will be based only on the hidden input. The sample input is only for your convenience.

Quora uses machine learning algorithms to try to generate interesting news feeds and digest emails for people.

Before a question gets an answer, we’d like to be able to make use of all the available information we have to be able to predict how interesting or relevant the question is to people. Ideally, we’d like to be able to tell this in real-time as soon as a few people have viewed it, by measuring the people following the question as a proxy of interest. Can you tell what questions will get the most followers?

The first line contains N. N questions follow, each being a valid json object. The following fields of raw data are given in json.

- question_key (string): Unique identifier for the question.
- question_text (string): Text of the question.
- context_topic (object): The primary topic of a question, if present. Null otherwise. The topic object will contain a name (string) and followers (integer) count.
- topics (array of objects): All topics on a question, including the primary topic. Each topic object will contain a name (string) and followers (integer) count.
- anonymous (boolean): Whether the question was anonymous.
- __ans__ (float): The ratio of viewers to followers of the question.

This is immediately followed by an integer T.

T questions follow, each being a valid json object.

The json contains all but one field

question_text, name in topics and context_topic is of UTF-8 format.

0 <= followers <= 106

9000 <= N <= 45000

1000 <= T <= 5000

Your score will be based only on the hidden input. The sample input is only for your convenience.

Quora uses a combination of machine learning algorithms and moderation to ensure high-quality content on the site. High question and answer quality has helped Quora distinguish itself from other Q&A sites on the web.

Not all of these questions are as interesting or appealing to people on the web. Can you tell what questions can organically attract the most viewers? What about questions that eventually become viral? Which questions are timeless and can sustain traffic?

- question_key (string): Unique identifier for the question.
- question_text (string): Text of the question.
- anonymous (boolean): Whether the question was anonymous.
- num_answers (integer): The number of visible non-collapsed answers the question has.
- promoted_to (integer): The number of people the question was promoted to.
- __ans__ (float): The ratio of viewers to age of the question in days.

This is immediately followed by an integer T.

T questions follow, each being a valid json object.

The json contains all but one field

question_text, name in topics and context_topic is of UTF-8 format.

0 <= followers <= 106

9000 <= N <= 45000

1000 <= T <= 5000

Sample testcases can be downloaded here and used for offline training if desired.

Your score will be based only on the hidden input. The sample input is only for your convenience.

The search bar at the top of every page on Quora allows you to search the most up-to-date people, topics and questions on our site.

We want to quickly return the most relevant results that match the search query entered into the input text field. Every time a new user, question or topic is added (or old ones deleted), subsequent queries must reflect those changes immediately. We currently use a fast in-memory service to handle this.

Input comes into the service as the following commands:

**ADD <type> <id> <score> <data string that can contain spaces>**

Adds the following <type> of item (user | topic | question | board) with the unique <id> string and <score> float, corresponding to the <data string> that would be used to match queries.**DEL <id>**

Deletes the item specified by unique identifier <id>.**QUERY <number of results> <query string that can contain spaces>**

Queries for the specified integer number of results (up to 20) that match a given query string. For a data item to be considered a matching result to a query, each token in the query must be found in the data string as a case-insensitive prefix to any token in the data string. The results are ranked in descending order of score, and we only take the top few results as specified. When there is a tie in the score, newer items (more recently added) should be ranked higher. If there are no results, just output the empty string on that line.**WQUERY <number of results> <number of boosts> (<type>:<boost>)* (<id>:<boost>)* <query string that can contain spaces>**

Same as QUERY, except that instead of using the raw input score specified by ADD for the particular item, the scores are weighted by the optional number of boosting factors. The boosts are floats that should be multiplied to the raw score, and affect either whole types, or specific items with the given <id>s. If there are multiple boosts applicable, they each are multiplied commutatively to the raw score.

Your task will be to write an equivalent service as a standalone program, with input files that correspond to the queries and updates to the data, and expected output files that correspond to the results obtained for each query.

Your program will be given an integer N on the first line of stdin, followed by N lines of the form:

<command> <command data>

The input commands are: ADD | DEL | QUERY | WQUERY

Types are: user | topic | question | board

Ids are alphanumeric strings without spaces or punctuation and will not include the same strings used for types.

Data strings can be any ASCII character, but are delimited by spaces or tabs. We will not be using anything special unprintable characters or \r and \n in the data string.

Command data for each command is as specified above. For example:

15 ADD user u1 1.0 Adam D’Angelo ADD user u2 1.0 Adam Black ADD topic t1 0.8 Adam D’Angelo ADD question q1 0.5 What does Adam D’Angelo do at Quora? ADD question q2 0.5 How did Adam D’Angelo learn programming? QUERY 10 Adam QUERY 10 Adam D’A QUERY 10 Adam Cheever QUERY 10 LEARN how QUERY 1 lear H QUERY 0 lea WQUERY 10 0 Adam D’A WQUERY 2 1 topic:9.99 Adam D’A DEL u2 QUERY 2 Adam

For each QUERY and WQUERY command, you should output the following line:

<sorted result ids>

Where each line contains the <id>s in descending score order, up to the required number of results, as specified above according to the QUERY and WQUERY commands. If there are no matches, output the empty line. For example:

u2 u1 t1 q2 q1 u1 t1 q2 q1 q2 q2 u1 t1 q2 q1 t1 u1 u1 t1

0 < N < 100000

0 < number of ADDs < 40000

0 < number of DELs < 10000

0 < number of QUERYs < 20000

0 < number of WQUERYs < 1000

0 < number of boosts < 25

0.0 < each score < 100.0

0 < data string length < 100 chars

You should aim to have your algorithm be fast enough to solve our largest test inputs in under 5 seconds, or be as close to that as possible.

In this problem, we are presenting a representative simplified version of our service for the purposes of the contest.

Quora shows a customized feed of recent stories on a user's home page. Stories in Quora refer to activities that happen on the site, for example, when a user posts a note, adds a question, or upvotes an answer. We score each story based on its type and other characteristics, and this score represents the value we think a story brings to the user. We want to be able to generate quickly the feed of best and most recent stories for the user every time they reload their home page.

Your task will be to design the algorithm that picks the stories to display in the feed.

You are given a list of stories, each having a time of publication, a score and a certain height in pixels that it takes to display the story. Given the total number of pixels in the browser available for displaying the feed, you want to maximize the sum of scores for the stories that you can display in the feed at each time the user reloads their home page. We only want to consider recent stories, so only stories that were published in a recent time window from the time of reload should be considered. You do not have to use up all the pixels in the browser.

The first line of input will be 3 positive integers: N the number of events, W the time window representing the window of recent stories, and H the height of the browser in pixels.

There will be N lines following that, each beginning with ‘S’ if it is a story event, or ‘R’ if it is a reload event. A story event will have 3 positive integers: the time of publication, the score and the height in pixels of that story. A reload event will have 1 positive integer: the time of reload.

The events will always be in chronological order, and no two events will happen at the same time.

For example:

9 10 100 S 11 50 30 R 12 S 13 40 20 S 14 45 40 R 15 R 16 S 18 45 20 R 21 R 22

For each reload event given in the input, you are to output a line of integers. First, the maximum score of stories you can show in the feed. This should be followed by the number of stories picked and the id number for each story picked, in increasing order. Stories are given an id starting from 1 in the order of their time of publication.

If two sets of stories give the same score, choose the set with fewer stories. If there is still a tie, choose the set which has the lexicographically smaller set of ids, e.g. choose [1, 2, 5] over [1, 3, 4].

For example:

50 1 1 135 3 1 2 3 135 3 1 2 3 140 3 1 3 4 130 3 2 3 4

Explanation:

There are 4 stories (with ids 1 to 4) and 5 reload events. At the first reload, there is only one story with score of 50 available for display. The next two reloads, we have 3 stories that take up 90 of the 100 pixels available, for a total score of 135. When we reload at time 21, there are 4 stories available for choosing, but only 3 will fit into the browser height. The best set is [1, 3, 4] for a total score of 140. At the last reload event, we no longer consider story 1 when choosing stories because it is more than 10 time units old. The best set is thus [2, 3, 4].

All times are positive integers up to 1,000,000,000.

All scores are positive integers up to 1,000,000.

All heights are positive integers.

0 < N <= 10,000

0 < H <= 2,000

0 < W <= 2,000

You should aim to have your algorithm be fast enough to solve our largest test inputs in under 5 seconds, or be as close to that as possible.

For this contest, we are considering a smaller-scale version of our feed optimization problem with simplified constraints. (The actual feed on our site uses a different algorithm.)

Quora uses a combination of machine learning (ML) algorithms and moderation to ensure high-quality content on the site. High answer quality has helped Quora distinguish itself from other Q&A sites on the web.

Your task will be to devise a classifier that is able to tell good answers from bad answers, as well as humans can. A good answer is denoted by a +1 in our system, and a bad answer is denoted by a -1.

The first line contains N, M. N = Number of training data records, M = number of parameters. Followed by N lines containing records of training data. Then one integer q, q = number of records to be classified, followed by q lines of query data

Training data corresponds to the following format:

`<answer-identifier> <+1 or -1> (<feature-index>:<feature-value>)*`

Query data corresponds to the following format:

`<answer-identifier> (<feature-index>:<feature-value>)*`

The answer identifier is an alphanumeric string of no more than 10 chars. Each identifier is guaranteed unique. All feature values are doubles.

0 < M < 100

0 < N < 50,000

0 < q < 5,000

For example:

5 23 2LuzC +1 1:2101216030446 2:1.807711 3:1 4:4.262680 5:4.488636 6:87.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.891820 12:0 13:1 14:0 15:0 16:0 17:1 18:1 19:0 20:2 21:2.197225 22:0.000000 23:0.000000 LmnUc +1 1:99548723068 2:3.032810 3:1 4:2.772589 5:2.708050 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:4.727388 12:5 13:1 14:0 15:0 16:1 17:1 18:0 19:0 20:9 21:2.833213 22:0.000000 23:0.000000 ZINTz -1 1:3030695193589 2:1.741764 3:1 4:2.708050 5:4.248495 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.091042 12:1 13:1 14:0 15:0 16:0 17:1 18:1 19:0 20:5 21:2.564949 22:0.000000 23:0.000000 gX60q +1 1:2086220371355 2:1.774193 3:1 4:3.258097 5:3.784190 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.258097 12:0 13:1 14:0 15:0 16:0 17:1 18:0 19:0 20:5 21:2.995732 22:0.000000 23:0.000000 5HG4U -1 1:352013287143 2:1.689824 3:1 4:0.000000 5:0.693147 6:0.000000 7:0.000000 8:0.000000 9:0 10:1 11:1.791759 12:0 13:1 14:1 15:0 16:1 17:0 18:0 19:0 20:4 21:2.197225 22:0.000000 23:0.000000 2 PdxMK 1:340674897225 2:1.744152 3:1 4:5.023881 5:7.042286 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.367296 12:0 13:1 14:0 15:0 16:0 17:0 18:0 19:0 20:12 21:4.499810 22:0.000000 23:0.000000 ehZ0a 1:2090062840058 2:1.939101 3:1 4:3.258097 5:2.995732 6:75.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.433987 12:0 13:1 14:0 15:0 16:1 17:0 18:0 19:0 20:3 21:2.639057 22:0.000000 23:0.000000

This data is completely anonymized and extracted from real production data, and thus will not include the raw form of the

answers. We, however, have extracted as many features as we think are useful, and you can decide which features make sense to be included in your final algorithm. The actual labeling of a good answer and bad answer is done organically on our site, through human moderators.

For each query, you should output q lines to stdout, representing the decision made by your classifier, whether each answer is good or not:

`<answer-identifier> <+1 or -1>`

For example:

PdxMK +1 ehZ0a -1

You are given a relative large sample input dataset offline with its corresponding output to finetune your program with your ML libraries. It can be downloaded here: http://qsf.cf.quoracdn.net/Quora...

Only one very large test dataset will be given for this problem online as input to your program for scoring. This input data set will not be revealed to you.

Output for every classification is awarded points separately. The score for this problem will be the sum of points for each correct classification. To prevent naive solution credit (outputting all +1s, for example), points are awarded only after X correct classifications, where X is number of +1 answers or -1 answers (whichever is greater).

Your program should complete in minutes. Try to achieve as high an accuracy as possible with this constraint in mind.