Nearby
Topics on Quora have location data optionally associated with them, allowing the
Quora Nearby feature for our iPhone app. Each question on Quora can have one or more topics associated with it. This feature allows us to display topics and questions near different locations, be it the user's current location, or a specified location on the map.
Your task will be to write a program that will be able to find topics or questions that are near to a given input location, up to a specified limit.
Input format (read from STDIN):The first line of input will be 3 integers: number of topics
T, number of questions
Q, and number of queries
N.
There will be T lines following that, each with a topic id integer, and two doubles representing that topic's location (you can consider the points to be located on a XY plane, location of a entity is in form of its x and y coordinate in the plane).
There will be
Q lines following that, each with a question id integer and number of topics the question is associated with,
Qn, followed by
Qn number of topic ids associated with that question. These integers are guaranteed to be well-formed existing topics. There may be 0 topics associated with a question, and in such cases, the question should never appear in the query results.
The next
N lines will consist of a char, an integer and two doubles each, representing the type of results required ("t" for topic or "q" for question), the number of results required, and the location to be used as the query.
Sample Input:
1
2
3
4
5
6
7
8
9
10
11
12
| |
Output format (write to STDOUT):For each query line given in the input, you are to output the ids of the nearest entities in ascending order of distance from the query location, up to the specified number of results. When there is a tie between different entities in the ordering due to equal distances (threshold of 0.001 for equality comparisons), the entity with the larger id should be ranked first.
Distance of a question from a point is the minimum of the distance of all topics associated with that question.
Sample Output:
Explanation:
There are 3 topics with ids 0, 1 and 2. There are also 6 questions, with ids 0 to 5. We first ask a nearest topic query.
The closest 2 topics to (0.0, 0.0) are topics 0 and 1.
The next query asks for upto 5 closest questions to (100.0, 100.0). Since questions 5 and 2 are tagged with topic 2 located at (2.0, 2.0), they are closest to the query location.
Because of the tie in distance, we put question 5 before question 2.
The next closest question is question 1, followed by question 0.
We do not output questions 3 or 4 because there are no topics associated with them.
Constraints:1 <= T <= 10000
1 <= Q <= 1000
1 <= N <= 10000
Integer ids are between 0 and 100000 inclusive.
Number of topics associated with a question is not more than 10.
The number of results required for a query is not more than 100.
0.0 <= x,y <= 1000000.0 (10^6)
For the large testcases, all topic x,y co-ordinates will be approximately uniformly distributed over the bounds.
You should aim to have your algorithm be fast enough to solve our largest test inputs in under 5 seconds, or be as close to that as possible.
NotesLocation data will be specified as cartesian co-ordinates on a plane for simplicity. In reality, our infrastructure utilizes the Haversine formula (
http://www.movable-type.co.uk/sc...) to convert these co-ordinates.
Typeahead Search
The search bar at the top of every page on Quora allows you to search the most up-to-date people, topics and questions on our site.
We want to quickly return the most relevant results that match the search query entered into the input text field. Every time a new user, question or topic is added (or old ones deleted), subsequent queries must reflect those changes immediately. We currently use a fast in-memory service to handle this.
Input comes into the service as the following commands:
- ADD <type> <id> <score> <data string that can contain spaces>
Adds the following <type> of item (user | topic | question | board) with the unique <id> string and <score> float, corresponding to the <data string> that would be used to match queries. - DEL <id>
Deletes the item specified by unique identifier <id>.
- QUERY <number of results> <query string that can contain spaces>
Queries for the specified integer number of results (up to 20) that match a given query string. For a data item to be considered a matching result to a query, each token in the query must be found in the data string as a case-insensitive prefix to any token in the data string. The results are ranked in descending order of score, and we only take the top few results as specified. When there is a tie in the score, newer items (more recently added) should be ranked higher. If there are no results, just output the empty string on that line. - WQUERY <number of results> <number of boosts> (<type>:<boost>)* (<id>:<boost>)* <query string that can contain spaces>
Same as QUERY, except that instead of using the raw input score specified by ADD for the particular item, the scores are weighted by the optional number of boosting factors. The boosts are floats that should be multiplied to the raw score, and affect either whole types, or specific items with the given <id>s. If there are multiple boosts applicable, they each are multiplied commutatively to the raw score.
Your task will be to write an equivalent service as a standalone program, with input files that correspond to the queries and updates to the data, and expected output files that correspond to the results obtained for each query.
Input format (read from STDIN):Your program will be given an integer N on the first line of stdin, followed by N lines of the form:
<command> <command data>
The input commands are: ADD | DEL | QUERY | WQUERY
Types are: user | topic | question | board
Ids are alphanumeric strings without spaces or punctuation and will not include the same strings used for types.
Data strings can be any ASCII character, but are delimited by spaces or tabs. We will not be using anything special unprintable characters or \r and \n in the data string.
Command data for each command is as specified above. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| ADD user u1 1.0 Adam D’Angelo
ADD user u2 1.0 Adam Black
ADD topic t1 0.8 Adam D’Angelo
ADD question q1 0.5 What does Adam D’Angelo do at Quora?
ADD question q2 0.5 How did Adam D’Angelo learn programming?
WQUERY 2 1 topic:9.99 Adam D’A
|
Output format (write to STDOUT):For each QUERY and WQUERY command, you should output the following line:
<sorted result ids>
Where each line contains the <id>s in descending score order, up to the required number of results, as specified above according to the QUERY and WQUERY commands. If there are no matches, output the empty line. For example:
Constraints:0 < N < 100000
0 < number of ADDs < 40000
0 < number of DELs < 10000
0 < number of QUERYs < 20000
0 < number of WQUERYs < 1000
0 < number of boosts < 25
0.0 < each score < 100.0
0 < data string length < 100 chars
You should aim to have your algorithm be fast enough to solve our largest test inputs in under 5 seconds, or be as close to that as possible.
NotesIn this problem, we are presenting a representative simplified version of our service for the purposes of the contest.
Scrabble Stepladder
A word ladder is a sequence of words in which each subsequent word is formed by replacing exactly one letter from the previous word with a different letter. Words may not be repeated in a word ladder. For example the following sequence is a valid word ladder:
CAT
COT
DOT
DOG
-----------------------------------------------------------------------------
The Scrabble score of a word is the sum of the Scrabble values of each letter in the word. The Scrabble score of a word ladder is the sum of the Scrabble scores of its words. The values are taken from the official English edition of Scrabble.
1 points: A, E, I, L, N, O, R, S, T, U
2 points: D, G
3 points: B, C, M, P
4 points: F, H, V, W, Y
5 points: K
8 points: J, X
10 points: Q, Z
For example, CAT has a scrabble score of 5. The word ladder CAT, COT, DOT, DOG has a Scrabble score of 19.
-----------------------------------------------------------------------------------
A Scrabble stepladder is a word ladder with some additional constraints. It must contain an odd number of words. From the first word to the central word each subsequent word must have a strictly greater Scrabble score than the previous word. From the central word to the last word each subsequent word must have a strictly lesser Scrabble score than the previous word. Stepladders that only contain 1 word are valid. All the rules of a general word ladder still apply. For example, the following sequence is a valid Scrabble stepladder:
SOUR (4)
SPUR (6)
SPUD (7)
STUD (5)
STUN (4)
but the following sequence is not:
APES (6)
APED (7)
AXED (12)
AXES (11)
EXES (11)
nor is the following sequence:
SORES (5)
PORES (7)
POXES (14)
LOXES (12)
--------------------------------------------------------------------------
You are provided a dictionary of words from which all words in a Scrabble stepladder must be drawn, and number of letters that each word in the stepladder must contain. Your task is to determine the Scrabble stepladder with the highest possible score - over all valid Scrabble stepladders given the dictionary and the word length.
Input format (read from STDIN):
The first line of the input is a positive integer K: the length of a word in the Scrabble stepladders to consider.
The second line of input is a positive integer N: the number of dictionary words that will follow.
The remaining input is N lines, each containing a word whose characters must be in [A-Z]. This set of N words constitutes the dictionary.
Output format (write to STDOUT):
Output a single integer corresponding to the highest Scrabble stepladder over all valid Scrabble stepladders given the dictionary and word length. If there is no valid Scrabble stepladder, output 0.Feed Optimizer
Quora shows a customized feed of recent stories on a user's home page. Stories in Quora refer to activities that happen on the site, for example, when a user posts a note, adds a question, or upvotes an answer. We score each story based on its type and other characteristics, and this score represents the value we think a story brings to the user. We want to be able to generate quickly the feed of best and most recent stories for the user every time they reload their home page.
Your task will be to design the algorithm that picks the stories to display in the feed.
You are given a list of stories, each having a time of publication, a score and a certain height in pixels that it takes to display the story. Given the total number of pixels in the browser available for displaying the feed, you want to maximize the sum of scores for the stories that you can display in the feed at each time the user reloads their home page. We only want to consider recent stories, so only stories that were published in a recent time window from the time of reload should be considered. You do not have to use up all the pixels in the browser.
Input format (read from STDIN):The first line of input will be 3 positive integers: N the number of events, W the time window representing the window of recent stories, and H the height of the browser in pixels.
There will be N lines following that, each beginning with ‘S’ if it is a story event, or ‘R’ if it is a reload event. A story event will have 3 positive integers: the time of publication, the score and the height in pixels of that story. A reload event will have 1 positive integer: the time of reload.
The events will always be in chronological order, and no two events will happen at the same time.
For example:
Output format (write to STDOUT):For each reload event given in the input, you are to output a line of integers. First, the maximum score of stories you can show in the feed. This should be followed by the number of stories picked and the id number for each story picked, in increasing order. Stories are given an id starting from 1 in the order of their time of publication.
If two sets of stories give the same score, choose the set with fewer stories. If there is still a tie, choose the set which has the lexicographically smaller set of ids, e.g. choose [1, 2, 5] over [1, 3, 4].
For example:
Explanation:
There are 4 stories (with ids 1 to 4) and 5 reload events. At the first reload, there is only one story with score of 50 available for display. The next two reloads, we have 3 stories that take up 90 of the 100 pixels available, for a total score of 135. When we reload at time 21, there are 4 stories available for choosing, but only 3 will fit into the browser height. The best set is [1, 3, 4] for a total score of 140. At the last reload event, we no longer consider story 1 when choosing stories because it is more than 10 time units old. The best set is thus [2, 3, 4].
ConstraintsAll times are positive integers up to 1,000,000,000.
All scores are positive integers up to 1,000,000.
All heights are positive integers.
0 < N <= 10,000
0 < H <= 2,000
0 < W <= 2,000
You should aim to have your algorithm be fast enough to solve our largest test inputs in under 5 seconds, or be as close to that as possible.
NotesFor this contest, we are considering a smaller-scale version of our feed optimization problem with simplified constraints. (The actual feed on our site uses a different algorithm.)
Answer Classifier
Quora uses a combination of machine learning (ML) algorithms and moderation to ensure high-quality content on the site. High answer quality has helped Quora distinguish itself from other Q&A sites on the web.
Your task will be to devise a classifier that is able to tell good answers from bad answers, as well as humans can. A good answer is denoted by a +1 in our system, and a bad answer is denoted by a -1.
Input format (read from STDIN):The first line contains N, M. N = Number of training data records, M = number of parameters. Followed by N lines containing records of training data. Then one integer q, q = number of records to be classified, followed by q lines of query data
Training data corresponds to the following format:
<answer-identifier> <+1 or -1> (<feature-index>:<feature-value>)*
Query data corresponds to the following format:
<answer-identifier> (<feature-index>:<feature-value>)*
The answer identifier is an alphanumeric string of no more than 10 chars. Each identifier is guaranteed unique. All feature values are doubles.
0 < M < 100
0 < N < 50,000
0 < q < 5,000
For example:
1
2
3
4
5
6
7
8
9
| 2LuzC +1 1:2101216030446 2:1.807711 3:1 4:4.262680 5:4.488636 6:87.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.891820 12:0 13:1 14:0 15:0 16:0 17:1 18:1 19:0 20:2 21:2.197225 22:0.000000 23:0.000000
LmnUc +1 1:99548723068 2:3.032810 3:1 4:2.772589 5:2.708050 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:4.727388 12:5 13:1 14:0 15:0 16:1 17:1 18:0 19:0 20:9 21:2.833213 22:0.000000 23:0.000000
ZINTz -1 1:3030695193589 2:1.741764 3:1 4:2.708050 5:4.248495 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.091042 12:1 13:1 14:0 15:0 16:0 17:1 18:1 19:0 20:5 21:2.564949 22:0.000000 23:0.000000
gX60q +1 1:2086220371355 2:1.774193 3:1 4:3.258097 5:3.784190 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.258097 12:0 13:1 14:0 15:0 16:0 17:1 18:0 19:0 20:5 21:2.995732 22:0.000000 23:0.000000
5HG4U -1 1:352013287143 2:1.689824 3:1 4:0.000000 5:0.693147 6:0.000000 7:0.000000 8:0.000000 9:0 10:1 11:1.791759 12:0 13:1 14:1 15:0 16:1 17:0 18:0 19:0 20:4 21:2.197225 22:0.000000 23:0.000000
PdxMK 1:340674897225 2:1.744152 3:1 4:5.023881 5:7.042286 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.367296 12:0 13:1 14:0 15:0 16:0 17:0 18:0 19:0 20:12 21:4.499810 22:0.000000 23:0.000000
ehZ0a 1:2090062840058 2:1.939101 3:1 4:3.258097 5:2.995732 6:75.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.433987 12:0 13:1 14:0 15:0 16:1 17:0 18:0 19:0 20:3 21:2.639057 22:0.000000 23:0.000000
|
This data is completely anonymized and extracted from real production data, and thus will not include the raw form of the
answers. We, however, have extracted as many features as we think are useful, and you can decide which features make sense to be included in your final algorithm. The actual labeling of a good answer and bad answer is done organically on our site, through human moderators.
Output format (write to STDOUT):For each query, you should output q lines to stdout, representing the decision made by your classifier, whether each answer is good or not:
<answer-identifier> <+1 or -1>
For example:
You are given a relative large sample input dataset offline with its corresponding output to finetune your program with your ML libraries. It can be downloaded here:
http://qsf.cf.quoracdn.net/Quora...ScoringOnly one very large test dataset will be given for this problem online as input to your program for scoring. This input data set will not be revealed to you.
Output for every classification is awarded points separately. The score for this problem will be the sum of points for each correct classification. To prevent naive solution credit (outputting all +1s, for example), points are awarded only after X correct classifications, where X is number of +1 answers or -1 answers (whichever is greater).
TimingYour program should complete in minutes. Try to achieve as high an accuracy as possible with this constraint in mind.
Datacenter Cooling
We have some rooms in our datacenter, and we need to connect them all with a single cooling duct.
Here are the rules:
- The datacenter is represented by a 2D grid.
- Rooms we own are represented by a 0.
- Rooms we do not own are represented by a 1.
- The duct has to start at the air intake valve, which is represented by a 2.
- The duct has to end at the air conditioner, which is represented by a 3.
- The duct cannot go in multiple directions out of the intake or the AC - they must be the two endpoints of the duct.
- The duct must pass through each room exactly once.
- The duct cannot pass through rooms we do not own.
- The duct can connect between rooms horizontally or vertically but not diagonally.
Here is an example datacenter:
There are two possible ways to run the duct here:
or
Write a program to compute the number of possible ways to run the duct. For the above example, the correct answer is 2.
Input format (read from STDIN):The input will be a series of integers separated by whitespace.
The first two integers will be W and H, with width and height of the datacenter. These will be followed by W*H more integers, specifying the 2D grid representing the datacenter.
Output format (write to STDOUT):Your program should write a single integer out: the number of possible ways to run the duct.
See how fast you can make it.
NotesOur best solution (written in C) can solve the following test case in under 5 seconds on a 2.4GHz Pentium 4, but it's pretty good if you can get to within 1-2 orders of magnitude of that.
Python URI
In Python, write a class or module with a bunch of functions for manipulating a URI. For this exercise, pretend that the urllib, urllib2, and urlparse modules don't exist. You can use other standard Python modules, such as re, for this. The focus of the class or module you write should be around usage on the web, so you'll want to have things that make it easier to update or append a querystring var, get the scheme for a URI, etc., and you may want to include ways to figure out the domain for a URL (
british-site.co.uk,
us-site.com, etc.)
We're looking for correctness (you'll probably want to read the relevant RFCs; make sure you handle edge cases), and elegance of your API (does it let you do the things you commonly want to do with URIs in a really straightforward way?,) as well as coding style. If you don't know Python already, then this is also an exercise in learning new things quickly and well. Your code should be well-commented and documented and conform to the guidelines in the PEP 8 Style Guide for Python Code. Include some instructions and examples of usage in your documentation. You may also want to write unit tests.
Email Submission
Send your solutions to challenges@quora.com
Trend Analyzer
Note: This is an open-ended problem geared towards our Data Analyst role. Data analysts at Quora work with data at all levels of our product, allowing our team to make informed decisions about which features to pursue and which areas need improvement. Being an information repository in itself, efficient and accurate use of data is central to everything we do at Quora.At Quora, we deal with a lot of data. As the number of users on our site grows, so does the volume of content and the number of interactions that go on daily.
We want to be able to determine various social and intellectual topics that are trending over time. However, we take privacy very seriously, and so we want to do this in a way that does not put any individual’s personal information at risk.
Your task is to design and implement a trend analysis engine. This engine should be able to take as input a table of raw data and present an interface for trend analysis. You are free to design this interface and choose the data analysis features that you wish to support. It is important that the engine only exposes to the user data that does not have any personal identifiers and protects sensitive attributes from being revealed. We provide some links to resources about how to think about this in the Resources section.
ScoringYour submission will be evaluated according to the following criteria:
- Code Quality and Design (25%):
- Is the code organized well and readable?
- Were there good design decisions made and documented?
- Privacy (25%):
- Is there adequate preservation of privacy?
- Is there a reasonable way to trade off between protecting personal information and getting sufficiently high quality of data for analysis?
- Analytics (25%):
- What analytical metrics do you support?
- How do you detect or compute trends in the data?
- Visualization (25%):
- Are there compelling visual aids for presenting data?
- Are there ways to interactively explore the data?
A Primer on Privacy ProtectionSuppose there is a table with the following raw data:
Table 1: Raw data with personal information exposedThe first field (User ID), or the second and third field if used together (First and Last Name), will uniquely identify the person in this data set. If you are trying to determine aggregate big picture trends of Popularity from this data, there is no need to deal with data of this resolution and granularity.
Instead, if you knew the fields which are identifiers and quasi-identifiers, you could produce the following table after an anonymization process.
Table 2: Raw data after anonymization processGrouped in the following way, you now have a table whose rows are placed into groups where you can no longer distinguish an individual record from the table. And you can still observe that asking of questions is yielding a high popularity score compared to answering questions, while following of topics has a mixed response.
A Primer on Time-Series DataAs another example, here is a different data set that contains time-series data.
Table 3: Raw data with no apparent patterns or trendsAt first glance, it is difficult to detect any patterns or trends in the data. But if you strip out personal info and sort by Location and then by Time, a few things become clear.
Table 4: Raw data after sorting exposes different patterns and trendsFirst, there is a high correlation between the Location of the user and the Object being acted upon. In this case, it happens to be that the users seem to be discussing the public transport systems in their location (MBTA for Boston and Caltrain for Palo Alto). It is also clear that a question was asked in about the Caltrain, and after a short wait, there was a flurry of activity around that topic, with an answer followed by 2 votes. Whereas in Boston, it was just a spike in questions getting asked about the MBTA.
Input FormatYour engine will be tested by us on several data sets, with both synthetic and real data. Your engine should take as input a path to a comma-delimited text file. The first line will be the field names, and the second line will be the field type. The remaining rows will be the raw data.
The table below lists the field types we want you to support. If you have ideas for other types, feel free to include support for them and provide us an input file for us to evaluate them.
Table 5: Field types to be specified in the input data setThe above 2 sample data sets can be described by the following schema.
Example 1User ID, First Name, Last Name, Action Type, Object, Popularity
ID, ID, ID, CAT, CAT/SENSITIVE, CONT
Example 2Time, User ID, Location, Action Type, Object
TIME, ID, CAT, CAT, CAT
Submission FormatYour submission should be the URL to a standalone repository that we can clone, build and run with minimal effort.
You have a week for this problem and you may ask clarification questions by emailing
challenges@quora.com. We will try our best to respond to your questions but given the open-ended nature of this problem, you may also use your best judgement to make decisions and let us know how you dealt with these by documenting them.
ResourcesWe give links below to 3 papers that we believe are excellent starting points to think about preserving privacy in microdata. You may extend these ideas with other state-of-the-art in privacy and data research.
In your submission, you may also use any third party software libraries that are helpful to you, provided you cite them properly in your comments.
[1] L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570
http://epic.org/privacy/reidenti...[2] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In Proc. 22nd Intnl. Conf. Data Engg. (ICDE), page 24, 2006.
http://www.cs.cornell.edu/~vmuth...[3] N. Li, T. Li, and S. Venkatasubramanian. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE International Conference on Data Engineering (ICDE), 2007
http://www.cs.purdue.edu/homes/n...Sample DataYou may download test datasets as a tarball from:
http://qsf.cf.quoracdn.net/Quora...The original data was sourced from
http://opendata.socrata.com/, but we've touched it up a little and added the relevant field types to make them applicable to this problem. For each dataset in the zip file, we provide both the original raw data and the cleaned annotated version, as well as the script used to do the cleaning.
Email Submission
Send your solutions to challenges@quora.com
Browser Extensions
Note: This is an open-ended problem geared towards our Product Engineer role. Product engineers help drive the Quora product forward by making strong contributions across multiple levels of our web stack, from implementing how users experience and interact with the product to the data models that support it. Quora does not yet have an official full-featured API for the public. However, there are already several ways users may interact with Quora from outside of Quora:
- using a simple API we released for notifications:
- using our “Post to Quora” bookmarklet to post links to boards:
- and by embedding a Quora follow button for users or topics:
The goal of this problem is to create an open-source browser extension for Quora using any or all of the above resources. It is completely up to you to determine the purpose and functionality of the browser extension (and which browsers to support) for Quora users.
Some common scenarios:
- allowing Quora to be accessible on any website without the use of bookmarklets and embedded scripts
- making Quora more useful for some target audience or site
- providing Quora users a richer browser experience
Some wackier ideas:
- automatically embedding images into a post through the right-click context menu
- suggesting relevant boards to post to, while surfing different websites
You may reverse-engineer any of these publicly available resources above within reason, but please ask us first at
challenges@quora.com if there is anything that you are unsure about. There is only one restriction: Your browser extension should not be storing or making use of any state on a third-party server (HTML5 Web Storage is okay).
We will be grading submissions based on the following criteria:
- Good, well organized, readable, and clean code (25%)
- Creativity and novelty (25%)
- Completeness and functionality (25%)
- Visual design and interactions (25%)
Here are links to some existing Quora browser extensions that we know about, contributed by our users Andrew Brown, Sean Owczarek and
Mirza Asif, which you may find useful:
Spend about a week on this problem, and submit a link to the project’s source (feel free to use Github or any other version control repository you prefer).
Email Submission
Send your solutions to challenges@quora.com