Yesterday I presented at the STM innovations conference in London. In addition to speaking about our recent API release and how it can be used by publishers to power material discovery on their websites, I was delighted to spend some time talking about our support for the Resource Identification Initiative. Below are my slides (slightly edited) from yesterdays presentation.
Over the past number of years Biocompare has been a data provider and partner of the Neuroscience Information Framework (NIF) based at UCSD. The following guest blog post by Anita Bandrowski, project lead on the NIF project, gives an overview of aims of the Resource Identification Initiative (RII) a collaborative effort that includes publishers, research groups, funding agencies and a number of commercial companies to improve resource visibility in science. In this post Anita gives some background on NIF and the problems that the project set out to tackle in 2008 and how the RII fits in to those important strategic objectives:
The goal of NIF, which is an initiative of the Blueprint for Neuroscience, is to make resources like data and biomedical tools more discoverable and transparent to bench scientists and program officers. The origins of NIF lie in a visionary moment in the halls of the NIH; two program officers were walking while discussing a grant application when one said to the other “I know I have funded this before, but can’t find any trace of it”. As a result of this and other similar discussions, NIF was established to keep track of the intellectual outputs of research that do not fit neatly into publications. I came to NIF when the project started in its current phase in 2008 with the goal of bringing my experience in data back to neuroscience my first love and the subject of much of my work during my time in graduate school.
At NIF we have been working for over 5 years to catalog research resources (by which we mean software tools, databases, experimental animal repositories, tissue banks, and antibodies) produced by scientists but not easily traceable as intellectual output. Over this time we have learned that resources are really hard to track in published research. Undoubtedly this is due in no small part to a general flawed view held by many authors and editors, that experimental resources are not important enough intellectual outputs of research and as such do not need be documented in detail in publications. Aside from this cultural indifference to reporting, and perhaps contributing to it, is a significant problem with the integrity of the data surrounding experimental materials. Routinely, materials circulated in the community and produced commercially have insufficient information provided to allow authors to uniquely and clearly cite materials in their work.
The shortsightedness of this relaxed attitude to resource documentation and the seriousness of the problem that it creates have been underlined recently in a number of interesting articles about reproducibility, or the lack thereof, in basic science. The Economist ran a piece with the dramatic headline “Trouble at the lab: Scientists like to think of science as self-correcting. To an alarming degree, it is not”. Another article from a number of years ago which caused a stir at the time was that of Glenn Begley et al., which was pointedly entitled“Drug development: raise standards for preclinical cancer research”. In this article Begley describes how he and his team at Amgen embarked on a study to reproduce the published findings of 53 “landmark” pre-clinical oncology projects. The results of this particular study were quite startling with only 7 of 53 studies (11%) proving reproducible as described in the article. While acknowledging that the reasons for these failures are complex, Begley notes that the “limitations of preclinical tools such as inadequate cancer-cell-line and mouse models make it difficult for even the best scientists working in optimal conditions to make a discovery that will ultimately have an impact in the clinic”.
Historically, editorial policies in high impact journals like Nature have progressively reduced the word count of their articles. This has forced many authors to the leave important details out of manuscript submissions with method sections bearing much of the brunt of this forced reduction. Indeed, in my postdoctoral lab at Stanford we used to joke that to publish a complete and reproducible set of methods in Nature you had to publish a companion paper in another journal. Furthermore, publisher guidelines for method reporting are often insufficient. Routinely manuscripts that fulfill all editorial requirements are published, but contain wholly inadequate information about the materials that have been used by the publishing author(s) making experimental reproduction more difficult.
An article published by Vasilevsky et al., earlier this year highlights the extent of the problem caused by inadequate resource reporting across a range of research materials in a single month.
Figure legend: (A) Summary of average fraction identified for each resource type. (B-F) Identifiability of each resource type by discipline. The total number of resources for each type is: (B) antibodies, n = 703; (C) cell lines, n = 104; (D) constructs, n = 258; (E) knockdown reagents, n = 210; (F) organisms, n = 428. The y-axis is the average for each resource type across each domain. Variation from this average is shown by the bars, error bars indicate upper and lower 95% confidence intervals.
Alarmingly, this study reported that “54% of resources are not uniquely identifiable in publications, regardless of domain, journal impact factor, or reporting requirements”.
So what can be done? Are we doomed to suffer inadequate and ambiguous methods?
Perhaps not, members of the RII are supporting positive change in a number of important areas
- Changing Editorial Policy and Reporting Culture Amongst Authors
Publishers are starting to pay much more attention to the methods section and increasingly are removing the space limits on them. Wiley, Elsevier andPeerJ have taken an active role in raising awareness amongst editors and authors.
- Improving Available Data: Unique Identifiers
In response to the issue of inadequate or inappropriate material nomenclature, a number of groups are now working together to address this. For example, NIF created the antibody registry, which assigns unique identifiers to all known antibodies. These identifiers are increasingly being used by commercial companies to reference specific antibodies. RII has created a portal for authors to discover unique identifiers for Organisms, Antibodies and Tools.
- Standardising Vocabularies: Resource Onthologies
The eagle-i project has played a central role is developing a standardized vocabulary for describing material resources and their relative relatedness. The eagle-i resource ontology is publically available and is increasingly being used within the community.
- Technology: Analysis, Citation and Annotation
There are numerous efforts underway to develop technology that allows better identification of resources in published research. The Domeo web annotation tool kit developed by Tim Clark at the MassGeneral Institute for Neurodegenerative Disease is one such example. Another example is the Citation Tool created by RII, which allows authors to automatically create machine readable resource identifiers that can be inserted into draft manuscripts.
- Storage and Disemination of Structured Resource Information
Access to structured, well curated, uniquely identified resource information and data will soon also be available via the scrazzl Developer network. This should make it easier for developers to build tools that help authors to discover consistent well-structured resource information that can be cited in new articles.
The RII pilot, which is being launched on the 12th of November at the Society for Neuroscience meeting in San Diego, involves a wide range of collaborators who aim to make three types of common resources machine identifiable in a selection of new publications in Neuroscience.
How can I help?
- If you are in the process of drafting a new paper for submission check out the citation tool created by RII which will allow you to automatically create resource identifiers.
- Put pressure on publishers and editors to improve their method reporting standards.
- Be more demanding of your suppliers, tell them that you want precise information about the products that they sell to be published on their websites including an Antibody Registry unique identifier. If they don’t provide sufficient data then don’t buy from them.
- Spread the word #RII #reproducibility
Back in June, CompareNetworks (CN) acquired scrazzl and since then we have been really busy finalizing some exciting features that have been in gestation for a while. For scrazzl, joining forces with CN has meant a couple of things:
- Engineering: Access to a large team of excellent engineers who are very familiar with common data issues in the life science space.
- Customers: CN has been the leading provider of discovery tools for the life science market for over 10 years and has a great relationship with its customers.
- Data: In this game it’s all about the data. CN has the most comprehensive and well-curated database of research products in the life science and healthcare sectors. By teaming up with CN, scrazzl is now in a position to provide highly structured research product information to developers interested in building product information into their web services and applications.
What can I do with the API?
- DOI: Publishers or content aggregators can query the API using a digital object identifier (DOI). This query will return a list of specific products that map to that digital object.
- Product Identifier: Product identifiers such as the given product name and supplier catalogue ID can be also be used to query. Returned results will include all available product and article information that maps to that query information
- Gene/Protein/CAS identifiers: It will soon be possible to query the API using a range of specific biological and chemical identifiers such as Uniprot IDs, Genbank Accession Numbers and Chemspider IDs.
Why join the scrazzl Developer network?
There are many benefits to joining the scrazzl Developer network not least being access to a comprehensive product database and related article metadata. This information can be used to build valuable new features for your audience. Importantly, members of the Developer network can opt to join our syndication programme. The syndication programme allows our partners to earn valuable ancillary revenue from the applications that they build using scrazzl data.
Please get in touch if you are interested in learning more about the scrazzl Developer network, syndication opportunities or if have any ideas for improvements or features.
This week, we were delighted to announce the acquisition of scrazzl.com by CompareNetworks. The journey that we have travelled together as a team over the past number of years has been exciting and at times challenging. However, over that time we have tried to stay true to our vision for scrazzl, which has been to build great online tools that help scientists make better and more efficient decisions about the products that they use in their research. Our stellar team of engineers, Paul Phillips, Daniel Hunt, Mauro Iovino and Ben Crowe brought passion, ingenuity and determination to our little start-up and for that I am incredibly grateful.
I am delighted that in CompareNetworks we have found a partner with a common vision and purpose, who share our desire and passion to help scientists all around world make better decisions for their research. Over the next few months we will be announcing some exciting new developments, publisher partnerships and features. Some more information about the recent deal can be viewed here and here
Two weeks ago I was delighted to be invited to speak at the ALPSP International Industy conference. ALPSP is one of the main industry associations for scholarly publishers and so the audience was comprised mainly of publishers and those with direct interest in the publishing industry. I was included in the industry updates parallel session with Jan Reichelt (Mendeley), Ryan Jones (Pubget)and Daniel Mayer (Temis). All the sessions are provided here with mine excerpted below.
Scrazzl was featured this week as part of siliconrepublic’s one to watch series. Carmel Doyle explains that the “One to Watch series will look at Irish start-ups from the technology and scientific spaces that are fast making waves in their respective industries and could be on the cusp of something big for their niche innovations”. The video below contains some extracts from that interview.
I was recently invited to pitch at Tech All Stars. Instead of the usual investor deck that I use, I wanted to try something different. Inevitably, the focus of an investor deck is on the hard numbers, the tangibles, investors want to know what traction you have and how you are going to make lots of money for them. I guess that is fair enough given that you are usually asking those same investors to invest in your business.
On the face of it our business is a relatively simple proposition, for scientists we provide search for research materials and we sell advertising solutions to companies that make research materials. In the middle are publishers who we partner with. Publishers give us access to their content/traffic and we make money for them. This is pretty much my standard investor deck in 3 sentences.
The frustrating thing is that it’s not the whole story and making money for publishers and suppliers is not what has motivated our team for the past three years. For Tech All Stars I prepared two slidedecks, one was for investors and one was for us. The following is the pitch I wanted to give but never got to….
At the core of what we do at scrazzl is data-mining and document analysis. With greater than 5 million full-text articles and more than 20 million abstracts we have no shortage of raw data. Undoubtedly, analysis of content and content usage has the potential to provide insights of great benefit to the research community. Globally, efforts to develop alternative metrics (altmetrics) to supplement impact factors and citation counting are gathering pace and are being spearheaded by companies like PloS, CiteULike and Mendeley. In parallel, publisher led initiatives like the Usage Factor project also promise to provide useful insights into article usage and quality.
It has been clear for some time that the traditional methods of establishing article quality and impact, namely peer review, JIF (journal impact factor) and citation count are straining under the weight of new content being produced every year. The emergence of altmetrics as a supplementary measure of quality taking account of storage, links, bookmarks and conversations certainly has the potential to alleviate some of the problems caused by the continuing data tsumani. Collectively, journal and article level metrics are designed to help readers assess research quality at the level of the article.
Apart from the question “Is this article worth reading?” there are quite a number of other questions asked by researchers that could in the future benefit from improvements in analysis of content and content usage. For example, Datacite is working to promote citation and attribution of primary datasets thus making it easier for researchers to source good primary data. At scrazzl our focus is on the provision of qualitative metrics to support decision making in experimental science and medicine. Our working hypothesis is that by collating data related to material usage in a large sample of articles, it is possible to derive qualitative measures and insights into the use and application of those materials. Through analysis of articles in combination with metrics related to article quality and researcher influence, we are working to help our users answer questions like:
- How good is this product, should I use it/prescribe it? Recommendation engines like TripAdvisor have revolutionised how we make decisions about the hotels we book or the places we visit. However, the fragmented nature of the scientific supply market coupled with the fact that scientists do not purchase like ordinary consumers (usually they buy through procurement systems) has meant that there is little by way of qualitative information relating to experimental materials. Knowing that other scientists have successfully used the tools and technologies that you are interested in can have a significant bearing on your decision to buy a specific product. In the competitive world of research the cost of failed experiments is high. An initial focus of our article analysis work has been on extracting usage data for experimental materials. We render these derivative metrics through our Product Metrics Widget.
- What are the optimal conditions for the experiment that I plan to run? Slightly more technically challenging than gathering usage statistics on material entities in articles is gathering optimal working conditions for those same materials. Take for example a specific antibody the use of which has been reported more than 500 times in published research. As a researcher, you are interested in the range of dilutions that have been successful for this antibody. Providing this normally fragmented and disparate information in a structured and actionable form is a current R&D focus.
- Who are the researchers that have the experience or resources that I need? Sites like LinkedIn have been successful because they have built a community of skilled individuals and made it really easy to create a network within that community. In science there have been many efforts to create a “facebook for science” most of which have failed. Those that are succeeding, Mendeley and CiteUlike, for example, are doing so because their core value proposition has never been to create a facebook for science. These platforms work to solve other issues such as reference management and social bookmarking and allow users to build a social network as a by-product. For scientists seeking to find specific material resources or technical experts arguably the most complete source of information is in existing literature. At scrazzl, by mapping the associations between scientists and the experimental tools that they document in their research, we are making it easier for scientists to find the people and answers that they require.
Currently we are at the start of a transformation in how the quality and impact of research is assessed. As new standards of research qualification (usage factor, altmetrics) mature it is likely that researchers will demand more tools that can unlock the collective insights housed within content and content usage to enable better decision making. There will be a greater demand for standardised metrics of quality in many areas of scientific enquiry. For our part we will be working hard to unlock the benefits of largescale article analysis in all areas of experimental science.
At scrazzl.com we love big data, and recently we were having a discussion about how it might be cool to look at the number of research articles published by Reed-Elsevier that mention particularly nasty cancers. Why would we do this, I hear you ask? Well, our working hypothesis was that cancer research output, as measured by the “mention frequency” of different cancer types in peer-reviewed journal articles, should closely mirror incidence and/or mortality statistics. However, what we found was quite unexpected!
Figures 1 and 2 below trace the death rates and incidence, respectively, per 100,000 of population in the US between 2003 and 2007 as reported by the Centres for Disease Control (CDC). The cancers resulting in the largest number of deaths over that period included: lung cancer, prostate cancer, breast cancer, colon cancer, pancreatic cancer, ovarian cancer and leukemia. As expected lung cancer resulted in the most deaths, and killed on average twice as many people as the next nearest cancers, breast and prostate. We next took a look at major cancer incidence per 100,000 and consistently, over the time period examined, prostate cancer had the highest incidence rates, with breast cancer coming in a close second. So, what conclusions can be drawn from this data?
- Lung cancer is a particularly problematic cancer and while the overall mortality trend is down it still kills a lot of people in the US every year.
- The incidence of prostate cancer is trending upwards and has the highest incidence of all cancers, albeit for a sub-set of the population (men).
- Breast cancer still has significantly high mortality rates but the overall mortality trend is downward (-7.14% between 2003 and 2007.
- Leukemia has the 7th highest mortality rate and the incidence of leukemia does not feature in the CDC top ten.
Figure 1 Age-Adjusted Cancer Death Rates for the 6 Primary Sites with the Highest Rates within Race- and Ethnic-Specific Categories (Data provided by Centres for Disease Control and Prevention).
Figure 2 Age-Adjusted Invasive Cancer Incidence Rates for the 10 Primary Sites with the Highest Rates within Race- and Ethnic-Specific Categories(Data provided by Centres for Disease Control and Prevention).
After bringing the CDC data together, we proceeded to run a “phrase analysis” on the text content of 4 million+ research articles using scrazzl AnalyticsTM. We defined our keywords as the major cancers resulting in death as reported by the CDC, namely, lung cancer, prostate cancer, breast cancer, colon cancer and leukemia. Figure 3 documents our results.
Figure 3 scrazzl AnalyticsTM showing the number of Reed-Elsevier articles that specifically mention notable cancers between 2003 and 2007.
scrazzl AnalyticsTM was configured in this instance to count a single phrase occurrence per article, so for example if “leukemia” appeared 10 times in an article it would only be counted once by our technology. This is important as we are interested in article count and not simply phrase count. The results are striking, leukemia which has in relative terms a low mortality rate and incidence rate has the highest article count. Breast cancer which has a significantly lower incidence rate than prostate cancer also features prominently in article counts. It is important to note at this stage that our analysis has not taken account of phrase aliases so there is a margin of error within our results. Nevertheless, the obvious disparity between the death/incidence rates and research efforts as measured by article count is striking and begs the question are research efforts focussing on the wrong cancers?
While this is a somewhat simplified approach to measuring trends in cancer research investment vs the most dangerous cancers in the population, the results prompt further discussion nonetheless. Similar to popular culture, scientific research can follow unexpected trends. Certain diseases become “popular” to research, perhaps because they are particularly emotive or have high incidence in population subsets that are vulnerable (leukemia and children). It is also likely that cancers such as breast cancer and leukemia attract disproportionate amounts of research funding and therefore attract greater numbers of researchers. It is also notable that the various charities that support research efforts on these types of cancers, leukemia and breast cancer, are very well organised and well supported within the global community.
When making decisions about research focus it is important that researchers, funding agencies and journal editorial staff are mindful of disease macrostatistics. When prioritising research individual scientists should look at the cold hard facts i.e. which cancers are killing the most people and occurring most frequently in populations. Our analysis of research output suggests that currently this may not be happening. Prostate cancer, lung cancer, and colon cancer are killing large numbers of people but appear to be under-researched when compared to the research output relating to breast cancer and leukemia.
To explore these findings further and improve the accuracy of this analysis we are currently looking at a number of solutions and are keen to get some suggestions from you guys. There are two issues that we think may be impacting the accuracy:
- Aliases and disease variants: Many cancers have more than one name or exist as multiple variants for this reason we are seeking to incorporate a disease ontology into our analysis.
- Limited data coverage: This analysis has been performed on Elsevier data only and while this represents a significant proportion of research content it is not all encompassing. Over the coming months we hope to add more data sources allowing more accurate analysis.
In my next post I’m going to take a look at some more interesting trends that we are seeing in disease related article data and how we are making it possible to view this data in new and useful ways.
Having just passed the 2 month line in my time here with scrazzl, I figure it’s about time I jot down a few thoughts about what we’re getting up to in here, and what it’s like to join such a young company early on.
Who am I?
My name is Daniel Hunt, and I’ve joined the scrazzl tech team as a senior engineer.
What do I do?
But, more specifically, I’m a dedicated techy who likes to keep his fingers in all the pies :)
What pies do I ♥?
I ♥ JS.
Ask me to choose between JS and PHP and I’ll cry. Ask me to choose between JS and anything else, and I’ll hug you with curly braces.
Naturally enough, this has recently developed into a heavy interest in all things Node, but I’ve yet to find an excuse to build anything scrazzl related in it :)
I ♥ databases
There’s just something about the hidden world of database schema design/maintenance/improvement that floats my boat.
The strictly normalised approach of standard relational databases (such as MySQL) has been my bread and butter for years now.
I ♥ JS & databases
As you’ll read a little later, I’ve recently come to love the combination of the other 2 things I love most.
I ♥ PHP
At this stage, PHP is a mature language. There are plenty of other upstarts claiming the mindshare amongst new developers, but I’ve yet to come across something to pull me away from its allure.
When it comes to web development, rapid prototyping and easy maintainability, PHP holds its own with elegant style.
I’ve been lucky enough to have been surrounded by many, many great PHP developers in the past, who have helped me climb the development ladder. As a result, now that I’ve joined a startup, I want to make sure that others get the same chances I got.
I ♥ the cloud
I’ve been lucky enough to have been exposed to Amazon’s EC2 regularly for the past few years - almost as soon as it was available, if I remember correctly.
Without it, nothing I’ve built over that time would have the flexibility, capability or reliability that it would otherwise have had. The cloud, as they say, is the future.
But it’s not. It’s the present. We’ve already arrived at the future.
Why did I join scrazzl?
Since entering the workforce a number of years ago, I’ve found myself gravitating towards smaller companies with every job-hop I made. More recently, I had found myself happy with the work I had done in my previous role, as the lead developer on DeviceAtlas.
Having spent a number of years there, honing my (admittedly, considerable ;) ) skills, I decided it was time to look for a new challenge. scrazzl popped up on my radar not long after I started scouting for one of those challenges, piquing my interest immediately.
The combination of an interesting problem to solve, a clear requirement in the market for the product itself, an enthusiastic CTO and a chance to help build the tech-stack from the ground up turned out to be too good an opportunity for me to skip.
So, I took the plunge.
2 months in
As I mentioned above, I’m now 2 months into my time here. In that time, I’ve been exposed to a whole array of technologies that I’ve been looking forward to trying out for a very long time, some of which I’ve mentioned below.
We are, first and foremost, a PHP dev house.
Our core libraries are written in PHP, as are our toolkits and anything else we find needs doing.
The wealth of development knowledge in scrazzl is incredible, and the elegance with how the main systems have been created knows no bounds. (too much? yeah. too much…. ;) )
Our main website is built using Zend Framework
Having only had limited exposure to this framework before joining scrazzl, I was a little apprehensive about just how easily I’d be able to jump in and start adding to the codebase.
Luckily enough, scrazzl has a pretty strict coding standard, so not only was everything well documented, but the existing code more or less steered me through the forest that was the initial learning curve.
We’ve used MongoDB extensively
As it relies entirely on JSON data structures, I feel right at home when playing about in MongoDB data. Easy to understand, and easy to get started with, this has been one of the highlights of working at scrazzl so far.
Schemaless design feels very natural once you shake off the shackles of SQL normalisation - it’s obviously not suitable for everything, but we’ve found it ideal for our analytics system.
Its speed is truly remarkable: even when dealing with very large data sets I’ve yet to see it crumble under the strain.
Knowing what I know now about NoSQL data storage mechanisms, I’ll be very reluctant to jump straight into a standard MySQL install without first considering the reasons for using a relational database!
Our servers run on Amazon EC2
I mentioned how much I love EC2 above.
It’s all true.
- I love the flexibility it provides, by allowing us to bring up/tear down servers at will.
- I love the redundancy it allows you to build into your systems, by forcing you to think of WHEN something will fail, not IF.
- I love how easy it is to build scalable solutions (such an easy thing to say, I know….).
But most of all?
I love that scrazzl uses it.
What’s it like in scrazzl?
scrazzl is a young company. So young, in fact, that we’ve been looking for more people for a few months now, and are hoping to grow very fast, very soon.
The technology we get to play with daily is interesting, self-rewarding and because we’re building our stack from the ground up, completely under our own control. I’m really looking forward to expanding what we have as the team grows - we have some very big plans in place for every corner of our systems! :)
The work itself is very rewarding, and absolutely every line of code written has a direct impact on what we do.
With so much work to do, it’s very important that everyone gets on well together. I think I’m very lucky to have managed to move all the way from huge multinational companies, down to a small startup and found myself on a great team of people. We pump through an amazing amount of work, are constantly evaluating one another to make sure that no one is getting left behind in the (organised) mayhem, and somehow we still manage to have a laugh each and every day.
If you’re interested in what we do, and would like to join us, feel free to contact us directly - firstname.lastname@example.org
We’re looking for great people to help the fantastic atmosphere we’ve already started to build here to grow - so don’t be shy!
I’m looking forward to the following months, where we will be launching our site officially, signing up a whole raft of customers, building our team significantly and partying until the small hours.
Well, I’ve got to have something to aim for, don’t I?