Extracting a string of text, sometimes between other text.
Hello all,
I'm looking to build a formula that will return the digital object identifier (doi) string from a paper citation. The problem is, sometimes it the last piece of the citation, sometimes there is additional text at the end, and sometimes there isn't a doi to return.
I've been playing around with a few formulas that I've found in the community here - with variable success. Here is the data that I'm looking at. I'm seeking to put a formula into the DOI column that will extract the text following "doi: " (if this exists, otherwise return N/A) and before either ". Epub" or ". eCollection" or ". Erratum" or ". No abstract" or ". Print". I would also like to exclude the final "." to return the information as listed in the example below.
=RIGHT(citation@row, LEN(citation@row) - FIND(" doi:", citation@row) - 4)
seems to return everything after " doi:" if it exists, but doesn't exclude any text after the string.
=LEFT([DOI formula]@row, (FIND(". Epub ", [DOI formula]@row) - 1))
will exclude everything after ". Epub" if I combine it with the result of the first formula (another column I've labeled [DOI formula]).
I'm having trouble setting up the right logic to consider all possible text after the doi string.
I've attempted to explain this to the AI generator - with no luck. So I'm either not a great AI communicator - or this one is for the humans out there.
Anyone super skilled in this sort of text extraction?
Thank you!
Meredith
Meredith Rhodes, PhD
ClinicalTrials.gov Specialist
UW School of Medicine & Public Health
UW Clinical Trials Institute
mkrhodes@clinicaltrials.wisc.edu
Answers
-
Hey @Meredith Rhodes
Will this work for you?
=IF(FIND("doi:", citation@row) > 0, MID(citation@row, FIND("doi:", citation@row) + 4, IF(CONTAINS(". Epub", citation@row), FIND(". Epub", citation@row), IF(CONTAINS(". Erratum", citation@row), FIND(". Erratum", citation@row), IF(CONTAINS(". No abstract", citation@row), FIND(". No abstract", citation@row), IF(CONTAINS(". eCollection", citation@row), FIND(". eCollection", citation@row), IF(CONTAINS(". Print", citation@row), FIND(". Print", citation@row), LEN(citation@row)))))) - 8 - FIND("doi:", citation@row) + 4), "N/A")
Kelly
-
Thank you @Kelly Moore - that does take care of the doi or not issue.
Is there an easy way to remove the "." (period) at the end of the citation from here?
Meredith Rhodes, PhD
ClinicalTrials.gov Specialist
UW School of Medicine & Public Health
UW Clinical Trials Institute
mkrhodes@clinicaltrials.wisc.edu
-
Hey @Meredith Rhodes
Just to verify that you're using this more recent formula, not the initial one I submitted
=IF(FIND("doi:", citation@row) > 0, MID(citation@row, FIND("doi:", citation@row) + 4, IF(CONTAINS(". Epub", citation@row), FIND(". Epub", citation@row), IF(CONTAINS(". Erratum", citation@row), FIND(". Erratum", citation@row), IF(CONTAINS(". No abstract", citation@row), FIND(". No abstract", citation@row), IF(CONTAINS(". eCollection", citation@row), FIND(". eCollection", citation@row), IF(CONTAINS(". Print", citation@row), FIND(". Print", citation@row), LEN(citation@row)))))) - 8 - FIND("doi:", citation@row) + 4), "N/A")
When I test the period is removed. Would you send me some actual text that I can copy/paste into my test sheet please.
Kelly
-
WHOA!
Thank you @Kelly Moore! This works beautifully! You are a formula artist :)
I'm going to have to work on 'reading' this one so I can learn from your logic.
I appreciate this!
Meredith
Meredith Rhodes, PhD
ClinicalTrials.gov Specialist
UW School of Medicine & Public Health
UW Clinical Trials Institute
mkrhodes@clinicaltrials.wisc.edu
-
Hey @Meredith Rhodes
Happy it works for you,
The formula isn't as clean as I would have liked it to be.
The first step determines if 'doi' even exists.
The main function is the MID function. MID(text, startposition, num_chars)
MID is used to extract strings in the middle of a larger string. The 'middle' is relative - it can actually be left, right, whatever. There are 3 components to it, vs just 2 components of the RIGHT or LEFT functions. For the MID, you must also specify Where you are starting from.
The first FIND ("doi:", citation@row) finds the starting position. All of the CONTAINS finds if any of the other terms exist and then their respect FIND locates where they are positioned in the string. The number of characters will be the difference between the next term and the "doi:" starting position.
So it looks like a complicated formula but it actually just looks more intimidating than it really is.
Kelly
-
Wonder if you can help me with a similar query - a formula to extract the date closest to the end of the citation to automatically populate the 'date published' column. I'm most interested in the month / year - in other words, all dates can be the first of the month in the given year.
You can see in the above screenshot that some dates have a 'day' and some do not.
Is this figure-out-able?
Meredith Rhodes, PhD
ClinicalTrials.gov Specialist
UW School of Medicine & Public Health
UW Clinical Trials Institute
mkrhodes@clinicaltrials.wisc.edu
-
Hey @Meredith Rhodes
It might be do-able but it will be an ugly formula- and not a promise I can get it to work. But I have an idea. Just eyeballing it, you're looking for a FIND('some month') that is greater than your FIND("doi). You will also have to have a nested IF that is twelve deep to convert your Month names into Month numbers, assuming we locate the month names. If the month names aren't in the citation I don't see a way to locate a year.
Are you open to adding a couple of helper columns, if needed, just to break the formula into smaller pieces? These columns can be hidden and slung out of the way.
I can give it a shot this weekend. Would you mind pasting a couple of your citations here (not a screenshot) so I can paste real text into my sheet to play with.
Kelly
-
Thank you, @Kelly Moore - Here are a few citations - all with dates, not necessarily doi: or after the doi: which makes it interesting.
Bailey HH, Attia S, Love RR, Fass T, Chappell R, Tutsch K, Harris L, Jumonville A, Hansen R, Shapiro GR, Stewart JA. Phase II trial of daily oral perillyl alcohol (NSC 641066) in treatment-refractory metastatic breast cancer. Cancer Chemother Pharmacol. 2008 Jun;62(1):149-57. doi: 10.1007/s00280-007-0585-6. Epub 2007 Sep 21.
McManus BM, Robert SA, Albanese A, Sadek-Badawi M, Palta M. Racial disparities in health-related quality of life in a cohort of very low birthweight 2- and 3-year-olds with and without cerebral palsy. Dev Med Child Neurol. 2011 May;53(5):467-9. doi: 10.1111/j.1469-8749.2011.03923.x. Epub 2011 Mar 17.
Palta M, Gabbert D, Fryback D, Widjaja I, Peters ME, Farrell P, Johnson J. Development and validation of an index for scoring baseline respiratory disease in the very low birth weight neonate. Severity Index Development and Validation Panels and Newborn Lung Project. Pediatrics. 1990 Nov;86(5):714-21.
McManus BM, Robert S, Albanese A, Sadek-Badawi M, Palta M. Predictors of receiving therapy among very low birth weight 2-year olds eligible for Part C early intervention in Wisconsin. BMC Pediatr. 2013 Jul 11;13:106. doi: 10.1186/1471-2431-13-106.
For sure I don't mind helper columns - I'm playing with the concept you proposed to try to get the number of the month listed. I'm not there yet - but I've nested a bunch of things:
=FIND((OR (="Jan", ="Feb", ="Mar", ="Apr", ="May", ="Jun", ="Jul", ="Aug", ="Sep", ="Oct", ="Nov", ="Dec"), [citation]@row, > "doi:"), IF([citation]@row = "Jan", 1, IF([citation]@row = "Feb", 2, IF([citation]@row = "Mar", 3, IF([citation]@row = "Apr", 4, IF([citation]@row = "May",, 5, IF([citation]@row = "Jun", 6, IF([citation]@row = "Jul", 7, IF([citation]@row = "Aug", 8, IF([citation]@row = "Sep", 9, IF([citation]@row = "Oct", 10, IF([citation]@row = "Nov", 11, IF([citation]@row = "Dec", 12))
Friday afternoon is my smartsheet playtime. ;)
Thank you!
Meredith
Meredith Rhodes, PhD
ClinicalTrials.gov Specialist
UW School of Medicine & Public Health
UW Clinical Trials Institute
mkrhodes@clinicaltrials.wisc.edu
-
Hey @Meredith Rhodes
Here's what I came up.
I did use a Helper column. I decided to grab the month and it's position in one formula to make it easier bringing the data into the Date column.
Here's the Helper column formula- It first looks to see if the Month is showing up after a 'doi' since occassionally random month names would show up. If there is not a month name after 'doi' (or no 'doi') then it will Find the first month name (if any) in the citation.
=IF(FIND("Jan", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 1 + "~" + FIND("Jan", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Feb", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 2 + "~" + FIND("Feb", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Mar", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 3 + "~" + FIND("Mar", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Apr", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 4 + "~" + FIND("Apr", RIGHT(citation@row, LEN(citation@row))), IF(FIND("May", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 5 + "~" + FIND("May", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Jun", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 6 + "~" + FIND("Jun", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Jul", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 7 + "~" + FIND("Jul", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Aug", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 8 + "~" + FIND("Aug", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Sep", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 9 + "~" + FIND("Sep", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Oct", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 10 + "~" + FIND("Oct", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Nov", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 11 + "~" + FIND("Nov", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Dec", RIGHT(citation@row, LEN(citation@row) - FIND("doi", citation@row))) > 0, 12 + "~" + FIND("Dec", RIGHT(citation@row, LEN(citation@row))), IF(FIND("Jan", citation@row) > 0, 1 + "~" + FIND("Jan", citation@row), IF(FIND("Feb", citation@row) > 0, 2 + "~" + FIND("Feb", citation@row), IF(FIND("Mar", citation@row) > 0, 3 + "~" + FIND("Mar", citation@row), IF(FIND("Apr", citation@row) > 0, 4 + "~" + FIND("Apr", citation@row), IF(FIND("May", citation@row) > 0, 5 + "~" + FIND("May", citation@row), IF(FIND("Jun", citation@row) > 0, 6 + "~" + FIND("Jun", citation@row), IF(FIND("Jul", citation@row) > 0, 7 + "~" + FIND("Jul", citation@row), IF(FIND("Aug", citation@row) > 0, 8 + "~" + FIND("Aug", citation@row), IF(FIND("Sep", citation@row) > 0, 9 + "~" + FIND("Sep", citation@row), IF(FIND("Oct", citation@row) > 0, 10 + "~" + FIND("Oct", citation@row), IF(FIND("Nov", citation@row) > 0, 11 + "~" + FIND("Nov", citation@row), IF(FIND("Dec", citation@row) > 0, 12 + "~" + FIND("Dec", citation@row)))))))))))))))))))))))))
Here's the Date column for your published Date
=IFERROR(DATE(VALUE(MID(citation@row, VALUE(RIGHT([Helper Column]@row, LEN([Helper Column]@row) - FIND("~", [Helper Column]@row))) - 5, 4)), VALUE(LEFT([Helper Column]@row, FIND("~", [Helper Column]@row) - 1)), 1), "")
This might not find all of the dates since your text is unstructured, but it will find most of them. Will this work for you?
Kelly
-
That is most impressive @Kelly Moore! Wow, this works well for most of my citations :)
Now, is there a way to tag this post for other people working with citations, I wonder?
Can you confirm what the number to the right of the ~ means in the helper column? I see that the month is extracted to the left...
Thank you!
Meredith
Meredith Rhodes, PhD
ClinicalTrials.gov Specialist
UW School of Medicine & Public Health
UW Clinical Trials Institute
mkrhodes@clinicaltrials.wisc.edu
-
FIND() returns a position number of whatever it was looking for. The number to the right of the tilde is that position in your citation string. We needed to use the MID function later to insert the data into the DATE function, and MID needs to know the position number it's going to start with. So rather than have to do two IF statements to look up if a Month name exists I added more text to the result of the original IF. The '~' is a great character to use as a marker, or anchor, when you're extracting data. It is uncommon in general text so you know if you find it is probably the correct one that you intentionally inserted. If you come across other examples of extraction formulas by me, it wouldn't be surprising to find the tilde in the formula.
Does that make sense? If not, ask me again
Kelly
-
Yes, thank you for this explanation @Kelly Moore!
Meredith Rhodes, PhD
ClinicalTrials.gov Specialist
UW School of Medicine & Public Health
UW Clinical Trials Institute
mkrhodes@clinicaltrials.wisc.edu
Help Article Resources
Categories
- All Categories
- 14 Welcome to the Community
- Customer Resources
- 65.1K Get Help
- 444 Global Discussions
- 142 Industry Talk
- 472 Announcements
- 5K Ideas & Feature Requests
- 83 Brandfolder
- 150 Just for fun
- 71 Community Job Board
- 489 Show & Tell
- 33 Member Spotlight
- 2 SmartStories
- 301 Events
- 36 Webinars
- 7.3K Forum Archives
Check out the Formula Handbook template!