Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 7 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ SynapseML (previously known as MMLSpark), is an open-source library that simplif

With SynapseML, you can build scalable and intelligent systems to solve challenges in domains such as anomaly detection, computer vision, deep learning, text analytics, and others. SynapseML can train and evaluate models on single-node, multi-node, and elastically resizable clusters of computers. This lets you scale your work without wasting resources. SynapseML is usable across Python, R, Scala, Java, and .NET. Furthermore, its API abstracts over a wide variety of databases, file systems, and cloud data stores to simplify experiments no matter where data is located.

SynapseML requires Scala 2.12, Spark 3.4+, and Python 3.8+.
SynapseML requires Scala 2.12, Spark 3.5+, and Python 3.8+.

| Topics | Links |
| :------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
Expand Down Expand Up @@ -117,48 +117,32 @@ In Microsoft Fabric notebooks SynapseML is already installed. To change the vers

### Synapse Analytics

In Azure Synapse notebooks please place the following in the first cell of your notebook.
In Azure Synapse notebooks please place the following in the first cell of your notebook.

- For Spark 3.5 Pools:
- For Spark 4.0 Pools:

```bash
%%configure -f
{
"name": "synapseml",
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:1.1.0",
"spark.jars.packages": "com.microsoft.azure:synapseml_2.13:1.1.0-spark4.0",
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.13,org.scalactic:scalactic_2.13,org.scalatest:scalatest_2.13,com.fasterxml.jackson.core:jackson-databind",
"spark.yarn.user.classpath.first": "true",
"spark.sql.parquet.enableVectorizedReader": "false"
}
}
```

- For Spark 3.4 Pools:

```bash
%%configure -f
{
"name": "synapseml",
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:1.0.15",
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
"spark.yarn.user.classpath.first": "true",
"spark.sql.parquet.enableVectorizedReader": "false"
}
}
```

- For Spark 3.3 Pools:
- For Spark 3.5 Pools:

```bash
%%configure -f
{
"name": "synapseml",
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.4-spark3.3",
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:1.1.0",
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
"spark.yarn.user.classpath.first": "true",
Expand All @@ -167,8 +151,6 @@ In Azure Synapse notebooks please place the following in the first cell of your
}
```



To install at the pool level instead of the notebook level [add the spark properties listed above to the pool configuration](https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/how-to-set-spark-pyspark-custom-configs-in-synapse-workspace/ba-p/2114434).

### Databricks
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,31 +16,6 @@ import spray.json.DefaultJsonProtocol._

import scala.language.existentials

trait HasPromptInputs extends HasServiceParams {
val prompt: ServiceParam[String] = new ServiceParam[String](
this, "prompt", "The text to complete", isRequired = false)

def getPrompt: String = getScalarParam(prompt)

def setPrompt(v: String): this.type = setScalarParam(prompt, v)

def getPromptCol: String = getVectorParam(prompt)

def setPromptCol(v: String): this.type = setVectorParam(prompt, v)

val batchPrompt: ServiceParam[Seq[String]] = new ServiceParam[Seq[String]](
this, "batchPrompt", "Sequence of prompts to complete", isRequired = false)

def getBatchPrompt: Seq[String] = getScalarParam(batchPrompt)

def setBatchPrompt(v: Seq[String]): this.type = setScalarParam(batchPrompt, v)

def getBatchPromptCol: String = getVectorParam(batchPrompt)

def setBatchPromptCol(v: String): this.type = setVectorParam(batchPrompt, v)

}

trait HasMessagesInput extends Params {
val messagesCol: Param[String] = new Param[String](
this, "messagesCol", "The column messages to generate chat completions for," +
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap, ParamValidators
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.{Column, DataFrame, Dataset, Row, functions => F, types => T}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.{col, typedLit, udf}
import org.apache.spark.sql.types.{DataType, StructField, StructType}
Expand Down Expand Up @@ -108,7 +107,7 @@ class OpenAIPrompt(override val uid: String) extends Transformer
set(postProcessingOptions, v.asScala.toMap)

val dropPrompt = new BooleanParam(
this, "dropPrompt", "whether to drop the column of prompts after templating (when using legacy models)")
this, "dropPrompt", "whether to drop the column of prompts after templating")

def getDropPrompt: Boolean = $(dropPrompt)

Expand Down Expand Up @@ -229,32 +228,25 @@ class OpenAIPrompt(override val uid: String) extends Transformer
createMessagesUDF: UserDefinedFunction,
attachmentsColumn: Column
): (DataFrame, String, OpenAIServicesBase with HasTextOutput) = {
service match {
case c: OpenAICompletion =>
if (isSet(responseFormat)) {
throw new IllegalArgumentException("responseFormat is not supported for Completion.")
}
import com.microsoft.azure.synapse.ml.core.schema.DatasetExtensions._
val promptColName = df.withDerivativeCol("prompt")
(df.withColumn(promptColName, promptCol), promptColName, c.setPromptCol(promptColName))

case c: HasMessagesInput =>
if (isSet(responseFormat)) {
// Pass through responseFormat without forcing a single shape here.
// Each service validates according to its API (chat_completions vs responses).
c match {
case cc: OpenAIChatCompletion => cc.setResponseFormat(getResponseFormat)
case resp: OpenAIResponses => resp.setResponseFormat(getResponseFormat)
}
}
val messageColName = getMessagesCol
if (isSet(responseFormat)) {
service match {
case cc: OpenAIChatCompletion => cc.setResponseFormat(getResponseFormat)
case resp: OpenAIResponses => resp.setResponseFormat(getResponseFormat)
case _ => // AIFoundryChatCompletion does not currently expose responseFormat.
}
}

(
df.withColumn(messageColName, createMessagesUDF(promptCol, attachmentsColumn)),
messageColName,
c.setMessagesCol(messageColName)
)
val messageColName = getMessagesCol
val configuredService = service match {
case c: HasMessagesInput => c.setMessagesCol(messageColName)
case other => other
}

(
df.withColumn(messageColName, createMessagesUDF(promptCol, attachmentsColumn)),
messageColName,
configuredService
)
}

private def usageMappingFor(service: OpenAIServicesBase with HasTextOutput)
Expand Down Expand Up @@ -491,21 +483,12 @@ class OpenAIPrompt(override val uid: String) extends Transformer

private[openai] def hasAIFoundryModel: Boolean = this.isDefined(model)

//deployment name can be set by user, it doesn't have to match with model name
private val legacyModels = Set("ada", "babbage", "curie", "davinci",
"text-ada-001", "text-babbage-001", "text-curie-001", "text-davinci-002",
"text-davinci-003", "code-cushman-001", "code-davinci-002")

private def getOpenAIChatService: OpenAIServicesBase with HasTextOutput = {

val completion: OpenAIServicesBase with HasTextOutput =
if (hasAIFoundryModel) {
new AIFoundryChatCompletion()
}
else if (legacyModels.contains(getDeploymentName)) {
new OpenAICompletion()
}
else {
} else {
// Use the apiType parameter to decide between chat_completions and responses
getApiType match {
case "responses" => new OpenAIResponses()
Expand All @@ -530,8 +513,6 @@ class OpenAIPrompt(override val uid: String) extends Transformer
chatCompletion.prepareEntity(r)
case chatCompletion: OpenAIChatCompletion =>
chatCompletion.prepareEntity(r)
case completion: OpenAICompletion =>
completion.prepareEntity(r)
}
}

Expand All @@ -556,8 +537,6 @@ class OpenAIPrompt(override val uid: String) extends Transformer
chatCompletion.transformSchema(schema.add(getMessagesCol, StructType(Seq())))
case chatCompletion: OpenAIChatCompletion =>
chatCompletion.transformSchema(schema.add(getMessagesCol, StructType(Seq())))
case completion: OpenAICompletion =>
completion.transformSchema(schema)
}

val outputDataType: DataType = {
Expand Down

This file was deleted.

Loading
Loading